Router

The SIE router is a stateless FastAPI proxy that sits between clients and GPU workers. It handles GPU-aware routing, load balancing, resource pools, and scale-from-zero orchestration.

When to Use the Router

Not every deployment needs a router. The deciding factor is how many workers you have:

Single server (local dev, single Docker container): Connect the SDK directly to the server (http://localhost:8080). No router needed.
Multiple workers (Kubernetes, multi-GPU, Docker Compose scale): Use the router. It handles load balancing across workers, GPU-aware routing, and returns 202 Accepted when workers are provisioning so the SDK can retry automatically.
The router is stateless — it discovers workers via the Kubernetes watch API or a static URL list, and holds no data of its own. You can run multiple router replicas behind a standard load balancer for high availability.

Setup	Use Router?	Why
Single Docker container	No	Connect the SDK directly to the worker
Docker Compose (multi-worker)	Optional	Useful for a unified endpoint across workers
Kubernetes (any scale)	Yes	Required for multi-worker routing, scale-from-zero, and pool isolation

Architecture

The router is stateless — it discovers workers dynamically and can be horizontally scaled.

Request Routing

The router selects a worker based on:

GPU type — X-SIE-MACHINE-PROFILE header matches a worker pool
Model affinity — Prefers workers with the requested model already loaded in GPU memory
Queue depth — Routes to the least-loaded healthy worker
Pool isolation — X-SIE-Pool header routes to reserved workers

GPU Routing

Every request must specify a target GPU type:

# HTTP
curl -X POST http://router:8081/v1/encode/BAAI/bge-m3 \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'

# SDK
result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

202 Scale-from-Zero

When no workers are available for the requested GPU type, the router returns:

HTTP/1.1 202 Accepted
Retry-After: 120
Content-Type: application/json

{"status": "provisioning", "gpu": "l4", "message": "Worker provisioning in progress"}

The SDK handles this automatically with wait_for_capacity=True. See Scale-from-Zero for details.

Worker Discovery

Static Mode

List worker URLs explicitly:

sie-router serve \
  -w http://worker-1:8080 \
  -w http://worker-2:8080 \
  -w http://worker-3:8080

Kubernetes Mode

Auto-discover workers via Kubernetes service endpoints:

sie-router serve \
  --kubernetes \
  --k8s-namespace sie \
  --k8s-service sie-worker \
  --k8s-port 8080

In Kubernetes mode, the router watches endpoint changes and automatically registers/deregisters workers.

Resource Pools

Resource pools reserve dedicated workers for tenant isolation. Pool workers only serve requests for that pool.

Create a Pool

client = SIEClient("http://router:8081")

# Reserve 2 L4 workers for this tenant
client.create_pool("tenant-abc", {"l4": 2})

# Route requests to the pool
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    gpu="tenant-abc/l4"  # pool_name/gpu_type
)

# Check pool status
info = client.get_pool("tenant-abc")

# Cleanup
client.delete_pool("tenant-abc")

Pool Lifecycle

Pools are created on first request (lazy initialization)
The SDK renews pool leases automatically in a background thread
Pools are garbage-collected after inactivity (no active lease renewal)

Health & Status

The router aggregates health from all workers:

Endpoint	Description
`GET /healthz`	Router liveness (always 200 if process is alive)
`GET /readyz`	Router readiness (200 if at least one worker is healthy)
`GET /health`	Cluster summary: worker count, GPU count, models loaded
`GET /v1/models`	Aggregated model list across all workers
`WS /ws/cluster-status`	Real-time cluster metrics stream

Cluster Health Example

curl http://router:8081/health

{
  "status": "healthy",
  "worker_count": 3,
  "gpu_count": 3,
  "models_loaded": 12,
  "configured_gpu_types": ["l4", "a100-80gb"],
  "live_gpu_types": ["l4"]
}

CLI Reference

sie-router serve [OPTIONS]

Option	Default	Description
`--port, -p`	`8081`	Router listen port
`--host`	`0.0.0.0`	Bind address
`-w, --worker`	None	Worker URL (repeat for multiple)
`-k, --kubernetes`	`false`	Enable Kubernetes service discovery
`--k8s-namespace`	`default`	Kubernetes namespace
`--k8s-service`	`sie-worker`	Kubernetes service name
`--k8s-port`	`8080`	Worker port
`-l, --log-level`	`info`	Log level
`--json-logs`	`false`	Structured JSON logging
`-r, --reload`	`false`	Auto-reload on code changes (dev)

What’s Next

Scale-from-Zero - the 202 flow and cold start handling
Kubernetes in GCP - full deployment with router
Monitoring - metrics and dashboards