Skip to content
Why did we open-source our inference engine? Read the post

Router

The SIE router is a stateless FastAPI proxy that sits between clients and GPU workers. It handles GPU-aware routing, load balancing, resource pools, and scale-from-zero orchestration.

Not every deployment needs a router. The deciding factor is how many workers you have:

  • Single server (local dev, single Docker container): Connect the SDK directly to the server (http://localhost:8080). No router needed.
  • Multiple workers (Kubernetes, multi-GPU, Docker Compose scale): Use the router. It handles load balancing across workers, GPU-aware routing, and returns 202 Accepted when workers are provisioning so the SDK can retry automatically.
  • The router is stateless — it discovers workers via the Kubernetes watch API or a static URL list, and holds no data of its own. You can run multiple router replicas behind a standard load balancer for high availability.
SetupUse Router?Why
Single Docker containerNoConnect the SDK directly to the worker
Docker Compose (multi-worker)OptionalUseful for a unified endpoint across workers
Kubernetes (any scale)YesRequired for multi-worker routing, scale-from-zero, and pool isolation

Router architecture: SDK/HTTP Client to Router to L4 and A100 workers

The router is stateless — it discovers workers dynamically and can be horizontally scaled.


The router selects a worker based on:

  1. GPU typeX-SIE-MACHINE-PROFILE header matches a worker pool
  2. Model affinity — Prefers workers with the requested model already loaded in GPU memory
  3. Queue depth — Routes to the least-loaded healthy worker
  4. Pool isolationX-SIE-Pool header routes to reserved workers

Every request must specify a target GPU type:

Terminal window
# HTTP
curl -X POST http://router:8081/v1/encode/BAAI/bge-m3 \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'
# SDK
result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

When no workers are available for the requested GPU type, the router returns:

HTTP/1.1 202 Accepted
Retry-After: 120
Content-Type: application/json
{"status": "provisioning", "gpu": "l4", "message": "Worker provisioning in progress"}

The SDK handles this automatically with wait_for_capacity=True. See Scale-from-Zero for details.


List worker URLs explicitly:

Terminal window
sie-router serve \
-w http://worker-1:8080 \
-w http://worker-2:8080 \
-w http://worker-3:8080

Auto-discover workers via Kubernetes service endpoints:

Terminal window
sie-router serve \
--kubernetes \
--k8s-namespace sie \
--k8s-service sie-worker \
--k8s-port 8080

In Kubernetes mode, the router watches endpoint changes and automatically registers/deregisters workers.


Resource pools reserve dedicated workers for tenant isolation. Pool workers only serve requests for that pool.

client = SIEClient("http://router:8081")
# Reserve 2 L4 workers for this tenant
client.create_pool("tenant-abc", {"l4": 2})
# Route requests to the pool
result = client.encode(
"BAAI/bge-m3",
Item(text="hello"),
gpu="tenant-abc/l4" # pool_name/gpu_type
)
# Check pool status
info = client.get_pool("tenant-abc")
# Cleanup
client.delete_pool("tenant-abc")
  • Pools are created on first request (lazy initialization)
  • The SDK renews pool leases automatically in a background thread
  • Pools are garbage-collected after inactivity (no active lease renewal)

The router aggregates health from all workers:

EndpointDescription
GET /healthzRouter liveness (always 200 if process is alive)
GET /readyzRouter readiness (200 if at least one worker is healthy)
GET /healthCluster summary: worker count, GPU count, models loaded
GET /v1/modelsAggregated model list across all workers
WS /ws/cluster-statusReal-time cluster metrics stream
Terminal window
curl http://router:8081/health
{
"status": "healthy",
"worker_count": 3,
"gpu_count": 3,
"models_loaded": 12,
"configured_gpu_types": ["l4", "a100-80gb"],
"live_gpu_types": ["l4"]
}

Terminal window
sie-router serve [OPTIONS]
OptionDefaultDescription
--port, -p8081Router listen port
--host0.0.0.0Bind address
-w, --workerNoneWorker URL (repeat for multiple)
-k, --kubernetesfalseEnable Kubernetes service discovery
--k8s-namespacedefaultKubernetes namespace
--k8s-servicesie-workerKubernetes service name
--k8s-port8080Worker port
-l, --log-levelinfoLog level
--json-logsfalseStructured JSON logging
-r, --reloadfalseAuto-reload on code changes (dev)