Skip to content
Why did we open-source our inference engine? Read the post

Scale-from-Zero & Autoscaling

SIE clusters scale GPU workers to zero when idle and provision them on-demand. This page explains the full lifecycle, cold start expectations, and how to handle 202 responses.

When all workers are scaled to zero and a request arrives:

Scale-from-zero request flow through Router, KEDA, GKE, and Worker

Key point: The X-SIE-MACHINE-PROFILE header (or SDK gpu parameter) is required for the router to know which worker pool to target. Without it, you get a 503 instead of a 202.


Cold start from zero has three phases:

PhaseDurationWhat Happens
Node provisioning2-5 minGKE finds a GPU node (spot takes longer if scarce)
Container startup20-40sPull image, start process, health checks pass
Model loading10-120sDownload weights (if not cached) and load to GPU

Total cold start: 3-7 minutes depending on model size and spot availability.

Once a worker is warm, subsequent requests for any model on that worker are fast (model loads on-demand from local cache in 10-120s, or instantly if already in GPU memory).


When the cluster is scaled to zero, HTTP requests receive a 202 Accepted response:

Terminal window
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-d '{"items": [{"text": "Hello world"}]}'
# Response: 202 Accepted
# Headers: Retry-After: 120

Your HTTP client should retry after the Retry-After interval. Keep retrying for at least 7 minutes on a cold start.

Without the GPU header, you get 503:

Terminal window
# Missing X-SIE-MACHINE-PROFILE → 503
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'
# Response: 503 Service Unavailable
# {"detail": {"message": "No healthy workers available"}}

The SDK handles 202 retries automatically with wait_for_capacity=True:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://sie.example.com", api_key="YOUR_KEY")
# Automatically retries 202s with exponential backoff
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="l4",
wait_for_capacity=True,
provision_timeout_s=420, # 7 minutes for cold start
)

Each (machine_profile, bundle) combination has its own KEDA ScaledObject and scales independently.

BundleModels ServedExample ScaledObject
defaultBGE-M3, E5, Stella, ColBERT, rerankersl4-spot-default
glinerGLiNER, GLiREL, GLiClassl4-spot-gliner
florence2Florence-2, Donutl4-spot-florence2
sglangLarge 4B+ parameter modelsa100-80gb-sglang

What this means in practice: If you have encode and score working on the default bundle worker, but then call extract with a GLiNER model, a separate gliner bundle worker needs to scale up. This is a new cold start — expect another 5-7 minutes.

# This uses the default bundle worker (already warm)
client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")
# This needs the gliner bundle worker (may trigger cold start)
client.extract(
"urchade/gliner_multi-v2.1",
Item(text="Tim Cook leads Apple."),
labels=["person", "org"],
gpu="l4",
wait_for_capacity=True,
provision_timeout_s=420,
)

KEDA uses Prometheus metrics to make scaling decisions:

MetricPurposeUsed For
sie_router_pending_demandRequests waiting for a worker typeScale-from-zero activation
sie_router_worker_queue_depthItems queued per workerScale-up (add more replicas)
autoscaling:
enabled: true
pollingInterval: 15 # Check metrics every 15 seconds
cooldownPeriod: 900 # 15 minutes before scaling to zero
scaleDownStabilization: 300 # 5 minute stabilization window
queueDepthThreshold: 10 # Scale up at 10 pending requests/pod
queueDepthActivation: 2 # Activate from zero at 2 requests

After no requests arrive for the cooldownPeriod (default: 15 minutes), KEDA scales workers back to zero. The next request triggers a full cold start again.

  • Consistent traffic: Lower cooldown (300s) to keep workers warm
  • Bursty traffic: Higher cooldown (900s) to avoid repeated cold starts
  • Cost-sensitive: Default 900s balances cost and responsiveness

The X-SIE-MACHINE-PROFILE header (HTTP) or gpu parameter (SDK) determines which worker pool receives the request.

ProfileGPUTypical Use
l4NVIDIA L4 (24GB)Standard inference, best price/performance
l4-spotNVIDIA L4 (spot)60-70% cheaper, may be preempted
a100-40gbNVIDIA A100 (40GB)Large models, high throughput
a100-80gbNVIDIA A100 (80GB)Very large models (7B+ params)

Spot instances offer significant cost savings but may take longer to provision if capacity is scarce.


Cause: Missing X-SIE-MACHINE-PROFILE header on HTTP requests, or no worker pool configured for the requested profile.

Fix: Add the GPU header to your request:

Terminal window
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'

Or use the SDK with the gpu parameter:

client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

Possible causes:

  • Too short timeout — Cold starts take 5-7 minutes. Use provision_timeout_s=420 in the SDK
  • Spot GPU unavailable — Try a different machine profile (e.g., l4 instead of l4-spot)
  • KEDA not configured — Check that KEDA is installed and ScaledObjects exist: kubectl get scaledobjects -n sie
  • Prometheus down — KEDA needs Prometheus for metrics. Check: kubectl get pods -n monitoring

Workers scale up then immediately scale down

Section titled “Workers scale up then immediately scale down”

Cause: Requests stopped before the worker became ready. KEDA sees demand drop to 0 and begins cooldown.

Fix: Keep sending requests (or use the SDK with wait_for_capacity=True) for the full cold start duration. The SDK handles this automatically with retry logic.

Models from different bundles not available

Section titled “Models from different bundles not available”

Cause: Each bundle runs in a separate worker. Your encode/score models (default bundle) may be warm, but extract models (gliner bundle) need their own worker.

Fix: Send requests with wait_for_capacity=True and a sufficient timeout. The gliner worker will scale up independently.