Troubleshooting

Connection Issues

Connection refused / timeouts

Symptoms: ConnectionError, ECONNREFUSED, or request timeouts.

Causes and fixes:

Server not running — Start with docker run -p 8080:8080 ghcr.io/superlinked/sie:default or sie-server serve
Wrong port — Default is 8080. Check with curl http://localhost:8080/healthz
Firewall/security group — Ensure port 8080 is open for your network
Docker networking — Use --network host or ensure port mapping is correct (-p 8080:8080)

503 “No healthy workers available”

Context: Kubernetes deployment with router.

Cause: Missing X-SIE-MACHINE-PROFILE header on HTTP requests. The router doesn’t know which worker pool to target.

Fix: Add the GPU header:

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'

Or use the SDK gpu parameter:

result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

See Scale-from-Zero for the full autoscaling flow.

202 responses that never resolve

Context: Kubernetes with KEDA scale-to-zero.

Causes:

Timeout too short — Cold starts take 5-7 minutes. Set provision_timeout_s=420
Spot GPUs unavailable — Try on-demand (l4 instead of l4-spot)
KEDA not running — Check: kubectl get pods -n keda
Prometheus unreachable — KEDA needs metrics: kubectl get pods -n monitoring

# Recommended: use SDK with generous timeout
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    gpu="l4",
    wait_for_capacity=True,
    provision_timeout_s=420,
)

Model Issues

Model not found

Symptoms: 404 Not Found or “model not available” error.

Causes and fixes:

Wrong model name — Use the SIE model ID (e.g., BAAI/bge-m3), not a custom alias. Check available models: curl http://localhost:8080/v1/models
Wrong bundle — Some models require specific bundles. GLiNER needs the gliner bundle, Florence-2 needs florence2. See Bundles
Model filter active — If SIE_MODEL_FILTER is set, only listed models are available

Model loading is slow

Context: First request to a model takes a long time.

Expected behavior: Models load on-demand. First request downloads weights (if not cached) and loads to GPU. Subsequent requests are fast.

Scenario	Expected Time
Weights cached, loading to GPU	10-30s (small model), 30-120s (large model)
Downloading from HuggingFace	1-10 minutes depending on model size and network
Downloading from cluster cache (S3/GCS)	30s-3 minutes

Speed up loading:

Mount a persistent HuggingFace cache: -v ~/.cache/huggingface:/app/.cache/huggingface
Use cluster cache: SIE_CLUSTER_CACHE=s3://bucket/weights
Pre-warm models by sending a dummy request at startup

GPU Issues

Docker GPU not detected

Symptoms: Server falls back to CPU, or --gpus all fails.

Fixes:

Install NVIDIA Container Toolkit:

# Ubuntu/Debian
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU access:

docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Use the --gpus all flag:

docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:default

Out of memory (OOM)

Symptoms: CUDA out of memory, process killed, or pod evicted.

Causes and fixes:

Model too large for GPU — Check model size vs GPU VRAM in Resources
Too many models loaded — Lower SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT (default: 85) to trigger eviction earlier
Batch size too large — Reduce SIE_MAX_BATCH_REQUESTS (default: 64)
Memory leak — Restart the server; report the issue if reproducible

Slow inference

Possible causes:

CPU fallback — Server may be running on CPU. Check with sie-top or WebSocket status
Wrong attention backend — Flash Attention 2 is fastest on Ampere+ GPUs. Set SIE_ATTENTION_BACKEND=flash_attention_2
Small batches — Low concurrency means small batches. Increase SIE_MAX_BATCH_WAIT_MS to wait longer for batch fill
Preprocessing bottleneck — For vision models, increase SIE_IMAGE_WORKERS (default: 4)

LoRA Issues

LoRA loading timeout

Symptoms: Request hangs or times out when using a LoRA adapter.

Causes:

LoRA too large — Large adapters take longer to download and load
Incompatible base model — LoRA must match the base model architecture
Cache full — SIE_MAX_LORAS_PER_MODEL (default: 10) exceeded, triggering eviction + reload

LoRA adapter not found

Fix: Ensure the LoRA ID is a valid HuggingFace repo:

result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    options={"lora_id": "username/my-lora-adapter"}
)

Gated Model Access

”Access denied” or 401 for gated models

Cause: Some HuggingFace models require manual approval and a token.

Fixes:

Accept the model’s license on HuggingFace (visit the model page)

Set your HuggingFace token:

# Docker
docker run --gpus all -p 8080:8080 \
  -e HF_TOKEN=hf_your_token_here \
  ghcr.io/superlinked/sie:default

# Local
export HF_TOKEN=hf_your_token_here
sie-server serve

For Kubernetes, create a secret:

kubectl create secret generic hf-token \
  --from-literal=token=hf_your_token_here \
  -n sie

Kubernetes Issues

Workers scale up then immediately down

Cause: Requests stopped before the worker finished cold start. KEDA sees demand drop to 0.

Fix: Keep sending requests for the full cold start duration (5-7 minutes), or use the SDK with wait_for_capacity=True.

Different bundles not scaling

Context: encode/score work fine, but extract with GLiNER returns only 202s.

Cause: Each bundle scales independently. The default bundle worker serves encode/score models, but GLiNER needs the gliner bundle worker.

Fix: Send extract requests with wait_for_capacity=True and provision_timeout_s=420. The gliner worker pool scales independently and needs its own cold start.

Pods stuck in Pending

Causes:

No GPU quota — Check: kubectl describe pod <pod-name> -n sie
Node pool at max — Increase maxReplicas in Helm values
Spot unavailable — Switch to on-demand instances

Getting Help

If your issue isn’t covered here:

Check server logs: docker logs <container> or kubectl logs -n sie -l app.kubernetes.io/component=worker
Use sie-top for real-time monitoring
Open an issue on GitHub