Troubleshooting
Connection Issues
Section titled “Connection Issues”Connection refused / timeouts
Section titled “Connection refused / timeouts”Symptoms: ConnectionError, ECONNREFUSED, or request timeouts.
Causes and fixes:
- Server not running — Start with
docker run -p 8080:8080 ghcr.io/superlinked/sie:defaultorsie-server serve - Wrong port — Default is 8080. Check with
curl http://localhost:8080/healthz - Firewall/security group — Ensure port 8080 is open for your network
- Docker networking — Use
--network hostor ensure port mapping is correct (-p 8080:8080)
503 “No healthy workers available”
Section titled “503 “No healthy workers available””Context: Kubernetes deployment with router.
Cause: Missing X-SIE-MACHINE-PROFILE header on HTTP requests. The router doesn’t know which worker pool to target.
Fix: Add the GPU header:
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \ -H "X-SIE-MACHINE-PROFILE: l4" \ -H "Content-Type: application/json" \ -d '{"items": [{"text": "Hello world"}]}'Or use the SDK gpu parameter:
result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")See Scale-from-Zero for the full autoscaling flow.
202 responses that never resolve
Section titled “202 responses that never resolve”Context: Kubernetes with KEDA scale-to-zero.
Causes:
- Timeout too short — Cold starts take 5-7 minutes. Set
provision_timeout_s=420 - Spot GPUs unavailable — Try on-demand (
l4instead ofl4-spot) - KEDA not running — Check:
kubectl get pods -n keda - Prometheus unreachable — KEDA needs metrics:
kubectl get pods -n monitoring
# Recommended: use SDK with generous timeoutresult = client.encode( "BAAI/bge-m3", Item(text="hello"), gpu="l4", wait_for_capacity=True, provision_timeout_s=420,)Model Issues
Section titled “Model Issues”Model not found
Section titled “Model not found”Symptoms: 404 Not Found or “model not available” error.
Causes and fixes:
- Wrong model name — Use the SIE model ID (e.g.,
BAAI/bge-m3), not a custom alias. Check available models:curl http://localhost:8080/v1/models - Wrong bundle — Some models require specific bundles. GLiNER needs the
glinerbundle, Florence-2 needsflorence2. See Bundles - Model filter active — If
SIE_MODEL_FILTERis set, only listed models are available
Model loading is slow
Section titled “Model loading is slow”Context: First request to a model takes a long time.
Expected behavior: Models load on-demand. First request downloads weights (if not cached) and loads to GPU. Subsequent requests are fast.
| Scenario | Expected Time |
|---|---|
| Weights cached, loading to GPU | 10-30s (small model), 30-120s (large model) |
| Downloading from HuggingFace | 1-10 minutes depending on model size and network |
| Downloading from cluster cache (S3/GCS) | 30s-3 minutes |
Speed up loading:
- Mount a persistent HuggingFace cache:
-v ~/.cache/huggingface:/app/.cache/huggingface - Use cluster cache:
SIE_CLUSTER_CACHE=s3://bucket/weights - Pre-warm models by sending a dummy request at startup
GPU Issues
Section titled “GPU Issues”Docker GPU not detected
Section titled “Docker GPU not detected”Symptoms: Server falls back to CPU, or --gpus all fails.
Fixes:
- Install NVIDIA Container Toolkit:
Terminal window # Ubuntu/Debiansudo apt-get install -y nvidia-container-toolkitsudo systemctl restart docker - Verify GPU access:
Terminal window docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi - Use the
--gpus allflag:Terminal window docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:default
Out of memory (OOM)
Section titled “Out of memory (OOM)”Symptoms: CUDA out of memory, process killed, or pod evicted.
Causes and fixes:
- Model too large for GPU — Check model size vs GPU VRAM in Resources
- Too many models loaded — Lower
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT(default: 85) to trigger eviction earlier - Batch size too large — Reduce
SIE_MAX_BATCH_REQUESTS(default: 64) - Memory leak — Restart the server; report the issue if reproducible
Slow inference
Section titled “Slow inference”Possible causes:
- CPU fallback — Server may be running on CPU. Check with
sie-topor WebSocket status - Wrong attention backend — Flash Attention 2 is fastest on Ampere+ GPUs. Set
SIE_ATTENTION_BACKEND=flash_attention_2 - Small batches — Low concurrency means small batches. Increase
SIE_MAX_BATCH_WAIT_MSto wait longer for batch fill - Preprocessing bottleneck — For vision models, increase
SIE_IMAGE_WORKERS(default: 4)
LoRA Issues
Section titled “LoRA Issues”LoRA loading timeout
Section titled “LoRA loading timeout”Symptoms: Request hangs or times out when using a LoRA adapter.
Causes:
- LoRA too large — Large adapters take longer to download and load
- Incompatible base model — LoRA must match the base model architecture
- Cache full —
SIE_MAX_LORAS_PER_MODEL(default: 10) exceeded, triggering eviction + reload
LoRA adapter not found
Section titled “LoRA adapter not found”Fix: Ensure the LoRA ID is a valid HuggingFace repo:
result = client.encode( "BAAI/bge-m3", Item(text="hello"), options={"lora_id": "username/my-lora-adapter"})Gated Model Access
Section titled “Gated Model Access””Access denied” or 401 for gated models
Section titled “”Access denied” or 401 for gated models”Cause: Some HuggingFace models require manual approval and a token.
Fixes:
- Accept the model’s license on HuggingFace (visit the model page)
- Set your HuggingFace token:
Terminal window # Dockerdocker run --gpus all -p 8080:8080 \-e HF_TOKEN=hf_your_token_here \ghcr.io/superlinked/sie:default# Localexport HF_TOKEN=hf_your_token_heresie-server serve - For Kubernetes, create a secret:
Terminal window kubectl create secret generic hf-token \--from-literal=token=hf_your_token_here \-n sie
Kubernetes Issues
Section titled “Kubernetes Issues”Workers scale up then immediately down
Section titled “Workers scale up then immediately down”Cause: Requests stopped before the worker finished cold start. KEDA sees demand drop to 0.
Fix: Keep sending requests for the full cold start duration (5-7 minutes), or use the SDK with wait_for_capacity=True.
Different bundles not scaling
Section titled “Different bundles not scaling”Context: encode/score work fine, but extract with GLiNER returns only 202s.
Cause: Each bundle scales independently. The default bundle worker serves encode/score models, but GLiNER needs the gliner bundle worker.
Fix: Send extract requests with wait_for_capacity=True and provision_timeout_s=420. The gliner worker pool scales independently and needs its own cold start.
Pods stuck in Pending
Section titled “Pods stuck in Pending”Causes:
- No GPU quota — Check:
kubectl describe pod <pod-name> -n sie - Node pool at max — Increase
maxReplicasin Helm values - Spot unavailable — Switch to on-demand instances
Getting Help
Section titled “Getting Help”If your issue isn’t covered here:
- Check server logs:
docker logs <container>orkubectl logs -n sie -l app.kubernetes.io/component=worker - Use
sie-topfor real-time monitoring - Open an issue on GitHub