Skip to content
Why did we open-source our inference engine? Read the post

Troubleshooting

Symptoms: ConnectionError, ECONNREFUSED, or request timeouts.

Causes and fixes:

  • Server not running — Start with docker run -p 8080:8080 ghcr.io/superlinked/sie:default or sie-server serve
  • Wrong port — Default is 8080. Check with curl http://localhost:8080/healthz
  • Firewall/security group — Ensure port 8080 is open for your network
  • Docker networking — Use --network host or ensure port mapping is correct (-p 8080:8080)

Context: Kubernetes deployment with router.

Cause: Missing X-SIE-MACHINE-PROFILE header on HTTP requests. The router doesn’t know which worker pool to target.

Fix: Add the GPU header:

Terminal window
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'

Or use the SDK gpu parameter:

result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

See Scale-from-Zero for the full autoscaling flow.

Context: Kubernetes with KEDA scale-to-zero.

Causes:

  • Timeout too short — Cold starts take 5-7 minutes. Set provision_timeout_s=420
  • Spot GPUs unavailable — Try on-demand (l4 instead of l4-spot)
  • KEDA not running — Check: kubectl get pods -n keda
  • Prometheus unreachable — KEDA needs metrics: kubectl get pods -n monitoring
# Recommended: use SDK with generous timeout
result = client.encode(
"BAAI/bge-m3",
Item(text="hello"),
gpu="l4",
wait_for_capacity=True,
provision_timeout_s=420,
)

Symptoms: 404 Not Found or “model not available” error.

Causes and fixes:

  • Wrong model name — Use the SIE model ID (e.g., BAAI/bge-m3), not a custom alias. Check available models: curl http://localhost:8080/v1/models
  • Wrong bundle — Some models require specific bundles. GLiNER needs the gliner bundle, Florence-2 needs florence2. See Bundles
  • Model filter active — If SIE_MODEL_FILTER is set, only listed models are available

Context: First request to a model takes a long time.

Expected behavior: Models load on-demand. First request downloads weights (if not cached) and loads to GPU. Subsequent requests are fast.

ScenarioExpected Time
Weights cached, loading to GPU10-30s (small model), 30-120s (large model)
Downloading from HuggingFace1-10 minutes depending on model size and network
Downloading from cluster cache (S3/GCS)30s-3 minutes

Speed up loading:

  • Mount a persistent HuggingFace cache: -v ~/.cache/huggingface:/app/.cache/huggingface
  • Use cluster cache: SIE_CLUSTER_CACHE=s3://bucket/weights
  • Pre-warm models by sending a dummy request at startup

Symptoms: Server falls back to CPU, or --gpus all fails.

Fixes:

  1. Install NVIDIA Container Toolkit:
    Terminal window
    # Ubuntu/Debian
    sudo apt-get install -y nvidia-container-toolkit
    sudo systemctl restart docker
  2. Verify GPU access:
    Terminal window
    docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
  3. Use the --gpus all flag:
    Terminal window
    docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:default

Symptoms: CUDA out of memory, process killed, or pod evicted.

Causes and fixes:

  • Model too large for GPU — Check model size vs GPU VRAM in Resources
  • Too many models loaded — Lower SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT (default: 85) to trigger eviction earlier
  • Batch size too large — Reduce SIE_MAX_BATCH_REQUESTS (default: 64)
  • Memory leak — Restart the server; report the issue if reproducible

Possible causes:

  • CPU fallback — Server may be running on CPU. Check with sie-top or WebSocket status
  • Wrong attention backend — Flash Attention 2 is fastest on Ampere+ GPUs. Set SIE_ATTENTION_BACKEND=flash_attention_2
  • Small batches — Low concurrency means small batches. Increase SIE_MAX_BATCH_WAIT_MS to wait longer for batch fill
  • Preprocessing bottleneck — For vision models, increase SIE_IMAGE_WORKERS (default: 4)

Symptoms: Request hangs or times out when using a LoRA adapter.

Causes:

  • LoRA too large — Large adapters take longer to download and load
  • Incompatible base model — LoRA must match the base model architecture
  • Cache fullSIE_MAX_LORAS_PER_MODEL (default: 10) exceeded, triggering eviction + reload

Fix: Ensure the LoRA ID is a valid HuggingFace repo:

result = client.encode(
"BAAI/bge-m3",
Item(text="hello"),
options={"lora_id": "username/my-lora-adapter"}
)

”Access denied” or 401 for gated models

Section titled “”Access denied” or 401 for gated models”

Cause: Some HuggingFace models require manual approval and a token.

Fixes:

  1. Accept the model’s license on HuggingFace (visit the model page)
  2. Set your HuggingFace token:
    Terminal window
    # Docker
    docker run --gpus all -p 8080:8080 \
    -e HF_TOKEN=hf_your_token_here \
    ghcr.io/superlinked/sie:default
    # Local
    export HF_TOKEN=hf_your_token_here
    sie-server serve
  3. For Kubernetes, create a secret:
    Terminal window
    kubectl create secret generic hf-token \
    --from-literal=token=hf_your_token_here \
    -n sie

Cause: Requests stopped before the worker finished cold start. KEDA sees demand drop to 0.

Fix: Keep sending requests for the full cold start duration (5-7 minutes), or use the SDK with wait_for_capacity=True.

Context: encode/score work fine, but extract with GLiNER returns only 202s.

Cause: Each bundle scales independently. The default bundle worker serves encode/score models, but GLiNER needs the gliner bundle worker.

Fix: Send extract requests with wait_for_capacity=True and provision_timeout_s=420. The gliner worker pool scales independently and needs its own cold start.

Causes:

  • No GPU quota — Check: kubectl describe pod <pod-name> -n sie
  • Node pool at max — Increase maxReplicas in Helm values
  • Spot unavailable — Switch to on-demand instances

If your issue isn’t covered here:

  1. Check server logs: docker logs <container> or kubectl logs -n sie -l app.kubernetes.io/component=worker
  2. Use sie-top for real-time monitoring
  3. Open an issue on GitHub