Architecture
SIE is a layered system: SDK clients talk to a router (or directly to a server), which manages GPU workers that load and run models on-demand.
System Overview
Section titled “System Overview”Components
Section titled “Components”Client SDK
Section titled “Client SDK”The SDK provides encode(), score(), and extract() methods. It handles:
- msgpack serialization — Binary wire format, faster and smaller than JSON
- Automatic 202 retry — Waits for scale-from-zero with
wait_for_capacity=True - Pool management — Background lease renewal for resource pools
- Numpy integration — Returns native numpy arrays for embeddings
Framework integrations (LangChain, LlamaIndex, etc.) wrap the SDK with framework-specific interfaces.
Router
Section titled “Router”The router is a stateless proxy that sits between clients and workers. It’s optional for single-server setups but required for Kubernetes clusters.
Responsibilities:
- Routes requests to the correct GPU pool based on
X-SIE-MACHINE-PROFILE - Prefers workers with the requested model already loaded (model affinity)
- Returns
202 Acceptedwhen workers are scaled to zero - Manages resource pools for tenant isolation
Worker (sie-server)
Section titled “Worker (sie-server)”Each worker is a single-GPU inference server running the full pipeline:
- Preprocess — Tokenization and image processing (CPU thread pool)
- Batch — Cost-based batching by token count
- GPU Inference — Model forward pass via adapter (PyTorch, Flash Attention, SGLang)
- Postprocess — Quantization, MUVERA transform (CPU thread pool)
Workers manage multiple models on one GPU with LRU eviction when memory pressure exceeds the threshold.
Wire Protocol
Section titled “Wire Protocol”SIE uses msgpack as the default wire format instead of JSON:
| Format | Encode speed | Decode speed | Size | Numpy support |
|---|---|---|---|---|
| msgpack | Fast | Fast | ~50% of JSON | Native via msgpack-numpy |
| JSON | Slower | Slower | Baseline | Requires list conversion |
The SDK sends and receives msgpack automatically. The OpenAI-compatible /v1/embeddings endpoint uses JSON for compatibility.
Model Cache Hierarchy
Section titled “Model Cache Hierarchy”Model weights are resolved through a 3-tier cache:
Local disk cache uses LRU eviction when disk usage exceeds SIE_DISK_PRESSURE_THRESHOLD_PERCENT (default: 85%).
Cluster cache is useful for Kubernetes deployments where multiple workers share the same S3/GCS bucket, avoiding redundant downloads from HuggingFace.
Deployment Modes
Section titled “Deployment Modes”Standalone (Direct)
Section titled “Standalone (Direct)”Client → sie-server (single GPU)Simplest setup. Client connects directly to one server. Good for development and small production.
Multi-Bundle (Docker Compose)
Section titled “Multi-Bundle (Docker Compose)”Client → sie-server:8080 (default bundle)Client → sie-server:8081 (gliner bundle)Multiple containers, each with a different bundle. Client routes to the correct port.
Cluster (Kubernetes)
Section titled “Cluster (Kubernetes)”Client → sie-router → worker pool(s)Full production setup with GPU routing, autoscaling, and observability. See Kubernetes in GCP or AWS.
What’s Next
Section titled “What’s Next”- Request Pipeline - detailed preprocessing, batching, and GPU inference flow
- Router - routing, load balancing, and resource pools
- Adapters - compute engine abstraction layer