Skip to content
Why did we open-source our inference engine? Read the post

Architecture

SIE is a layered system: SDK clients talk to a router (or directly to a server), which manages GPU workers that load and run models on-demand.

SIE system architecture: Client Layer, Router Layer, Worker Layer

The SDK provides encode(), score(), and extract() methods. It handles:

  • msgpack serialization — Binary wire format, faster and smaller than JSON
  • Automatic 202 retry — Waits for scale-from-zero with wait_for_capacity=True
  • Pool management — Background lease renewal for resource pools
  • Numpy integration — Returns native numpy arrays for embeddings

Framework integrations (LangChain, LlamaIndex, etc.) wrap the SDK with framework-specific interfaces.

The router is a stateless proxy that sits between clients and workers. It’s optional for single-server setups but required for Kubernetes clusters.

Responsibilities:

  • Routes requests to the correct GPU pool based on X-SIE-MACHINE-PROFILE
  • Prefers workers with the requested model already loaded (model affinity)
  • Returns 202 Accepted when workers are scaled to zero
  • Manages resource pools for tenant isolation

Each worker is a single-GPU inference server running the full pipeline:

  1. Preprocess — Tokenization and image processing (CPU thread pool)
  2. Batch — Cost-based batching by token count
  3. GPU Inference — Model forward pass via adapter (PyTorch, Flash Attention, SGLang)
  4. Postprocess — Quantization, MUVERA transform (CPU thread pool)

Workers manage multiple models on one GPU with LRU eviction when memory pressure exceeds the threshold.


SIE uses msgpack as the default wire format instead of JSON:

FormatEncode speedDecode speedSizeNumpy support
msgpackFastFast~50% of JSONNative via msgpack-numpy
JSONSlowerSlowerBaselineRequires list conversion

The SDK sends and receives msgpack automatically. The OpenAI-compatible /v1/embeddings endpoint uses JSON for compatibility.


Model weights are resolved through a 3-tier cache:

Model cache hierarchy: Local Cache, Cluster Cache, HuggingFace Hub

Local disk cache uses LRU eviction when disk usage exceeds SIE_DISK_PRESSURE_THRESHOLD_PERCENT (default: 85%).

Cluster cache is useful for Kubernetes deployments where multiple workers share the same S3/GCS bucket, avoiding redundant downloads from HuggingFace.


Client → sie-server (single GPU)

Simplest setup. Client connects directly to one server. Good for development and small production.

Client → sie-server:8080 (default bundle)
Client → sie-server:8081 (gliner bundle)

Multiple containers, each with a different bundle. Client routes to the correct port.

Client → sie-router → worker pool(s)

Full production setup with GPU routing, autoscaling, and observability. See Kubernetes in GCP or AWS.


  • Request Pipeline - detailed preprocessing, batching, and GPU inference flow
  • Router - routing, load balancing, and resource pools
  • Adapters - compute engine abstraction layer