Architecture

SIE is a layered system: SDK clients talk to a router (or directly to a server), which manages GPU workers that load and run models on-demand.

System Overview

SIE system architecture: Client Layer, Router Layer, Worker Layer

Components

Client SDK

The SDK provides encode(), score(), and extract() methods. It handles:

msgpack serialization — Binary wire format, faster and smaller than JSON
Automatic 202 retry — Waits for scale-from-zero with wait_for_capacity=True
Pool management — Background lease renewal for resource pools
Numpy integration — Returns native numpy arrays for embeddings

Framework integrations (LangChain, LlamaIndex, etc.) wrap the SDK with framework-specific interfaces.

Router

The router is a stateless proxy that sits between clients and workers. It’s optional for single-server setups but required for Kubernetes clusters.

Responsibilities:

Routes requests to the correct GPU pool based on X-SIE-MACHINE-PROFILE
Prefers workers with the requested model already loaded (model affinity)
Returns 202 Accepted when workers are scaled to zero
Manages resource pools for tenant isolation

Worker (sie-server)

Each worker is a single-GPU inference server running the full pipeline:

Preprocess — Tokenization and image processing (CPU thread pool)
Batch — Cost-based batching by token count
GPU Inference — Model forward pass via adapter (PyTorch, Flash Attention, SGLang)
Postprocess — Quantization, MUVERA transform (CPU thread pool)

Workers manage multiple models on one GPU with LRU eviction when memory pressure exceeds the threshold.

Wire Protocol

SIE uses msgpack as the default wire format instead of JSON:

Format	Encode speed	Decode speed	Size	Numpy support
msgpack	Fast	Fast	~50% of JSON	Native via msgpack-numpy
JSON	Slower	Slower	Baseline	Requires list conversion

The SDK sends and receives msgpack automatically. The OpenAI-compatible /v1/embeddings endpoint uses JSON for compatibility.

Model Cache Hierarchy

Model weights are resolved through a 3-tier cache:

Local disk cache uses LRU eviction when disk usage exceeds SIE_DISK_PRESSURE_THRESHOLD_PERCENT (default: 85%).

Cluster cache is useful for Kubernetes deployments where multiple workers share the same S3/GCS bucket, avoiding redundant downloads from HuggingFace.

Deployment Modes

Standalone (Direct)

Client → sie-server (single GPU)

Simplest setup. Client connects directly to one server. Good for development and small production.

Multi-Bundle (Docker Compose)

Client → sie-server:8080 (default bundle)
Client → sie-server:8081 (gliner bundle)

Multiple containers, each with a different bundle. Client routes to the correct port.

Cluster (Kubernetes)

Client → sie-router → worker pool(s)

Full production setup with GPU routing, autoscaling, and observability. See Kubernetes in GCP or AWS.

What’s Next

Request Pipeline - detailed preprocessing, batching, and GPU inference flow
Router - routing, load balancing, and resource pools
Adapters - compute engine abstraction layer