Overview

SIE processes requests through a multi-stage pipeline. Understanding this flow helps you tune throughput, debug latency, and optimize GPU utilization.

Request Flow Overview

Every request follows this path:

HTTP Request
     │
     ▼
┌─────────────┐
│ Preprocess  │  Tokenization / image processing (CPU thread pool)
└─────────────┘
     │
     ▼
┌─────────────┐
│   Batch     │  Accumulate requests by cost budget
└─────────────┘
     │
     ▼
┌─────────────┐
│ GPU Worker  │  Model inference (encode, score, or extract)
└─────────────┘
     │
     ▼
┌─────────────┐
│ Postprocess │  MUVERA, quantization (CPU thread pool)
└─────────────┘
     │
     ▼
HTTP Response

The server handles encode, score, and extract operations. All three share this architecture.

Preprocessing

Preprocessing runs on CPU in a parallel thread pool. This stage converts raw inputs into tensors.

Tokenization

Text inputs are tokenized before batching. Tokenization runs in a parallel thread pool (up to 8 workers by default), computing the token count that determines batching cost. Original indices are tracked so results can be routed back to the correct request items.

Image Processing

For multimodal models, images are resized and normalized. The same CPU thread pool handles both text tokenization and image processing, enabling overlap with GPU inference.

Cost-Based Batching

The batcher accumulates requests based on token budget, not request count. This prevents large sequences from monopolizing GPU memory.

How It Works

The BatchFormer accumulates incoming requests and yields a batch when any of these conditions is met:

Cost limit: Total tokens reach max_batch_cost (default: 16,384)
Request limit: Request count reaches max_batch_requests (default: 64)
Timeout: Wait time exceeds max_batch_wait_ms (default: 10ms)

The timeout ensures low latency under light load, while the cost and request limits maximize throughput under heavy load.

Cost Semantics

Cost varies by modality:

Modality	Cost Calculation
Text	Token count
Images	1 per image (fixed dimensions)
Audio	sample_count / chunk_size

Padding Optimization

Before inference, items are sorted by cost within each batch. This groups similar-length sequences together, minimizing padding waste. Short sequences batch with short sequences; long sequences batch with long sequences. This simple sort can reduce padding by 20-40% compared to FIFO ordering.

GPU Inference

The ModelWorker manages a single model’s inference pipeline. It runs inference in a dedicated thread to avoid blocking the async event loop.

Cross-Request Batching

Items from different HTTP requests can share a GPU batch if they have matching configuration: same output types, same instruction (for instruction-tuned models), same query/document flag, and same LoRA adapter. This maximizes GPU utilization when multiple clients send concurrent requests with compatible parameters.

Worker Execution

The worker loop waits for batches from the BatchFormer, groups items by inference configuration, runs the model forward pass in a dedicated thread pool, and fans results back to waiting request futures. A single inference thread serializes GPU work, preventing memory fragmentation from concurrent CUDA allocations.

Postprocessing

After inference, optional transforms run on CPU:

MUVERA Transform

For ColBERT models, MUVERA converts variable-length token embeddings into fixed-size vectors. This enables standard vector search on multi-vector outputs.

Quantization

The output_dtype option reduces embedding precision for storage efficiency. Postprocessors run in the same CPU thread pool as preprocessing, and quantization applies last after all model-specific transforms.

Memory Management

SIE uses reactive LRU eviction to manage GPU memory without static VRAM budgets.

Pressure Threshold

The memory manager monitors device utilization and triggers eviction when usage exceeds the pressure threshold (default: 85%). The least-recently-used model is evicted first.

Eviction Strategy

Eviction happens at two points:

Pre-load: Before loading a new model, evict LRU models until below threshold
Background monitor: Periodic checks catch memory growth during inference

LRU Tracking

Every request updates the model’s last-used timestamp. Models are tracked in an ordered structure where the front contains the least-recently-used model (first eviction candidate) and the back contains the most-recently-used.

Timing Breakdown

The SDK returns timing information with each response:

Python
TypeScript

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")
result = client.encode("BAAI/bge-m3", Item(text="Hello"))

print(result["timing"])
# {"queue_ms": 2.1, "inference_ms": 15.3, "total_ms": 18.5}

import { SIEClient } from "@sie/sdk";

const client = new SIEClient("http://localhost:8080");
const result = await client.encode("BAAI/bge-m3", { text: "Hello" });

console.log(result.timing);
// { queueMs: 2.1, inferenceMs: 15.3, totalMs: 18.5 }

Metric	Description
`queue_ms` / `queueMs`	Time waiting in batch queue
`inference_ms` / `inferenceMs`	GPU inference time
`total_ms` / `totalMs`	End-to-end server latency

High queue time indicates batching is working effectively. Very low values may mean requests are processed one at a time (low concurrency).

What’s Next

Deployment Options - Docker, Kubernetes, cloud marketplaces
CLI Reference - server configuration options