Skip to content
Why did we open-source our inference engine? Read the post

What is SIE?

SIE is an inference server for small models (85+ supported). It exposes three primitives: encode (text and images to vectors), score (query-document relevance), and extract (entities and structure).

Start with the Quickstart to get your first vectors in 2 minutes. The API Reference and SDK Reference cover the full interface.

I want to…Go to
Embed text or imagesEncode Overview
Rerank search resultsScore Overview
Extract entitiesExtract Overview
Choose the right modelModel Selection Guide
See all 85+ modelsModel Catalog
Deploy to productionDeployment Overview
Migrate from OpenAIIntegrations
Use LangChain / LlamaIndexIntegrations

LLM inference tools optimize for one large model across multiple GPUs. Small model inference is the opposite problem. You run many models (encoders, rerankers, extractors) on one GPU with fast switching.

What makes SIE different:

  1. Compute engine abstraction. SIE wraps PyTorch, SGLang, and Flash Attention behind three primitives. The server picks the best engine per model automatically.

  2. Multi-model GPU sharing. Load many models on one GPU with LRU eviction. One server instance serves any model at query time.

  3. Laptop to cloud. Same codebase runs locally and in production Kubernetes.

  4. Validated correctness. Every model has quality and latency targets checked in CI.