What is SIE?
SIE is an inference server for small models (85+ supported). It exposes three primitives: encode (text and images to vectors), score (query-document relevance), and extract (entities and structure).
Start with the Quickstart to get your first vectors in 2 minutes. The API Reference and SDK Reference cover the full interface.
Get Started
Section titled “Get Started”| I want to… | Go to |
|---|---|
| Embed text or images | Encode Overview |
| Rerank search results | Score Overview |
| Extract entities | Extract Overview |
| Choose the right model | Model Selection Guide |
| See all 85+ models | Model Catalog |
| Deploy to production | Deployment Overview |
| Migrate from OpenAI | Integrations |
| Use LangChain / LlamaIndex | Integrations |
Why SIE Exists
Section titled “Why SIE Exists”LLM inference tools optimize for one large model across multiple GPUs. Small model inference is the opposite problem. You run many models (encoders, rerankers, extractors) on one GPU with fast switching.
What makes SIE different:
-
Compute engine abstraction. SIE wraps PyTorch, SGLang, and Flash Attention behind three primitives. The server picks the best engine per model automatically.
-
Multi-model GPU sharing. Load many models on one GPU with LRU eviction. One server instance serves any model at query time.
-
Laptop to cloud. Same codebase runs locally and in production Kubernetes.
-
Validated correctness. Every model has quality and latency targets checked in CI.