What is SIE?
SIE is an inference server for small models (80+ supported). It exposes three primitives: encode (text and images to vectors), score (query-document relevance), and extract (entities and structure).
Start with the Quickstart to get your first vectors in 2 minutes. The API Reference and SDK Reference cover the full interface.
# docker run -p 8080:8080 ghcr.io/superlinked/sie:latestfrom sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# encode: text → vectorsresult = client.encode("BAAI/bge-m3", Item(text="your text"))print(result["dense"].shape) # (1024,)
# score: query + items → ranked resultsquery = Item(text="What is machine learning?")items = [Item(text="ML learns from data."), Item(text="The weather is nice.")]scores = client.score("BAAI/bge-reranker-v2-m3", query, items)
# extract: text → entitiesresult = client.extract("urchade/gliner_multi-v2.1", Item(text="Tim Cook leads Apple."), labels=["person", "org"])Why SIE Exists
Section titled “Why SIE Exists”LLM inference tools optimize for one large model across multiple GPUs. Small model inference is the opposite problem. You run many models (encoders, rerankers, extractors) on one GPU with fast switching.
What makes SIE different:
-
Compute engine abstraction. SIE wraps PyTorch, SGLang, and Flash Attention behind three primitives. The server picks the best engine per model automatically.
-
Multi-model GPU sharing. Load many models on one GPU with LRU eviction. One server instance serves any model at query time.
-
Laptop to cloud. Same codebase runs locally and in production Kubernetes.
-
Validated correctness. Every model has quality and latency targets checked in CI.