Skip to content
Why did we open-source our inference engine? Read the post

Choosing a Model

SIE supports 85+ models. This guide helps you pick the right one based on your use case, language requirements, and performance needs.

Use CaseRecommended ModelWhy
English-only, balancedNovaSearch/stella_en_400M_v5Strong MTEB scores, efficient size
English-only, max qualitynvidia/NV-Embed-v2Top MTEB scores, 4096 dims
Speed-optimizedsentence-transformers/all-MiniLM-L6-v222M params, 384 dims, very fast
MultilingualBAAI/bge-m3100+ languages, also supports sparse + multivector
Hybrid searchBAAI/bge-m3 or naver/splade-v3Dense + sparse from one model, or dedicated sparse
Late interaction (ColBERT)jinaai/jina-colbert-v2Best ColBERT quality, multilingual
Vision / image searchgoogle/siglip-so400m-patch14-384Image-text similarity
Multilingual, fastQwen/Qwen3-Embedding-0.6B1024 dims, 32K context, 100+ languages
Document vision (PDF)vidore/colpali-v1.3-hfVisual document retrieval
ColBERT rerankinganswerdotai/answerai-colbert-small-v1Fast MaxSim reranking; also jina-colbert-v2, GTE-ModernColBERT-v1
Reranking (multilingual)BAAI/bge-reranker-v2-m3Strong cross-language reranking
Reranking (English)mixedbread-ai/mxbai-rerank-large-v2High quality, 8192 max length
Entity extractionurchade/gliner_multi-v2.1Zero-shot NER, multilingual

Use CaseScenarioRecommended Models
Semantic Search / RAGEnglish-onlystella_en_400M_v5, NV-Embed-v2, all-MiniLM-L6-v2
MultilingualBAAI/bge-m3
Hybrid (dense + sparse)BAAI/bge-m3 + naver/splade-v3
Image SearchText ↔ ImageSigLIP, CLIP
Visual docsColPali
RerankingMultilingualBAAI/bge-reranker-v2-m3
Englishmixedbread-ai/mxbai-rerank-large-v2
Entity ExtractionNERGLiNER
RelationsGLiREL
ClassificationGLiClass

ModelParamsDimsVRAMRelative SpeedQuality
all-MiniLM-L6-v222M384~200MBFastestGood
stella_en_400M_v5400M1024~1.5GBFastVery good
bge-m3568M1024~2GBFastVery good
NV-Embed-v27B4096~14GBSlowBest

Rule of thumb: For English, start with stella_en_400M_v5. For multilingual or hybrid search, use BAAI/bge-m3. Only move to 7B+ models if benchmarks show a meaningful gap on your data.

Output TypeStorageSearch SpeedQualityBest For
DenseSmall (1024 floats)FastGoodStandard semantic search
SparseVariableFastGood for keywordsHybrid search, keyword matching
Multi-vector (ColBERT)Large (N * 128 floats)SlowerBestWhen accuracy is critical

Recommendation: Use dense for most cases. Add sparse for hybrid search if you need keyword matching. Use multi-vector only when you need the best possible retrieval quality and can afford the storage.


Language NeedModels
English onlyStella, NV-Embed-v2, all-MiniLM, GTE-Qwen2
Multilingual (100+ languages)BGE-M3, multilingual-e5-large, Qwen3-Embedding-0.6B
Chinese-focusedGTE-Qwen2, BGE-M3

GPUVRAMModels That Fit
T416GBMost models up to ~1B params
L424GBAll standard models, 2-3 loaded simultaneously
A100 40GB40GBLarge models, 5+ loaded simultaneously
A100 80GB80GB7B+ parameter models (NV-Embed-v2, e5-mistral-7b)

With LRU eviction, you can serve all 85+ models from a single GPU — only the most recently used models stay in memory.


Almost always. Two-stage retrieval (retrieve with embeddings, then rerank with a cross-encoder) consistently improves quality:

  1. Retrieve 20-50 candidates with dense embeddings (fast)
  2. Rerank to top 5-10 with a cross-encoder (more accurate)

The reranker sees both query and document together, enabling deeper semantic matching than embedding similarity alone.

# Stage 1: Fast retrieval
results = vector_db.search(query_embedding, k=20)
# Stage 2: Accurate reranking
reranked = client.score(
"mixedbread-ai/mxbai-rerank-large-v2",
query=Item(text="What is machine learning?"),
items=[Item(text=r.text) for r in results]
)

Add sparse embeddings when your data has:

  • Domain-specific terminology that dense models might miss
  • Exact keyword matching requirements (product codes, identifiers)
  • Mixed content where some queries are keyword-like and others are semantic
# Get both dense and sparse from one model
result = client.encode(
"BAAI/bge-m3",
Item(text="your text"),
output_types=["dense", "sparse"]
)
# Use dense for semantic search, sparse for keyword matching
# Combine scores for hybrid retrieval