Choosing a Model
SIE supports 85+ models. This guide helps you pick the right one based on your use case, language requirements, and performance needs.
Quick Recommendations
Section titled “Quick Recommendations”| Use Case | Recommended Model | Why |
|---|---|---|
| English-only, balanced | NovaSearch/stella_en_400M_v5 | Strong MTEB scores, efficient size |
| English-only, max quality | nvidia/NV-Embed-v2 | Top MTEB scores, 4096 dims |
| Speed-optimized | sentence-transformers/all-MiniLM-L6-v2 | 22M params, 384 dims, very fast |
| Multilingual | BAAI/bge-m3 | 100+ languages, also supports sparse + multivector |
| Hybrid search | BAAI/bge-m3 or naver/splade-v3 | Dense + sparse from one model, or dedicated sparse |
| Late interaction (ColBERT) | jinaai/jina-colbert-v2 | Best ColBERT quality, multilingual |
| Vision / image search | google/siglip-so400m-patch14-384 | Image-text similarity |
| Multilingual, fast | Qwen/Qwen3-Embedding-0.6B | 1024 dims, 32K context, 100+ languages |
| Document vision (PDF) | vidore/colpali-v1.3-hf | Visual document retrieval |
| ColBERT reranking | answerdotai/answerai-colbert-small-v1 | Fast MaxSim reranking; also jina-colbert-v2, GTE-ModernColBERT-v1 |
| Reranking (multilingual) | BAAI/bge-reranker-v2-m3 | Strong cross-language reranking |
| Reranking (English) | mixedbread-ai/mxbai-rerank-large-v2 | High quality, 8192 max length |
| Entity extraction | urchade/gliner_multi-v2.1 | Zero-shot NER, multilingual |
Decision Guide
Section titled “Decision Guide”| Use Case | Scenario | Recommended Models |
|---|---|---|
| Semantic Search / RAG | English-only | stella_en_400M_v5, NV-Embed-v2, all-MiniLM-L6-v2 |
| Multilingual | BAAI/bge-m3 | |
| Hybrid (dense + sparse) | BAAI/bge-m3 + naver/splade-v3 | |
| Image Search | Text ↔ Image | SigLIP, CLIP |
| Visual docs | ColPali | |
| Reranking | Multilingual | BAAI/bge-reranker-v2-m3 |
| English | mixedbread-ai/mxbai-rerank-large-v2 | |
| Entity Extraction | NER | GLiNER |
| Relations | GLiREL | |
| Classification | GLiClass |
Tradeoff Axes
Section titled “Tradeoff Axes”Quality vs Speed vs Memory
Section titled “Quality vs Speed vs Memory”| Model | Params | Dims | VRAM | Relative Speed | Quality |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | 384 | ~200MB | Fastest | Good |
| stella_en_400M_v5 | 400M | 1024 | ~1.5GB | Fast | Very good |
| bge-m3 | 568M | 1024 | ~2GB | Fast | Very good |
| NV-Embed-v2 | 7B | 4096 | ~14GB | Slow | Best |
Rule of thumb: For English, start with stella_en_400M_v5. For multilingual or hybrid search, use BAAI/bge-m3. Only move to 7B+ models if benchmarks show a meaningful gap on your data.
Dense vs Sparse vs Multi-vector
Section titled “Dense vs Sparse vs Multi-vector”| Output Type | Storage | Search Speed | Quality | Best For |
|---|---|---|---|---|
| Dense | Small (1024 floats) | Fast | Good | Standard semantic search |
| Sparse | Variable | Fast | Good for keywords | Hybrid search, keyword matching |
| Multi-vector (ColBERT) | Large (N * 128 floats) | Slower | Best | When accuracy is critical |
Recommendation: Use dense for most cases. Add sparse for hybrid search if you need keyword matching. Use multi-vector only when you need the best possible retrieval quality and can afford the storage.
Language Support
Section titled “Language Support”| Language Need | Models |
|---|---|
| English only | Stella, NV-Embed-v2, all-MiniLM, GTE-Qwen2 |
| Multilingual (100+ languages) | BGE-M3, multilingual-e5-large, Qwen3-Embedding-0.6B |
| Chinese-focused | GTE-Qwen2, BGE-M3 |
GPU Memory Planning
Section titled “GPU Memory Planning”| GPU | VRAM | Models That Fit |
|---|---|---|
| T4 | 16GB | Most models up to ~1B params |
| L4 | 24GB | All standard models, 2-3 loaded simultaneously |
| A100 40GB | 40GB | Large models, 5+ loaded simultaneously |
| A100 80GB | 80GB | 7B+ parameter models (NV-Embed-v2, e5-mistral-7b) |
With LRU eviction, you can serve all 85+ models from a single GPU — only the most recently used models stay in memory.
When to Add Reranking
Section titled “When to Add Reranking”Almost always. Two-stage retrieval (retrieve with embeddings, then rerank with a cross-encoder) consistently improves quality:
- Retrieve 20-50 candidates with dense embeddings (fast)
- Rerank to top 5-10 with a cross-encoder (more accurate)
The reranker sees both query and document together, enabling deeper semantic matching than embedding similarity alone.
# Stage 1: Fast retrievalresults = vector_db.search(query_embedding, k=20)
# Stage 2: Accurate rerankingreranked = client.score( "mixedbread-ai/mxbai-rerank-large-v2", query=Item(text="What is machine learning?"), items=[Item(text=r.text) for r in results])When to Use Hybrid Search
Section titled “When to Use Hybrid Search”Add sparse embeddings when your data has:
- Domain-specific terminology that dense models might miss
- Exact keyword matching requirements (product codes, identifiers)
- Mixed content where some queries are keyword-like and others are semantic
# Get both dense and sparse from one modelresult = client.encode( "BAAI/bge-m3", Item(text="your text"), output_types=["dense", "sparse"])# Use dense for semantic search, sparse for keyword matching# Combine scores for hybrid retrievalWhat’s Next
Section titled “What’s Next”- Model Catalog - full list of all supported models
- Sparse Embeddings - hybrid search patterns
- Multi-vector / ColBERT - late interaction retrieval
- Quantization - reduce embedding size for storage