Multi-vector & ColBERT
Multi-vector embeddings assign a vector to each token instead of pooling into a single vector. This enables “late interaction” scoring where query and document tokens interact during search.
Quick Example
Section titled “Quick Example”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.encode( "jinaai/jina-colbert-v2", Item(text="What is machine learning?"), output_types=["multivector"], is_query=True,)
# Per-token embeddings: [num_tokens, dim]mv = result["multivector"]print(f"Tokens: {mv.shape[0]}, Dim: {mv.shape[1]}")# Tokens: 7, Dim: 128import { SIEClient } from "@sie/sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.encode( "jinaai/jina-colbert-v2", { text: "What is machine learning?" }, { outputTypes: ["multivector"], isQuery: true });
// Per-token embeddings: Float32Array[] (one per token)const mv = result.multivector;console.log(`Tokens: ${mv?.length}, Dim: ${mv?.[0]?.length}`);// Tokens: 7, Dim: 128
await client.close();When to Use Multi-Vector
Section titled “When to Use Multi-Vector”Use multi-vector when:
- Retrieval quality is critical (RAG, legal/medical search)
- You can afford ~12x storage vs dense (see Storage Considerations)
- Sub-100ms latency is acceptable
Stick to dense when:
- Storage is constrained
- You need sub-millisecond latency
- Quality difference does not justify the storage cost
How Late Interaction Works
Section titled “How Late Interaction Works”ColBERT-style models use MaxSim scoring:
- Encode query → N query token vectors
- Encode document → M document token vectors
- For each query token, find max similarity with any document token
- Sum max similarities = final score
Query: [q1, q2, q3, q4] # 4 tokensDocument: [d1, d2, d3, d4, d5] # 5 tokens
MaxSim = max(sim(q1,d*)) + max(sim(q2,d*)) + max(sim(q3,d*)) + max(sim(q4,d*))This captures fine-grained term matching that dense embeddings miss.
MaxSim Scoring
Section titled “MaxSim Scoring”The SDK provides client-side MaxSim scoring:
from sie_sdk import SIEClientfrom sie_sdk.types import Itemfrom sie_sdk.scoring import maxsim
client = SIEClient("http://localhost:8080")
# Encode query (is_query=True enables query expansion with MASK tokens)query_result = client.encode( "jinaai/jina-colbert-v2", Item(text="What is ColBERT?"), output_types=["multivector"], is_query=True,)
# Encode documents (no is_query - documents are not expanded)documents = [ Item(text="ColBERT is a late interaction retrieval model."), Item(text="The weather is sunny today."),]doc_results = client.encode( "jinaai/jina-colbert-v2", documents, output_types=["multivector"],)
# Compute MaxSim scoresquery_mv = query_result["multivector"]doc_mvs = [r["multivector"] for r in doc_results]scores = maxsim(query_mv, doc_mvs)
for i, score in enumerate(scores): print(f"Doc {i}: {score:.3f}")import { SIEClient, maxsim } from "@sie/sdk";
const client = new SIEClient("http://localhost:8080");
// Encode query (isQuery=true enables query expansion with MASK tokens)const queryResult = await client.encode( "jinaai/jina-colbert-v2", { text: "What is ColBERT?" }, { outputTypes: ["multivector"], isQuery: true });
// Encode documents (no isQuery - documents are not expanded)const documents = [ { text: "ColBERT is a late interaction retrieval model." }, { text: "The weather is sunny today." },];const docResults = await client.encode( "jinaai/jina-colbert-v2", documents, { outputTypes: ["multivector"] });
// Compute MaxSim scores using SDK helperconst queryMv = queryResult.multivector!;const scores = docResults.map((r) => maxsim(queryMv, r.multivector!));
scores.forEach((score, i) => { console.log(`Doc ${i}: ${score.toFixed(3)}`);});
await client.close();Query Expansion
Section titled “Query Expansion”ColBERT models pad queries with [MASK] tokens that become “virtual” query terms:
# Short query gets expanded with MASK tokensresult = client.encode( "jinaai/jina-colbert-v2", Item(text="python"), output_types=["multivector"], is_query=True,)
# Produces 32 tokens (1 real + 31 MASK) instead of just 1print(f"Tokens: {result['multivector'].shape[0]}")// Short query gets expanded with MASK tokensconst result = await client.encode( "jinaai/jina-colbert-v2", { text: "python" }, { outputTypes: ["multivector"], isQuery: true });
// Produces 32 tokens (1 real + 31 MASK) instead of just 1console.log(`Tokens: ${result.multivector?.length}`);Documents are NOT expanded—only queries.
MUVERA: Multi-Vector to Fixed-Dimension Embeddings
Section titled “MUVERA: Multi-Vector to Fixed-Dimension Embeddings”MUVERA (Multi-Vector to Single-Vector Approximate Retrieval) converts variable-length multi-vector embeddings to fixed-dimension dense vectors. This enables ColBERT-quality retrieval using standard HNSW vector search instead of specialized multi-vector indexes:
result = client.encode( "jinaai/jina-colbert-v2", Item(text="document text"), output_types=["dense"], options={"profile": "muvera"},)
# Result is fixed-dimension dense vector (10240 dims)print(f"FDE dimensions: {len(result['dense'])}")// Note: MUVERA profile support may require server-side configurationconst result = await client.encode( "jinaai/jina-colbert-v2", { text: "document text" }, { outputTypes: ["dense"] });
// Result is fixed-dimension dense vector (10240 dims)console.log(`FDE dimensions: ${result.dense?.length}`);Trade-off: MUVERA incurs ~5-10% quality loss compared to true MaxSim scoring, but enables use of standard vector databases (Qdrant, Pinecone, pgvector) without multi-vector support.
Storage Considerations
Section titled “Storage Considerations”Multi-vector embeddings are significantly larger than dense embeddings because they store one vector per token rather than a single pooled vector:
| Representation | Dimensions | Storage per 1M docs (float32) |
|---|---|---|
| Dense (bge-m3) | 1024 | ~4 GB |
| Multi-vector (ColBERT 128-dim) | ~100 tokens x 128 dim | ~51 GB |
| Multi-vector (ColBERT 96-dim) | ~100 tokens x 96 dim | ~38 GB |
| MUVERA FDE | 10240 | ~41 GB |
The multi-vector storage calculation: 1M docs x 100 tokens x 128 dims x 4 bytes = ~51 GB. Actual token counts vary by document length (max 512 for most models, 8192 for long-context models).
Multi-Vector Models
Section titled “Multi-Vector Models”| Model | Token Dim | Max Length | Notes |
|---|---|---|---|
jinaai/jina-colbert-v2 | 128 | 8192 | Long context, rotary embeddings |
answerdotai/answerai-colbert-small-v1 | 96 | 512 | Smallest, fastest |
colbert-ir/colbertv2.0 | 128 | 512 | Original ColBERT |
mixedbread-ai/mxbai-colbert-large-v1 | 128 | 512 | Large model, standard dim |
lightonai/GTE-ModernColBERT-v1 | 128 | 8192 | ModernBERT architecture |
HTTP API
Section titled “HTTP API”The server defaults to msgpack. For JSON, set the Accept header:
curl -X POST http://localhost:8080/v1/encode/jinaai/jina-colbert-v2 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{"items": [{"text": "ColBERT query"}], "params": {"output_types": ["multivector"]}}'What’s Next
Section titled “What’s Next”- Dense embeddings - simpler, smaller storage
- Reranking - alternative quality improvement via cross-encoders