Skip to content
SIE

Multi-vector & ColBERT

Multi-vector embeddings assign a vector to each token instead of pooling into a single vector. This enables “late interaction” scoring where query and document tokens interact during search.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.encode(
"jinaai/jina-colbert-v2",
Item(text="What is machine learning?"),
output_types=["multivector"],
is_query=True,
)
# Per-token embeddings: [num_tokens, dim]
mv = result["multivector"]
print(f"Tokens: {mv.shape[0]}, Dim: {mv.shape[1]}")
# Tokens: 7, Dim: 128

Use multi-vector when:

  • Retrieval quality is critical (RAG, legal/medical search)
  • You can afford ~12x storage vs dense (see Storage Considerations)
  • Sub-100ms latency is acceptable

Stick to dense when:

  • Storage is constrained
  • You need sub-millisecond latency
  • Quality difference does not justify the storage cost

ColBERT-style models use MaxSim scoring:

  1. Encode query → N query token vectors
  2. Encode document → M document token vectors
  3. For each query token, find max similarity with any document token
  4. Sum max similarities = final score
Query: [q1, q2, q3, q4] # 4 tokens
Document: [d1, d2, d3, d4, d5] # 5 tokens
MaxSim = max(sim(q1,d*)) + max(sim(q2,d*)) + max(sim(q3,d*)) + max(sim(q4,d*))

This captures fine-grained term matching that dense embeddings miss.

The SDK provides client-side MaxSim scoring:

from sie_sdk import SIEClient
from sie_sdk.types import Item
from sie_sdk.scoring import maxsim
client = SIEClient("http://localhost:8080")
# Encode query (is_query=True enables query expansion with MASK tokens)
query_result = client.encode(
"jinaai/jina-colbert-v2",
Item(text="What is ColBERT?"),
output_types=["multivector"],
is_query=True,
)
# Encode documents (no is_query - documents are not expanded)
documents = [
Item(text="ColBERT is a late interaction retrieval model."),
Item(text="The weather is sunny today."),
]
doc_results = client.encode(
"jinaai/jina-colbert-v2",
documents,
output_types=["multivector"],
)
# Compute MaxSim scores
query_mv = query_result["multivector"]
doc_mvs = [r["multivector"] for r in doc_results]
scores = maxsim(query_mv, doc_mvs)
for i, score in enumerate(scores):
print(f"Doc {i}: {score:.3f}")

ColBERT models pad queries with [MASK] tokens that become “virtual” query terms:

# Short query gets expanded with MASK tokens
result = client.encode(
"jinaai/jina-colbert-v2",
Item(text="python"),
output_types=["multivector"],
is_query=True,
)
# Produces 32 tokens (1 real + 31 MASK) instead of just 1
print(f"Tokens: {result['multivector'].shape[0]}")

Documents are NOT expanded—only queries.

MUVERA: Multi-Vector to Fixed-Dimension Embeddings

Section titled “MUVERA: Multi-Vector to Fixed-Dimension Embeddings”

MUVERA (Multi-Vector to Single-Vector Approximate Retrieval) converts variable-length multi-vector embeddings to fixed-dimension dense vectors. This enables ColBERT-quality retrieval using standard HNSW vector search instead of specialized multi-vector indexes:

result = client.encode(
"jinaai/jina-colbert-v2",
Item(text="document text"),
output_types=["dense"],
options={"profile": "muvera"},
)
# Result is fixed-dimension dense vector (10240 dims)
print(f"FDE dimensions: {len(result['dense'])}")

Trade-off: MUVERA incurs ~5-10% quality loss compared to true MaxSim scoring, but enables use of standard vector databases (Qdrant, Pinecone, pgvector) without multi-vector support.

Multi-vector embeddings are significantly larger than dense embeddings because they store one vector per token rather than a single pooled vector:

RepresentationDimensionsStorage per 1M docs (float32)
Dense (bge-m3)1024~4 GB
Multi-vector (ColBERT 128-dim)~100 tokens x 128 dim~51 GB
Multi-vector (ColBERT 96-dim)~100 tokens x 96 dim~38 GB
MUVERA FDE10240~41 GB

The multi-vector storage calculation: 1M docs x 100 tokens x 128 dims x 4 bytes = ~51 GB. Actual token counts vary by document length (max 512 for most models, 8192 for long-context models).

ModelToken DimMax LengthNotes
jinaai/jina-colbert-v21288192Long context, rotary embeddings
answerdotai/answerai-colbert-small-v196512Smallest, fastest
colbert-ir/colbertv2.0128512Original ColBERT
mixedbread-ai/mxbai-colbert-large-v1128512Large model, standard dim
lightonai/GTE-ModernColBERT-v11288192ModernBERT architecture

The server defaults to msgpack. For JSON, set the Accept header:

Terminal window
curl -X POST http://localhost:8080/v1/encode/jinaai/jina-colbert-v2 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{"items": [{"text": "ColBERT query"}], "params": {"output_types": ["multivector"]}}'