Multi-vector & ColBERT

Multi-vector embeddings assign a vector to each token instead of pooling into a single vector. This enables “late interaction” scoring where query and document tokens interact during search.

Quick Example

Python
TypeScript

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

result = client.encode(
    "jinaai/jina-colbert-v2",
    Item(text="What is machine learning?"),
    output_types=["multivector"],
    is_query=True,
)

# Per-token embeddings: [num_tokens, dim]
mv = result["multivector"]
print(f"Tokens: {mv.shape[0]}, Dim: {mv.shape[1]}")
# Tokens: 7, Dim: 128

import { SIEClient } from "@sie/sdk";

const client = new SIEClient("http://localhost:8080");

const result = await client.encode(
  "jinaai/jina-colbert-v2",
  { text: "What is machine learning?" },
  { outputTypes: ["multivector"], isQuery: true }
);

// Per-token embeddings: Float32Array[] (one per token)
const mv = result.multivector;
console.log(`Tokens: ${mv?.length}, Dim: ${mv?.[0]?.length}`);
// Tokens: 7, Dim: 128

await client.close();

When to Use Multi-Vector

Use multi-vector when:

Retrieval quality is critical (RAG, legal/medical search)
You can afford ~12x storage vs dense (see Storage Considerations)
Sub-100ms latency is acceptable

Stick to dense when:

Storage is constrained
You need sub-millisecond latency
Quality difference does not justify the storage cost

How Late Interaction Works

ColBERT-style models use MaxSim scoring:

Encode query → N query token vectors
Encode document → M document token vectors
For each query token, find max similarity with any document token
Sum max similarities = final score

Query:    [q1, q2, q3, q4]        # 4 tokens
Document: [d1, d2, d3, d4, d5]   # 5 tokens

MaxSim = max(sim(q1,d*)) + max(sim(q2,d*)) + max(sim(q3,d*)) + max(sim(q4,d*))

This captures fine-grained term matching that dense embeddings miss.

MaxSim Scoring

The SDK provides client-side MaxSim scoring:

Python
TypeScript

from sie_sdk import SIEClient
from sie_sdk.types import Item
from sie_sdk.scoring import maxsim

client = SIEClient("http://localhost:8080")

# Encode query (is_query=True enables query expansion with MASK tokens)
query_result = client.encode(
    "jinaai/jina-colbert-v2",
    Item(text="What is ColBERT?"),
    output_types=["multivector"],
    is_query=True,
)

# Encode documents (no is_query - documents are not expanded)
documents = [
    Item(text="ColBERT is a late interaction retrieval model."),
    Item(text="The weather is sunny today."),
]
doc_results = client.encode(
    "jinaai/jina-colbert-v2",
    documents,
    output_types=["multivector"],
)

# Compute MaxSim scores
query_mv = query_result["multivector"]
doc_mvs = [r["multivector"] for r in doc_results]
scores = maxsim(query_mv, doc_mvs)

for i, score in enumerate(scores):
    print(f"Doc {i}: {score:.3f}")

import { SIEClient, maxsim } from "@sie/sdk";

const client = new SIEClient("http://localhost:8080");

// Encode query (isQuery=true enables query expansion with MASK tokens)
const queryResult = await client.encode(
  "jinaai/jina-colbert-v2",
  { text: "What is ColBERT?" },
  { outputTypes: ["multivector"], isQuery: true }
);

// Encode documents (no isQuery - documents are not expanded)
const documents = [
  { text: "ColBERT is a late interaction retrieval model." },
  { text: "The weather is sunny today." },
];
const docResults = await client.encode(
  "jinaai/jina-colbert-v2",
  documents,
  { outputTypes: ["multivector"] }
);

// Compute MaxSim scores using SDK helper
const queryMv = queryResult.multivector!;
const scores = docResults.map((r) => maxsim(queryMv, r.multivector!));

scores.forEach((score, i) => {
  console.log(`Doc ${i}: ${score.toFixed(3)}`);
});

await client.close();

Query Expansion

ColBERT models pad queries with [MASK] tokens that become “virtual” query terms:

Python
TypeScript

# Short query gets expanded with MASK tokens
result = client.encode(
    "jinaai/jina-colbert-v2",
    Item(text="python"),
    output_types=["multivector"],
    is_query=True,
)

# Produces 32 tokens (1 real + 31 MASK) instead of just 1
print(f"Tokens: {result['multivector'].shape[0]}")

// Short query gets expanded with MASK tokens
const result = await client.encode(
  "jinaai/jina-colbert-v2",
  { text: "python" },
  { outputTypes: ["multivector"], isQuery: true }
);

// Produces 32 tokens (1 real + 31 MASK) instead of just 1
console.log(`Tokens: ${result.multivector?.length}`);

Documents are NOT expanded—only queries.

MUVERA: Multi-Vector to Fixed-Dimension Embeddings

MUVERA (Multi-Vector to Single-Vector Approximate Retrieval) converts variable-length multi-vector embeddings to fixed-dimension dense vectors. This enables ColBERT-quality retrieval using standard HNSW vector search instead of specialized multi-vector indexes:

Python
TypeScript

result = client.encode(
    "jinaai/jina-colbert-v2",
    Item(text="document text"),
    output_types=["dense"],
    options={"profile": "muvera"},
)

# Result is fixed-dimension dense vector (10240 dims)
print(f"FDE dimensions: {len(result['dense'])}")

// Note: MUVERA profile support may require server-side configuration
const result = await client.encode(
  "jinaai/jina-colbert-v2",
  { text: "document text" },
  { outputTypes: ["dense"] }
);

// Result is fixed-dimension dense vector (10240 dims)
console.log(`FDE dimensions: ${result.dense?.length}`);

Trade-off: MUVERA incurs ~5-10% quality loss compared to true MaxSim scoring, but enables use of standard vector databases (Qdrant, Pinecone, pgvector) without multi-vector support.

Storage Considerations

Multi-vector embeddings are significantly larger than dense embeddings because they store one vector per token rather than a single pooled vector:

Representation	Dimensions	Storage per 1M docs (float32)
Dense (bge-m3)	1024	~4 GB
Multi-vector (ColBERT 128-dim)	~100 tokens x 128 dim	~51 GB
Multi-vector (ColBERT 96-dim)	~100 tokens x 96 dim	~38 GB
MUVERA FDE	10240	~41 GB

The multi-vector storage calculation: 1M docs x 100 tokens x 128 dims x 4 bytes = ~51 GB. Actual token counts vary by document length (max 512 for most models, 8192 for long-context models).

Multi-Vector Models

Model	Token Dim	Max Length	Notes
`jinaai/jina-colbert-v2`	128	8192	Long context, rotary embeddings
`answerdotai/answerai-colbert-small-v1`	96	512	Smallest, fastest
`colbert-ir/colbertv2.0`	128	512	Original ColBERT
`mixedbread-ai/mxbai-colbert-large-v1`	128	512	Large model, standard dim
`lightonai/GTE-ModernColBERT-v1`	128	8192	ModernBERT architecture

HTTP API

The server defaults to msgpack. For JSON, set the Accept header:

curl -X POST http://localhost:8080/v1/encode/jinaai/jina-colbert-v2 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{"items": [{"text": "ColBERT query"}], "params": {"output_types": ["multivector"]}}'

What’s Next

Dense embeddings - simpler, smaller storage
Reranking - alternative quality improvement via cross-encoders