Why did we open-source our inference engine? Read the post

vidore/colqwen2.5-v0.2

ColQwen is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features.

Architecture
Qwen2
Parameters
7.0B
Tasks
Encode
Outputs
Multi-Vec
Dimensions
Multi-Vec: 128
Max Sequence Length
2,048 tokens
License
mit
Languages
en

Benchmarks

Vidore3ComputerScienceRetrieval

technology retrieval en

Visual document retrieval on computer science papers and slides

Quality
ndcg at 10 0.7680
map at 10 0.6543
mrr at 10 0.8726
Performance L4 b1 c4
Corpus TPS 2
Corpus p50 1.9s
Query TPS 139
Query p50 527.1ms
Reference →

Vidore3FinanceEnRetrieval

finance retrieval en

Visual document retrieval on financial reports

Quality
ndcg at 10 0.6207
map at 10 0.5008
mrr at 10 0.7416
Performance L4 b1 c4
Corpus TPS 2
Corpus p50 1.9s
Query TPS 150
Query p50 547.2ms
Reference →

english

general retrieval en

Visual document retrieval on HR-related documents

Quality
ndcg at 10 0.6034
map at 10 0.4666
mrr at 10 0.7046
Performance L4 b1 c4
Corpus TPS 2
Corpus p50 2.0s
Query TPS 110
Query p50 769.7ms
Reference →

Vidore3PharmaceuticalsRetrieval

medical retrieval en

Visual document retrieval on pharmaceutical documents

Quality
ndcg at 10 0.6274
map at 10 0.5173
mrr at 10 0.7234
Performance L4 b1 c4
Corpus TPS 2
Corpus p50 1.9s
Query TPS 138
Query p50 565.3ms
Reference →

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.