Why did we open-source our inference engine? Read the post
← All Posts

Boost performance & reduce cost by self-hosting specialized AI models

Boost performance & reduce cost by self-hosting specialized AI models

The open-source opportunity

We have built TB-scale search systems from e-commerce to enterprise for years, always relying on open source models. Models for multi-vector & multi-modal embeddings, re-ranking, classification, transcription, OCR, VQA and more — no matter the domain or language, a good OSS model either already exists or is just a fine-tuning run away.

We are living through a Cambrian explosion of open source AI. Hugging Face went from zero to one million models in about 1,000 days. The second million took just 335. By mid-2025, models added that year had already surpassed the entire 2024 total — and the pace keeps accelerating.

Open-source models on Hugging Face
0 1M 2M 3M 2022 2023 2024 2025 Mar 26 2.745M ColBERTv2 E5 SigLIP BGE-M3 ColPali GOT-OCR2 GLiNER ReLiK ModernColBERT Qwen3 Jina v3 GLM-OCR Nemotron Voxtral

Source: Hugging Face Hub

Even the largest proprietary models now have an equally capable open source alternative within a couple of months of launch. The building blocks are there. So why aren’t most companies running their search, data processing and agents on dozens of task-focused open source models?

The multi-model problem

Real-world AI pipelines aren’t one model — they are many specialized models working together. We think of them in three categories:

Encode — turn text, images, and documents into vectors for search. This includes dense embeddings, sparse (BM25-style) vectors, multi-vector representations like ColBERT that capture token-level interactions, and vision-capable embeddings like SigLIP and ColQwen2.5 that understand both text and images natively.

Score — re-rank and score candidate results for precision. Cross-encoder rerankers compare query-document pairs head-to-head for maximum accuracy. Late interaction models like ColBERT offer a middle ground — better than dense retrieval, faster than cross-encoders — by comparing individual token representations.

Extract — pull structured data from unstructured inputs. Named entity recognition with GLiNER, relationship extraction with GLiREL, zero-shot classification with GLiClass, OCR and visual question answering — working across text and image inputs. A single document processing pipeline might chain three or four of these models together.

Consider a legal enterprise processing 100k PDFs a month: they need an OCR model, a document classifier, an embedding model for semantic search and a reranker for precision. Or an e-commerce platform that runs product tagging, query intent extraction and multi-language search across 20 markets. Both end up stitching together 5+ separate model deployments with independent scaling and monitoring.

Most companies can’t take advantage of this because single-model inference servers weren’t designed for multi-model workloads.

When every model requires a separate deployment, you end up with 5 models on 5 GPUs — each running at 1–3% utilization. You’re paying for 50x the compute you need. GPUs are provisioned for peak load and sit idle 97% of the time. Query-time models and batch pipelines run on separate infrastructure, duplicating GPU spend everywhere. And most deployments use default HuggingFace pipelines — no flash attention, no continuous batching, no quantization. The GPU is barely touched.

Between the mainstream inference providers chasing the latest general purpose LLM while neglecting small models, and the open source inference projects that still require building home-grown infra around them — there hasn’t been a simple way to self-host hundreds of OSS models in your own cloud, at scale, reliably, securely and cost-efficiently.

Introducing Superlinked Inference Engine

SIE is a multi-model inference cluster for search and document processing workloads. Instead of spinning up one service per model, SIE runs all of them — encode, score and extract — behind a single API.

We are proud to release it under an Apache 2.0 license for AWS & GCP:

  1. 85+ state-of-the-art models for encoding, scoring and extraction — including multi-vector embeddings (ColBERT, ColPali), vision models (SigLIP, ColQwen2.5), cross-encoder and late-interaction rerankers, NER, classification, OCR and more. We continuously add top-ranked models from MTEB, BEIR and document processing benchmarks — see all models here.
  2. High GPU utilization through lazy-loading multiple models on shared GPUs with elastic scaling. Share GPU capacity across your entire encode/score/extract pipeline instead of dedicating one GPU per model.
  3. One cluster, one API with built-in auth and workflow partitioning. The legal team runs their entire PDF pipeline on one cluster; the e-commerce team runs product tagging, intent classification and search side-by-side.
85+ supported models
50x cost reduction vs APIs
80%+ GPU utilization

Self-hosting with SIE is dramatically cheaper than managed APIs — and the savings apply across all model categories:

Embedding — per 1M tokens
Voyage large
$0.18
Gemini
$0.15
OpenAI large
$0.13
Cohere v4
$0.12
SIE self-hosted
~$0.01

Sources: OpenAI, Cohere, Voyage AI, Google pricing pages — Mar 2026

Reranking — per 1M tokens
Cohere Rerank
$1.00
Voyage Rerank
$0.05
SIE self-hosted
~$0.01

Sources: Cohere, Voyage AI pricing pages — Mar 2026

Extraction — per 1K pages
AWS Textract
$1.50
Google Doc AI
$1.50
Azure Doc Intel
$1.00
SIE self-hosted
~$0.05

Sources: AWS, Google Cloud, Azure pricing pages — Mar 2026

Get your own cluster up-and-running today by following our quickstart, browse projects built by our community to see what’s possible, or drop it into your existing Chroma, Weaviate, LangChain, LlamaIndex, DSPy, Haystack or CrewAI projects via our native integrations.

Our plan for the next few months is to add 100s more models and to keep improving the performance-per-dollar you can get from your GPUs — we are excited to see what you build, let us know!

Here is to an epic 2026!

Daniel, Ben and the Superlinked team

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.