google/siglip-so400m-patch14-224

SigLIP model pre-trained on WebLi at resolution 224x224. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.

Architecture

SigLIP

Parameters

877M

Tasks

Encode

Outputs

Dense

Dimensions

Dense: 1,152

Max Sequence Length

64 tokens

License

apache-2.0

View on HuggingFace →

Benchmarks

Flickr30kI2TRetrieval

general retrieval en

Image-to-text retrieval: retrieve captions from images

Corpus: 31,783 Queries: 1,000

Quality

ndcg at 10 0.8382

map at 10 0.7479

mrr at 10 0.9353

Performance L4-SPOT b1 c8

Corpus TPS 223

Corpus p50 395.0ms

Query TPS 11

Query p50 392.1ms

Performance L4 b1 c16

Corpus TPS 473

Corpus p50 484.7ms

Query TPS 22

Query p50 425.9ms

Reference →

Benchmarks

Flickr30kI2TRetrieval

Self-hosted inference for search & document processing