google/siglip-so400m-patch14-384

SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.

Architecture

SigLIP

Parameters

878M

Tasks

Encode

Outputs

Dense

Dimensions

Dense: 1,152

Max Sequence Length

64 tokens

License

apache-2.0

View on HuggingFace →

Benchmarks

Flickr30kI2TRetrieval

general retrieval en

Image-to-text retrieval: retrieve captions from images

Corpus: 31,783 Queries: 1,000

Quality

ndcg at 10 0.9001

map at 10 0.8364

mrr at 10 0.9663

Performance L4-SPOT b1 c8

Corpus TPS 202

Corpus p50 523.6ms

Query TPS 10

Query p50 711.3ms

Performance L4 b1 c16

Corpus TPS 508

Corpus p50 452.9ms

Query TPS 18

Query p50 551.4ms

Reference →

Benchmarks

Flickr30kI2TRetrieval

Self-hosted inference for search & document processing