google/siglip-so400m-patch14-384
SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.
Benchmarks
Flickr30kI2TRetrieval
Image-to-text retrieval: retrieve captions from images
Corpus: 31,783 Queries: 1,000
Quality
ndcg at 10 0.9001
map at 10 0.8364
mrr at 10 0.9663
Performance L4-SPOT b1 c8
Corpus TPS 202
Corpus p50 523.6ms
Query TPS 10
Query p50 711.3ms
Performance L4 b1 c16
Corpus TPS 508
Corpus p50 452.9ms
Query TPS 18
Query p50 551.4ms