Title: Paper page - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Description: Join the discussion on this paper page Keywords: No keywords Text content: Paper page - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Hugging Face Models Datasets Spaces Posts Docs Enterprise Pricing Log In Sign Up Papers arxiv:2502.14786 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Published on Feb 20 · Submitted by akhaliq on Feb 21 #2 Paper of the day Upvote 87 +79 Authors: Michael Tschannen , Alexey Gritsenko , Xiao Wang , Muhammad Ferjad Naeem , Ibrahim Alabdulmohsin , Nikhil Parthasarathy , Talfan Evans , Lucas Beyer , Ye Xia , Basil Mustafa , Olivier Hénaff , Jeremiah Harmsen , Andreas Steiner , Xiaohua Zhai Abstract We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B). View arXiv page View PDF Add to collection Community akhaliq Paper submitter 1 day ago https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md See translation + Reply TronEasy about 22 hours ago This comment has been hidden + AntonioNocerino99 about 16 hours ago It is truly magnificent! Thank you for sharing this work. See translation + Reply librarian-bot about 10 hours ago This is an automated message from the Librarian Bot. I found the following papers similar to this paper. The following papers were recommended by the Semantic Scholar API FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing (2025) Generate, Transduct, Adapt: Iterative Transduction with VLMs (2025) Efficient Few-Shot Continual Learning in Vision-Language Models (2025) MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders (2025) Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model (2025) Unifying Specialized Visual Encoders for Video Language Models (2025) SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval (2025) Please give a thumbs up to this comment if you found it helpful! If you want recommendations for any Paper on Hugging Face checkout this Space You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend See translation + Reply franchesoni about 1 hour ago Not compared against dinov2reg or other SSL backbones for vision tasks... See translation + Reply EditPreview Upload images, audio, and videos by dragging in the text input, pasting, or clicking here. Tap or paste here to upload images Comment · Sign up or log in to comment Upvote 87 +75 Models citing this paper 79 google/siglip2-base-patch16-224 Zero-Shot Image Classification • Updated 1 day ago • 364 • 15 google/siglip2-giant-opt-patch16-384 Zero-Shot Image Classification • Updated 1 day ago • 203 • 6 google/siglip2-large-patch16-512 Zero-Shot Image Classification • Updated 1 day ago • 58 • 5 google/siglip2-so400m-patch16-naflex Zero-Shot Image Classification • Updated 1 day ago • 234 • 5 Browse 79 models citing this paper Datasets citing this paper 0 No dataset linking this paper Cite arxiv.org/abs/2502.14786 in a dataset README.md to link it from this page. Spaces citing this paper 2 🚀 google/zero-shot-sg1-sg2🤔 adorkin/siglip2-clothes Collections including this paper 9 SigLIP2 Collection 36 items • Updated 1 day ago • 34 Papers Collection 348 items • Updated about 6 hours ago • 9 Papers to Read Collection 25 items • Updated about 17 hours ago • 1 readings Collection 229 items • Updated about 22 hours ago Browse 9 collections that include this paper System theme Company TOS Privacy About Jobs Website Models Datasets Spaces Pricing Docs