DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab; Timothée Darcet; Théo Moutakanni; Huy Vo; Marc Szafraniec; Vasil Khalidov; Pierre Fernandez; Daniel Haziza; Francisco Massa; Alaaeldin El-Nouby; Mahmoud Assran; Nicolas Ballas; Wojciech Galuba; Russell Howes; Po-Yao Huang; Shang-Wen Li; Ishan Misra; Michael Rabbat; Vasu Sharma; Gabriel Synnaeve; Hu Xu; Hervé Jeǵou; Julien Mairal; Patrick Labatut; Armand Joulin; Piotr Bojanowski

doi:10.48550/arxiv.2304.07193

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab(Institut national de recherche en informatique et en automatique), Timothée Darcet(Institut national de recherche en informatique et en automatique), Théo Moutakanni(Institut national de recherche en informatique et en automatique), Huy Vo(Institut national de recherche en informatique et en automatique), Marc Szafraniec(Institut national de recherche en informatique et en automatique), Vasil Khalidov(Institut national de recherche en informatique et en automatique), Pierre Fernandez(Institut national de recherche en informatique et en automatique), Daniel Haziza(Institut national de recherche en informatique et en automatique), Francisco Massa(Institut national de recherche en informatique et en automatique), Alaaeldin El-Nouby(Institut national de recherche en informatique et en automatique), Mahmoud Assran(Institut national de recherche en informatique et en automatique), Nicolas Ballas(Institut national de recherche en informatique et en automatique), Wojciech Galuba(Institut national de recherche en informatique et en automatique), Russell Howes(Institut national de recherche en informatique et en automatique), Po-Yao Huang(Institut national de recherche en informatique et en automatique), Shang-Wen Li(Institut national de recherche en informatique et en automatique), Ishan Misra(Institut national de recherche en informatique et en automatique), Michael Rabbat(Institut national de recherche en informatique et en automatique), Vasu Sharma(Institut national de recherche en informatique et en automatique), Gabriel Synnaeve(Institut national de recherche en informatique et en automatique), Hu Xu(Institut national de recherche en informatique et en automatique), Hervé Jeǵou(Institut national de recherche en informatique et en automatique), Julien Mairal(Institut national de recherche en informatique et en automatique), Patrick Labatut(Institut national de recherche en informatique et en automatique), Armand Joulin(Institut national de recherche en informatique et en automatique), Piotr Bojanowski(Institut national de recherche en informatique et en automatique)

arXiv (Cornell University)

April 14, 2023

10.48550/arxiv.2304.07193

Cited by 1,029Open Access

Full Text

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Olga Russakovsky, Jia Deng, Hao Su et al.|International Journal of Computer Vision|2015|40.1k

The Pascal Visual Object Classes (VOC) Challenge

Mark Everingham, Luc Van Gool, Christopher K. I. Williams et al.|International Journal of Computer Vision|2009|19.4k

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, Oriol Vinyals, Jay B. Dean|arXiv (Cornell University)|2015|13.9k

Vision meets robotics: The KITTI dataset

Andreas Geiger, Philip Lenz, Christoph Stiller et al.|The International Journal of Robotics Research|2013|9.7k

Indoor Segmentation and Support Inference from RGBD Images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli et al.|Lecture notes in computer science|2012|5.6k

DINOv2: Learning Robust Visual Features without Supervision

Abstract

Related Papers