Maxime Oquab

Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks

Maxime Oquab, Léon Bottou, Ivan Laptev et al.|Unknown|2014

Cited by 3.2k

Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The success of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level features used in other image classification methods. Learning CNNs, however, amounts to estimating millions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be efficiently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni et al.|arXiv (Cornell University)|2023

Cited by 1kOpen Access

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Is object localization for free? - Weakly-supervised learning with convolutional neural networks

Maxime Oquab, Léon Bottou, Ivan Laptev et al.|Unknown|2015

Cited by 921

Successful methods for visual object recognition typically rely on training datasets containing lots of richly annotated images. Detailed image annotation, e.g. by object bounding boxes, however, is both expensive and often subjective. We describe a weakly supervised convolutional neural network (CNN) for object classification that relies only on image-level labels, yet can learn from cluttered scenes containing multiple objects. We quantify its object classification and object location prediction performance on the Pascal VOC 2012 (20 object classes) and the much larger Microsoft COCO (80 object classes) datasets. We find that the network (i) outputs accurate image-level labels, (ii) predicts approximate locations (but not extents) of objects, and (iii) performs comparably to its fully-supervised counterparts using object bounding box annotation for training.

ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization

Vadim Kantorov, Maxime Oquab, Minsu Cho et al.|Lecture notes in computer science|2016

Cited by 281Open Access

Weakly supervised object recognition with convolutional neural networks

Maxime Oquab, Léon Bottou, Ivan Laptev et al.|Unknown|2014

Cited by 69

Successful visual object recognition methods typically rely on training datasets containing lots of richly annotated images. Annotating object bounding boxes is both expensive and subjective. We describe a weakly supervised convolutional neural network (CNN) for object recognition that does not rely on detailed object annotation and yet returns 86.3% mAP on the Pascal VOC classification task, outperforming previous fully-supervised systems by a sizable margin. Despite the lack of bounding box supervision, the network produces maps that clearly localize the objects in cluttered scenes. We also show that adding fully supervised object examples to our weakly supervised setup does not increase the classification performance.

Is this you? Claim your profile.

Top publicationsby citations