Super-resolution from a single imageMethods for super-resolution can be broadly classified into two families of methods: (i) The classical multi-image super-resolution (combining images obtained at subpixel misalignments), and (ii) Example-Based super-resolution (learning correspondence between low and high resolution image patches from a database). In this paper we propose a unified framework for combining these two families of methods. We further show how this combined approach can be applied to obtain super resolution from as little as a single image (with no database or prior examples). Our approach is based on the observation that patches in a natural image tend to redundantly recur many times inside the image, both within the same scale, as well as across different scales. Recurrence of patches within the same image scale (at subpixel misalignments) gives rise to the classical super-resolution, whereas recurrence of patches across different scales of the same image gives rise to example-based super-resolution. Our approach attempts to recover at each pixel its best possible resolution increase based on its patch redundancy within and across scales.
Understanding Robustness of Transformers for Image ClassificationSrinadh Bhojanapalli, Ayan Chakrabarti, Daniel Gläsner et al.|2021 IEEE/CVF International Conference on Computer Vision (ICCV)|2021 Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture –such as the use of non-overlapping patches– lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.
Accurate Blur Models vs. Image Priors in Single Image Super-resolutionOver the past decade, single image Super-Resolution (SR) research has focused on developing sophisticated image priors, leading to significant advances. Estimating and incorporating the blur model, that relates the high-res and low-res images, has received much less attention, however. In particular, the reconstruction constraint, namely that the blurred and down sampled high-res output should approximately equal the low-res input image, has been either ignored or applied with default fixed blur models. In this work, we examine the relative importance of the image prior and the reconstruction constraint. First, we show that an accurate reconstruction constraint combined with a simple gradient regularization achieves SR results almost as good as those of state-of-the-art algorithms with sophisticated image priors. Second, we study both empirically and theoretically the sensitivity of SR algorithms to the blur model assumed in the reconstruction constraint. We find that an accurate blur model is more important than a sophisticated image prior. Finally, using real camera data, we demonstrate that the default blur models of various SR algorithms may differ from the camera blur, typically leading to over-smoothed results. Our findings highlight the importance of accurately estimating camera blur in reconstructing raw lowers images acquired by an actual camera.
Rethinking FID: Towards a Better Evaluation Metric for Image GenerationAs with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD of-fers a more robust and reliable assessment of image quality. A reference implementation of CMMD is available at: https://github.com/google-research/google-research/tree/master/cmmd.
Viewpoint-aware object detection and pose estimationWe describe an approach to category-level detection and viewpoint estimation for rigid 3D objects from single 2D images. In contrast to many existing methods, we directly integrate 3D reasoning with an appearance-based voting architecture. Our method relies on a nonparametric representation of a joint distribution of shape and appearance of the object class. Our voting method employs a novel parametrization of joint detection and viewpoint hypothesis space, allowing efficient accumulation of evidence. We combine this with a re-scoring and refinement mechanism, using an ensemble of view-specific Support Vector Machines. We evaluate the performance of our approach in detection and pose estimation of cars on a number of benchmark datasets.