SegFormer: Simple and Efficient Design for Semantic Segmentation with\n TransformersEnze Xie, Wenhai Wang, Zhiding Yu et al.|arXiv (Cornell University)|2021 We present SegFormer, a simple, efficient yet powerful semantic segmentation\nframework which unifies Transformers with lightweight multilayer perception\n(MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a\nnovel hierarchically structured Transformer encoder which outputs multiscale\nfeatures. It does not need positional encoding, thereby avoiding the\ninterpolation of positional codes which leads to decreased performance when the\ntesting resolution differs from training. 2) SegFormer avoids complex decoders.\nThe proposed MLP decoder aggregates information from different layers, and thus\ncombining both local attention and global attention to render powerful\nrepresentations. We show that this simple and lightweight design is the key to\nefficient segmentation on Transformers. We scale our approach up to obtain a\nseries of models from SegFormer-B0 to SegFormer-B5, reaching significantly\nbetter performance and efficiency than previous counterparts. For example,\nSegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x\nsmaller and 2.2% better than the previous best method. Our best model,\nSegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows\nexcellent zero-shot robustness on Cityscapes-C. Code will be released at:\ngithub.com/NVlabs/SegFormer.\n
SphereFace: Deep Hypersphere Embedding for Face RecognitionThis paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. However, few existing algorithms can effectively achieve this criterion. To this end, we propose the angular softmax (A-Softmax) loss that enables convolutional neural networks (CNNs) to learn angularly discriminative features. Geometrically, A-Softmax loss can be viewed as imposing discriminative constraints on a hypersphere manifold, which intrinsically matches the prior that faces also lie on a manifold. Moreover, the size of angular margin can be quantitatively adjusted by a parameter m. We further derive specific m to approximate the ideal feature criterion. Extensive analysis and experiments on Labeled Face in the Wild (LFW), Youtube Faces (YTF) and MegaFace Challenge 1 show the superiority of A-Softmax loss in FR tasks.
Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-trainingYang Zou, Zhiding Yu, B. V. K. Vijaya Kumar et al.|Lecture notes in computer science|2018 Joint Discriminative and Generative Learning for Person Re-IdentificationPerson re-identification (re-id) remains challenging due to significant intra-class variations across different cameras. Recently, there has been a growing interest in using generative models to augment training data and enhance the invariance to input changes. The generative pipelines in existing methods, however, stay relatively separate from the discriminative re-id learning stages. Accordingly, re-id models are often trained in a straightforward manner on the generated data. In this paper, we seek to improve learned re-id embeddings by better leveraging the generated data. To this end, we propose a joint learning framework that couples re-id learning and data generation end-to-end. Our model involves a generative module that separately encodes each person into an appearance code and a structure code, and a discriminative module that shares the appearance encoder with the generative module. By switching the appearance or structure codes, the generative module is able to generate high-quality cross-id composed images, which are online fed back to the appearance encoder and used to improve the discriminative module. The proposed joint learning framework renders significant improvement over the baseline without using generated data, leading to the state-of-the-art performance on several benchmark datasets.
SegFormer: Simple and Efficient Design for Semantic Segmentation with TransformersEnze Xie, Wenhai Wang, Zhiding Yu et al.|CaltechAUTHORS (California Institute of Technology)|2021 We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.