Swin Transformer V2: Scaling Up Capacity and ResolutionZe Liu, Han Hu, Yutong Lin et al.|2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)|2022 We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536x1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet- V2 image classification, 63.1 / 54.4 box / mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification. We tackle issues of training instability, and study how to effectively transfer models pre-trained at low resolutions to higher resolution ones. To this aim, several novel technologies are proposed: 1) a residual post normalization technique and a scaled cosine attention approach to improve the stability of large vision models; 2) a log-spaced continuous position bias technique to effectively transfer models pre-trained at low-resolution images and windows to their higher-resolution counterparts. In addition, we share our crucial implementation details that lead to significant savings of GPU memory consumption and thus make it feasi-ble to train large vision models with regular GPUs. Using these techniques and self-supervised pre-training, we suc-cessfully train a strong 3 billion Swin Transformer model and effectively transfer it to various vision tasks involving high-resolution images or windows, achieving the state-of-the-art accuracy on a variety of benchmarks. Code is avail-able at https://github.com/microsoft/Swin-Transformer.
Dual Super-Resolution Learning for Semantic SegmentationCurrent state-of-the-art semantic segmentation methods often apply high-resolution input to attain high performance, which brings large computation budgets and limits their applications on resource-constrained devices. In this paper, we propose a simple and flexible two-stream framework named Dual Super-Resolution Learning (DSRL) to effectively improve the segmentation accuracy without introducing extra computation costs. Specifically, the proposed method consists of three parts: Semantic Segmentation Super-Resolution (SSSR), Single Image Super-Resolution (SISR) and Feature Affinity (FA) module, which can keep high-resolution representations with low-resolution input while simultaneously reducing the model computation complexity. Moreover, it can be easily generalized to other tasks, e.g., human pose estimation. This simple yet effective method leads to strong representations and is evidenced by promising performance on both semantic segmentation and human pose estimation. Specifically, for semantic segmentation on CityScapes, we can achieve $\geq$2\% higher mIoU with similar FLOPs, and keep the performance with 70\% FLOPs. For human pose estimation, we can gain $\geq$2\% mAP with the same FLOPs and maintain mAP with $30\%$ fewer FLOPs. Code and models are available at \url{https://github.com/wanglixilinx/DSRL}.
Adaptive downsampling to improve image compression at low bit ratesWeisi Lin, Li Dong|IEEE Transactions on Image Processing|2006 At low bit rates, better coding quality can be achieved by downsampling the image prior to compression and estimating the missing portion after decompression. This paper presents a new algorithm in such a paradigm, based on the adaptive decision of appropriate downsampling directions/ratios and quantization steps, in order to achieve higher coding quality with low bit rates with the consideration of local visual significance. The full-resolution image can be restored from the DCT coefficients of the downsampled pixels so that the spatial interpolation required otherwise is avoided. The proposed algorithm significantly raises the critical bit rate to approximately 1.2 bpp, from 0.15-0.41 bpp in the existing downsample-prior-to-JPEG schemes and, therefore, outperforms the standard JPEG method in a much wider bit-rate scope. The experiments have demonstrated better PSNR improvement over the existing techniques before the critical bit rate. In addition, the adaptive mode decision not only makes the critical bit rate less image-independent, but also automates the switching coders in variable bit-rate applications, since the algorithm turns to the standard JPEG method whenever it is necessary at higher bit rates.
Visual distortion gauge based on discrimination of noticeable contrast changesWeisi Lin, Li Dong, Ping Xue|IEEE Transactions on Circuits and Systems for Video Technology|2005 This paper presents a method to discriminate pixel differences according to their impact toward perceived visual quality. Noticeable local contrast changes are formulated firstly since contrast is the basic sensory feature in the human visual system (HVS) perception. The analysis aims at quantifying the actual impact of such changes (further divided into increases and decreases on edges) in different signal contexts. An associated full-reference distortion metric proposed next provides better match with the HVS viewing. Experiments have used two independent visual data sets and the related subjective viewing results, and demonstrated the performance improvement of the proposed metric over the relevant existing ones with various video/images and under diversified test conditions. The proposed metric is particularly effective to visual signal with blurring and luminance fluctuations as the major artifacts, and brings about the fundamental improvement when sharpened image edges are involved.
Quantum Secure Communication Using a Class of Three-Particle W StateLi Dong, Xiu Xiao-Ming, Gao Ya-Jun et al.|Communications in Theoretical Physics|2008 A theoretical scheme of quantum secure communication using a class of three-particle W states is proposed. In the scheme, two communicators may communicate after they test the security of the quantum channel. The receiver can obtain the secret message determinately if the quantum channel is safe. The present scheme can be realized without using teleportation.