Z

Zi-Yi Dou

University of California, Los Angeles

Publishes on Topic Modeling, Natural Language Processing Techniques, Multimodal Machine Learning Applications. 77 papers and 2.5k citations.

77Publications
2.5kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan et al.|2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)|2022
Cited by 315

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present Meter, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, De-BERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. Meterachieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

Generalized Decoding for Pixel, Image, and Language
Xueyan Zou, Zi-Yi Dou, Jianwei Yang et al.|Unknown|2023
Cited by 162

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Without any pseudo-labeling, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level understanding. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on seven datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient fine-tuning and novel task composition (e.g., referring captioning and image editing shown in Fig. 1). Code, demo, video and visualization are available at: https://x-decoder-vl.github.io.

Word Alignment by Fine-tuning Embeddings on Parallel Corpora
Zi-Yi Dou, Graham Neubig|Unknown|2021
Cited by 122Open Access

Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. The great majority of past work on word alignment has worked by performing unsupervised learning on parallel text. Recently, however, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data. In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but finetuning them on parallel text with objectives designed to improve alignment quality, and proposing methods to effectively extract alignments from these fine-tuned models. We perform experiments on five language pairs and demonstrate that our model can consistently outperform previous state-of-the-art models of all varieties. In addition, we demonstrate that we are able to train multilingual word aligners that can obtain robust performance on different language pairs. Our aligner, AWE-SOME (Aligning Word Embedding Spaces Of Multilingual Encoders), with pre-trained models is available at https://github. com/neulab/awesome-align.