Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

Su Wang; Chitwan Saharia; Ceslee Montgomery; Jordi Pont-Tuset; Shai Noy; Stefano Pellegrini; Yasumasa Onoe; Sarah Laszlo; David J. Fleet; Radu Soricut; Jason Baldridge; Mohammad Norouzi; Peter J. Anderson; William Chan

doi:10.1109/cvpr52729.2023.01761

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

Su Wang(Google (United States)), Chitwan Saharia(Google (United States)), Ceslee Montgomery(Google (United States)), Jordi Pont-Tuset(Google (United States)), Shai Noy(Google (United States)), Stefano Pellegrini(Google (United States)), Yasumasa Onoe(Google (United States)), Sarah Laszlo(Google (United States)), David J. Fleet(Google (United States)), Radu Soricut(Google (United States)), Jason Baldridge(Google (United States)), Mohammad Norouzi(Google (United States)), Peter J. Anderson(Google (United States)), William Chan(Google (United States))

Unknown

June 1, 2023

10.1109/cvpr52729.2023.01761

Cited by 142

Abstract

Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen [36] on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment – such that Imagen Editor is preferred over DALL-E 2 [31] and Stable Diffusion [33] – and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.

Mark Sandler, Andrew Howard, Menglong Zhu et al.|Unknown|2018|25.2k

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth et al.|International Journal of Computer Vision|2017|5.2k

Generative Image Inpainting with Contextual Attention

Jiahui Yu, Zhe Lin, Jimei Yang et al.|Unknown|2018|2.5k

Image Super-Resolution Via Iterative Refinement

Chitwan Saharia, Jonathan Ho, William Chan et al.|IEEE Transactions on Pattern Analysis and Machine Intelligence|2022|1.6k

Resolution-robust Large Mask Inpainting with Fourier Convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin et al.|2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)|2022|978

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

Abstract

Related Papers