Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Bin Xiao; Haiping Wu; Weijian Xu; Xiyang Dai; Houdong Hu; Yumao Lu; Michael Zeng; Ce Liu; Lu Yuan

doi:10.1109/cvpr52733.2024.00461

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Bin Xiao(Microsoft Research (United Kingdom)), Haiping Wu(Microsoft Research (United Kingdom)), Weijian Xu(Microsoft Research (United Kingdom)), Xiyang Dai(Microsoft Research (United Kingdom)), Houdong Hu(Microsoft Research (United Kingdom)), Yumao Lu(Microsoft Research (United Kingdom)), Michael Zeng(Microsoft Research (United Kingdom)), Ce Liu(Microsoft Research (United Kingdom)), Lu Yuan(Microsoft Research (United Kingdom))

Unknown

June 16, 2024

10.1109/cvpr52733.2024.00461

Cited by 197

Abstract

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for various computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform diverse tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with un-precedented zero-shot and fine-tuning capabilities.

Related Papers

No related papers found

Powered by citation graph analysis