UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and GenerationHuaishao Luo, Ming Zhou, Botian Shi et al.|arXiv (Cornell University)|2020Cited by 169