ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Transformers at NeurIPS 2019. Papers related to transformers at… | by Pavel  Gladkov | Towards Data Science

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks, visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

https://arxiv.org/abs/1908.02265

我们在本文中提出了ViLBERT,这是一个跨任务的图像-自然语言联合表现的模型。我们将著名的BERT模型拓展成一个多模态的双通道模型,这个模型分别处理视觉和文字输入并且使用联合注意力transformer层进行融合。我们通过两个代理任务在大型自动搜集的内容标记数据上进行预训练,然后将其迁移到多个现有的视觉-语言任务:视觉问答、视觉常识推理以及指称语法、基于标注的图像检索任务。我们在以上四个任务中观察到明显的性能提升。我们的工作表明了一种迁移方式是可行的,这种方式是通过学习视觉和语言任务,并且将视觉信息作为预训练和可迁移的能力实现的。

发表评论

邮箱地址不会被公开。 必填项已用*标注