ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pretraining (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches for VLP heavily rely on image feature extraction processes, most of which involve region supervisions (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the actual multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual encoder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.

视觉-语言预训练可以为视觉语言结合的下游任务提高性能。现在的方法都强烈地依赖图像特征提取的过程,而且绝大多数都包括区域监督过程(例如目标检测)以及卷积架构(例如ResNet)。然而这些方法都忽略了以下问题 (1) 效率/速度,仅仅是简单地提取输入特征会比多域融合的模型需要更多算力;(2)表达能力,这样的模型的性能上界被视觉编码器的表达能力所限制,而这样的编码器是由视觉词汇训练的。在本文中,我们介绍了一个小型化的VLP模型:Vision-and-Language Transformer (ViLT)。通过整体的流程处理视觉输入可以相较于卷积流程简化并且同时处理文本信息。我们的模型比之前的VLP模型快60倍,并且可以更好地应用于下游任务中。


邮箱地址不会被公开。 必填项已用*标注