Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.

https://arxiv.org/abs/2012.09958

Transformers类的模型已经成为主导NLP领域的模型,借助在大量的数据上进行预训练,它可以通过fine-tuning迁移到规模更小更加细分的领域。ViT是第一个将图像直接输入Transformer中,结果显示它可以达到与CNN比肩的性能。但是,计算复杂度限制了输入图像的分辨率,在例如目标检测或者分割领域有不可忽视的缺点。在本文中,我们提出了一种以ViT为主干的通用目标检测模型ViT-FRCNN,它可在COCO数据集以达到有竞争力的性能。它还保留了传统Transformer的优点:大规模预训练潜力,快速fine-Tuning性能。我们还研究了相较于标准的检测模型架构,基于ViT的模型的提升之处,包括:在域外图像上更好的性能,在大尺寸目标上更好的性能以及对于非最大抑制更少的以来。我们认为ViT-FRCNN是 将Transformer应用到一般机器视觉任务的重要一步。

发表评论

邮箱地址不会被公开。 必填项已用*标注