Do We Really Need Explicit Position Encodings for Vision Transformers?

Almost all visual transformers such as ViT or DeiT rely on predefined positional encodings to incorporate the order of each input token. These encodings are often implemented as learnable fixed-dimension vectors or sinusoidal functions of different frequencies, which are not possible to accommodate variable-length input sequences. This inevitably limits a wider application of transformers in vision, where many tasks require changing the input size on-the-fly. 
In this paper, we propose to employ a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. It is effortlessly implemented as what we call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework. Our new model with PEG is named Conditional Position encoding Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings. We obtain state-of-the-art results on the ImageNet classification task compared with visual Transformers to date.


在本文中,我们提出了一种利用条件位置编码机制,这种机制以本地邻域输入token作为条件。我们由此提出位置编码生成器 (PEG), 它可以与现有的transformer架构无缝协作。另外,我们将提出的模型命名为条件位置编码视觉Transformer (CPVT). 它可以处理可变长度的输入序列。我们展示了CPVT可以获得视觉上相似的注意力图以及更优的性能相对于现有的预定义位置编码的方法,并且我们的模型在ImageNet分类任务上获得了SOTA的评价。


邮箱地址不会被公开。 必填项已用*标注