Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.
最近，视觉transformers (ViTs) 已经被成功地运用在图像分类任务中。在本文中，我们发现，ViTs的性能不像CNNs一样可以通过堆叠更多的卷积层实现提升，而是随着深度的提升而变得笨重而低效。我们观察到这样的问题是由注意力塌陷导致的：当transformers的层数增加时，经过特定层之后的注意力图逐渐趋向于相似甚至相同。换句话来说，ViTs在顶层的特征图趋向于一致。这个发现说明了对于更深的ViTs，自注意力机制无法为表示学习获得有效的特征，自然也无法获得额外的性能提升。根据我们的发现，我们提出了一种简单但有效的方法，称为Re-attention. 它可以在不同的层恢复注意力图的多样性同时只消耗少量的算力和资源。我们提出的方法为训练更深的ViT模型并且同时保持性能提供了可能。尤其是我们基于32个Transformer块的模型在ImageNet上获得1.6%Top-1精确度的提升。