Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost.

https://arxiv.org/abs/2103.00112

Transformer是一种基于自注意力的用于NLP任务的神经网络架构。最近,一些完全基于transformer的模型被提出来解决机器视觉问题。这些模型往往都将一张图片视为一系列的图像patches而忽略了patch之间的固有结构信息。在本文中,我们提出来一种叫TNT的架构,它可以为patch层级以及像素层级进行建模。在每个TNT模块中,有一个外部transformer模块被用于处理patch嵌入。像素级的特征会被线性变换层映射到patch嵌入的空间然后加入patch中。通过堆叠TNT模块,我们可以构成用于图像识别的TNT模型。在ImageNet和下游任务中都证明了TNT的优势。

发表评论

邮箱地址不会被公开。 必填项已用*标注