标签归档:Network Architecture

Not All Images are Worth 16×16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16×16. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of “easy” images which can be accurately predicted with a mere number of 4×4 tokens, while only a small fraction of “hard” ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed.

https://arxiv.org/abs/2105.15075

ViT 在大规模图像识别任务中崭露头角。它主要通过将一张2D图像分解成固定数量的patch,并将其作为token。一般来说,使用更多的token将带来模型精度的提升,然而同时也会带来运算量的急剧提升。所以为了平衡效率和性能,token的数量通常会被设置成16×16. 在本文中,我们认为每一张图片都有其独特性,所以对于每一张图片应该使用不一样数量的token。事实上,我们发现大部分的图像可以表示为4×4的token就已经足够,只有少量的复杂图像需要更加多的token进行表示。基于上述发现,我们提出一种动态的Transformer架构以自主决定合适的token数量。这个架构由多个Transformers级联而成,多个Transfomer可以动态地增加输入的token数量,在有足够地token时推理过程会停止。另外,我们还设计了一个高效的特征重利用机制用于减少计算量。实验证明,我们的模型在多个公开数据集上都可以取得很好的效率和分类性能。

MLP-Mixer: An all-MLP Architecture for Vision

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9% top-1 accuracy, compared to 77.9% and 79.9% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

本文提出了一种attention layer free的全MLP的transformer架构。

Escaping the Big Data Paradigm with Compact Transformers

model-sym

With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are “data hungry” and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 94.72% when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets.

https://arxiv.org/abs/2104.05704

传统的ViT模型在训练的时候需要大量的数据,为了解决这个问题,我们在本文中提出CCT架构,这个架构可以以少量数据参与训练达到与CNNs匹配的性能。我们的模型通过一种新的序列池化策略以摆脱对class token以及位置嵌入的依赖。实验结果表明,我们的模型可以以更少的参数和更快的推理速度实验与SOTA模型相似的性能。

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

https://arxiv.org/abs/2103.11886

最近,视觉transformers (ViTs) 已经被成功地运用在图像分类任务中。在本文中,我们发现,ViTs的性能不像CNNs一样可以通过堆叠更多的卷积层实现提升,而是随着深度的提升而变得笨重而低效。我们观察到这样的问题是由注意力塌陷导致的:当transformers的层数增加时,经过特定层之后的注意力图逐渐趋向于相似甚至相同。换句话来说,ViTs在顶层的特征图趋向于一致。这个发现说明了对于更深的ViTs,自注意力机制无法为表示学习获得有效的特征,自然也无法获得额外的性能提升。根据我们的发现,我们提出了一种简单但有效的方法,称为Re-attention. 它可以在不同的层恢复注意力图的多样性同时只消耗少量的算力和资源。我们提出的方法为训练更深的ViT模型并且同时保持性能提供了可能。尤其是我们基于32个Transformer块的模型在ImageNet上获得1.6%Top-1精确度的提升。

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

https://arxiv.org/abs/2103.03404

基于注意力的架构已经在机器学习领域被广泛采用,但是它有效性的来源依旧不明。本文中我们从一个新的角度去了解自注意力网络:我们发现网络的输出可以被分解为几个部分,每一个部分包括对于一个跨层的注意力头序列的操作。利用这个分解,我们自注意力机制具有强烈的对于token一致性的诱导性偏差。特别是在没有跳跃连接或者多层感知机的情况下,输出会很快地掉入秩为1的矩阵。换句话说,跳跃连接和多层感知机阻止了输出的退化。我们实验证明收敛现象发生在多个标准的transformer建构中。

OmniNet: Omnidirectional Representations from Transformers

This paper proposes Omnidirectional Representations from Transformers (OmniNet). In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirectional attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based (Choromanski et al.), low-rank attention (Wang et al.) and/or Big Bird (Zaheer et al.) as the meta-learner. Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. The experiments show that OmniNet achieves considerable improvements across these tasks, including achieving state-of-the-art performance on LM1B, WMT’14 En-De/En-Fr, and Long Range Arena. Moreover, using omnidirectional representation in Vision Transformers leads to significant improvements on image recognition tasks on both few-shot learning and fine-tuning setups.

https://arxiv.org/abs/2103.01075

本文提出一种全方向表示的Transformers (OmniNet). 在OmniNet中,我们没有严格地设定一个水平的感受野,而是任意一个token都可以接触到整个网络中的所有tokens. 这个过程也可以看作是一种扩展的注意力机制,这种注意力机制拥有整个网络的感受野。通过上述过程,OmniNet可以作为一个meta-leraner进行训练,这也是另一种基本的自注意力机制模型。为了缓解全局注意力机制带来的复杂计算量,我们参考了其他高效自注意力模型例如基于核,低阶注意力和Big Bird meta-learner. 实验证明,在NLP和视觉任务上OmniNet都有不错的效果。

Training Generative Adversarial Networks in One Stage

Generative Adversarial Networks (GANs) have demonstrated unprecedented success in various image generation tasks. The encouraging results, however, come at the price of a cumbersome training process, during which the generator and discriminator are alternately updated in two stages. In this paper, we investigate a general training scheme that enables training GANs efficiently in only one stage. Based on the adversarial losses of the generator and discriminator, we categorize GANs into two classes, Symmetric GANs and Asymmetric GANs, and introduce a novel gradient decomposition method to unify the two, allowing us to train both classes in one stage and hence alleviate the training effort. Computational analysis and experimental results on several datasets and various network architectures demonstrate that, the proposed one-stage training scheme yields a solid 1.5× acceleration over conventional training schemes, regardless of the network architectures of the generator and discriminator. Furthermore, we show that the proposed method is readily applicable to other adversarial-training scenarios, such as data-free knowledge distillation.

https://arxiv.org/pdf/2103.00430.pdf

生成对抗网络(GANs)已经在不同的图像生成任务中展示了它史无前例的成功。但是这样的成功是来自于复杂的训练流程,这样的训练流程由生成器和判别器通过两阶段交替更新完成。在本文中,我们提出了一种GANs单步训练流程。根据对抗损失类型分类,我们把GANs分成对称GANs和非对称GANs两种,同时我们还提出了一种新的梯度分解方法去统一两种GANs使得我们可以在单步内完成训练。计算量分析和实验结果表明单步训练的GANs可以得到1.5倍的加速。

Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost.

https://arxiv.org/abs/2103.00112

Transformer是一种基于自注意力的用于NLP任务的神经网络架构。最近,一些完全基于transformer的模型被提出来解决机器视觉问题。这些模型往往都将一张图片视为一系列的图像patches而忽略了patch之间的固有结构信息。在本文中,我们提出来一种叫TNT的架构,它可以为patch层级以及像素层级进行建模。在每个TNT模块中,有一个外部transformer模块被用于处理patch嵌入。像素级的特征会被线性变换层映射到patch嵌入的空间然后加入patch中。通过堆叠TNT模块,我们可以构成用于图像识别的TNT模型。在ImageNet和下游任务中都证明了TNT的优势。

Do We Really Need Explicit Position Encodings for Vision Transformers?

Almost all visual transformers such as ViT or DeiT rely on predefined positional encodings to incorporate the order of each input token. These encodings are often implemented as learnable fixed-dimension vectors or sinusoidal functions of different frequencies, which are not possible to accommodate variable-length input sequences. This inevitably limits a wider application of transformers in vision, where many tasks require changing the input size on-the-fly. 
In this paper, we propose to employ a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. It is effortlessly implemented as what we call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework. Our new model with PEG is named Conditional Position encoding Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings. We obtain state-of-the-art results on the ImageNet classification task compared with visual Transformers to date.

https://arxiv.org/abs/2102.10882

几乎所有的视觉transformers例如ViT或者DeiT都依赖预定义的位置编码与输入token的协作。这些编码经常以可学习的定长向量或者不同频率的正弦函数的形式出现,所以这样的形式不适用于变长序列的输入。这不可避免地限制了视觉transformer的应用,尤其是一些输入长度会改变的应用。

在本文中,我们提出了一种利用条件位置编码机制,这种机制以本地邻域输入token作为条件。我们由此提出位置编码生成器 (PEG), 它可以与现有的transformer架构无缝协作。另外,我们将提出的模型命名为条件位置编码视觉Transformer (CPVT). 它可以处理可变长度的输入序列。我们展示了CPVT可以获得视觉上相似的注意力图以及更优的性能相对于现有的预定义位置编码的方法,并且我们的模型在ImageNet分类任务上获得了SOTA的评价。

LambdaNetworks: Modeling Long-Range Interactions Without Attention

We present lambda layers — an alternative framework to self-attention — for capturing long-range interactions between an input and structured contextual information (e.g. a pixel surrounded by other pixels). Lambda layers capture such interactions by transforming available contexts into linear functions, termed lambdas, and applying these linear functions to each input separately. Similar to linear attention, lambda layers bypass expensive attention maps, but in contrast, they model both content and position-based interactions which enables their application to large structured inputs such as images. The resulting neural network architectures, LambdaNetworks, significantly outperform their convolutional and attentional counterparts on ImageNet classification, COCO object detection and COCO instance segmentation, while being more computationally efficient. Additionally, we design LambdaResNets, a family of hybrid architectures across different scales, that considerably improves the speed-accuracy tradeoff of image classification models. LambdaResNets reach excellent accuracies on ImageNet while being 3.2 – 4.4x faster than the popular EfficientNets on modern machine learning accelerators. When training with an additional 130M pseudo-labeled images, LambdaResNets achieve up to a 9.5x speed-up over the corresponding EfficientNet checkpoints.

https://arxiv.org/abs/2102.08602

在本文中我们提出lambda层,这是一种可以替代自注意力机制的架构,这个架构可以捕捉长距离的输入与结构化上下文信息之间的相互作用(例如一个像素与周围像素)。Lambda层通过将可用的上下文信息进行线性变换,这种变换成为lambdas,我们将这样的线性变换作用于每一个独立输入上。与线性注意力相似,lambda层不需要计算复杂的注意力图,相反它仅仅对内容和基于位置的相互关系进行建模从而使得它可以适应于大规模的结构性输入例如图像。我们将lambda层构建成一个LambdaNetworks,这个网络在目标检测,图像识别和实例分割上都有出色的效果。