标签归档:Network Architecture

Escaping the Big Data Paradigm with Compact Transformers


With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are “data hungry” and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 94.72% when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets.


传统的ViT模型在训练的时候需要大量的数据,为了解决这个问题,我们在本文中提出CCT架构,这个架构可以以少量数据参与训练达到与CNNs匹配的性能。我们的模型通过一种新的序列池化策略以摆脱对class token以及位置嵌入的依赖。实验结果表明,我们的模型可以以更少的参数和更快的推理速度实验与SOTA模型相似的性能。

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.


最近,视觉transformers (ViTs) 已经被成功地运用在图像分类任务中。在本文中,我们发现,ViTs的性能不像CNNs一样可以通过堆叠更多的卷积层实现提升,而是随着深度的提升而变得笨重而低效。我们观察到这样的问题是由注意力塌陷导致的:当transformers的层数增加时,经过特定层之后的注意力图逐渐趋向于相似甚至相同。换句话来说,ViTs在顶层的特征图趋向于一致。这个发现说明了对于更深的ViTs,自注意力机制无法为表示学习获得有效的特征,自然也无法获得额外的性能提升。根据我们的发现,我们提出了一种简单但有效的方法,称为Re-attention. 它可以在不同的层恢复注意力图的多样性同时只消耗少量的算力和资源。我们提出的方法为训练更深的ViT模型并且同时保持性能提供了可能。尤其是我们基于32个Transformer块的模型在ImageNet上获得1.6%Top-1精确度的提升。

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.



OmniNet: Omnidirectional Representations from Transformers

This paper proposes Omnidirectional Representations from Transformers (OmniNet). In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirectional attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based (Choromanski et al.), low-rank attention (Wang et al.) and/or Big Bird (Zaheer et al.) as the meta-learner. Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. The experiments show that OmniNet achieves considerable improvements across these tasks, including achieving state-of-the-art performance on LM1B, WMT’14 En-De/En-Fr, and Long Range Arena. Moreover, using omnidirectional representation in Vision Transformers leads to significant improvements on image recognition tasks on both few-shot learning and fine-tuning setups.


本文提出一种全方向表示的Transformers (OmniNet). 在OmniNet中,我们没有严格地设定一个水平的感受野,而是任意一个token都可以接触到整个网络中的所有tokens. 这个过程也可以看作是一种扩展的注意力机制,这种注意力机制拥有整个网络的感受野。通过上述过程,OmniNet可以作为一个meta-leraner进行训练,这也是另一种基本的自注意力机制模型。为了缓解全局注意力机制带来的复杂计算量,我们参考了其他高效自注意力模型例如基于核,低阶注意力和Big Bird meta-learner. 实验证明,在NLP和视觉任务上OmniNet都有不错的效果。

Training Generative Adversarial Networks in One Stage

Generative Adversarial Networks (GANs) have demonstrated unprecedented success in various image generation tasks. The encouraging results, however, come at the price of a cumbersome training process, during which the generator and discriminator are alternately updated in two stages. In this paper, we investigate a general training scheme that enables training GANs efficiently in only one stage. Based on the adversarial losses of the generator and discriminator, we categorize GANs into two classes, Symmetric GANs and Asymmetric GANs, and introduce a novel gradient decomposition method to unify the two, allowing us to train both classes in one stage and hence alleviate the training effort. Computational analysis and experimental results on several datasets and various network architectures demonstrate that, the proposed one-stage training scheme yields a solid 1.5× acceleration over conventional training schemes, regardless of the network architectures of the generator and discriminator. Furthermore, we show that the proposed method is readily applicable to other adversarial-training scenarios, such as data-free knowledge distillation.



Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost.



Do We Really Need Explicit Position Encodings for Vision Transformers?

Almost all visual transformers such as ViT or DeiT rely on predefined positional encodings to incorporate the order of each input token. These encodings are often implemented as learnable fixed-dimension vectors or sinusoidal functions of different frequencies, which are not possible to accommodate variable-length input sequences. This inevitably limits a wider application of transformers in vision, where many tasks require changing the input size on-the-fly. 
In this paper, we propose to employ a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. It is effortlessly implemented as what we call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework. Our new model with PEG is named Conditional Position encoding Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings. We obtain state-of-the-art results on the ImageNet classification task compared with visual Transformers to date.



在本文中,我们提出了一种利用条件位置编码机制,这种机制以本地邻域输入token作为条件。我们由此提出位置编码生成器 (PEG), 它可以与现有的transformer架构无缝协作。另外,我们将提出的模型命名为条件位置编码视觉Transformer (CPVT). 它可以处理可变长度的输入序列。我们展示了CPVT可以获得视觉上相似的注意力图以及更优的性能相对于现有的预定义位置编码的方法,并且我们的模型在ImageNet分类任务上获得了SOTA的评价。

LambdaNetworks: Modeling Long-Range Interactions Without Attention

We present lambda layers — an alternative framework to self-attention — for capturing long-range interactions between an input and structured contextual information (e.g. a pixel surrounded by other pixels). Lambda layers capture such interactions by transforming available contexts into linear functions, termed lambdas, and applying these linear functions to each input separately. Similar to linear attention, lambda layers bypass expensive attention maps, but in contrast, they model both content and position-based interactions which enables their application to large structured inputs such as images. The resulting neural network architectures, LambdaNetworks, significantly outperform their convolutional and attentional counterparts on ImageNet classification, COCO object detection and COCO instance segmentation, while being more computationally efficient. Additionally, we design LambdaResNets, a family of hybrid architectures across different scales, that considerably improves the speed-accuracy tradeoff of image classification models. LambdaResNets reach excellent accuracies on ImageNet while being 3.2 – 4.4x faster than the popular EfficientNets on modern machine learning accelerators. When training with an additional 130M pseudo-labeled images, LambdaResNets achieve up to a 9.5x speed-up over the corresponding EfficientNet checkpoints.



Bottleneck Transformers for Visual Recognition

We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 2.33x faster in compute time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.


我们提出BoTNet,一种基于自注意力的简单但是高效的基本架构,它可以广泛地用于许多计算机视觉任务:包括图像分类,目标检测以及实例分割。我们仅仅在ResNet地瓶颈模块中将空间卷积替换为全局自注意力就可以在实例分割以及目标检测任务上取得显著地性能提升且减少了参数。通过BoTNet,我们还展示了如何将基于自注意地ResNet瓶颈模块视为Transformer模块。公正地说,BoTNet以Mask R-CNN作为基础模型取得了44.4% Mask AP和49.7% Box AP在COCO实例分割排行榜上…

Visual Concept Reasoning Networks

Alex Naka on Twitter: "Visual Concept Reasoning Networks  https://t.co/WnGgez392j… "

A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. It approximates sparsely connected networks by explicitly defining multiple branches to simultaneously learn representations with different visual concepts or properties. Dependencies or interactions between these representations are typically defined by dense and local operations, however, without any adaptiveness or high-level reasoning. In this work, we propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts. We associate each branch with a visual concept and derive a compact concept state by selecting a few local descriptors through an attention module. These concept states are then updated by graph-based interaction and used to adaptively modulate the local descriptors. We describe our proposed model by split-transform-attend-interact-modulate-merge stages, which are implemented by opting for a highly modularized architecture. Extensive experiments on visual recognition tasks such as image classification, semantic segmentation, object detection, scene recognition, and action recognition show that our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.


拆分-转换-合并的策略已经被广泛地应用于视觉理解的卷积神经网络设计中。它利用一系列显式多路径的稀疏连接同时学习视觉信息和特性的表达。表达的相互依赖或者内部关联由密集的本地操作定义,但是没有考虑到适应性和高层级的因果关系。本文研究将上述策略与isual Concept Reasoning Networks (VCRNet) 合并似的上述策略能够学习到高层视觉信息的因果特性。我们将每一个支路与一个视觉信息合并,并且经由一些本地描述器构成的注意力模块以获得一个紧凑的特征。这些信息可以由基于图的互联更新,并且能自适应地构建本地描述器。本文提出的模型由拆分-转换-注册-互联-建模-合并等步骤构成,它是面对高层级的建模。实验证明在多重视觉理解任务中本模型都能取得良好的成绩。