标签归档:Transformer

Do Transformers Really Perform Bad for Graph Representation?

The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.

https://arxiv.org/abs/2106.05234

Transformer架构已经成为自然语言处理和机器视觉等任务的主要模型之一。然而,它依然没有在图级的排行榜上展现出有竞争力的性能。所以,Transformer在图表示学习领域依然有需要探索的空间。在本文中,我们提出了Graphormer,这是一种基于标准Transformer的架构,可以实现在图表示学习任务上优秀的性能。模型的关键是如何有效的将图信息编码进架构中。最后,我们还提出了几个简单但有效的编码架构用于帮助Graphormer更好得建模图架构数据。另外,我们从数学角度分析了Graphormer架构的表达能力以及我们编码方式,许多受欢迎的GNN架构都可以转换成特殊形式的Graphormer.

Not All Images are Worth 16×16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16×16. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of “easy” images which can be accurately predicted with a mere number of 4×4 tokens, while only a small fraction of “hard” ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed.

https://arxiv.org/abs/2105.15075

ViT 在大规模图像识别任务中崭露头角。它主要通过将一张2D图像分解成固定数量的patch,并将其作为token。一般来说,使用更多的token将带来模型精度的提升,然而同时也会带来运算量的急剧提升。所以为了平衡效率和性能,token的数量通常会被设置成16×16. 在本文中,我们认为每一张图片都有其独特性,所以对于每一张图片应该使用不一样数量的token。事实上,我们发现大部分的图像可以表示为4×4的token就已经足够,只有少量的复杂图像需要更加多的token进行表示。基于上述发现,我们提出一种动态的Transformer架构以自主决定合适的token数量。这个架构由多个Transformers级联而成,多个Transfomer可以动态地增加输入的token数量,在有足够地token时推理过程会停止。另外,我们还设计了一个高效的特征重利用机制用于减少计算量。实验证明,我们的模型在多个公开数据集上都可以取得很好的效率和分类性能。

Are Pre-trained Convolutions Better than Pre-trained Transformers?

In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.

https://arxiv.org/abs/2105.03322

Transformers在自然语言处理任务中已经是一个当然的选择。然而,在作为预训练模型使用的时候的研究还不完善。在获得充分预训练的情况下,卷积神经网络可以获得与Transformer相似的性能吗?本文研究了在八个数据集和任务上CNNs和Transformers的性能比较,我们发现在某些情况下基于CNN的预训练模型可以优于基于Transformer的预训练模型。

Escaping the Big Data Paradigm with Compact Transformers

model-sym

With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are “data hungry” and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 94.72% when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets.

https://arxiv.org/abs/2104.05704

传统的ViT模型在训练的时候需要大量的数据,为了解决这个问题,我们在本文中提出CCT架构,这个架构可以以少量数据参与训练达到与CNNs匹配的性能。我们的模型通过一种新的序列池化策略以摆脱对class token以及位置嵌入的依赖。实验结果表明,我们的模型可以以更少的参数和更快的推理速度实验与SOTA模型相似的性能。

Can Vision Transformers Learn without Natural Images?

Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a large-scale dataset and human-annotated labels, recent large-scale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and labor-intensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0.

https://arxiv.org/abs/2103.13023

我们可以在没有自然图像和人工标注的情况下完成视觉Transformers的训练吗?尽管一个ViT的预训练似乎非常依赖与大规模数据集和人工标注,可是最近的大规模数据集都有一些隐私侵犯,不公正的保护以及密集劳动力的标注等问题。在本文中,我们在没有大规模标注数据的介入下训练ViT。我们验证我们的网络部分优于一些自监督学习方法在没有自然图像参与预训练的情况下。另外,尽管我们的网络没有自然图像参与预训练,但是它可以拥有更多样的可视化结果相较于ImageNet上训练的ViT,这说明我们的模型可以处理自然图像。

An Image is Worth 16×16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method. 

https://arxiv.org/abs/2103.13915

在动作识别领域领跑的方法们都需要从一个视频中提取空间和时间的信息。对于那些达到SOTA精度的方法经常使用3D卷积层来获得时间域的信息。使用卷积操作就代表着需要将视频切割成短片段后再进行处理,每一个片段是相邻的被采样的帧。这意味着为了覆盖整个视频,我们需要采样多个短片段以达到全覆盖的目的。这也使得计算量增加从而无法在实际应用中部署。我们通过减少采样帧数的方式极大地减少了计算量。我们的方法是用一个时域transformer通过全局注意力的方式覆盖视频帧,因此可以更好地利用每一帧的注意力信息。因此,我们的方法在处理输入信息的时候更加有效率并且可以达到SOTA的性能。

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

https://arxiv.org/abs/2103.11886

最近,视觉transformers (ViTs) 已经被成功地运用在图像分类任务中。在本文中,我们发现,ViTs的性能不像CNNs一样可以通过堆叠更多的卷积层实现提升,而是随着深度的提升而变得笨重而低效。我们观察到这样的问题是由注意力塌陷导致的:当transformers的层数增加时,经过特定层之后的注意力图逐渐趋向于相似甚至相同。换句话来说,ViTs在顶层的特征图趋向于一致。这个发现说明了对于更深的ViTs,自注意力机制无法为表示学习获得有效的特征,自然也无法获得额外的性能提升。根据我们的发现,我们提出了一种简单但有效的方法,称为Re-attention. 它可以在不同的层恢复注意力图的多样性同时只消耗少量的算力和资源。我们提出的方法为训练更深的ViT模型并且同时保持性能提供了可能。尤其是我们基于32个Transformer块的模型在ImageNet上获得1.6%Top-1精确度的提升。

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Transformer architectures have brought about fundamental changes to computational linguistic field, which had been dominated by recurrent neural networks for many years. Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue. In this paper, we review some of the most critical milestones in the field, as well as overall trends on how transformer architecture has been incorporated into visuolinguistic cross-modal tasks. Furthermore, we discuss its current limitations and speculate upon some of the prospects that we find imminent.

https://arxiv.org/abs/2103.04037

Transformer架构为计算语言领域带来了根本性的变化,这改变了循环神经网络长期占据的局面。它的成功揭示了视觉和语言跨模态任务正在发生显著的改变,很多科研人员已经投入并正在解决上述问题。本文中我们回顾了一些这个领域的里程碑以及transformer在跨模态任务中的进化趋势。然后我们还讨论了transormer架构目前的缺陷以及对未来的展望。

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

https://arxiv.org/abs/2103.03404

基于注意力的架构已经在机器学习领域被广泛采用,但是它有效性的来源依旧不明。本文中我们从一个新的角度去了解自注意力网络:我们发现网络的输出可以被分解为几个部分,每一个部分包括对于一个跨层的注意力头序列的操作。利用这个分解,我们自注意力机制具有强烈的对于token一致性的诱导性偏差。特别是在没有跳跃连接或者多层感知机的情况下,输出会很快地掉入秩为1的矩阵。换句话说,跳跃连接和多层感知机阻止了输出的退化。我们实验证明收敛现象发生在多个标准的transformer建构中。

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. 
We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation.

https://arxiv.org/abs/2102.04306

医疗图像分割是医疗看护系统的一个基本任务,尤其是在疾病诊断和治疗计划领域。在多种医疗图像分割任务中,类U-Net的架构已经成为了一个基本的方法并且取得了巨大的成功。但是因为卷积操作固有的特点,UNet往往无法应对长距离的依赖。然而Transformers就是为了序列到序列预测任务设计的,它通过本身的自注意力机制实现架构融合但是它也受到定位的限制从而限制了其在低层级的表现。在本文中,我们提出TransUnet,一种融合了Transformer和Unet的有效的医疗图像分割方法。一方面,Transformer可以对已经序列化的卷积网络特征进行编码以提取全局上下文信息。另一方面,解码器对编码的特征进行上采样同时实现了多分辨率特征的融合,这使得网络在全局和局部的定位精确度得到了保证。TransUnet在多种器官和心脏病的分割任务上取得了良好的表现。