标签归档:Vision Transformer

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

https://arxiv.org/abs/2102.12092

文字到图像生成任务专注于寻找固定数据集上的更好的建模假设。这些假设可能包括复杂的网络架构,辅助的损失函数或者有助于训练流程的次要信息例如目标部分标签或者分割掩膜等。我们提出了一个基于transformer的简单的文字到图像自回归生成模型,它可以将文字和图像符号视为同一个数据流。通过使用充分的数据训练,我们的方法可以在zero-shot任务上达到现有域相关方法的相似性能。

TransGAN: Two Transformers Can Make One Strong GAN

The recent explosive interest on transformers has suggested their potential to become powerful “universal” models for computer vision tasks, such as classification, detection, and segmentation. However, how further transformers can go – are they ready to take some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs)?Driven by that curiosity, we conduct the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution while decreasing embedding dimension, and a patch-level discriminator that is also transformer-based. We then demonstrate TransGAN to notably benefit from data augmentations (more than standard GANs), a multi-task co-training strategy for the generator, and a locally initialized self-attention that emphasizes the neighborhood smoothness of natural images. Equipped with those findings, TransGAN can effectively scale up with bigger models and high-resolution image datasets. Specifically, our best architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones. Specifically, TransGAN sets new state-of-the-art IS score of 10.10 and FID score of 25.32 on STL-10. It also reaches competitive 8.64 IS score and 11.89 FID score on Cifar-10, and 12.23 FID score on CelebA 64×64, respectively. We also conclude with a discussion of the current limitations and future potential of TransGAN.

https://arxiv.org/abs/2102.07074v2

最近关于transformer的爆发式的关注证明了它有在例如分类,检测或者分割等计算机视觉任务上成为通用模型的潜力。但是,transformer可以走多远呢?它能够解决例如GANs等一些困难的视觉任务了吗?好奇心驱使我们完成了第一个完全非卷积的GAN,这个GAN完全由transformer构成。我们的GAN架构被成为TransGAN. 它可以分为以下几个部分:内存友好的基于transformer的生成器,这个生成器通过渐进式地提升特征分辨率且降低特征的尺寸。一个patch级别的基于transformer的判别器。然后我们展示了TransGAN相对与其他的GANs能够更好地利用数据增广来提升性能。我们还提出了一个多任务的联合训练策略以更好地训练生成器,使得生成器可以用过局部自注意力机制感知图像的邻域平滑度。通过以上的发现,TransGAN得以适应更大且更高清的数据集。实验证明TransGAN拥有SOTA的性能。

TransReID: Transformer-based Object Re-Identification

In this paper, we explore the Vision Transformer (ViT), a pure transformer-based model, for the object re-identification (ReID) task. With several adaptations, a strong baseline ViT-BoT is constructed with ViT as backbone, which achieves comparable results to convolution neural networks- (CNN-) based frameworks on several ReID benchmarks. Furthermore, two modules are designed in consideration of the specialties of ReID data: (1) It is super natural and simple for Transformer to encode non-visual information such as camera or viewpoint into vector embedding representations. Plugging into these embeddings, ViT holds the ability to eliminate the bias caused by diverse cameras or viewpoints.(2) We design a Jigsaw branch, parallel with the Global branch, to facilitate the training of the model in a two-branch learning framework. In the Jigsaw branch, a jigsaw patch module is designed to learn robust feature representation and help the training of transformer by shuffling the patches. With these novel modules, we propose a pure-transformer framework dubbed as TransReID, which is the first work to use a pure Transformer for ReID research to the best of our knowledge. Experimental results of TransReID are superior promising, which achieve state-of-the-art performance on both person and vehicle ReID benchmarks.

https://arxiv.org/abs/2102.04378

在本文中,我们研究视觉Transformer (ViT), 提出了一种单纯的基于Transformer的目标重识别模型。经过一系列改动,我们以ViT为骨架构建了ViT-BoT模型并且在数个ReID排行榜上取得了与类CNNs架构模型的相似性能。我们为ReID任务特别设计了如下两个模块:(1) 它可以将非视觉信息(例如摄像机或者视角信息)与视觉信息融合到特征空间中。通过对于上述特征的学习,模型可以消除对于不同摄像机和视角的偏差; (2) 我们设计了链锯分支,这个分支平行于全局分支用于促进训练流程。在链锯分支中,一个链锯采样模块被设计用于学习鲁棒的特征表示并且通过打乱碎片样本来帮助训练。利用上述模块,我们提出了称为TransReID的模型,这是第一个利用纯Transformer模型完成ReID任务的模型,实验证明了模型性能的优越性,并且在人和车辆ReID任务中都达到SOTA的性能。

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pretraining (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches for VLP heavily rely on image feature extraction processes, most of which involve region supervisions (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the actual multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual encoder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.

https://arxiv.org/abs/2102.03334

视觉-语言预训练可以为视觉语言结合的下游任务提高性能。现在的方法都强烈地依赖图像特征提取的过程,而且绝大多数都包括区域监督过程(例如目标检测)以及卷积架构(例如ResNet)。然而这些方法都忽略了以下问题 (1) 效率/速度,仅仅是简单地提取输入特征会比多域融合的模型需要更多算力;(2)表达能力,这样的模型的性能上界被视觉编码器的表达能力所限制,而这样的编码器是由视觉词汇训练的。在本文中,我们介绍了一个小型化的VLP模型:Vision-and-Language Transformer (ViLT)。通过整体的流程处理视觉输入可以相较于卷积流程简化并且同时处理文本信息。我们的模型比之前的VLP模型快60倍,并且可以更好地应用于下游任务中。

Bottleneck Transformers for Visual Recognition

We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 2.33x faster in compute time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.

https://arxiv.org/abs/2101.11605

我们提出BoTNet,一种基于自注意力的简单但是高效的基本架构,它可以广泛地用于许多计算机视觉任务:包括图像分类,目标检测以及实例分割。我们仅仅在ResNet地瓶颈模块中将空间卷积替换为全局自注意力就可以在实例分割以及目标检测任务上取得显著地性能提升且减少了参数。通过BoTNet,我们还展示了如何将基于自注意地ResNet瓶颈模块视为Transformer模块。公正地说,BoTNet以Mask R-CNN作为基础模型取得了44.4% Mask AP和49.7% Box AP在COCO实例分割排行榜上…

Video Transformer Network

This paper presents VTN, a transformer-based framework for video recognition. Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets and introduce a method that classifies actions by attending to the entire video sequence information. Our approach is generic and builds on top of any given 2D spatial network. In terms of wall runtime, it trains 16.1× faster and runs 5.1× faster during inference while maintaining competitive accuracy compared to other state-of-the-art methods. It enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs. We report competitive results on Kinetics-400 and present an ablation study of VTN properties and the trade-off between accuracy and inference speed. We hope our approach will serve as a new baseline and start a fresh line of research in the video recognition domain.

https://arxiv.org/abs/2102.00719

在这篇文章中我们提出VTN,一种基于Transformer的视频识别框架。受到最近视觉Transformer的启发,我们发现现有的基本动作识别算法都是依赖3D卷积网络,而我们提出一种方法可以通过获取整个视频序列的信息做出预测的算法。我们的方法可以与任意2D空间网络结合,它可以比现有的SOTA方法训练上快16.1以及在推理的时候快5.1倍并保持相似的性能。当进行全局推理的时候,它对算力的需要少1.5 GFLOPs. 我们在Kinetics-400上展示了优异的性能,并且在消融实验中验证了VTN的性质以及如何平衡精确度和推理速度。我们希望我们的方法可以成为新的baseline,并且在视频识别领域占有一席之地。

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art.

https://arxiv.org/abs/2101.08833

在本文中我们介绍一个基于Transformer的视频目标分割方法(VOS). 为了解决之前存在的复合误差以及可拓展性的问题,我们提出了一个可拓展的,端到端的VOS方法佳作稀疏时空Transformers (SST). SST使用稀疏注意力机制为视频中每一个目标提取时空的像素特征。我们的基于注意力的VOS公式使得模型可以学习到多帧的历史并且为动作分割相关性计算提供归纳偏差。我们展示了在循环网络上的注意力机制的有效性。我们的方法在YouTube-VOS和DAVIS 2017上取得了具有竞争力的成绩,同时也展示了它相较于其他SOTA方法的可拓展性以及鲁棒性。

Trans2Seg: Transparent Object Segmentation with Transformer

This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset. Unlike Trans10K-v1 that only has two limited categories, our new dataset has several appealing benefits. (1) It has 11 fine-grained categories of transparent objects, commonly occurring in the human domestic environment, making it more practical for real-world application. (2) Trans10K-v2 brings more challenges for the current advanced segmentation methods than its former version. Furthermore, a novel transformer-based segmentation pipeline termed Trans2Seg is proposed. Firstly, the transformer encoder of Trans2Seg provides the global receptive field in contrast to CNN’s local receptive field, which shows excellent advantages over pure CNN architectures. Secondly, by formulating semantic segmentation as a problem of dictionary look-up, we design a set of learnable prototypes as the query of Trans2Seg’s transformer decoder, where each prototype learns the statistics of one category in the whole dataset. We benchmark more than 20 recent semantic segmentation methods, demonstrating that Trans2Seg significantly outperforms all the CNN-based methods, showing the proposed algorithm’s potential ability to solve transparent object segmentation.

https://arxiv.org/abs/2101.08461

本文提出了一个由Trans10K-v1精细化改进的透明目标分割数据集Trans10K-v2,这也是第一个大规模的透明目标分割数据集。与Trans10K-v1只有两个类不同的是,Trans10K-v2有以下几个优点: (1) 它有11个精细的透明目标分类,这些类别的目标都是人类社会常见的;(2) Trans10K-v2 为先进的分割方法设置了许多挑战。 另外我们还提出了一个基于Transformer的分割模型叫做Trans2Seg. 首先,Transformer的编码器提供全局的感受野而非CNNs的局部感受野。再者,我们把分割的流程看作是一个查字典的过程,我们设计了一系列可学习的原型作为Trans2Seg的解码器,每个原型可以学习到每个类的统计信息。我们的方法优于最近20个SOTA分割方法,这说明我们的方法有解决透明目标分割任务的能力。

VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search

Text-to-image retrieval is an essential task in multi-modal information retrieval, i.e. retrieving relevant images from a large and unlabelled image dataset given textual queries. In this paper, we propose VisualSparta, a novel text-to-image retrieval model that shows substantial improvement over existing models on both accuracy and efficiency. We show that VisualSparta is capable of outperforming all previous scalable methods in MSCOCO and Flickr30K. It also shows substantial retrieving speed advantages, i.e. for an index with 1 million images, VisualSparta gets over 391x speed up compared to standard vector search. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for very large dataset, with significant accuracy improvement compared to previous state-of-the-art methods.

https://arxiv.org/abs/2101.00265

由文字重建图像是一个在多模态信息重建领域一个重要的任务。详细来说,就是从一个大型且无标签的图像数据集中通过给定的文字信息恢复图像。在本文中,我们提出了VisualSparta,一种新的由文字到图像的生成模型,它在精度和效率方面都领先于现有的模型。我们在MSCOCO和Flickr30K数据集上进行了测试。在速度方面,给定索引一百万张图片的任务,我们可以比标准矢量搜索方法快391倍。实验表明,我们的速度优势来自逆索引。据我们了解,VisualSparta是第一个由文字重建图像的方法并且优于现有SOTA的方法。

Self-Attention Based Context-Aware 3D Object Detection

Most existing point-cloud based 3D object detectors use convolution-like operators to process information in a local neighbourhood with fixed-weight kernels and aggregate global context hierarchically. However, recent work on non-local neural networks and self-attention for 2D vision has shown that explicitly modeling global context and long-range interactions between positions can lead to more robust and competitive models. In this paper, we explore two variants of self-attention for contextual modeling in 3D object detection by augmenting convolutional features with self-attention features. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors and show consistent improvement over strong baseline models while simultaneously significantly reducing their parameter footprint and computational cost. We also propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations. This not only allows us to scale explicit global contextual modeling to larger point-clouds, but also leads to more discriminative and informative feature descriptors. Our method can be flexibly applied to most state-of-the-art detectors with increased accuracy and parameter and compute efficiency. We achieve new state-of-the-art detection performance on KITTI and nuScenes datasets.

https://arxiv.org/pdf/2101.02672.pdf

绝大多数基于点云的3D目标检测器都是利用类卷积的方式,通过固定参数的核以及多层次的全局上下文处理来实现对于本地邻域信息的处理。但是,最近的在非邻域神经网络以及自注意力机制上获得研究显示对于全局上下文以及长距离位置交互的精细建模可以有利于构建更加稳定以及高效的模型。在本文中,我们研究了两种自注意力机制的变形在3D目标检测任务中的上下文建模的应用,这两种自注意机制可以利用自注意力特征来扩增卷积特征。我们首次将对间自注意力机制与目前SOTA的BEV,基于体素和点的检测器结合,取得了强于baseline模型的性能同时也极大地减小了计算量。另外,我们还提出了一种自注意力机制的变形,它可以从解构随机采样的位置中提取最具有表达能力的特征。这使得精细化的全局上下文模型能接受更大规模的点云,也使得模型能获得更多的信息。我们的方法可以应用到绝大多数SOTA的检测器中并在多个数据集上取得SOTA的成绩。