分类目录归档:Daily Paper Review

HumanGAN: A Generative Model of Humans Images

Generative adversarial networks achieve great performance in photorealistic image synthesis in various domains, including human images. However, they usually employ latent vectors that encode the sampled outputs globally. This does not allow convenient control of semantically-relevant individual parts of the image, and is not able to draw samples that only differ in partial aspects, such as clothing style. We address these limitations and present a generative model for images of dressed humans offering control over pose, local body part appearance and garment style. This is the first method to solve various aspects of human image generation such as global appearance sampling, pose transfer, parts and garment transfer, and parts sampling jointly in a unified framework. As our model encodes part-based latent appearance vectors in a normalized pose-independent space and warps them to different poses, it preserves body and clothing appearance under varying posture. Experiments show that our flexible and general generative method outperforms task-specific baselines for pose-conditioned image generation, pose transfer and part sampling in terms of realism and output resolution.

https://arxiv.org/abs/2103.06902

生成对抗网络将图像生成拓展许多应用中并取得了良好的反响。但是,它们往往使用隐矢量对采样输出进行编码,这使得对于独立部分的编辑工作变得很不方便,也无法控制部分单独变量例如服饰的风格。我们通过提出一个新的生成模型来解决这个问题,我们提出的模型可以控制姿态,局部身体部位以及服装风格。这是第一个从多方面解决人体图像生成的方法,它由全局外观采样,姿态转移,部位和服饰转移,以及部位联合采样几个部分组成。当我们的编码器编码完成隐外观向量到一个标准化的姿态无关的空间之后我们将它映射到不同的姿态,这不会影响身体和服饰的外观。实验表明我们的模型在条件图像生成,姿态转移以及部分采样等任务中获得了优异的性能。

CheXseen: Unseen Disease Detection for Deep Learning Interpretation of Chest X-rays

We systematically evaluate the performance of deep learning models in the presence of diseases not labeled for or present during training. First, we evaluate whether deep learning models trained on a subset of diseases (seen diseases) can detect the presence of any one of a larger set of diseases. We find that models tend to falsely classify diseases outside of the subset (unseen diseases) as “no disease”. Second, we evaluate whether models trained on seen diseases can detect seen diseases when co-occurring with diseases outside the subset (unseen diseases). We find that models are still able to detect seen diseases even when co-occurring with unseen diseases. Third, we evaluate whether feature representations learned by models may be used to detect the presence of unseen diseases given a small labeled set of unseen diseases. We find that the penultimate layer of the deep neural network provides useful features for unseen disease detection. Our results can inform the safe clinical deployment of deep learning models trained on a non-exhaustive set of disease classes.

https://arxiv.org/abs/2103.04590

我们系统地评估了深度学习模型在未标注疾病上性能表现。首先,我们评估了深度学习模型在较小地数据集上预训练后在新的疾病种类上地测试表现。其次,我们评估了深度学习模型是否能判别已见过和未见过疾病的混合情况。我们发现在已见过和未见过病症同时出现的时候,深度学习模型依旧能检测到已见过的疾病。最后,我们评估了特征表示是否能检测到未见过的疾病在只有少量标签的情况下。我们发现深度学习模型的倒数第二层可以为未见过的疾病提供有用的特征。我们的结果展示了在不详尽的疾病种类上训练的深度学习模型部署是安全的。

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

https://arxiv.org/abs/2103.03404

基于注意力的架构已经在机器学习领域被广泛采用,但是它有效性的来源依旧不明。本文中我们从一个新的角度去了解自注意力网络:我们发现网络的输出可以被分解为几个部分,每一个部分包括对于一个跨层的注意力头序列的操作。利用这个分解,我们自注意力机制具有强烈的对于token一致性的诱导性偏差。特别是在没有跳跃连接或者多层感知机的情况下,输出会很快地掉入秩为1的矩阵。换句话说,跳跃连接和多层感知机阻止了输出的退化。我们实验证明收敛现象发生在多个标准的transformer建构中。

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. 
We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation.

https://arxiv.org/abs/2102.04306

医疗图像分割是医疗看护系统的一个基本任务,尤其是在疾病诊断和治疗计划领域。在多种医疗图像分割任务中,类U-Net的架构已经成为了一个基本的方法并且取得了巨大的成功。但是因为卷积操作固有的特点,UNet往往无法应对长距离的依赖。然而Transformers就是为了序列到序列预测任务设计的,它通过本身的自注意力机制实现架构融合但是它也受到定位的限制从而限制了其在低层级的表现。在本文中,我们提出TransUnet,一种融合了Transformer和Unet的有效的医疗图像分割方法。一方面,Transformer可以对已经序列化的卷积网络特征进行编码以提取全局上下文信息。另一方面,解码器对编码的特征进行上采样同时实现了多分辨率特征的融合,这使得网络在全局和局部的定位精确度得到了保证。TransUnet在多种器官和心脏病的分割任务上取得了良好的表现。

OmniNet: Omnidirectional Representations from Transformers

This paper proposes Omnidirectional Representations from Transformers (OmniNet). In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirectional attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based (Choromanski et al.), low-rank attention (Wang et al.) and/or Big Bird (Zaheer et al.) as the meta-learner. Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. The experiments show that OmniNet achieves considerable improvements across these tasks, including achieving state-of-the-art performance on LM1B, WMT’14 En-De/En-Fr, and Long Range Arena. Moreover, using omnidirectional representation in Vision Transformers leads to significant improvements on image recognition tasks on both few-shot learning and fine-tuning setups.

https://arxiv.org/abs/2103.01075

本文提出一种全方向表示的Transformers (OmniNet). 在OmniNet中,我们没有严格地设定一个水平的感受野,而是任意一个token都可以接触到整个网络中的所有tokens. 这个过程也可以看作是一种扩展的注意力机制,这种注意力机制拥有整个网络的感受野。通过上述过程,OmniNet可以作为一个meta-leraner进行训练,这也是另一种基本的自注意力机制模型。为了缓解全局注意力机制带来的复杂计算量,我们参考了其他高效自注意力模型例如基于核,低阶注意力和Big Bird meta-learner. 实验证明,在NLP和视觉任务上OmniNet都有不错的效果。

Training Generative Adversarial Networks in One Stage

Generative Adversarial Networks (GANs) have demonstrated unprecedented success in various image generation tasks. The encouraging results, however, come at the price of a cumbersome training process, during which the generator and discriminator are alternately updated in two stages. In this paper, we investigate a general training scheme that enables training GANs efficiently in only one stage. Based on the adversarial losses of the generator and discriminator, we categorize GANs into two classes, Symmetric GANs and Asymmetric GANs, and introduce a novel gradient decomposition method to unify the two, allowing us to train both classes in one stage and hence alleviate the training effort. Computational analysis and experimental results on several datasets and various network architectures demonstrate that, the proposed one-stage training scheme yields a solid 1.5× acceleration over conventional training schemes, regardless of the network architectures of the generator and discriminator. Furthermore, we show that the proposed method is readily applicable to other adversarial-training scenarios, such as data-free knowledge distillation.

https://arxiv.org/pdf/2103.00430.pdf

生成对抗网络(GANs)已经在不同的图像生成任务中展示了它史无前例的成功。但是这样的成功是来自于复杂的训练流程,这样的训练流程由生成器和判别器通过两阶段交替更新完成。在本文中,我们提出了一种GANs单步训练流程。根据对抗损失类型分类,我们把GANs分成对称GANs和非对称GANs两种,同时我们还提出了一种新的梯度分解方法去统一两种GANs使得我们可以在单步内完成训练。计算量分析和实验结果表明单步训练的GANs可以得到1.5倍的加速。

Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost.

https://arxiv.org/abs/2103.00112

Transformer是一种基于自注意力的用于NLP任务的神经网络架构。最近,一些完全基于transformer的模型被提出来解决机器视觉问题。这些模型往往都将一张图片视为一系列的图像patches而忽略了patch之间的固有结构信息。在本文中,我们提出来一种叫TNT的架构,它可以为patch层级以及像素层级进行建模。在每个TNT模块中,有一个外部transformer模块被用于处理patch嵌入。像素级的特征会被线性变换层映射到patch嵌入的空间然后加入patch中。通过堆叠TNT模块,我们可以构成用于图像识别的TNT模型。在ImageNet和下游任务中都证明了TNT的优势。

Convolution-Free Medical Image Segmentation using Transformers

Like other applications in computer vision, medical image segmentation has been most successfully addressed using deep learning models that rely on the convolution operation as their main building block. Convolutions enjoy important properties such as sparse interactions, weight sharing, and translation equivariance. These properties give convolutional neural networks (CNNs) a strong and useful inductive bias for vision tasks. In this work we show that a different method, based entirely on self-attention between neighboring image patches and without any convolution operations, can achieve competitive or better results. Given a 3D image block, our network divides it into n3 3D patches, where n=3 or 5 and computes a 1D embedding for each patch. The network predicts the segmentation map for the center patch of the block based on the self-attention between these patch embeddings. We show that the proposed model can achieve segmentation accuracies that are better than the state of the art CNNs on three datasets. We also propose methods for pre-training this model on large corpora of unlabeled images. Our experiments show that with pre-training the advantage of our proposed network over CNNs can be significant when labeled training data is small.

https://arxiv.org/abs/2102.13645

与其他的机器视觉任务类似,深度学习模型依赖卷积操作模块在医疗图像分割领域也取得了许多的成功。卷积操作有许多优势,例如稀疏交互,共享权重和翻译同异性。这些优势使得卷积神经网络变得强势且在许多视觉应用中获得广泛应用。在本文中我们提出一种不一样的方法,这种方法完全基于图像邻域patch之间的自注意力机制而不需要卷积操作,而且我们提出的方法能够获得与卷积模型相似甚至是更好的性能。我们的模型接受一个3D图像,然后会将它拆分成n^3个patches,这里n=3或5.之后我们会计算每一个patch的1D嵌入。网络会预测中心patch的分割结果依靠周围邻域的patch的自注意力。我们发现我们提出的模型在分割任务上优于CNNs模型。我们的模型可以在大型无标签的图像语料库中进行预训练,并使用预训练的优势在少样本测试中领先CNNs模型。

Do We Really Need Explicit Position Encodings for Vision Transformers?

Almost all visual transformers such as ViT or DeiT rely on predefined positional encodings to incorporate the order of each input token. These encodings are often implemented as learnable fixed-dimension vectors or sinusoidal functions of different frequencies, which are not possible to accommodate variable-length input sequences. This inevitably limits a wider application of transformers in vision, where many tasks require changing the input size on-the-fly. 
In this paper, we propose to employ a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. It is effortlessly implemented as what we call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework. Our new model with PEG is named Conditional Position encoding Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings. We obtain state-of-the-art results on the ImageNet classification task compared with visual Transformers to date.

https://arxiv.org/abs/2102.10882

几乎所有的视觉transformers例如ViT或者DeiT都依赖预定义的位置编码与输入token的协作。这些编码经常以可学习的定长向量或者不同频率的正弦函数的形式出现,所以这样的形式不适用于变长序列的输入。这不可避免地限制了视觉transformer的应用,尤其是一些输入长度会改变的应用。

在本文中,我们提出了一种利用条件位置编码机制,这种机制以本地邻域输入token作为条件。我们由此提出位置编码生成器 (PEG), 它可以与现有的transformer架构无缝协作。另外,我们将提出的模型命名为条件位置编码视觉Transformer (CPVT). 它可以处理可变长度的输入序列。我们展示了CPVT可以获得视觉上相似的注意力图以及更优的性能相对于现有的预定义位置编码的方法,并且我们的模型在ImageNet分类任务上获得了SOTA的评价。

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches.

https://arxiv.org/abs/2102.12122

虽然以卷积神经网络(CNNs)作为主干的模型在计算机视觉领域获得了巨大的成功,我们这篇文章会提出一个非卷积的简单网络,它将能够运用在许多预测任务上。不像最近提出的Transformer模型(例如ViT)是为了分类任务设计的,我们提出金字塔视觉Transformer (PVT). 我们的模型能够解决Transformer应用在密集预测任务时的种种困难。相比现有模型,PVT拥有以下优点:(1)不像现有ViT模型使用低分辨率输入且要求较大的计算量,PVT不仅仅能够在密集的图像区块上达到高分辨率输出,而且还运用渐进收缩金字塔去降低对于大尺寸特征图的计算量;(2)PVT从CNNs和Transformer那里继承了优点,这使得在许多视觉任务上统一简单将CNN主干进行替换无卷积的主干架构成为可能。(3)我们在例如目标检测、语义和实例分割任务等下游任务上对PVT模型进行了验证,实验结果说明我们的模型是SOTA的。