作者归档:Given Jiang

Escaping the Big Data Paradigm with Compact Transformers

model-sym

With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are “data hungry” and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 94.72% when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets.

https://arxiv.org/abs/2104.05704

传统的ViT模型在训练的时候需要大量的数据,为了解决这个问题,我们在本文中提出CCT架构,这个架构可以以少量数据参与训练达到与CNNs匹配的性能。我们的模型通过一种新的序列池化策略以摆脱对class token以及位置嵌入的依赖。实验结果表明,我们的模型可以以更少的参数和更快的推理速度实验与SOTA模型相似的性能。

InfinityGAN: Towards Infinite-Resolution Image Synthesis

We present InfinityGAN, a method to generate arbitrary-resolution images. The problem is associated with several key challenges. First, scaling existing models to a high resolution is resource-constrained, both in terms of computation and availability of high-resolution training data. Infinity-GAN trains and infers patch-by-patch seamlessly with low computational resources. Second, large images should be locally and globally consistent, avoid repetitive patterns, and look realistic. To address these, InfinityGAN takes global appearance, local structure and texture into account.With this formulation, we can generate images with resolution and level of detail not attainable before. Experimental evaluation supports that InfinityGAN generates imageswith superior global structure compared to baselines at the same time featuring parallelizable inference. Finally, we how several applications unlocked by our approach, such as fusing styles spatially, multi-modal outpainting and image inbetweening at arbitrary input and output resolutions

https://arxiv.org/abs/2104.03963

任意分辨率图像生成任务有以下几个挑战:(1)高分辨率的图像生成要求高的资源消耗;(2)高分辨的图像各个部分应该保持一致,尽量避免重复的特征,并且要看起来真实。为了解决上述问题,本文提出InfinityGAN,一种可以生成任意分辨率图像的方法。我们的方法同时考虑全局外观、局部解构和纹理。因此我们可以生成之前方法无法生成的高分辨图像。

Few-shot Semantic Image Synthesis Using StyleGAN Prior

This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one- or five-shot settings.

https://arxiv.org/abs/2103.14877

本文专注于解决在少样本的场景下利用语义分割图生成高质量图像的任务,这样的任务中取得像素级的标签往往是困难的。我们提出一个训练策略,这个策略可以使用StyleGAN生成伪标签。我们的中心想法是使用少量建立起一个StyleGAN特征到每一个语义分类的映射。通过上述映射,我们可以使用随机噪声生成无限量的伪语义分割图以训练一个编码器,这个编码器回用来控制一个预训练的StyleGAN生成器。尽管之前的方法可能会因为伪标签太粗糙而无法生成高质量的图像因为它们需要像素级对应的标签,而我们的方法可以通过密集的伪标签且稀疏的面部特征来生成高质量的图像。实验证明我们的方法在少样本或者单样本生成任务中的性能提升。

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

https://arxiv.org/abs/2103.15691

我们提出一种基于纯transformer的视频分类的模型,这种纯transformer的模型已经在图像分类领域取得了成功。我们的模型从输入视频中提取时空tokens,并且将其嵌入一系列transformer层中。为了处理长序列的tokens,我们提出了几种我们模型的变形用于在时间和空间域分解输入视频。尽管一般认为基于transformer的模型只有依赖大规模的训练集才能够应用,我们的模型却可以在正则化和预训练模型的帮助下在小数据集上取得匹敌大规模训练集的效果。我们在数个数据集上的测试表明我们的模型优于3D卷积网络。

Can Vision Transformers Learn without Natural Images?

Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a large-scale dataset and human-annotated labels, recent large-scale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and labor-intensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0.

https://arxiv.org/abs/2103.13023

我们可以在没有自然图像和人工标注的情况下完成视觉Transformers的训练吗?尽管一个ViT的预训练似乎非常依赖与大规模数据集和人工标注,可是最近的大规模数据集都有一些隐私侵犯,不公正的保护以及密集劳动力的标注等问题。在本文中,我们在没有大规模标注数据的介入下训练ViT。我们验证我们的网络部分优于一些自监督学习方法在没有自然图像参与预训练的情况下。另外,尽管我们的网络没有自然图像参与预训练,但是它可以拥有更多样的可视化结果相较于ImageNet上训练的ViT,这说明我们的模型可以处理自然图像。

Is Medical Chest X-ray Data Anonymous?

With the rise and ever-increasing potential of deep learning techniques in recent years, publicly available medical data sets became a key factor to enable reproducible development of diagnostic algorithms in the medical domain. Medical data contains sensitive patient-related information and is therefore usually anonymized by removing patient identifiers, e.g., patient names before publication. To the best of our knowledge, we are the first to show that a well-trained deep learning system is able to recover the patient identity from chest X-ray data. We demonstrate this using the publicly available large-scale ChestX-ray14 dataset, a collection of 112,120 frontal-view chest X-ray images from 30,805 unique patients. Our verification system is able to identify whether two frontal chest X-ray images are from the same person with an AUC of 0.9940 and a classification accuracy of 95.55%. We further highlight that the proposed system is able to reveal the same person even ten and more years after the initial scan. When pursuing a retrieval approach, we observe an mAP@R of 0.9748 and a precision@1 of 0.9963. Based on this high identification rate, a potential attacker may leak patient-related information and additionally cross-reference images to obtain more information. Thus, there is a great risk of sensitive content falling into unauthorized hands or being disseminated against the will of the concerned patients. Especially during the COVID-19 pandemic, numerous chest X-ray datasets have been published to advance research. Therefore, such data may be vulnerable to potential attacks by deep learning-based re-identification algorithms.

https://arxiv.org/abs/2103.08562

随着近年来深度学习技术的发展,公开的医疗数据集称为诊断算法能够成功的关键因素之一。医疗数据包含敏感的个人信息,因此这些信息常常会被移除,例如病人的姓名。据我们所知,我们是第一个展示一个预训练的深度学习模型可以从X光数据中恢复病人的个人信息的研究小组。我们使用公认的Chest-ray14 数据集进行测试,这个数据集拥有112120前侧X光数据,由30805独立病人采集。我们的系统用可以有效识别两张X光图像是否来自同一个人,甚至两张图像的生成时间相差多年。基于这样的高识别率,一个潜在的攻击者可以泄露这些个人信息,并通过交叉对比获得更多的信息。因此,敏感信息的泄漏横在面临很高的风险。特别是对于COVID-19疫情,多个胸部X光数据集已经被公开。所以,这些数据的隐私应该被考虑进行有效保护。

An Image is Worth 16×16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method. 

https://arxiv.org/abs/2103.13915

在动作识别领域领跑的方法们都需要从一个视频中提取空间和时间的信息。对于那些达到SOTA精度的方法经常使用3D卷积层来获得时间域的信息。使用卷积操作就代表着需要将视频切割成短片段后再进行处理,每一个片段是相邻的被采样的帧。这意味着为了覆盖整个视频,我们需要采样多个短片段以达到全覆盖的目的。这也使得计算量增加从而无法在实际应用中部署。我们通过减少采样帧数的方式极大地减少了计算量。我们的方法是用一个时域transformer通过全局注意力的方式覆盖视频帧,因此可以更好地利用每一帧的注意力信息。因此,我们的方法在处理输入信息的时候更加有效率并且可以达到SOTA的性能。

Paint by Word

We investigate the problem of zero-shot semantic image painting. Instead of painting modifications into an image using only concrete colors or a finite set of semantic concepts, we ask how to create semantic paint based on open full-text descriptions: our goal is to be able to point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.” To do this, our method combines a state-of-the art generative model of realistic images with a state-of-the-art text-image semantic similarity network. We find that, to make large changes, it is important to use non-gradient methods to explore latent space, and it is important to relax the computations of the GAN to target changes to a specific region. We conduct user studies to compare our methods to several baselines.

https://arxiv.org/abs/2103.10951

在本文中我们研究零样本的语义图像生成问题。不同于往一张图片上绘制离散的色彩或者有限的语义内容,我们提出了如何基于完全文字描述进行语义绘图的问题:我们的目标是通过文字描述给出一个区域就可以在此区域上绘制任意的内容,例如朴素的,奢华的或者特定的图案。为了实现这个任务,我们的方法结合了的现有的SOTA图像生成模型以及文字-图像语义相似度估计网络。我们发现,为了有所改善,放松GAN对于特定域的计算变得十份重要。我们让我们的方法与几个baseline进行了比较。

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

https://arxiv.org/abs/2103.11886

最近,视觉transformers (ViTs) 已经被成功地运用在图像分类任务中。在本文中,我们发现,ViTs的性能不像CNNs一样可以通过堆叠更多的卷积层实现提升,而是随着深度的提升而变得笨重而低效。我们观察到这样的问题是由注意力塌陷导致的:当transformers的层数增加时,经过特定层之后的注意力图逐渐趋向于相似甚至相同。换句话来说,ViTs在顶层的特征图趋向于一致。这个发现说明了对于更深的ViTs,自注意力机制无法为表示学习获得有效的特征,自然也无法获得额外的性能提升。根据我们的发现,我们提出了一种简单但有效的方法,称为Re-attention. 它可以在不同的层恢复注意力图的多样性同时只消耗少量的算力和资源。我们提出的方法为训练更深的ViT模型并且同时保持性能提供了可能。尤其是我们基于32个Transformer块的模型在ImageNet上获得1.6%Top-1精确度的提升。

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Transformer architectures have brought about fundamental changes to computational linguistic field, which had been dominated by recurrent neural networks for many years. Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue. In this paper, we review some of the most critical milestones in the field, as well as overall trends on how transformer architecture has been incorporated into visuolinguistic cross-modal tasks. Furthermore, we discuss its current limitations and speculate upon some of the prospects that we find imminent.

https://arxiv.org/abs/2103.04037

Transformer架构为计算语言领域带来了根本性的变化,这改变了循环神经网络长期占据的局面。它的成功揭示了视觉和语言跨模态任务正在发生显著的改变,很多科研人员已经投入并正在解决上述问题。本文中我们回顾了一些这个领域的里程碑以及transformer在跨模态任务中的进化趋势。然后我们还讨论了transormer架构目前的缺陷以及对未来的展望。