分类目录归档:Daily Paper Review

Few-shot Semantic Image Synthesis Using StyleGAN Prior

This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one- or five-shot settings.

https://arxiv.org/abs/2103.14877

本文专注于解决在少样本的场景下利用语义分割图生成高质量图像的任务,这样的任务中取得像素级的标签往往是困难的。我们提出一个训练策略,这个策略可以使用StyleGAN生成伪标签。我们的中心想法是使用少量建立起一个StyleGAN特征到每一个语义分类的映射。通过上述映射,我们可以使用随机噪声生成无限量的伪语义分割图以训练一个编码器,这个编码器回用来控制一个预训练的StyleGAN生成器。尽管之前的方法可能会因为伪标签太粗糙而无法生成高质量的图像因为它们需要像素级对应的标签,而我们的方法可以通过密集的伪标签且稀疏的面部特征来生成高质量的图像。实验证明我们的方法在少样本或者单样本生成任务中的性能提升。

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

https://arxiv.org/abs/2103.15691

我们提出一种基于纯transformer的视频分类的模型,这种纯transformer的模型已经在图像分类领域取得了成功。我们的模型从输入视频中提取时空tokens,并且将其嵌入一系列transformer层中。为了处理长序列的tokens,我们提出了几种我们模型的变形用于在时间和空间域分解输入视频。尽管一般认为基于transformer的模型只有依赖大规模的训练集才能够应用,我们的模型却可以在正则化和预训练模型的帮助下在小数据集上取得匹敌大规模训练集的效果。我们在数个数据集上的测试表明我们的模型优于3D卷积网络。

Can Vision Transformers Learn without Natural Images?

Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a large-scale dataset and human-annotated labels, recent large-scale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and labor-intensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0.

https://arxiv.org/abs/2103.13023

我们可以在没有自然图像和人工标注的情况下完成视觉Transformers的训练吗?尽管一个ViT的预训练似乎非常依赖与大规模数据集和人工标注,可是最近的大规模数据集都有一些隐私侵犯,不公正的保护以及密集劳动力的标注等问题。在本文中,我们在没有大规模标注数据的介入下训练ViT。我们验证我们的网络部分优于一些自监督学习方法在没有自然图像参与预训练的情况下。另外,尽管我们的网络没有自然图像参与预训练,但是它可以拥有更多样的可视化结果相较于ImageNet上训练的ViT,这说明我们的模型可以处理自然图像。

Is Medical Chest X-ray Data Anonymous?

With the rise and ever-increasing potential of deep learning techniques in recent years, publicly available medical data sets became a key factor to enable reproducible development of diagnostic algorithms in the medical domain. Medical data contains sensitive patient-related information and is therefore usually anonymized by removing patient identifiers, e.g., patient names before publication. To the best of our knowledge, we are the first to show that a well-trained deep learning system is able to recover the patient identity from chest X-ray data. We demonstrate this using the publicly available large-scale ChestX-ray14 dataset, a collection of 112,120 frontal-view chest X-ray images from 30,805 unique patients. Our verification system is able to identify whether two frontal chest X-ray images are from the same person with an AUC of 0.9940 and a classification accuracy of 95.55%. We further highlight that the proposed system is able to reveal the same person even ten and more years after the initial scan. When pursuing a retrieval approach, we observe an mAP@R of 0.9748 and a precision@1 of 0.9963. Based on this high identification rate, a potential attacker may leak patient-related information and additionally cross-reference images to obtain more information. Thus, there is a great risk of sensitive content falling into unauthorized hands or being disseminated against the will of the concerned patients. Especially during the COVID-19 pandemic, numerous chest X-ray datasets have been published to advance research. Therefore, such data may be vulnerable to potential attacks by deep learning-based re-identification algorithms.

https://arxiv.org/abs/2103.08562

随着近年来深度学习技术的发展,公开的医疗数据集称为诊断算法能够成功的关键因素之一。医疗数据包含敏感的个人信息,因此这些信息常常会被移除,例如病人的姓名。据我们所知,我们是第一个展示一个预训练的深度学习模型可以从X光数据中恢复病人的个人信息的研究小组。我们使用公认的Chest-ray14 数据集进行测试,这个数据集拥有112120前侧X光数据,由30805独立病人采集。我们的系统用可以有效识别两张X光图像是否来自同一个人,甚至两张图像的生成时间相差多年。基于这样的高识别率,一个潜在的攻击者可以泄露这些个人信息,并通过交叉对比获得更多的信息。因此,敏感信息的泄漏横在面临很高的风险。特别是对于COVID-19疫情,多个胸部X光数据集已经被公开。所以,这些数据的隐私应该被考虑进行有效保护。

An Image is Worth 16×16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method. 

https://arxiv.org/abs/2103.13915

在动作识别领域领跑的方法们都需要从一个视频中提取空间和时间的信息。对于那些达到SOTA精度的方法经常使用3D卷积层来获得时间域的信息。使用卷积操作就代表着需要将视频切割成短片段后再进行处理,每一个片段是相邻的被采样的帧。这意味着为了覆盖整个视频,我们需要采样多个短片段以达到全覆盖的目的。这也使得计算量增加从而无法在实际应用中部署。我们通过减少采样帧数的方式极大地减少了计算量。我们的方法是用一个时域transformer通过全局注意力的方式覆盖视频帧,因此可以更好地利用每一帧的注意力信息。因此,我们的方法在处理输入信息的时候更加有效率并且可以达到SOTA的性能。

Paint by Word

We investigate the problem of zero-shot semantic image painting. Instead of painting modifications into an image using only concrete colors or a finite set of semantic concepts, we ask how to create semantic paint based on open full-text descriptions: our goal is to be able to point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.” To do this, our method combines a state-of-the art generative model of realistic images with a state-of-the-art text-image semantic similarity network. We find that, to make large changes, it is important to use non-gradient methods to explore latent space, and it is important to relax the computations of the GAN to target changes to a specific region. We conduct user studies to compare our methods to several baselines.

https://arxiv.org/abs/2103.10951

在本文中我们研究零样本的语义图像生成问题。不同于往一张图片上绘制离散的色彩或者有限的语义内容,我们提出了如何基于完全文字描述进行语义绘图的问题:我们的目标是通过文字描述给出一个区域就可以在此区域上绘制任意的内容,例如朴素的,奢华的或者特定的图案。为了实现这个任务,我们的方法结合了的现有的SOTA图像生成模型以及文字-图像语义相似度估计网络。我们发现,为了有所改善,放松GAN对于特定域的计算变得十份重要。我们让我们的方法与几个baseline进行了比较。

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

https://arxiv.org/abs/2103.11886

最近,视觉transformers (ViTs) 已经被成功地运用在图像分类任务中。在本文中,我们发现,ViTs的性能不像CNNs一样可以通过堆叠更多的卷积层实现提升,而是随着深度的提升而变得笨重而低效。我们观察到这样的问题是由注意力塌陷导致的:当transformers的层数增加时,经过特定层之后的注意力图逐渐趋向于相似甚至相同。换句话来说,ViTs在顶层的特征图趋向于一致。这个发现说明了对于更深的ViTs,自注意力机制无法为表示学习获得有效的特征,自然也无法获得额外的性能提升。根据我们的发现,我们提出了一种简单但有效的方法,称为Re-attention. 它可以在不同的层恢复注意力图的多样性同时只消耗少量的算力和资源。我们提出的方法为训练更深的ViT模型并且同时保持性能提供了可能。尤其是我们基于32个Transformer块的模型在ImageNet上获得1.6%Top-1精确度的提升。

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Transformer architectures have brought about fundamental changes to computational linguistic field, which had been dominated by recurrent neural networks for many years. Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue. In this paper, we review some of the most critical milestones in the field, as well as overall trends on how transformer architecture has been incorporated into visuolinguistic cross-modal tasks. Furthermore, we discuss its current limitations and speculate upon some of the prospects that we find imminent.

https://arxiv.org/abs/2103.04037

Transformer架构为计算语言领域带来了根本性的变化,这改变了循环神经网络长期占据的局面。它的成功揭示了视觉和语言跨模态任务正在发生显著的改变,很多科研人员已经投入并正在解决上述问题。本文中我们回顾了一些这个领域的里程碑以及transformer在跨模态任务中的进化趋势。然后我们还讨论了transormer架构目前的缺陷以及对未来的展望。

HumanGAN: A Generative Model of Humans Images

Generative adversarial networks achieve great performance in photorealistic image synthesis in various domains, including human images. However, they usually employ latent vectors that encode the sampled outputs globally. This does not allow convenient control of semantically-relevant individual parts of the image, and is not able to draw samples that only differ in partial aspects, such as clothing style. We address these limitations and present a generative model for images of dressed humans offering control over pose, local body part appearance and garment style. This is the first method to solve various aspects of human image generation such as global appearance sampling, pose transfer, parts and garment transfer, and parts sampling jointly in a unified framework. As our model encodes part-based latent appearance vectors in a normalized pose-independent space and warps them to different poses, it preserves body and clothing appearance under varying posture. Experiments show that our flexible and general generative method outperforms task-specific baselines for pose-conditioned image generation, pose transfer and part sampling in terms of realism and output resolution.

https://arxiv.org/abs/2103.06902

生成对抗网络将图像生成拓展许多应用中并取得了良好的反响。但是,它们往往使用隐矢量对采样输出进行编码,这使得对于独立部分的编辑工作变得很不方便,也无法控制部分单独变量例如服饰的风格。我们通过提出一个新的生成模型来解决这个问题,我们提出的模型可以控制姿态,局部身体部位以及服装风格。这是第一个从多方面解决人体图像生成的方法,它由全局外观采样,姿态转移,部位和服饰转移,以及部位联合采样几个部分组成。当我们的编码器编码完成隐外观向量到一个标准化的姿态无关的空间之后我们将它映射到不同的姿态,这不会影响身体和服饰的外观。实验表明我们的模型在条件图像生成,姿态转移以及部分采样等任务中获得了优异的性能。

CheXseen: Unseen Disease Detection for Deep Learning Interpretation of Chest X-rays

We systematically evaluate the performance of deep learning models in the presence of diseases not labeled for or present during training. First, we evaluate whether deep learning models trained on a subset of diseases (seen diseases) can detect the presence of any one of a larger set of diseases. We find that models tend to falsely classify diseases outside of the subset (unseen diseases) as “no disease”. Second, we evaluate whether models trained on seen diseases can detect seen diseases when co-occurring with diseases outside the subset (unseen diseases). We find that models are still able to detect seen diseases even when co-occurring with unseen diseases. Third, we evaluate whether feature representations learned by models may be used to detect the presence of unseen diseases given a small labeled set of unseen diseases. We find that the penultimate layer of the deep neural network provides useful features for unseen disease detection. Our results can inform the safe clinical deployment of deep learning models trained on a non-exhaustive set of disease classes.

https://arxiv.org/abs/2103.04590

我们系统地评估了深度学习模型在未标注疾病上性能表现。首先,我们评估了深度学习模型在较小地数据集上预训练后在新的疾病种类上地测试表现。其次,我们评估了深度学习模型是否能判别已见过和未见过疾病的混合情况。我们发现在已见过和未见过病症同时出现的时候,深度学习模型依旧能检测到已见过的疾病。最后,我们评估了特征表示是否能检测到未见过的疾病在只有少量标签的情况下。我们发现深度学习模型的倒数第二层可以为未见过的疾病提供有用的特征。我们的结果展示了在不详尽的疾病种类上训练的深度学习模型部署是安全的。