分类目录归档:Daily Paper Review

On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation

We investigate the sensitivity of the Fréchet Inception Distance (FID) score to inconsistent and often incorrect implementations across different image processing libraries. FID score is widely used to evaluate generative models, but each FID implementation uses a different low-level image processing process. Image resizing functions in commonly-used deep learning libraries often introduce aliasing artifacts. We observe that numerous subtle choices need to be made for FID calculation and a lack of consistencies in these choices can lead to vastly different FID scores. In particular, we show that the following choices are significant: (1) selecting what image resizing library to use, (2) choosing what interpolation kernel to use, (3) what encoding to use when representing images. We additionally outline numerous common pitfalls that should be avoided and provide recommendations for computing the FID score accurately. We provide an easy-to-use optimized implementation of our proposed recommendations in the accompanying code.

https://arxiv.org/abs/2104.11222

我们发现FID的敏感度会因为在不同图像处理库下开发出现偏差。虽然FID是一个被广泛使用的标准用于评价生成模型,但是它在不同的库中使用不同的方式开发的。我们观察到图像缩放操作在深度学习应用中会引入混淆失真。这就说明我们需要为FID的计算提供多个选择以防止上述缩放操作引入的失真,(1)选择使用哪种库进行图像缩放;(2)选择使用哪种插值核进行缩放;(3)选择使用哪种编码方式保存图像。

Cetacean Translation Initiative: a roadmap to deciphering the communication of sperm whales

The past decade has witnessed a groundbreaking rise of machine learning for human language analysis, with current methods capable of automatically accurately recovering various aspects of syntax and semantics – including sentence structure and grounded word meaning – from large data collections. Recent research showed the promise of such tools for analyzing acoustic communication in nonhuman species. We posit that machine learning will be the cornerstone of future collection, processing, and analysis of multimodal streams of data in animal communication studies, including bioacoustic, behavioral, biological, and environmental data. Cetaceans are unique non-human model species as they possess sophisticated acoustic communications, but utilize a very different encoding system that evolved in an aquatic rather than terrestrial medium. Sperm whales, in particular, with their highly-developed neuroanatomical features, cognitive abilities, social structures, and discrete click-based encoding make for an excellent starting point for advanced machine learning tools that can be applied to other animals in the future. This paper details a roadmap toward this goal based on currently existing technology and multidisciplinary scientific community effort. We outline the key elements required for the collection and processing of massive bioacoustic data of sperm whales, detecting their basic communication units and language-like higher-level structures, and validating these models through interactive playback experiments. The technological capabilities developed by such an undertaking are likely to yield cross-applications and advancements in broader communities investigating non-human communication and animal behavioral research.

https://arxiv.org/abs/2104.08614

最近的机器学习算法可以精确地重构句法和语义,这包括从大规模数据集中提取的句子结构和词汇含义。最近的研究也表明这样的技术可以用于分析动物之间的语言交流。我们使用机器学习的算法分析抹香鲸的交流,抹香鲸拥有高度发达的神经系统,感知能力以及社交结构。这将为未来在其他生物上的研究带来参考。本文详细地展示了一个路线图,这个路线图描绘了如何收集和处理抹香鲸的生物声学信号,侦测基本的沟通单元以及语言相关的高级结构,并且在新的数据上进行验证。

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort

We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code can be decoded to produce a semantic segmentation of the image. Training the decoder only needs a few labeled examples to generalize to the rest of the latent space, resulting in an infinite annotated dataset generator! These generated datasets can then be used for training any computer vision architecture just as real datasets are. As only a few images need to be manually segmented, it becomes possible to annotate images in extreme detail and generate datasets with rich object and part segmentations. To showcase the power of our approach, we generated datasets for 7 image segmentation tasks which include pixel-level labels for 34 human face parts, and 32 car parts. Our approach outperforms all semi-supervised baselines significantly and is on par with fully supervised methods, which in some cases require as much as 100x more annotated data as our method.

https://arxiv.org/abs/2104.06490

在本文中,我们提出了一种可以为语义分割任务生成大量数据的DatasetGAN.GAN的隐编码可以被解码为分割图,而训练解码器只需要少量标注数据即可.

Escaping the Big Data Paradigm with Compact Transformers

model-sym

With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are “data hungry” and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 94.72% when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets.

https://arxiv.org/abs/2104.05704

传统的ViT模型在训练的时候需要大量的数据,为了解决这个问题,我们在本文中提出CCT架构,这个架构可以以少量数据参与训练达到与CNNs匹配的性能。我们的模型通过一种新的序列池化策略以摆脱对class token以及位置嵌入的依赖。实验结果表明,我们的模型可以以更少的参数和更快的推理速度实验与SOTA模型相似的性能。

InfinityGAN: Towards Infinite-Resolution Image Synthesis

We present InfinityGAN, a method to generate arbitrary-resolution images. The problem is associated with several key challenges. First, scaling existing models to a high resolution is resource-constrained, both in terms of computation and availability of high-resolution training data. Infinity-GAN trains and infers patch-by-patch seamlessly with low computational resources. Second, large images should be locally and globally consistent, avoid repetitive patterns, and look realistic. To address these, InfinityGAN takes global appearance, local structure and texture into account.With this formulation, we can generate images with resolution and level of detail not attainable before. Experimental evaluation supports that InfinityGAN generates imageswith superior global structure compared to baselines at the same time featuring parallelizable inference. Finally, we how several applications unlocked by our approach, such as fusing styles spatially, multi-modal outpainting and image inbetweening at arbitrary input and output resolutions

https://arxiv.org/abs/2104.03963

任意分辨率图像生成任务有以下几个挑战:(1)高分辨率的图像生成要求高的资源消耗;(2)高分辨的图像各个部分应该保持一致,尽量避免重复的特征,并且要看起来真实。为了解决上述问题,本文提出InfinityGAN,一种可以生成任意分辨率图像的方法。我们的方法同时考虑全局外观、局部解构和纹理。因此我们可以生成之前方法无法生成的高分辨图像。

Few-shot Semantic Image Synthesis Using StyleGAN Prior

This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one- or five-shot settings.

https://arxiv.org/abs/2103.14877

本文专注于解决在少样本的场景下利用语义分割图生成高质量图像的任务,这样的任务中取得像素级的标签往往是困难的。我们提出一个训练策略,这个策略可以使用StyleGAN生成伪标签。我们的中心想法是使用少量建立起一个StyleGAN特征到每一个语义分类的映射。通过上述映射,我们可以使用随机噪声生成无限量的伪语义分割图以训练一个编码器,这个编码器回用来控制一个预训练的StyleGAN生成器。尽管之前的方法可能会因为伪标签太粗糙而无法生成高质量的图像因为它们需要像素级对应的标签,而我们的方法可以通过密集的伪标签且稀疏的面部特征来生成高质量的图像。实验证明我们的方法在少样本或者单样本生成任务中的性能提升。

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

https://arxiv.org/abs/2103.15691

我们提出一种基于纯transformer的视频分类的模型,这种纯transformer的模型已经在图像分类领域取得了成功。我们的模型从输入视频中提取时空tokens,并且将其嵌入一系列transformer层中。为了处理长序列的tokens,我们提出了几种我们模型的变形用于在时间和空间域分解输入视频。尽管一般认为基于transformer的模型只有依赖大规模的训练集才能够应用,我们的模型却可以在正则化和预训练模型的帮助下在小数据集上取得匹敌大规模训练集的效果。我们在数个数据集上的测试表明我们的模型优于3D卷积网络。

Can Vision Transformers Learn without Natural Images?

Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a large-scale dataset and human-annotated labels, recent large-scale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and labor-intensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0.

https://arxiv.org/abs/2103.13023

我们可以在没有自然图像和人工标注的情况下完成视觉Transformers的训练吗?尽管一个ViT的预训练似乎非常依赖与大规模数据集和人工标注,可是最近的大规模数据集都有一些隐私侵犯,不公正的保护以及密集劳动力的标注等问题。在本文中,我们在没有大规模标注数据的介入下训练ViT。我们验证我们的网络部分优于一些自监督学习方法在没有自然图像参与预训练的情况下。另外,尽管我们的网络没有自然图像参与预训练,但是它可以拥有更多样的可视化结果相较于ImageNet上训练的ViT,这说明我们的模型可以处理自然图像。

Is Medical Chest X-ray Data Anonymous?

With the rise and ever-increasing potential of deep learning techniques in recent years, publicly available medical data sets became a key factor to enable reproducible development of diagnostic algorithms in the medical domain. Medical data contains sensitive patient-related information and is therefore usually anonymized by removing patient identifiers, e.g., patient names before publication. To the best of our knowledge, we are the first to show that a well-trained deep learning system is able to recover the patient identity from chest X-ray data. We demonstrate this using the publicly available large-scale ChestX-ray14 dataset, a collection of 112,120 frontal-view chest X-ray images from 30,805 unique patients. Our verification system is able to identify whether two frontal chest X-ray images are from the same person with an AUC of 0.9940 and a classification accuracy of 95.55%. We further highlight that the proposed system is able to reveal the same person even ten and more years after the initial scan. When pursuing a retrieval approach, we observe an mAP@R of 0.9748 and a precision@1 of 0.9963. Based on this high identification rate, a potential attacker may leak patient-related information and additionally cross-reference images to obtain more information. Thus, there is a great risk of sensitive content falling into unauthorized hands or being disseminated against the will of the concerned patients. Especially during the COVID-19 pandemic, numerous chest X-ray datasets have been published to advance research. Therefore, such data may be vulnerable to potential attacks by deep learning-based re-identification algorithms.

https://arxiv.org/abs/2103.08562

随着近年来深度学习技术的发展,公开的医疗数据集称为诊断算法能够成功的关键因素之一。医疗数据包含敏感的个人信息,因此这些信息常常会被移除,例如病人的姓名。据我们所知,我们是第一个展示一个预训练的深度学习模型可以从X光数据中恢复病人的个人信息的研究小组。我们使用公认的Chest-ray14 数据集进行测试,这个数据集拥有112120前侧X光数据,由30805独立病人采集。我们的系统用可以有效识别两张X光图像是否来自同一个人,甚至两张图像的生成时间相差多年。基于这样的高识别率,一个潜在的攻击者可以泄露这些个人信息,并通过交叉对比获得更多的信息。因此,敏感信息的泄漏横在面临很高的风险。特别是对于COVID-19疫情,多个胸部X光数据集已经被公开。所以,这些数据的隐私应该被考虑进行有效保护。

An Image is Worth 16×16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method. 

https://arxiv.org/abs/2103.13915

在动作识别领域领跑的方法们都需要从一个视频中提取空间和时间的信息。对于那些达到SOTA精度的方法经常使用3D卷积层来获得时间域的信息。使用卷积操作就代表着需要将视频切割成短片段后再进行处理,每一个片段是相邻的被采样的帧。这意味着为了覆盖整个视频,我们需要采样多个短片段以达到全覆盖的目的。这也使得计算量增加从而无法在实际应用中部署。我们通过减少采样帧数的方式极大地减少了计算量。我们的方法是用一个时域transformer通过全局注意力的方式覆盖视频帧,因此可以更好地利用每一帧的注意力信息。因此,我们的方法在处理输入信息的时候更加有效率并且可以达到SOTA的性能。