标签归档:Image Sythesis

InfinityGAN: Towards Infinite-Resolution Image Synthesis

We present InfinityGAN, a method to generate arbitrary-resolution images. The problem is associated with several key challenges. First, scaling existing models to a high resolution is resource-constrained, both in terms of computation and availability of high-resolution training data. Infinity-GAN trains and infers patch-by-patch seamlessly with low computational resources. Second, large images should be locally and globally consistent, avoid repetitive patterns, and look realistic. To address these, InfinityGAN takes global appearance, local structure and texture into account.With this formulation, we can generate images with resolution and level of detail not attainable before. Experimental evaluation supports that InfinityGAN generates imageswith superior global structure compared to baselines at the same time featuring parallelizable inference. Finally, we how several applications unlocked by our approach, such as fusing styles spatially, multi-modal outpainting and image inbetweening at arbitrary input and output resolutions

https://arxiv.org/abs/2104.03963

任意分辨率图像生成任务有以下几个挑战:(1)高分辨率的图像生成要求高的资源消耗;(2)高分辨的图像各个部分应该保持一致,尽量避免重复的特征,并且要看起来真实。为了解决上述问题,本文提出InfinityGAN,一种可以生成任意分辨率图像的方法。我们的方法同时考虑全局外观、局部解构和纹理。因此我们可以生成之前方法无法生成的高分辨图像。

Few-shot Semantic Image Synthesis Using StyleGAN Prior

This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one- or five-shot settings.

https://arxiv.org/abs/2103.14877

本文专注于解决在少样本的场景下利用语义分割图生成高质量图像的任务,这样的任务中取得像素级的标签往往是困难的。我们提出一个训练策略,这个策略可以使用StyleGAN生成伪标签。我们的中心想法是使用少量建立起一个StyleGAN特征到每一个语义分类的映射。通过上述映射,我们可以使用随机噪声生成无限量的伪语义分割图以训练一个编码器,这个编码器回用来控制一个预训练的StyleGAN生成器。尽管之前的方法可能会因为伪标签太粗糙而无法生成高质量的图像因为它们需要像素级对应的标签,而我们的方法可以通过密集的伪标签且稀疏的面部特征来生成高质量的图像。实验证明我们的方法在少样本或者单样本生成任务中的性能提升。

Paint by Word

We investigate the problem of zero-shot semantic image painting. Instead of painting modifications into an image using only concrete colors or a finite set of semantic concepts, we ask how to create semantic paint based on open full-text descriptions: our goal is to be able to point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.” To do this, our method combines a state-of-the art generative model of realistic images with a state-of-the-art text-image semantic similarity network. We find that, to make large changes, it is important to use non-gradient methods to explore latent space, and it is important to relax the computations of the GAN to target changes to a specific region. We conduct user studies to compare our methods to several baselines.

https://arxiv.org/abs/2103.10951

在本文中我们研究零样本的语义图像生成问题。不同于往一张图片上绘制离散的色彩或者有限的语义内容,我们提出了如何基于完全文字描述进行语义绘图的问题:我们的目标是通过文字描述给出一个区域就可以在此区域上绘制任意的内容,例如朴素的,奢华的或者特定的图案。为了实现这个任务,我们的方法结合了的现有的SOTA图像生成模型以及文字-图像语义相似度估计网络。我们发现,为了有所改善,放松GAN对于特定域的计算变得十份重要。我们让我们的方法与几个baseline进行了比较。

HumanGAN: A Generative Model of Humans Images

Generative adversarial networks achieve great performance in photorealistic image synthesis in various domains, including human images. However, they usually employ latent vectors that encode the sampled outputs globally. This does not allow convenient control of semantically-relevant individual parts of the image, and is not able to draw samples that only differ in partial aspects, such as clothing style. We address these limitations and present a generative model for images of dressed humans offering control over pose, local body part appearance and garment style. This is the first method to solve various aspects of human image generation such as global appearance sampling, pose transfer, parts and garment transfer, and parts sampling jointly in a unified framework. As our model encodes part-based latent appearance vectors in a normalized pose-independent space and warps them to different poses, it preserves body and clothing appearance under varying posture. Experiments show that our flexible and general generative method outperforms task-specific baselines for pose-conditioned image generation, pose transfer and part sampling in terms of realism and output resolution.

https://arxiv.org/abs/2103.06902

生成对抗网络将图像生成拓展许多应用中并取得了良好的反响。但是,它们往往使用隐矢量对采样输出进行编码,这使得对于独立部分的编辑工作变得很不方便,也无法控制部分单独变量例如服饰的风格。我们通过提出一个新的生成模型来解决这个问题,我们提出的模型可以控制姿态,局部身体部位以及服装风格。这是第一个从多方面解决人体图像生成的方法,它由全局外观采样,姿态转移,部位和服饰转移,以及部位联合采样几个部分组成。当我们的编码器编码完成隐外观向量到一个标准化的姿态无关的空间之后我们将它映射到不同的姿态,这不会影响身体和服饰的外观。实验表明我们的模型在条件图像生成,姿态转移以及部分采样等任务中获得了优异的性能。

K-Hairstyle: A Large-scale Korean hairstyle dataset for virtual hair editing and hairstyle classification

The hair and beauty industry is one of the fastest growing industries. This led to the development of various applications, such as virtual hair dyeing or hairstyle translations, to satisfy the need of the customers. Although there are several public hair datasets available for these applications, they consist of limited number of images with low resolution, which restrict their performance on high-quality hair editing. Therefore, we introduce a novel large-scale Korean hairstyle dataset, K-hairstyle, 256,679 with high-resolution images. In addition, K-hairstyle contains various hair attributes annotated by Korean expert hair stylists and hair segmentation masks. We validate the effectiveness of our dataset by leveraging several applications, such as hairstyle translation, and hair classification and hair retrieval. Furthermore, we will release K-hairstyle soon.

https://arxiv.org/abs/2102.06288

美发和美容产业是最近发展得最快的行业之一。它们的发展带动了许多类似于虚拟染发或者发型迁移等应用的发展。尽管现在已经有几个公开的发型数据集,但是都存在数据量小或者低分辨率等等问题,这限制了发型编辑技术的发展。所以我们介绍一个大规模的韩国发型数据集K-hairstyle. 它拥有256,679张高分辨率的图像。另外,数据集还包含多种由韩国发型师标注的发型属性标签以及分割掩膜。我们在诸如发型迁移,发型分类以及发型检索应用中测试和验证了我们的数据集。

Crop mapping from image time series: deep learning with multi-scale label hierarchies

The aim of this paper is to map agricultural crops by classifying satellite image time series. Domain experts in agriculture work with crop type labels that are organised in a hierarchical tree structure, where coarse classes (like orchards) are subdivided into finer ones (like apples, pears, vines, etc.). We develop a crop classification method that exploits this expert knowledge and significantly improves the mapping of rare crop types. The three-level label hierarchy is encoded in a convolutional, recurrent neural network (convRNN), such that for each pixel the model predicts three labels at different level of granularity. This end-to-end trainable, hierarchical network architecture allows the model to learn joint feature representations of rare classes (e.g., apples, pears) at a coarser level (e.g., orchard), thereby boosting classification performance at the fine-grained level. Additionally, labelling at different granularity also makes it possible to adjust the output according to the classification scores; as coarser labels with high confidence are sometimes more useful for agricultural practice than fine-grained but very uncertain labels. We validate the proposed method on a new, large dataset that we make public. ZueriCrop covers an area of 50 km x 48 km in the Swiss cantons of Zurich and Thurgau with a total of 116’000 individual fields spanning 48 crop classes, and 28,000 (multi-temporal) image patches from Sentinel-2. We compare our proposed hierarchical convRNN model with several baselines, including methods designed for imbalanced class distributions. The hierarchical approach performs superior by at least 9.9 percentage points in F1-score.

https://arxiv.org/abs/2102.08820

本文的目的是通过对时序卫星图片的分类来对农作物的生长进行绘图。区域的农业专家对作物的种类进行了多层次的标注,在粗粒度的标注等级(如果园)中还可以能有更加精细粒度的标注(例如苹果、梨、葡萄等)。我们构建了一个作物分类模型并且极大地提高了对于罕见作物类型的绘图能力。总共有三种层次的标签被输入模型中,所以模型(卷积、循环卷积)会对每一个像素做出三个预测。这种端到端的多层次网络结构使得模型可以在粗粒度层级(如果园)学习到细粒度的特征表示(如苹果、梨),所以可以获得更好的分类性能。另外,在不同粒度上的标注也使得根据分类分数调整输出成为可能;粗粒度上高置信度的标签往往在实际种植业中比细粒度的低置信度标签更加有用。我们在ZueriCrop这个新的大型数据集上进行了验证,这个数据集包括了50×48平方公里的瑞士苏黎世州和图尔高州的116000独立地块,并且对48种作物进行了标注,同时它也包含28000张通过Sentinel-2拍摄的热成像照片。我们将我们的提出的多层次卷积循环网络模型与其他基线进行了比较,这些基线包括了那些为不对称类分布设计的模型。我们的多层次模型在F1-score指标上比其他方法至少领先9.9%.

TransGAN: Two Transformers Can Make One Strong GAN

The recent explosive interest on transformers has suggested their potential to become powerful “universal” models for computer vision tasks, such as classification, detection, and segmentation. However, how further transformers can go – are they ready to take some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs)?Driven by that curiosity, we conduct the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution while decreasing embedding dimension, and a patch-level discriminator that is also transformer-based. We then demonstrate TransGAN to notably benefit from data augmentations (more than standard GANs), a multi-task co-training strategy for the generator, and a locally initialized self-attention that emphasizes the neighborhood smoothness of natural images. Equipped with those findings, TransGAN can effectively scale up with bigger models and high-resolution image datasets. Specifically, our best architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones. Specifically, TransGAN sets new state-of-the-art IS score of 10.10 and FID score of 25.32 on STL-10. It also reaches competitive 8.64 IS score and 11.89 FID score on Cifar-10, and 12.23 FID score on CelebA 64×64, respectively. We also conclude with a discussion of the current limitations and future potential of TransGAN.

https://arxiv.org/abs/2102.07074v2

最近关于transformer的爆发式的关注证明了它有在例如分类,检测或者分割等计算机视觉任务上成为通用模型的潜力。但是,transformer可以走多远呢?它能够解决例如GANs等一些困难的视觉任务了吗?好奇心驱使我们完成了第一个完全非卷积的GAN,这个GAN完全由transformer构成。我们的GAN架构被成为TransGAN. 它可以分为以下几个部分:内存友好的基于transformer的生成器,这个生成器通过渐进式地提升特征分辨率且降低特征的尺寸。一个patch级别的基于transformer的判别器。然后我们展示了TransGAN相对与其他的GANs能够更好地利用数据增广来提升性能。我们还提出了一个多任务的联合训练策略以更好地训练生成器,使得生成器可以用过局部自注意力机制感知图像的邻域平滑度。通过以上的发现,TransGAN得以适应更大且更高清的数据集。实验证明TransGAN拥有SOTA的性能。

SWAGAN: A Style-based Wavelet-driven Generative Model

In recent years, considerable progress has been made in the visual quality of Generative Adversarial Networks (GANs). Even so, these networks still suffer from degradation in quality for high-frequency content, stemming from a spectrally biased architecture, and similarly unfavorable loss functions. To address this issue, we present a novel general-purpose Style and WAvelet based GAN (SWAGAN) that implements progressive generation in the frequency domain. SWAGAN incorporates wavelets throughout its generator and discriminator architectures, enforcing a frequency-aware latent representation at every step of the way. This approach yields enhancements in the visual quality of the generated images, and considerably increases computational performance. We demonstrate the advantage of our method by integrating it into the SyleGAN2 framework, and verifying that content generation in the wavelet domain leads to higher quality images with more realistic high-frequency content. Furthermore, we verify that our model’s latent space retains the qualities that allow StyleGAN to serve as a basis for a multitude of editing tasks, and show that our frequency-aware approach also induces improved downstream visual quality.

https://arxiv.org/abs/2102.06108

最近,通过GANs生成的图像质量有了显著提高。然而,对于一些高频内容,GANs生成效果还有待提高,这些问题是由网络对于特定频谱的偏差以及不合适的损失函数造成的。为了解决上述问题,我们提出了一种通用的基于风格和小波的GAN:SWAGAN用于频域的生成任务。SWAGAN在它的生成器和判别器中输入小波,强迫网络学习到频率相关的隐空间表示。这种架构可以提高生成图像的质量,并且节约算力。我们展示了把我们的方法域StyleGAN2结合在一起的优点,并且验证了在小波域上的内容生成可以生成更高质量的图片且保留真实的高频特征。另外,我们还验证了我们模型的隐空间保留了足够的特征可供StyleGAN进行后续的图像编辑任务,这充分证明了我们的频率感知的方法可以有效提升下游任务。

Training Generative Adversarial Networks with Limited Data

Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.42.

https://arxiv.org/abs/2006.06676

使用极少量数据训练GAN会导致判别器过拟合从而训练无法收敛。我们提出一个自适应的判别器训练机制以有效地稳定在小规模数据集上的训练过程。我们的方法不需要修改损失函数或者模型架构,并且可以应用在从头训练或者fine-tuning的阶段。实验结果说明在数个数据集上我们的方法可以以几千张的训练图像达到原始StyleGAN2的性能。我们希望我们的研究可以打开一个新的GAN应用领域,并且我们还展示了在CIFAR10这样的小数据集上我们可以有效地将FID从5.59降至2.42.

This Face Does Not Exist … But It Might Be Yours! Identity Leakage in Generative Models

Generative adversarial networks (GANs) are able to generate high resolution photo-realistic images of objects that “do not exist.” These synthetic images are rather difficult to detect as fake. However, the manner in which these generative models are trained hints at a potential for information leakage from the supplied training data, especially in the context of synthetic faces. This paper presents experiments suggesting that identity information in face images can flow from the training corpus into synthetic samples without any adversarial actions when building or using the existing model. This raises privacy-related questions, but also stimulates discussions of (a) the face manifold’s characteristics in the feature space and (b) how to create generative models that do not inadvertently reveal identity information of real subjects whose images were used for training. We used five different face matchers (face_recognition, FaceNet, ArcFace, SphereFace and Neurotechnology MegaMatcher) and the StyleGAN2 synthesis model, and show that this identity leakage does exist for some, but not all methods. So, can we say that these synthetically generated faces truly do not exist? Databases of real and synthetically generated faces are made available with this paper to allow full replicability of the results discussed in this work.

https://arxiv.org/abs/2101.05084

GAN被普遍认为可以生成一些高分辨率的并不存在的虚假人脸。但是,GAN的训练过程中有可能通过训练数据泄露,尤其是从生成人脸的上下文关系。本文揭示了个人的人脸信息可能可以从训练集中流入合成数据中不需要借助任何对抗操作或者使用任何模型。这让我们想问(1) 人脸的流形特征在特征空间中是如何的形式; (2) 如何构造一个生成模型而不会泄露个人信息。我们使用四个人脸匹配器(face_recognition, FaceNet, ArcFace, SphereFace and Neurotechnology MegaMatcher) 以及StyleGAN2生成模型,揭示了这种泄露存在于部分但不是所有的方法中。所以我们还能够说生成的人脸是绝对不存在的吗?本文涉及的数据集将会公开以供社区讨论。