标签归档:Conditional GAN

Few-shot Semantic Image Synthesis Using StyleGAN Prior

This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one- or five-shot settings.



ArrowGAN : Learning to Generate Videos by Learning Arrow of Time

Training GANs on videos is even more sophisticated than on images because videos have a distinguished dimension: time. While recent methods designed a dedicated architecture considering time, generated videos are still far from indistinguishable from real videos. In this paper, we introduce ArrowGAN framework, where the discriminators learns to classify arrow of time as an auxiliary task and the generators tries to synthesize forward-running videos. We argue that the auxiliary task should be carefully chosen regarding the target domain. In addition, we explore categorical ArrowGAN with recent techniques in conditional image generation upon ArrowGAN framework, achieving the state-of-the-art performance on categorical video generation. Our extensive experiments validate the effectiveness of arrow of time as a self-supervisory task, and demonstrate that all our components of categorical ArrowGAN lead to the improvement regarding video inception score and Frechet video distance on three datasets: Weizmann, UCFsports, and UCF-101.


使用视频训练GANs相较于用图像训练是复杂的,因为视频多出来一个时间轴。虽然最近的专用方法都考虑到了时间,但是生成的视频还远远不是完美的。在本文中,我们介绍一个称为ArrowGAN的视频生成模型,判别器将判别时间箭作为一个辅助任务,而生成器用于生成正向时间的视频。我们认为辅助任务的选择应该根据不同的目标域进行选择。另外,我们研究了类别ArrowGAN作为一个条件图像生成GAN在视频生成任务上的表现。我们在追加实验室验证了时间箭作为在自监督任务上的有效性,并且证明了所有ArrowGAN的部分均对性能有益并在Weizmann, UCFsports, 和UCF-101数据集上进行了验证。

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers.


Transformer是为了学习长序列之间作用的而设计的,它在多个任务上都能够达到SOTA的性能。CNNs而正相反,他们不会考虑由先验本地相互作用导致的归纳偏差,这使得他们在一般任务上具有较好的性能但是不适用于应对长序列(例如高分辨率图像)。我们在本文中展示了如何结合CNNs和Transformer的优点从而优化高分辨图像任务的表现。(i) 我们利用CNNs学习一个内容丰富的词汇对应于图像内容; (ii) 然后利用Transformers优化模型的效率。我们的方法可以应用于条件图像合成任务,这个任务包含空间或非空间信息,例如目标类别标签,分割图 。特别的,本文是第一篇关于语义引导的使用Transformer的图像合成论文。

Semantic Image Synthesis via Efficient Class-Adaptive Normalization

Spatially-adaptive normalization (SPADE) is remarkably successful recently in conditional semantic image synthesis, which modulates the normalized activation with spatially-varying transformations learned from semantic layouts, to prevent the semantic information from being washed away. Despite its impressive performance, a more thorough understanding of the advantages inside the box is still highly demanded to help reduce the significant computation and parameter overhead introduced by this novel structure. In this paper, from a return-on-investment point of view, we conduct an in-depth analysis of the effectiveness of this spatially-adaptive normalization and observe that its modulation parameters benefit more from semantic-awareness rather than spatial-adaptiveness, especially for high-resolution input masks. Inspired by this observation, we propose class-adaptive normalization (CLADE), a lightweight but equally-effective variant that is only adaptive to semantic class. In order to further improve spatial-adaptiveness, we introduce intra-class positional map encoding calculated from semantic layouts to modulate the normalization parameters of CLADE and propose a truly spatially-adaptive variant of CLADE, namely CLADE-ICPE. %Benefiting from this design, CLADE greatly reduces the computation cost while being able to preserve the semantic information in the generation. Through extensive experiments on multiple challenging datasets, we demonstrate that the proposed CLADE can be generalized to different SPADE-based methods while achieving comparable generation quality compared to SPADE, but it is much more efficient with fewer extra parameters and lower computational cost.


SPADE在条件语义图像合成任务中取得了瞩目的成绩,它对通过分割标签学习到的经过空间变换的标准化激活进行建模,从而避免了语义信息在生成过程中的损失。除了关注它优越的性能,对于模型深入的探究将有助于提升模型的计算效率。在本文中,我们从投资-回报理论的角度对spatially-adaptive normalization模型效率进行了深入的研究,我们发现模型的性能在语义层级的收益比空间适应性层级更高,这样的收益差距在高分辨输入条件时更加明显。根据这个发现,我们提出了CLADE,一个轻量级但同等效率的仅仅受语义标签影响的模型。为了进一步改进空间适应性,我们提出了通过语义分割图计算的类内位置图以对CLADE的标准化参数进行建模。在不同数据上的实验表明,CLADE能够在保持于SPADE相似性能的前提下以更少的参数量达到更高效的运算效率。

MPG: A Multi-ingredient Pizza Image Generator with Conditional StyleGANs

Multilabel conditional image generation is a challenging problem in computer vision. In this work we propose Multi-ingredient Pizza Generator (MPG), a conditional Generative Neural Network (GAN) framework for synthesizing multilabel images. We design MPG based on a state-of-the-art GAN structure called StyleGAN2, in which we develop a new conditioning technique by enforcing intermediate feature maps to learn scalewise label information. Because of the complex nature of the multilabel image generation problem, we also regularize synthetic image by predicting the corresponding ingredients as well as encourage the discriminator to distinguish between matched image and mismatched image. To verify the efficacy of MPG, we test it on Pizza10, which is a carefully annotated multi-ingredient pizza image dataset. MPG can successfully generate photo-realist pizza images with desired ingredients. The framework can be easily extend to other multilabel image generation scenarios.


多标签条件图像生成是一个在计算机视觉领域的挑战性任务。在本文中我们介绍Multi-ingredient Pizza Generator (MPG),一种为多标签图像生成任务而设计的条件GAN。MPG的架构基于目前state-of-the-art的StyleGAN2架构,我们为MPG设计了一个新的条件机制因此它可以强制特征学习到多尺度的标签信息,我们还利用预测佐料以及鼓励判别器去判别匹配和未匹配的图像。

A Note on Data Biases in Generative Models

It is tempting to think that machines are less prone to unfairness and prejudice. However, machine learning approaches compute their outputs based on data. While biases can enter at any stage of the development pipeline, models are particularly receptive to mirror biases of the datasets they are trained on and therefore do not necessarily reflect truths about the world but, primarily, truths about the data. To raise awareness about the relationship between modern algorithms and the data that shape them, we use a conditional invertible neural network to disentangle the dataset-specific information from the information which is shared across different datasets. In this way, we can project the same image onto different datasets, thereby revealing their inherent biases. We use this methodology to (i) investigate the impact of dataset quality on the performance of generative models, (ii) show how societal biases of datasets are replicated by generative models, and (iii) present creative applications through unpaired transfer between diverse datasets such as photographs, oil portraits, and animes. 



SHAD3S: A model to Sketch, Shade and Shadow

Hatching is a common method used by artists to accentuate the third dimension of a sketch, and to illuminate the scene. Our system SHAD3S attempts to compete with a human at hatching generic three-dimensional (3D) shapes, and also tries to assist her in a form exploration exercise. The novelty of our approach lies in the fact that we make no assumptions about the input other than that it represents a 3D shape, and yet, given a contextual information of illumination and texture, we synthesise an accurate hatch pattern over the sketch, without access to 3D or pseudo 3D. In the process, we contribute towards a) a cheap yet effective method to synthesise a sufficiently large high fidelity dataset, pertinent to task; b) creating a pipeline with conditional generative adversarial network (CGAN); and c) creating an interactive utility with GIMP, that is a tool for artists to engage with automated hatching or a form-exploration exercise. User evaluation of the tool suggests that the model performance does generalise satisfactorily over diverse input, both in terms of style as well as shape. A simple comparison of inception scores suggest that the generated distribution is as diverse as the ground truth.


影线法是一种艺术家强调草图三维特性的方式。我们的SHAD3S系统尝试辅助艺术家完成对于普通3D物体的影线法工作,并且能够辅助他进行更多的创作。我们文章的创新性在于利用情景信息(光照和纹理),从草图生成准确的硬线法特征而无需3D数据的输入。我的贡献在于a)我们提出了一个高效费比的方法用于生成高保真的草图数据集; b) 提出基于cGAN的架构; c) 提出了一种可以与GIMP合作的内联架构。

Teaching a GAN What Not to Learn

Generative adversarial networks (GANs) were originally envisioned as unsupervised generative models that learn to follow a target distribution. Variants such as conditional GANs, auxiliary-classifier GANs (ACGANs) project GANs on to supervised and semi-supervised learning frameworks by providing labelled data and using multi-class discriminators. In this paper, we approach the supervised GAN problem from a different perspective, one that is motivated by the philosophy of the famous Persian poet Rumi who said, “The art of knowing is knowing what to ignore.” In the GAN framework, we not only provide the GAN positive data that it must learn to model, but also present it with so-called negative samples that it must learn to avoid – we call this “The Rumi Framework.” This formulation allows the discriminator to represent the underlying target distribution better by learning to penalize generated samples that are undesirable – we show that this capability accelerates the learning process of the generator. We present a reformulation of the standard GAN (SGAN) and least-squares GAN (LSGAN) within the Rumi setting. The advantage of the reformulation is demonstrated by means of experiments conducted on MNIST, Fashion MNIST, CelebA, and CIFAR-10 datasets. Finally, we consider an application of the proposed formulation to address the important problem of learning an under-represented class in an unbalanced dataset. The Rumi approach results in substantially lower FID scores than the standard GAN frameworks while possessing better generalization capability.



Data Augmentation Using Generative Adversarial Network

Effective training of neural networks requires much data. In the low-data regime, parameters are underdetermined, and learnt networks generalise poorly. Data Augmentation alleviates this by using existing data more effectively. However standard data augmentation produces only limited plausible alternative data. Given there is potential to generate a much broader set of augmentations, we design and train a generative model to do data augmentation. The model, based on image conditional Generative Adversarial Networks, takes data from a source domain and learns to take any data item and generalise it to generate other within-class data items. As this generative process does not depend on the classes themselves, it can be applied to novel unseen classes of data. We show that a Data Augmentation Generative Adversarial Network (DAGAN) augments standard vanilla classifiers well. We also show a DAGAN can enhance few-shot learning systems such as Matching Networks. We demonstrate these approaches on Omniglot, on EMNIST having learnt the DAGAN on Omniglot, and VGG-Face data. In our experiments we can see over 13% increase in accuracy in the low-data regime experiments in Omniglot (from 69% to 82%), EMNIST (73.9% to 76%) and VGG-Face (4.5% to 12%); in Matching Networks for Omniglot we observe an increase of 0.5% (from 96.9% to 97.4%) and an increase of 1.8% in EMNIST (from 59.5% to 61.3%).



SRFlow: Learning the Super-Resolution Space with Normalizing Flow

PDF] SRFlow: Learning the Super-Resolution Space with Normalizing Flow |  Semantic Scholar

Super-resolution is an ill-posed problem, since it allows for multiple predictions for a given low-resolution image. This fundamental fact is largely ignored by state-of-the-art deep learning based approaches. These methods instead train a deterministic mapping using combinations of reconstruction and adversarial losses. In this work, we therefore propose SRFlow: a normalizing flow based super-resolution method capable of learning the conditional distribution of the output given the low-resolution input. Our model is trained in a principled manner using a single loss, namely the negative log-likelihood. SRFlow therefore directly accounts for the ill-posed nature of the problem, and learns to predict diverse photo-realistic high-resolution images. Moreover, we utilize the strong image posterior learned by SRFlow to design flexible image manipulation techniques, capable of enhancing super-resolved images by, e.g., transferring content from other images. We perform extensive experiments on faces, as well as on super-resolution in general. SRFlow outperforms state-of-the-art GAN-based approaches in terms of both PSNR and perceptual quality metrics, while allowing for diversity through the exploration of the space of super-resolved solutions.


传统cGAN超分辨率模型存在以下问题:(1) 单一条件对应一个输出,输出的样本多样性差;(2) 往往忽略输入噪声。针对以上问题,作者提出SRFlow: SRFlow模型可以以单一输入生成多样的输出,它的主要思路是将LR-HR的输入对映射到高斯隐空间中,然后在隐空间中进行采样获得噪声,这样的操作可以使得模型充分利用噪声带来的变化从而生成多样的输出。由于模型采用单一损失函数,所以模型训练稳定并且收敛性好。