标签归档:Text-to-Image

Paint by Word

We investigate the problem of zero-shot semantic image painting. Instead of painting modifications into an image using only concrete colors or a finite set of semantic concepts, we ask how to create semantic paint based on open full-text descriptions: our goal is to be able to point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.” To do this, our method combines a state-of-the art generative model of realistic images with a state-of-the-art text-image semantic similarity network. We find that, to make large changes, it is important to use non-gradient methods to explore latent space, and it is important to relax the computations of the GAN to target changes to a specific region. We conduct user studies to compare our methods to several baselines.

https://arxiv.org/abs/2103.10951

在本文中我们研究零样本的语义图像生成问题。不同于往一张图片上绘制离散的色彩或者有限的语义内容,我们提出了如何基于完全文字描述进行语义绘图的问题:我们的目标是通过文字描述给出一个区域就可以在此区域上绘制任意的内容,例如朴素的,奢华的或者特定的图案。为了实现这个任务,我们的方法结合了的现有的SOTA图像生成模型以及文字-图像语义相似度估计网络。我们发现,为了有所改善,放松GAN对于特定域的计算变得十份重要。我们让我们的方法与几个baseline进行了比较。

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

https://arxiv.org/abs/2102.12092

文字到图像生成任务专注于寻找固定数据集上的更好的建模假设。这些假设可能包括复杂的网络架构,辅助的损失函数或者有助于训练流程的次要信息例如目标部分标签或者分割掩膜等。我们提出了一个基于transformer的简单的文字到图像自回归生成模型,它可以将文字和图像符号视为同一个数据流。通过使用充分的数据训练,我们的方法可以在zero-shot任务上达到现有域相关方法的相似性能。

Fine-grained Semantic Constraint in Image Synthesis

In this paper, we propose a multi-stage and high-resolution model for image synthesis that uses fine-grained attributes and masks as input. With a fine-grained attribute, the proposed model can detailedly constrain the features of the generated image through rich and fine-grained semantic information in the attribute. With mask as prior, the model in this paper is constrained so that the generated images conform to visual senses, which will reduce the unexpected diversity of samples generated from the generative adversarial network. This paper also proposes a scheme to improve the discriminator of the generative adversarial network by simultaneously discriminating the total image and sub-regions of the image. In addition, we propose a method for optimizing the labeled attribute in datasets, which reduces the manual labeling noise. Extensive quantitative results show that our image synthesis model generates more realistic images.

https://arxiv.org/abs/2101.04558

在本文中,我们提出一个多阶段的高分辨率图像生成模型,这个模型可以利用精细粒度的属性标签和掩膜作为输入。通过精细粒度的属性标签提供的语义信息,我们提出的模型可以在细节上限制生成图像的特征。通过掩膜输入,模型可以生成符合视觉直觉的图像并且减少生成非期望的异常图像。本文还提出了一种新的架构通过同时输入整体图像以及部分图像用来提升判别器的行呢个。另外,我们的方法通过优化数据集中的属性标签以减少人工标注带来的早哦生。实验证明我们的模型可以生成更加真实的图像。

VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search

Text-to-image retrieval is an essential task in multi-modal information retrieval, i.e. retrieving relevant images from a large and unlabelled image dataset given textual queries. In this paper, we propose VisualSparta, a novel text-to-image retrieval model that shows substantial improvement over existing models on both accuracy and efficiency. We show that VisualSparta is capable of outperforming all previous scalable methods in MSCOCO and Flickr30K. It also shows substantial retrieving speed advantages, i.e. for an index with 1 million images, VisualSparta gets over 391x speed up compared to standard vector search. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for very large dataset, with significant accuracy improvement compared to previous state-of-the-art methods.

https://arxiv.org/abs/2101.00265

由文字重建图像是一个在多模态信息重建领域一个重要的任务。详细来说,就是从一个大型且无标签的图像数据集中通过给定的文字信息恢复图像。在本文中,我们提出了VisualSparta,一种新的由文字到图像的生成模型,它在精度和效率方面都领先于现有的模型。我们在MSCOCO和Flickr30K数据集上进行了测试。在速度方面,给定索引一百万张图片的任务,我们可以比标准矢量搜索方法快391倍。实验表明,我们的速度优势来自逆索引。据我们了解,VisualSparta是第一个由文字重建图像的方法并且优于现有SOTA的方法。

ManiGAN: Text-Guided Image Manipulation

ManiGAN: Text-Guided Image Manipulation | Papers With Code

The goal of our paper is to semantically edit parts of an image matching a given text that describes desired attributes (e.g., texture, colour, and background), while preserving other contents that are irrelevant to the text. To achieve this, we propose a novel generative adversarial network (ManiGAN), which contains two key components: text-image affine combination module (ACM) and detail correction module (DCM). The ACM selects image regions relevant to the given text and then correlates the regions with corresponding semantic words for effective manipulation. Meanwhile, it encodes original image features to help reconstruct text-irrelevant contents. The DCM rectifies mismatched attributes and completes missing contents of the synthetic image. Finally, we suggest a new metric for evaluating image manipulation results, in terms of both the generation of new attributes and the reconstruction of text-irrelevant contents. Extensive experiments on the CUB and COCO datasets demonstrate the superior performance of the proposed method. Code is available at https://github.com/mrlibw/ManiGAN.

https://arxiv.org/pdf/1912.06203.pdf

本文的目的是在保留其他与文字无关内容的前提下,使用文字去从语义层级编辑图片中特定的部分(例如纹理,颜色或者背景)。为了做到这点,我们提出了一个先进的GAN (ManiGAN),它主要由两个部分组成: 文字-图片仿射合成模块(ACM)和细节校正模块(DCM).ACM可以选择与文字对应的图片的部分,并且根据文字的信息编辑图片上对应的区域。同时,他会提取原始图片特征去帮助重建文字无关的内容。DCM可以校正为配对的标签以及完成对于合成图像缺漏部分的补全。最后,我们还提出了一个新的指标用于评估图像编辑的效果,这个指标反映了对于新标签的生成以及对于文字无关内容的重建。在CUB和COCO数据集上的实验证明了本文方法的先进性能。

Generative adversarial text to image synthesis

GitHub - zsdonghao/text-to-image: Generative Adversarial Text to Image  Synthesis / Please Star -->

Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories, such as faces, album covers, and room interiors. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.

https://arxiv.org/pdf/1605.05396.pdf

本文提出了一种基于GAN的文字到图像的合成方法。Generator将词嵌入和随机向量拼接作为输入,使用generator生成合成图像。Discriminator中间的特征会和词嵌入拼接,并且做出判断图像的真假。