Modality Disentangled Discriminator for Text-to-Image Synthesis

Text-to-image (T2I) synthesis aims at generating photo-realistic images from text descriptions, which is a particularly important task in bridging vision and language. Each generated image consists of two parts: the content part related to the text and the style part irrelevant to the text. The exis...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on multimedia Vol. 24; pp. 2112 - 2124
Main Authors: Feng, Fangxiang, Niu, Tianrui, Li, Ruifan, Wang, Xiaojie
Format: Journal Article
Language:English
Published: Piscataway IEEE 2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text-to-image (T2I) synthesis aims at generating photo-realistic images from text descriptions, which is a particularly important task in bridging vision and language. Each generated image consists of two parts: the content part related to the text and the style part irrelevant to the text. The existing discriminator does not distinguish between the content part and the style part. This not only precludes the T2I synthesis models from generating the content part effectively but also makes it difficult to manipulate the style of the generated image. In this paper, we propose a modality disentangled discriminator that distinguishes between the content part and the style part at a specific layer. Specifically, we enforce the early layers of a certain number in the discriminator to become the disentangled representation extractor through two losses. The extracted common representation for the content part can make the discriminator more effective for capturing the text-image correlation, while the extracted modality-specific representation for the style part can be directly transferred to other images. The combination of these two representations can also improve the quality of the generated images. Our proposed discriminator is used to substitute the discriminator of each stage in the representative model AttnGAN and the SOTA model DM-GAN. Extensive experiments are conducted on three widely used datasets, i.e. CUB, Oxford-102, and COCO, for the T2I synthesis task, demonstrating the superior performance of the modality disentangled discriminator over the base models. Code for DM-GAN with our modality disentangled discriminator is available at https://github.com/FangxiangFeng/DM-GAN-MDD .
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2021.3075997