Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative Text-to-image generation. Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers, which modify obje...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
05-03-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have
recently gained significant popularity for creative Text-to-image generation.
Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE)
is of greater importance for application developers, which modify objects or
object properties in images by manipulating feature components in attention
layers during the generation process. However, little is known about what
semantic meanings these attention layers have learned and which parts of the
attention maps contribute to the success of image editing. In this paper, we
conduct an in-depth probing analysis and demonstrate that cross-attention maps
in Stable Diffusion often contain object attribution information that can
result in editing failures. In contrast, self-attention maps play a crucial
role in preserving the geometric and shape details of the source image during
the transformation to the target image. Our analysis offers valuable insights
into understanding cross and self-attention maps in diffusion models. Moreover,
based on our findings, we simplify popular image editing methods and propose a
more straightforward yet more stable and efficient tuning-free procedure that
only modifies self-attention maps of the specified attention layers during the
denoising process. Experimental results show that our simplified method
consistently surpasses the performance of popular approaches on multiple
datasets. |
---|---|
DOI: | 10.48550/arxiv.2403.03431 |