Lazy Diffusion Transformer for Interactive Image Editing

We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a “lazy” fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder’s runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10× speedup for typical user interactions, where the editing mask represents 10% of the image.

Our diffusion transformer decoder (bottom) reduces synthesis computation using two strategies.
First, we compress the image context using a separate encoder (not shown) outside the diffusion loop. Second, we only generate tokens corresponding to the masked region to generate.
In contrast, typical diffusion transformers (top) maintain tokens for the entire image throughout the diffusion process, to preserve global context. When performing inpainting, such model generates a full-size image, most of which is discarded in order to in-fill the hole region only. Existing convolutional diffusion models for inpainting suffer from the same drawbacks.

We compare LazyDiffusion to the two existing inpainting approaches -- regenerating a smaller crop or the entire image. All methods are using a PixArt-based architecture. LazyDiffusion is consistently faster than a regenerating the entire image, especially for small mask ratios typical to interactive edits, reaching a speedup of 10x. Similarly, LazyDiffusion is faster than regenerating a crop when the mask is smaller than that. For masks greater than that (dashed), regenerating the crop is technically faster but generates in low-resolution and naively upsamples to match the desired resolution, harming image quality.

Each panel illustrates a generative progression compared to the preceding state of the canvas to its left.

BibTeX


      @misc{nitzan2024lazy,
        title={Lazy Diffusion Transformer for Interactive Image Editing}, 
        author={Yotam Nitzan and Zongze Wu and Richard Zhang and Eli Shechtman and Daniel Cohen-Or and Taesung Park and Michaël Gharbi},
        year={2024},
        eprint={2404.12382},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }

Lazy Diffusion Transformer for Interactive Image Editing

Incremental image generation using LazyDiffusion. The model generates content according to a text prompt in an area specified by a mask. Each update generates only the masked pixels, with a runtime that depends chiefly on the size of the mask, rather than that of the image.

Abstract

How does it work?

Runtime

Results

Progressive editing and generation

Comparison

BibTeX