11 minute read

Diffusion models have exhibited superior performance compared to previous state-of-the-art generative models, spurring researchers to explore their remarkable generative capabilities across a spectrum of downstream tasks. Notably, conditional generation using pretrained diffusion models in a zero-shot or few-shot fashion has emerged as a focal point of research. Consequently, conditional diffusion models leveraging diverse conditioning inputs, including text, class labels, degraded images, segmentation maps, landmarks, hand-drawn sketches, and more, have been introduced. This post provides a concise overview of select works in this domain.


Zero-shot ApplicationsPermalink

The training-free methods leverage various innovative techniques to achieve condition generation without additional training for certain tasks, exploiting the unique iterative denoising process inherent to diffusion models.

SDEdit: Guided Image Synthesis & EditingPermalink

Stochastic differential editing (SDEdit) facilitates image generation and editing of diffusion models through user interaction, devoid of task-specific training or loss functions. Any form of manipulating RGB pixels, including stroke painting or an image with stroke edits, can be injected to the SDE-based generative models, yielding realistic and faithful images from guides with various levels of fidelity.

SDEdit can generate realistic, faithful and diverse images for a given stroke input drawn by human

Fig 1. SDEdit can generate realistic, faithful and diverse images for a given stroke input drawn by human (Meng et al. 2022)


The fundamental insight behind SDEdit is to hijack the generative process of SDE-based generative models by introducing an appropriate level of noise to smooth out undesirable artifacts and distortions in user guidance input, while retaining the overarching structure. Subsequently, the reverse process of diffusion is executed to eliminate the noise, yielding a denoised result that is both realistic and faithful to the initial user guidance input.

Synthesizing images from strokes with SDEdit

Fig 2. Synthesizing images from strokes with SDEdit (Meng et al. 2022)


Formally, consider the following perturbing forward process of SDE-based generative models for time interval t[0,1]:

x(t)=α(t)x(0)+σ(t)z where x(0)pdata,zN(0,I)

Note that we generally set α(t)0 and σ(t)1 as t1. For instance, DDPM adopts α(t)=ˉαt and σ(t)=1ˉαt. Given a user guidance image x(g), it samples x(g)(t0)N(x(g),σ2(t0)I) and obtain the denoised generated output x(0)=SDEdit(x(g);t0,θ) by iterating the reverse SDE. The key hyperparameter for SDEdit is t0(0,1), denoting the initiation time for the image synthesis procedure in the reverse SDE, and also indicating the degree of forward processsing. Indeed, it determines the degree of realism-faithfulness trade-off. In practice, the authors found that t0[0.3,0.6] works well.

Trade-off between faithfulness and realism for SDEdit.

Fig 3. Trade-off between faithfulness and realism for SDEdit. (Meng et al. 2022)


RePaint: Inpainting with DDPMPermalink

RePaint, DDPM-based inpainting algorithm proposed by Lugmayr et al. 2022 has demonstrated that unconditionally pre-trained image diffusion models excel at inpainting task without any additional training and tuning, effectively filling regions of the missing forground while preserving the given background.

Visualization of samples generated from RePaint

Fig 4. Visualization of samples generated from RePaint (Lugmayr et al. 2022)


Formally, the goal of inpainting is to predict missing pixels mx of the ground truth image x using a mask region m as a condition, while preserving the known pixels as (1m)x. And RePaint combines denoised foreground images to be filled, and noisy background to be fixed images at each iteration of the reverse process:

  1. Sampling foreground image xunknown
    Starting from xT, at each timestep t, denoise xt one step with reverse process, yielding xunknownt1. xunknownt1N(μθ(xt,t),Σθ(xt,t))
  2. Sampling background image xknown
    Perturb the input background image x0 via forward process with a noise scale of the timestep t1, yielding xknownt1. xknownt1N(ˉαtx0,(1ˉαt)I)
  3. Inpainting
    Combine xunknownt1 and xknownt1 using the background mask m, yielding xt1: xt1=mxknownt1+(1m)xunknownt1

    Overview of RePaint

    Fig 5. Overview of sampling step of RePaint (Lugmayr et al. 2022)

  4. Resampling
    The sampling of the background pixels, xknown, is performed independently of the generated parts of the image, which can result in disharmony. To address the semantic disparity between these samples, RePaint diffuses the output xt1 back to xt with q(xt|xt1)=N(xt;1βtxt1,βtI) and resamples xt1 again. Qualitatively, additional resampling steps lead to more harmonized images.

    The effect of applying $n$ sampling steps

    Fig 6. The effect of applying n sampling steps, where n=2 is the DDPM with 1 resampling (Lugmayr et al. 2022)


The following pseudo-code summarizes the pipeline of RePaint algorithm:

Pseudo-code of RePaint algorithm

Fig 7. Pseudo-code of RePaint algorithm (Lugmayr et al. 2022)


Few-shot AdapationsPermalink

Training-required methods can achieve robust control generation capabilities through fine-tuning with data pairs. One of the most prominent applications of these methods is the text-to-image task, which can be accomplished with few-shot fine-tuning.

Text-to-Image Few-shot PersonalizationPermalink

Adapting Stable Diffusion to a particular style is typically done by prompt engineering, or by fine-tuning the U-Net on a set of target style images, well-exemplified by Textual Inversion and DreamBooth.

Textual Inversion (TI)Permalink

Instead of fine-tuning the diffusion model, a natural way to personalize the generated outputs with exemplar images is prompt engineering; to describe the desired style in the textual prompt. Textual Inversion (TI) introduces a new way of prompt engineering, by way of learning new “words” from a small set of exemplar images.

To inject the specific concept or particular style of exemplar images within textual embedding space of pre-trained text-to-image models, Textual Inversion inverts the new, specific concept of exemplar images by optimizing the embedding vector v to find new tokens of pseudo-word S that represent specific concepts of given images.

Textual Inversion learns to generate specific concepts by describing them using new 'words' in the embedding space of pre-trained text-to-image models

Fig 8. Textual Inversion learns to generate specific concepts by describing them using new "words" in the embedding space of pre-trained diffusion. (Gal et al. 2023)


  1. Text Embedding
    Each word or sub-word in an input string is converted to a token nN, which corresponds to an index in some pre-defined dictionary. Each token is then linked to corresponding unique embedding vector vn that can be retrieved via index lookup. The embedding vectors are transformed into a single conditioning code cθ(y) by the text encoder cθ, which subsequently guides the generative model.
  2. Textual Inversion
    To find these new embeddings, a small set of images (typically 3-5) depicting the target concept across multiple settings, such as varied backgrounds or poses, is used. The embedding vector v of S is directly optimized by minimizing the LDM loss: v=argminvEzε(x),y,ϵN(0,1),t[ϵϵθ(zt,t,cθ(y))22] while freezing both cθ and ϵθ.


Outline of the text-embedding and inversion process

Fig 9. Outline of the text-embedding and inversion process (Gal et al. 2023)


However, akin to the prompt engineering technique, this approach is limited by text embedding’s capacity to capture the style characteristics and the expressiveness of the frozen diffusion model to mimic the visual appearance of subjects in a given reference set, and synthesize novel renditions of the same subjects in different contexts.

DreamBoothPermalink

Only with 3~5 exemplar images, one can fine-tune text-to-image diffusion to specific objects or people, thereby expanding language-vision dictionary of pre-trained model, using a method called DreamBooth of Ruiz et al. 2023. Typically, it requires about 1000 iterations with 3 ~ 5 reference images.

Given a few images of a subject, DreamBooth aims to implant the subject into the output domain of the model such that it can be synthesized with a unique identifier [V]. Text-to-image diffusion model is fine-tuned with the input images paired with a text prompt containing a unique identifier and the name of the class the subject belongs to, e.g., "A [V] dog". In parallel, to prevent language drift that causes the model to associate the class name with the specific instance, a class-specific prior preservation loss encourages the model to generate diverse instances belong to the subject’s class using the class name in the text prompt.

With just a few images of a subject, DreamBooth can generate images of the subject in different contexts, using the guidance of a text prompt.

Fig 10. With just a few images of a subject, DreamBooth can generate images of the subject in different contexts, using the guidance of a text prompt. (Ruiz et al. 2023)


  1. Personalization Prompts
    To bypass the overhead of writing detailed image descriptions, for a given image set, it labels all input images of the subject "a [identifier] [class noun]". The class descriptor aids in tethering the class-prior to the unique subject of exemplar images. In case of identifier, to prevent entangling with other words and class descriptor, an identifier should have weak prior in both the language model and the diffusion model.

    To find such tokens, the authors look-up rare-token f(ˆV), where f is the tokenizer and obtain a sequence of characters ˆV by de-tokenizing f(ˆV).
  2. Class-specific Prior Preservation Loss
    There are two major challenges for fine-tuning text-to-image models:
    • Language drift
      Fine-tunining for a specific task progressively extinguishes syntactic and semantic knowledge of the language of pre-trained model.
    • Reduced output diversity
      The model must generate the subject in innovative viewpoints, poses, and articulations. However, fine-tuning might diminish the variability in the output poses and perspectives of the subject, solidified in the limited few-shot views.
    To retain the pre-trained prior space and generate diverse images of the class prior, DreamBooth supervises the model with its own generated samples xpr=ˆx(zt1,cpr) from the frozen pre-trained diffusion model ˆx with random initial noise zt1N(0,I) and conditioning vector cpr:=Γ(f(a [class noun])). Then, it is fine-tuned with the following loss function: Ex,c,ϵ,ϵ,t[wtˆxθ(αtx+σtϵ,c)x22+λwtˆxθ(αtxpr+σtϵ,cpr)xpr22] where the first term is the second term is the prior-preservation term that supervises the model with its own generated images, and λ controls for the relative weight of this term.


Outline of DreamBooth fine-tuning

Fig 11. Outline of DreamBooth fine-tuning (Ruiz et al. 2023)


ControlNet: Conditional Control to Text-to-Image DiffusionsPermalink

Text-to-image models often lack precise control over the spatial composition of images, and to accurately express complex layouts, poses, shapes and forms is challenging via text prompts alone. Zhang et al. 2023 proposed ControlNet, which allows for efficient fine-tuning of an unconditionally trained image diffusion model for both image- and text-conditioned scenarios via a relatively smaller set of input-output pairs (<50k).

Controlling Stable Diffusion with learned conditions.

Fig 12. Controlling Stable Diffusion with learned conditions. (Zhang et al. 2023)


ControlNet is an end-to-end neural network architecture that attaches to frozen, large pretrained text-to-image diffusion models, enabling conditional controls for these extensive models. By integrating with the large model, ControlNet injects additional conditions into the blocks of a neural network. Specifically, consider a trained neural block F(;Θ) with parameters Θ, which transforms an input feature map x, into another feature map y:

y=F(x;Θ)

To preserve the quality and capabilities of the large model, ControlNet locks Θ, but instead clones this neural block to a trainable copy with parameters Θc, which takes an external conditioning vector c as input. This trainable copy is connected to the locked model with two zero convolution layers Z(;), 1×1 convolution layer with both weight w and bias b initialized to 0:

Z(x;w,b)=wx+b

The complete ControlNet then computes:

yc=F(x;Θ)+Z(F(x+Z(c;w1,b1);Θc);w2,b2)

ControlNet block

Fig 13. ControlNet block (Zhang et al. 2023)


Note that yc=y and ZI=0 as zero convolution initializes its parameters to zero. That means ControlNet begins with the state identical to the unconditional case, without adapting the conditional information initially, and gradually introduces the conditional information. This architecture ensures that harmful noise is not introduced into the deep features of the large diffusion model at the beginning of fine-tuning, enabling the model to retain the capabilities of the large, pretrained model and allowing it to serve as a robust backbone for further learning.

This modification can be further extended to the entire text-to-image diffusion model by adapting each neural network block with zero convolutions:

  • For the encoder of the conditional image, clone the pre-trained U-Net parameters while permitting them to be updated during fine-tuning.
  • Integrate the encoded conditional image information with the noisy image using zero convolution.

Given an input image z0 and corresponding noisy image zt, the diffusion model ϵθ of ControlNet is fine-tuned by optimizing the following loss given text prompts ct and task-specific condition cf:

L=Ez0,t,ct,cf,ϵN(0,I)[ϵϵθ(zt,t,ct,cf)22]

Overview of ControlNet

Fig 14. Overview of ControlNet (Zhang et al. 2023)


Experimentally, ControlNet has demonstrated robust interpretation of content semantics in various input conditioning images, such as edge maps, segmentation maps, depth maps, etc.

Controlling Stable Diffusion with various conditions without prompts.

Fig 15. Controlling Stable Diffusion with various conditions without prompts. (Zhang et al. 2023)


ReferencesPermalink

[1] Meng et al., SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, ICLR 2022.
[2] Lugmayr et al., RePaint: Inpainting using Denoising Diffusion Probabilistic Models, CVPR 2022.
[3] Ruiz et al., DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, CVPR 2023.
[4] Gal et al., An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, ICLR 2023.
[5] Zhang et al., Adding Conditional Control to Text-to-Image Diffusion Models, ICCV 2023.
[6] Yu et al., FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model ICCV 2023

Leave a comment