Introduction

In this project I implement and deploy diffusion models for image generation. It is divided into two parts. In part A, I use Stability AI's DeepFloyd Model implement diffusion sampling loops, and use them for other tasks such as inpainting and creating optical illusions. In part B, I implement a UNet Models from Scratch.

Part A: The Power of Diffusion Models!

Diffusion Model: Forward Process

The forward process gradually adds noise to a clean image \(x_0\), resulting in a noisy image \(x_t\).

Original Image (x₀)

Noisy Images at Different Timesteps

Classical Denoising

In this section, we apply classical Gaussian blur filtering to the noisy images generated in the forward process. Gaussian blur is a simple denoising technique that removes high-frequency noise.

Noisy Images at Different Timesteps

Gaussian Blur Denoising at Different Timesteps

Gaussian Denoised Image t=250 — Gaussian Blur Denoising at t=250

Gaussian Blur Denoising at t=250 — Gaussian Blur Denoising at t=500

One Step Denoising

In this section, we use a pretrained diffusion model to denoise noisy images generated in the forward process. The pretrained UNet model is conditioned on Gaussian noise levels, allowing it to estimate and remove noise from noisy images at different timesteps.

Additionally, this model is text-conditioned, so we pass a text prompt "a high quality photo" to help guide the denoising process. This prompt embedding provides additional context to the UNet.

Noisy Images at Different Timesteps

One Step Denoising at Different Timesteps

One Step Denoised Image t=250 — One Step Denoising at t=250

One Step Denoising at t=250 — One Step Denoising at t=500

Iterative Denoising

In the previous section, we used a pretrained UNet to perform one-step denoising. However, diffusion models are designed for iterative denoising, progressively refining noisy images until we obtain an estimate of the original clean image \(x_0\).

Instead of performing 1000 steps (which is computationally expensive), we use a technique called strided timesteps to skip steps and speed up the process. By choosing a regular stride (e.g., 30), we reduce the number of timesteps while maintaining denoising quality.

Noisy to Clean Image Transition

Below, we show the iterative denoising process using strided timesteps (\(t = 690, 540, 390, 240, 90\)):

Predicted Clean Images

Diffusion Model Sampling

In this section, we use the diffusion model to generate images from random noise. By starting at \(i_{\text{start}} = 0\), we begin the denoising process with pure noise and iterate until the final clean image is obtained. This effectively allows the model to create new images from scratch.

For this example, we use the text prompt "a high quality photo" to guide the image generation. Below are 5 samples generated by the model:

Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) is a technique that enhances image quality by balancing conditional and unconditional noise estimates. In CFG, we compute both a noise estimate conditioned on a text prompt (\( \epsilon_c \)) and an unconditional noise estimate (\( \epsilon_u \)). The final noise estimate is given by:

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \]

Here, \( \gamma \) controls the strength of the guidance. When \( \gamma = 0 \), we use the unconditional noise estimate, and when \( \gamma = 1 \), we fully condition on the text prompt. For \( \gamma > 1 \), we achieve higher-quality images by amplifying the guidance signal. For more details, refer to this blog post.

Generated Images

Below are 5 images generated using CFG with \( \gamma = 7 \):

Image-to-Image Translation

Image-to-image translation involves taking an existing image, adding controlled amounts of noise to it, and using a diffusion model to denoise it back into the natural image manifold. This process allows us to make creative "edits" to the original image, where the degree of noise determines the extent of the edits.

Editing Hand-Drawn and Web Images

This procedure is particularly effective when applied to non-realistic images, such as hand-drawn sketches, paintings, or abstract scribbles. By applying diffusion models, we can project these images onto the natural image manifold, resulting in creative transformations. Below, we experiment with various hand-drawn and web images, running them through iterative denoising processes to see how they evolve into realistic outputs.

Inpainting

Inpainting involves restoring or editing specific parts of an image while preserving the rest. This method is inspired by the RePaint paper, which utilizes diffusion models to fill in missing or altered parts of an image.

Formula

At each timestep t, we update the image as follows:

            x_t = m * x_t + (1 - m) * forward(x_orig, t)

Text-Conditioned Image-to-Image Translation

In this section, we will perform the same image-to-image translation as the previous step, but now we will guide the translation using a text prompt. Instead of simply projecting the image onto the natural image manifold, we introduce text-based control, allowing the model to generate results aligned with specific language-based instructions.

Visual Anagrams

In this section, we create Visual Anagrams using diffusion models to generate optical illusions. The goal is to create an image that resembles one thing when viewed normally, but transforms into another when flipped upside down.

Hybrid Images

In this section, we create Hybrid Images using a technique called Factorized Diffusion. Hybrid images are created by blending low and high-frequency components derived from two separate noise estimates using diffusion models. To create a composite image, we combine the low frequencies from one noise estimate with the high frequencies of another. The resulting hybrid image shows different characteristics depending on the viewing distance.

Hybrid image of a Red Hulk's face and a mountain range

Hybrid image of a Snowy Village and an old man's face

Training a Single-Step Denoising UNet

The Single-Step Denoising UNet is designed to denoise a noisy image (z), approximating the clean image (x). The process combines a specialized architecture (the UNet) and a noise-injection process for training. Here's an overview:

Forward Process (Generating Training Data)

To train the denoiser, we simulate noisy data by adding noise to clean images (x):

z = x + σ ε, where ε ∼ N(0, I)

Here:

σ is the noise level, and we vary it across a range (e.g., σ = [0.0, 0.2, 0.4, ...]).
This process creates pairs (z, x), where z is the noisy input, and x is the target.

Training the UNet

The training objective is to minimize the reconstruction loss between the predicted clean image (hat{x}) and the ground truth (x):

L = E[z, x] || D_θ(z) - x ||²

Key Steps:

Input: The noisy image z and its corresponding clean image x.
Prediction: The UNet outputs hat{x} = D_θ(z), an estimate of the clean image.
Loss: The L2 loss measures the difference between hat{x} and x.
Optimization: Gradients are backpropagated to update the UNet's parameters.

Results on digits from the test set after 1 epoch of training

Results on digits from the test set after 5 epoch of training

Results on digits from the test set with varying noise levels.

High-Level Overview of the Time-Conditioned UNet

The Time-Conditioned UNet is an enhancement of the Single-Step Denoising UNet, designed to model the diffusion process iteratively across timesteps. This model is essential in diffusion models, where denoising at each timestep depends on both the input and the timestep itself. Here's an overview:

Forward Process (Training Data Generation)

To train the model, we simulate noisy images at various timesteps using the forward diffusion process:

x_t = sqrt(α̅_t) * x_0 + sqrt(1 - α̅_t) * ε, where ε ∼ N(0, I)

Here:

x_0: Clean image.
x_t: Noisy image at timestep t.
α̅_t: Cumulative product of noise scheduling terms.
ε: Gaussian noise.

This process creates pairs (x_t, t) where the model learns to predict ε.

Training the Time-Conditioned UNet

The objective of training is to minimize the error in predicting the noise term ε:

L = E[x_t, ε] || ε_θ(x_t, t) - ε ||²

Key Steps:

Input: Noisy image x_t and its timestep t.
Output: The UNet predicts ε_θ(x_t, t), the noise added to x_0 at timestep t.
Loss: L2 loss measures the difference between the predicted and actual noise.
Optimization: Backpropagation updates the UNet's parameters.

Iterative Denoising (Inference)

During inference, the Time-Conditioned UNet performs iterative denoising starting from pure noise (x_T) and progressing toward the clean image (x_0):

x_t-1 = sqrt(α̅_t-1) / sqrt(α̅_t) * (x_t - sqrt(1 - α̅_t) * ε_θ(x_t, t)) + sqrt(1 - α̅_t-1) * z

Here:

z: Gaussian noise added for intermediate steps.
This formula ensures that the image is gradually denoised step-by-step.

Summary

The Time-Conditioned UNet builds on the standard UNet by conditioning on timesteps, making it suitable for iterative denoising tasks. Its strengths include:

Timestep Adaptation: Learns to predict noise specific to each timestep.
Iterative Denoising: Gradually refines the noisy image to reconstruct the clean image.
Generalization: Trained on varying noise levels, enabling robustness across timesteps.

Epoch 5 — Sampling digits after 5 epoch of training

Epoch 20 — Sampling digits after 20 epoch of training

High-Level Overview of the Class-Conditioned UNet

The Class-Conditioned UNet builds on the foundation of the Time-Conditioned UNet by introducing an additional conditioning mechanism based on image class labels. This model is designed to denoise images while incorporating semantic information about the image class, making it suitable for class-conditional diffusion models. Here's an overview:

Forward Process (Training Data Generation)

To train the model, we simulate noisy images at various timesteps using the forward diffusion process:

x_t = sqrt(α̅_t) * x_0 + sqrt(1 - α̅_t) * ε, where ε ∼ N(0, I)

Here:

x_0: Clean image.
x_t: Noisy image at timestep t.
α̅_t: Cumulative product of noise scheduling terms.
ε: Gaussian noise.

This process creates pairs (x_t, t, c), where c is the class label, and the model learns to predict ε.

Training the Class-Conditioned UNet

The objective of training is to minimize the error in predicting the noise term ε:

L = E[x_t, ε, c] || ε_θ(x_t, t, c) - ε ||²

Key Steps:

Input: Noisy image x_t, its timestep t, and class label c.
Output: The UNet predicts ε_θ(x_t, t, c), the noise added to x_0 at timestep t for class c.
Loss: L2 loss measures the difference between the predicted and actual noise.
Optimization: Backpropagation updates the UNet's parameters.

Classifier-Free Guidance (CFG)

During inference, the Class-Conditioned UNet can leverage Classifier-Free Guidance (CFG) to balance fidelity and diversity:

ε = ε_u + γ (ε_c - ε_u)

Here:

ε_u: Unconditional noise estimate (no class conditioning).
ε_c: Class-conditional noise estimate.
γ: CFG scale that controls the strength of class conditioning.

Iterative Denoising (Inference)

During inference, the Class-Conditioned UNet performs iterative denoising starting from pure noise (x_T) and progressing toward the clean image (x_0):

x_t-1 = sqrt(α̅_t-1) / sqrt(α̅_t) * (x_t - sqrt(1 - α̅_t) * ε_θ(x_t, t, c)) + sqrt(1 - α̅_t-1) * z

Here:

z: Gaussian noise added for intermediate steps.
This formula ensures that the image is gradually denoised step-by-step while adhering to the specified class.

Summary

The Class-Conditioned UNet builds on the Time-Conditioned UNet by incorporating class-conditioning, making it capable of generating or denoising images conditioned on a specific class. Its strengths include:

Class Awareness: Learns to predict noise specific to each class label.
Timestep Adaptation: Adapts denoising to the current timestep.
Versatility: Suitable for class-conditional image generation and denoising tasks.

Programming Project #5: Fun With Diffusion Models

By Sai Kolasani

Introduction

Part A: The Power of Diffusion Models!

Diffusion Model: Forward Process

Classical Denoising

One Step Denoising

Iterative Denoising

Diffusion Model Sampling

Classifier-Free Guidance (CFG)

Image-to-Image Translation

Editing Hand-Drawn and Web Images

Inpainting

Formula

Text-Conditioned Image-to-Image Translation

Visual Anagrams

Hybrid Images

Training a Single-Step Denoising UNet

Forward Process (Generating Training Data)

Training the UNet

High-Level Overview of the Time-Conditioned UNet

Forward Process (Training Data Generation)

Training the Time-Conditioned UNet

Iterative Denoising (Inference)

Summary

High-Level Overview of the Class-Conditioned UNet

Forward Process (Training Data Generation)

Training the Class-Conditioned UNet

Classifier-Free Guidance (CFG)

Iterative Denoising (Inference)

Summary