The power of Diffusion Models

Diffusion models are trained for denoising by estimating the noise in a given input. They are widely applicable in image denoising, as well as image generation.

In this project, we are using the DeepFloyd model by Huggingface.

Sampling from the Model

Here are some images generated using the following prompts with different number of inference steps. The random seed is selected to be 180.

	An oil painting of a snowy mountain village	A man wearing a hat	A rocket ship
5 inference steps
10 inference steps
20 inference steps
40 inference steps

Notice how the quality of the image increases as the number of inference steps increases, especially the details. Also, the images become more relevant to the text prompt and makes more sense in general with more inference steps.

Forward Process

In the forward process, we take a clean image and gradually add noise to it. More specifically, we add a zero-mean, i.i.d. Gaussian noise to each pixel of the image. This is equivalent to computing $$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$ where $\epsilon \sim \mathcal{N} (0, 1)$. In the DeepFloyd model, $T \in [0, 999]$.

We take an image of the Campanile as an example and show the noisy image at $T \in \{250, 500, 750\}$. This example shows that the image gradually becomes pure noise as $T$ increases.

Forward process at various noise levels.

Classical Denoising

Recall that a Gaussian filter removes the high-frequency components of an image. Since the Gaussian noise is of higher frequency than the elements in the original image, we can mitigate the noise by applying a Gaussian filter.

The result is clearly not ideal, especially for the cases with higher noise levels. This is because the Gaussian filter also caused loss of high-frequency information in the original image.

One-Step Denoising

Now, we use the pretrained diffusion model for the denoising task. The DeepFloyd model is trained with both time conditioning and text conditioning, which takes the time $T$ and a text embedding as a conditioning signal. Here, we simply use the embedding a high quality photo as an "empty" prompt.

We apply the model at various time $T$ to estimate the noise in the image, and recover the clean image with the same formula as in the forward process.

The results are as follows. Notice that the model is also "editing" the image a little bit to create another version of Campanile that also makes sense.

One-step denoising with the DeepFloyd Model.

Iterative Denoising

Although the one-step denoising did a reasonable job, the result is still not ideal when the image is especially noisy. However, note that diffusion models are trained to denoise the image iteratively. Instead of trying to remove all the noise in one step, we do the reverse of the forward process by removing the noise little by little.

In the iteration at time $t$, we use the one-step denoising to estimate the clean image $\hat{x}_0$ based the noisy image $x_t$ and the noise level at time $t$. Then, we get the estimation of the previous time-step by taking a linear combination of the noisy image, $$x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \beta_t}{1 - \bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t}x_t + v_\sigma$$ where $\bar{\alpha}_t$ is related to the noise schedule, which we will stick to the design choice behind the DeepFloyd model. $v_\sigma$ is the estimated variance to add back into the image.

Also, to speed up the process, it turns out we can skip some of the timesteps. Here, we use a strided timestep with step size 30.

The result of the iterative denoising is as follows. Compared side-by-side with the results of previous methods, the iterative denoising has by far the best performance.

$t=90$	$t=240$	$t=390$	$t=540$	$t=690$

Original	Iterative Denoised Campanile	One-step Denoised Campanile	Gaussian-Blurred Campanile

Diffusion Model Sampling

Similar to what we did in the beginning, we can also use the diffusion model to denoise an image that contains nothing but pure noise.

Below are some results.

Sample 1	Sample 2	Sample 3	Sample 4	Sample 5

We can see that some of the images generated from pure noise are not so good.

Classifier-free Guidance

To improve the quality of the generated images, we borrow the idea from caricature generation. In each iteration of the denoising process, we additionally estimate the noise with a truly empty prompt, and "push" the image towards the "realistic" manifold, i.e., $$\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$$ where $\epsilon_u$ is the estimated noise with an empty prompt, a.k.a. the unconditioned noise estimation, while $\epsilon_c$ is the noise estimated with our prompt, a.k.a. the conditioned noise estimation. This trick is known as Classifier-free Guidance (CFG), and its strength is controlled by the $\gamma$.

Below are some results generated with CFG.

Sample 1	Sample 2	Sample 3	Sample 4	Sample 5

Image-to-Image Translation

As mentioned in the previous section, the denoiser doesn't usually revert the image back to its clean version, but performs some reasonable "editing" as well. Utilizing this fact, we can intentionally add some noise to the image and pass it into the diffusion model. This follows the SDEdit algorithm.

Here we inject different levels of noise into the campanile image, and here are the results. Not surprisingly, the larger the noise level, the more significant the edit.

i_start=1	i_start=3	i_start=5	i_start=7	i_start=10	i_start=20

Editing Hand-Drawn and Web Images

Let's apply this on some other images we can find on the web.

i_start=1	i_start=3	i_start=5	i_start=7	i_start=10	i_start=20	Original

It is even possible to draw a sketch by hand and have it rendered into a realistic image. Here are some examples.

i_start=1	i_start=3	i_start=5	i_start=7	i_start=10	i_start=20	Original

Inpainting

We can also limit the editing to a specific region, following the idea in the InPainting paper. We define a binary mask with the same shape as the spatial dimensions of the image. In each iteration of CFG, we "force" the pixels outside the editing area to be the ground truth plus the correct amount of noise. In other words, $$x_t \gets \mathbb{m} x_t + (1 - \mathbb{m}) \mathrm{forward}(x_{orig}, t)$$ This maintains the correct noise level at each timestep, while ensuring only the editing area gets modified.

Below are some examples of the output of the InPainting algorithm.

Original	Mask	Region to Edit	Result

Text-Conditional Image-to-image Translation

So far, we've been only using the "empty" prompt a high quality photo. However, if the prompt contains specific descriptions, we can modify the content of the photo.

We adopt a similar algorithm as SEdit, but this time using various different prompts. Some examples of the results are shown below.

Prompt	i_start=1	i_start=3	i_start=5	i_start=7	i_start=10	i_start=20	Original
a rocket ship
A violently erupting volcano
A tropical forest
A Steinway nine-foot grand piano

Visual Anagrams

With the diffusion model, we can also create multi-view optical illusions with ideas from Visual Anagrams paper.

Starting from an image with pure noise, we run the denoising algorithm with the noisy image using the first prompt, and in the same time use a second prompt on the same image flipped up-side-down. The estimated noise at each timestep is the average of the two inferred noise. In other words, $$\epsilon_1 = \mathrm{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \mathrm{flip}(\mathrm{UNet}(\mathrm{flip}(x_t), t, p_2))$$ $$\epsilon = (\epsilon_1 + \epsilon_2) / 2$$

Below are some visual anagrams created using this algorithm. Hover on the image to rotate it upside down.

**Right side up:** an oil painting of people around a campfire;
**Upside-down:** an oil painting of an old man

**Right side up:** a photo of a hipster barista;
**Upside down:** an oil painting of a snowy mountain village

**Right side up:** a photo of a dog;
**Upside down:** a photo of a man

Hybrid Images

Another type of visual illusion are images that look different when viewed from up close and from afar. For this type of illusion, we implement the Factorized Diffusion. The algorithm is $$\epsilon_1 = \mathrm{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \mathrm{UNet}(x_t, t, p_2)$$ $$\epsilon = f_{\mathrm{lowpass}}(\epsilon_1) + f_{\mathrm{highpass}}(\epsilon_2)$$ where $f_{\mathrm{lowpass}}$ and $f_{\mathrm{highpass}}$ are low pass and high pass functions, respectively. Here, the low pass function is the Gaussian Blur with the kernel size of 33 and sigma of 2, and the high pass function is the Laplacian filter with the same kernel size and sigma. $p_1$ and $p_2$ are two different prompts for the hybrid image.

Below are some results.

Hybrid image of a skull and a waterfall	Hybrid image of the Amalfi coast and people around a campfire	Hybrid image of a snowy mountain village and a dog

Diffusion Models from scratch!

Now let's try to implement a diffusion model of our own. In this project, we use the MNIST digit dataset to train a diffusion model to generate handwritten digits.

Implementing the UNet

Our diffusion model is a simple UNet, which consists of an encoder structure and a decoder structure. The architecture we implemented is shown in the figure below.

The UNet architecture implemented in this section

Preparing the dataset

To train our denoiser, we first need to generate a dataset by adding noise to the images in the MNIST dataset. For each training batch, we take the clean images $x$ and perform $$z = x + \sigma \epsilon$$ to get the noisy images. Here, $\sigma$ is a noise level that we can choose. $\epsilon \sim \mathcal{N}(0, \mathcal{I})$ is the pixel-wise Gaussian noise.

Below are some examples of noisy images with different noise levels.

Visualization of various noise levels on MNIST digits.

Training the Denoiser

Now, we train the model on the dataset. For this section, we select $\sigma = 0.5$ in the noising process. Note that the images are noised when fetched from the dataloader, so that the noise is different each time. The dataset is shuffled with a batch size of 256, and we train on the dataset for 5 epochs. The optimizer is selected to be Adam, with a learning rate of $10^{-4}$.

Training loss curve for the denoiser. The loss is reported every 10 mini-batches.

Here are the results of outputs on the noised digits in the testing set.

Results on the digits from the test set after 1 epoch of training.

Results on the digits from the test set after 5 epochs of training.

Out-of-Distribution Testing

We can also test the performance of our denoiser on images with other noise levels to validate its performance on out-of-distribution test cases.

Results on the digits from the test set with various noise levels.

We can see that the model did reasonably well on noise levels below $\sigma=0.5$, but the performance starts to degenerate as the noise level increases. In the following section, we will explore ways to remedy for this.

Adding Time Conditioning to UNet

To enable denoising at various noise levels, we need to somehow inject the time $t$ as a conditioning signal into our UNet. One way to modify the model is as follows.

Here, since our dataset is simple, we can get decent performance with $T=300$. The time $t \in [0, 300]$ is normalized into a value between $[0, 1]$ before embedding.

Training the time-conditioned UNet

The training process is mostly similar to the unconditioned UNet, as described above. However, this time we estimate the noise in the image instead of the clean image directly. We use the same forward process as described in the previous section to add noise to the image, and the task is now to regress the model to predict the noise.

Additionally, we use an exponential learning rate schedule, with a gamma of $0.1^{(1.0 / \mathrm{num\_epochs})}$.

Time-Conditioned UNet training loss curve. The loss is reported every 10 mini-batches.

Sampling from the Time-conditioned UNet

The sampling process is also similar to the previous section. Here are some results from various epochs. Hover the mouse on the image to freeze on the results.

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Adding Class-Conditioning to UNet

Borrowing from the idea in Classifier-free Guidance, we can also pass in the class type as an additional conditioning signal to generate images that look more like real digits. Therefore, we add the class embedding layers to the models, and treat the embeddings as another conditioning signal in the decoder side of the UNet.

In addition, recall that CFG also requires the model to be able to perform unconditioned inference, i.e. without any class-conditioning signal. In addition to the embedding layers, we randomly set the conditioning signal to be all zeros during the training.

Class-conditioned UNet training loss curve. The loss is reported every 10 mini-batches.

Sampling from the Class-Conditioned UNet

The sampling process is similar to the one in the previous section. Here, we use $\gamma = 5$ and visualize the outputs of the sampled images at various epochs.

Hint: Hover the mouse on the image to freeze the frame on the result.

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Reference

HTML template: Editorial by HTML5UP