In this project I implement diffusion loops and leverage them for various image generation tasks like inpainting and creating visual anagrams or hybrid images.
"DeepFloyd" is a two stage Diffusion model trained by Stability AI. The first stage generates 64 x 64 images and then upsamples them 256 x 256. Below are some outputs with varrying numbers of iteratives steps which dictate how many denoising steps are taken. As supported by the images, as you take more steps you get clearer more detailed imagse but the runtime increases quite a bit, consequently.
"an oil painting of a snowy mountain village" 10 iteratitions 256x256
"a man wearing a hat" 10 iterations 256x256
"a rocket ship" 10 iterations 256x256
"an oil painting of a snowy mountain village" 20 iteratitions 256x256
"a man wearing a hat" 20 iterations 256x256
"a rocket ship" 20 iterations 256x256
"an oil painting of a snowy mountain village" 500 iteratitions 256x256
"a man wearing a hat" 500 iterations 256x256
"a rocket ship" 500 iterations 256x256
The idea of a sampling loop is to begin with a clean image x0 and progressively add more noise to this image giving xt until at timestep t=T we have in image of essentially pure noise. The goal of a diffusion model is to remove this noise by predicting the noise in an image.
The first step is the forward pass, which encompasses adding noise to a clean image. This process is defined by the formula:
xt = √αt x0 + √1 - αt ε
where αt is the noise coefficient. Below are noisy images at various t values for a sample image of the campanile.
Berkeley Campanile
Campanile at t = 250
Campanile at t = 500
Campanile at t = 750
A basic approach to denoising is to use a Gaussian blurr to remove the noise in the image, however as seen below the results are not so great.
Campanile at t = 250
Campanile at t = 500
Campanile at t = 750
Campanile at t = 250 Denoised
Campanile at t = 500 Denoised
Campanile at t = 750 Denoised
For this part we use a pretrained UNet to recover gaussian noise dorm the image and then remove this noise and in theory getr something close to the original image. I ran this for t = [250, 500, 750] and displayed the images below.
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
One-Step Denoised Campanile at t=250
One-Step Denoised Campanile at t=500
One-Step Denoised Campanile at t=750
In the last part we can see that the UNet struggles as you add more noise. For this part we iteratively denoise meaning that we denoise from x1000 to x0 step by step, we do so in strides of 30 to speed this up. The formula for computing the next step is:
\( x_{t'} = \frac{\sqrt{\overline{\alpha_{t'}} \beta_t}}{1 - \overline{\alpha_t}} x_0 + \sqrt{\alpha_t \frac{(1 - \overline{\alpha_{t'}})}{1 - \overline{\alpha_t}}} x_t + v \sigma \)
Where αt = αt ∕ αt'
Noisy Campanile at t=90
Noisy Campanile at t=240
Noisy Campanile at t=390
Noisy Campanile at t=540
Noisy Campanile at t=690
Originial Campanile
Iteratively Denoised Campanile
One-step Denoised Campanile
Gaussian Blurred Campanile
We can use our iterative denoising to generate images form scratch, passing random noise in and making the text prompt "a high quality photo". Here's 5 results of that below.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
The images from the previous section lack in quality and sometimes even appear nonsensical. To improve the results we'll use classifier-free guidance to refine the images by computing two different noise estimates: a conditional noise estimate and an unconditional noise estimate. By combining these two estimates, we can control the balance between diversity and quality in the images we generate.
In CFG, we denote the conditional noise estimate as \( \epsilon_{\text{cond}} \) and the unconditional noise estimate as \( \epsilon_{\text{uncond}} \). Our new noise estimate \( \epsilon \) is then given by:
\[ \epsilon = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}}) \]
Here, \( s \) is a scaling factor that controls the strength of the guidance. By adjusting \( s \), we can achieve different effects:
The true "magic" of CFG happens when \( s > 1 \). In this case, the generated images are often much higher in quality, providing clearer and more meaningful visuals compared to the unguided generation process.
Sample 1 with CFG
Sample 2 with CFG
Sample 3 with CFG
Sample 4 with CFG
Sample 5 with CFG
We can take what we did in 1.4 and by adding more noise we force our algorithm to make a larger edit in hopes that it will be "creative" i.e. hallucinate a bit and produce something cool. This can be thought of as forcing a noisy image back onto the manifold of natural images. I tried this with varrying values for the starting index, and you can see that as the edits go on the image looks more like the original.
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
The procedure above works even better if you start with a nonrealistic image like a drawing or painting and project it onto the manifold of natural images. I've done this with some images from the web as well as a few hand drawn scribbles.
Avocado i_start = 1
Avocado i_start = 3
Avocado i_start = 5
Avocado i_start = 7
Avocado i_start = 10
Avocado i_start = 20
Avocado original
Apple scribble i_start = 1
Apple scribble i_start = 3
Apple scribble i_start = 5
Apple scribble i_start = 7
Apple scribble i_start = 10
Apple scribble i_start = 20
Apple scribble original
Man scribble i_start = 1
Man scribble i_start = 3
Man scribble i_start = 5
Man scribble i_start = 7
Man scribble i_start = 10
Man scribble i_start = 20
Man original
We can use a mask to create an image that has the original content of the image in some parts, but generated content in other parts. For this I put a mask over the upper portion of the campanile and generated a few results.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Original
Mask
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Original
Mask
We can add control to our previous results by changing th etext promp from "a high quality photo" to something more descriptive like "a rocket ship". The results below are for increasing values of i_start, as i_Start increases the model gets less "creative" and the output looks more like the input image.
Rocket ship at i_start=1
Rocket ship at i_start=3
Rocket ship at i_start=5
Rocket ship at i_start=7
Rocket ship at i_start=10
Rocket ship at i_start=20
Pencil at i_start=1
Pencil at i_start=3
Pencil at i_start=5
Pencil at i_start=7
Pencil at i_start=10
Pencil at i_start=20
Waterfall at i_start=1
Waterfall at i_start=3
Waterfall at i_start=5
Waterfall at i_start=7
Waterfall at i_start=10
Waterfall at i_start=20
Getting even more creative we can create an image that looks like one thing right side up and another thing upside down, similar to Salvador Dali's Elephants Reflecting Swans. The idea is to compute two noises, one for a given prompt for example "an oil painting of an old man" and another for the same image but flipped and with the prompt "an oil painting of people around a campfire". Then, average these two noises and use that as your noise.
"Oil painting of an old man"
"A photo of a man""
"An oil painting of a snowy mountain village"
"a photo of the amalfi cost"
"An oil painting of people around a campfire"
"A photo of a dog"
"A lithograph of a waterfall"
"A lithograph of a skull"
Similar to project 2 where I took advantage of the way humans percieve high and low frequnecy visual data, we can use a diffusion model to mkae factorized images that look like one thing from a far and another from close up. The idea is similar to the previous par tbut instead of flipping the image we just run a high pass and low pass filter on the image and take the noise with respect to two different text prompts, then sum this noise to get our overall noise. The results from this part are not as pleasing as the Visual Anagrams, I palyed around with tryong to weight the noise differently with emphasis on the "harder" prompt but still many of my images look dissatisfying.
"A lithograph of a skull" and "A lithograph of a waterfall"
"An oil painting of a snowy mountain village" and "A lithograph of a waterfall"
"A lithograph of a skull" and "a photo of the amalfi cost"
"A lithograph of a skull" and "a photo of the amalfi cost" w/ emphasis on amalfi.
This project was quite challenging and involved lots of trial and error with different input images and prompts and playing around with how to use the noise appropriately. However, it was quite rewarding and provided great insight into the capabilities of diffusion models and how to get creative and create some fun images that mess with human perception.
In this next part of the project, I'll be training my own diffusion model on MNIST.
The first step is to implement the denoiser as a UNet. A UNet has an encoder-decoder structure with skip connections which aims to capture the context of the input image by progressively reducing its spatial dimensions, and reconstructs the spatial information to its original resolution while maintaining the fine-grined details. My UNet follows the structure below.
To begin we'll just visualize varrying levels of noise on a few MNIST samples.
Now it's time to train the model to denoise. I trained the denoiser on noisy images z with σ = 0.5 applied to a clean image x. I used a batch size of 256 and 5 epochs with 128 hidden layers and the Adam optimizer with learning rate 1e-4.
Training Loss Curve
Results on digits from the test set after 1 epoch
Results on digits from the test set after 5 epochs
The denoiser was trained on σ = 0.5, but here I visualize how it performs on different sigma's.
Results on digits from the test set with varrying noise levels.
To condition on time we need to inject a scalar t into the UNet like so.
It's now time to train the UNet. This looks the same as before but this time we use our time conditioned model to predict the noise passing in a random t. This time training for 20 epochs because this task is more difficult and use 64 hidden channels. Below is the training loss curve.
Training loss curve for time-conditioned UNet
For this part, we sample similar to how we did with the DeepFloyd model in part A; however, this time we don't need to predict the variance and can instead use
A list Β of length T such that:
Samples for time-conditioned UNet at epoch 5
Samples for time-conditioned UNet at epoch 20
To condition the UNet on classes we do the same thing as time conditioning but we now also condition on a class tensor c. We repalce the original unflatten with: unflatten = c1 * unflatten + t1 and replace the original up1 with up1 = c2 * up1 + t1 where c1 and c2 are fc_1(c) and fc_2(c) respectively, using the same FCBlock we used for time conditioning.
Training loss curve for class-conditioned UNet
Below are samples at epoch 5 and epoch 20, 4 samples for each digit.
Epoch 5 Samples
Epoch 20 Samples