What Stable Diffusion is and How to Run It

What is Stable Diffusion

Stable Diffusion (SD) is a text-to-image model that can generate or modify images from the text and image prompts you provide.

It is claimed that Diffusion models are superior to other generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) for image generation. Stable Diffusion incorporates autoencoding/text-encoding steps alongside it's diffusion model to make it even more efficient than standalone diffusion.

SD is similar to other text-to-image models such as DALL-E/DALL-E-2 which garnered widespread awe and attention for their ability to generate impressive custom images. Stable Diffusion is open source and that differs from DALL-E, since access to use DALL-E is provided by OpenAI through their own API. Stable Diffusion is also more computationally efficient than DALL-E, which was a GAN, since SD is a latent diffusion model. This video from Two Minute Papers about Stable Diffusion goes into more depth about the model and it's relation to Dall-E.

Stable Diffusion is similar to Google's recent Imagen[1] model since they both utilize diffusion models in their pipelines, yet Imagen is not a latent diffusion model.

Latent Diffusion

Stable Diffusion was trained on images from the LAION-5B database. The training process is above my level of comprehension but it "combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder"[2]. My understanding is that training images are encoded into latent representations from which the de-noising process works to generate images. Image noise is added/removed in small normally-distributed steps (see Markov Chain), distorted images then look like a pixilated fuzzy pictures. The encoded user input describes how the de-noised output will look and the diffusion model de-noises/creates an image. During training the loss is somehow generated by "the noise that was added to the latent and the prediction made by the UNet"[2].

Stable Diffusion seems to be the combination of de-noising autoencoders and diffusion models. By utilizing Latent Diffusion's use of latent-space and cross-attention layers, SD is exceedingly efficient in text-to-image generation.

Youtube Video on Diffusion
Youtube Video on Markov Chains

Compared to previous diffusion models, latent diffusion is able to "reach a near-optimal point between complexity reduction and detail preservation"[3] an Latent Diffusion research paper mentions. The efficiency gains show up in SD's ability to run on Mid/High level personal computers (recommended 10GB VRAM GPU). Stable Diffusion incorporates three components for implementing 'latent diffusion': an autoencoder (VAE), a U-Net, and a text encoder CLIP Vit-L/14. Using these components Stable Diffusion receives a text-prompt and then outputs an image of that prompt.

How to Run Stable Diffusion and Generate Images

To run the Stable Diffusion 'inference pipeline' and try it's text-to-image capabilities you should have a computer with at least 10GB of ram and GPU capabilities[4]. I personally ran out of memory on my 8GB Macbook but I proceeded to use a Google Collab notebook to run the model. The Collab notebook came from huggingface and uses the model weights v1-4, check it out here

The Code

The following code is available on the previously linked collab notebook and it basically comes down to the following:

the string variable prompt provides the inspiration for the content or style of the generated image. The use_auth_token parameter means you're required to provide a huggingface token to run.

import torch
from torch import autocast
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=True)
pipe = pipe.to("cuda") # pipe to 'mps' on Apple Silicon

# generator = torch.Generator("cuda").manual_seed(1024)
prompt = "calgary flames winning the stanley cup in the dome with moustaches all around"
with autocast("cuda"):
  image = pipe(prompt).images[0]
  # image = pipe(prompt, num_inference_steps=75, generator=generator).images[0]

Change the prompt to whatever you want and then run the code, but know that their is a NSFW checker in the library (look for the function check_safety()). My prompt created this:

I must remind you that the prompt I provided was somewhat obscure and random. With some prompt engineering and more training steps I would expect to see the Stable Diffusion generate a higher quality image.

You can also specify some training parameters to the model, like:

num_inference_steps that takes an integer for amount of steps, generally more steps means a higher quality image (default 50).
guidance_scale that is an integer that determines how diverse the types of generated images will be, I read 7-8.5 usually works well.
generator you can set a seed image that makes the output image the same by passing torch.Generator("cuda").manual_seed(<int variable>)

with autocast("cuda"):
  image = pipe(prompt, num_inference_steps=80, generator=generator, guidance_scale=7).images[0]

Thoughts

The 'skills' of this model may seem threatening for artistic creatives but I see it as empowering more than anything, this is a tool for creatives. The business world is forever chasing a future with AI helpers that 'assist' people... Well Standard Diffusion is as compelling as any AI software 'tool' I've ever used. Using this model alongside thoughtful prompts and image inputs enables fast prototyping of high quality image outputs. The iterative workflow shown here of an artist using After Effects and SD looks incredible. The back and forth nature of some visual edits being done by a person and then prompting a text-to-image model to further develop an image seems empowering and even natural.

The state of the art text-to-image models have shown incredible ability to use words, images and different custom vectors as inputs for producing images, and I suspect this ability will be one of the most important tools in any AI assistants toolbox in the not too distant future. To their credit, text-to-image models appear among the best AI models for having an embedded fun-factor and an obvious application value, even in simple examples. What also excites me about Stable Diffusion is it's lower computational constraints, enabling text-to-image locally due to public weights[5]. I will be looking for other ways to work with Stable Diffusion to improve my prompts, inputs, and ideations.

Resources

HuggingFace Stable Diffusion Repo