Introduction
Stable Diffusion techniques open up new possibilities for creating custom images. In the field of AI-based image generation, DreamBooth is a significant innovation, allowing people to create unique visuals based on their own ideas. Stable Diffusion brings a new level of creativity, transforming ordinary images into extraordinary ones.
In this discussion, we'll explore DreamBooth, a platform that lets users convert everyday images into unique works of art using Stable Diffusion.
Learning objectives
What is stable diffusion?
Stable Diffusion is a text-to-image model that transforms a text prompt into a high-resolution image. For example, if you type in a cute and adorable bunny, Stable Diffusion generates high-resolution images depicting that — a cute and adorable bunny — in a few seconds. Click “Select another prompt” in Diffusion Explainer to change prompts and check the fascinating images generated from each prompt!
How does stable diffusion work?
Stable Diffusion first changes the text prompt into a text representation, numerical values that summarize the prompt. The text representation is used to generate an image representation, which summarizes an image depicted in the text prompt. This image representation is then upscaled into a high-resolution image.
You may wonder why Stable Diffusion introduces image representation instead of directly generating high-resolution images. The reason is computational efficiency. Doing most computations on compact image representation instead of a high-resolution image significantly reduces the time and cost for the computations while maintaining high image quality.
The image representation, which starts as a random noise, is refined over multiple timesteps to reach the image representation for a high-quality image with strong adherence to the text prompt. The number of refining timesteps is typically set as 50 or 100; we fix it to 50 in Diffusion Explainer.
We break down the image generation process of Stable Diffusion into three main steps:
Now, let's look closer into each process.
Text Representation Generation
1. Tokenizing:
Tokenization is a technique used to convert text into numbers, so it can be processed by neural networks. In Stable Diffusion, when you give a text prompt, like "a cute and adorable bunny," it splits it into individual words or pieces called tokens: "a," "cute," "and," "adorable," and "bunny." To signal the start and end of the prompt, it also adds special markers called
To make computations easier, Stable Diffusion ensures that every token sequence is exactly 77 tokens long. If your text has fewer than 77 tokens, it adds
2. Text encoding:
Stable Diffusion turns a sequence of tokens back into text. To guide image generation, it needs to make sure this text contains information about the image described in the prompt. To do this, it uses a special neural network called CLIP.
CLIP has two parts: an image encoder and a text encoder. It's trained to transform an image and its description into similar sets of numbers, called vectors. This way, if you have a text prompt, the text encoder creates a representation that is likely to contain information about the images described in the prompt. By using these text representations, Stable Diffusion can generate images that match the text description. If you'd like to see visual explanations, you can click on the Text Encoder.
Image Representation Refining
Stable Diffusion creates a representation of an image, which is like a set of numbers that summarizes what a high-resolution image looks like based on the text prompt. It does this by starting with a random pattern of "noise" and gradually refining it over several steps to improve the quality of the image and make it match the prompt more closely. You can change the initial random pattern by adjusting the seed in the Diffusion Explainer. To see each step of the refining process, click "Image Representation Refiner," which shows how noise is predicted and removed as the image takes shape.
1. Noise Prediction:
At each step of the process, a neural network called UNet estimates how much noise is in the image it's working on. UNet uses three things to do this:
The current version of the image, which contains some level of noise.
The text of your prompt, which tells UNet what the final image should look like. This helps guide what kind of noise should be removed.
The step number, which indicates how much noise is still in the current image.
Put simply, UNet predicts how much noise there should be in the current image based on the text prompt and step number.
However, even though the text prompt helps guide UNet's noise prediction, the final image might not always align perfectly with the text prompt. To fix this, Stable Diffusion also predicts a generic amount of noise, conditioned on a blank text prompt (just an empty space), and subtracts this generic noise from the original noise prediction. This helps ensure that the final image better matches the given text prompt.
impact of prompt = prompt-conditioned noise - generic noise
In Stable Diffusion, the final noise added to the image comes from two sources:
Generic noise
The prompt's impact
These two noise components are combined using a "guidance scale," which controls how much the text prompt influences the final image. The formula for this is:
generic noise + guidance scale x impact of prompt
Here's what the guidance scale does:
You can adjust the guidance scale in a tool like Diffusion Explainer to see how it affects the generated images.
2. Noise Removal
After predicting the noise in an image, Stable Diffusion uses a process called "noise removal" to decide how much noise to take out of the image. This process is managed by an algorithm called "scheduler." Removing noise bit by bit helps make the image clearer and sharper.
Here's how it works:
image representation of timestep t+1=image representation of timestep t−downscaled noiseimage representation of timestep t+1=image representation of timestep t−downscaled noise
By slowly removing noise, the scheduler helps the model create a high-quality final image that aligns with the text prompt.
Image Upscaling
After all denoising steps have been completed, Stable Diffusion uses a neural network called Decoder to upscale the image representation into a high-resolution image. The refined image representation fully denoised with the guidance of the text representations would result in a high-resolution image strongly adhering to the text prompt.
The effect of prompt keywords on image generation
When you write text prompts to generate images, you often have to experiment and change things repeatedly to get the image you want. For example, if your initial prompt is "a cute bunny," you might try adding or removing keywords, like "in the style of Pixar," to see how it changes the outcome.
Understanding how different words affect the generated image can really help you refine your prompts. By experimenting with specific keywords, you can figure out what works best. Click the highlighted words in the prompt to compare how these changes impact the image. This way, you can learn which keywords make the biggest difference and use them to create the images you envision.
You have control over text prompt and hyperparameters in our Diffusion Explainer to change the generated images:
Additionally, there are other hyperparameters that are not included in the Diffusion Explainer, such as the total number of timesteps, image size, and the type of scheduler.
Dreambooth
DreamBooth takes the power of Stable Diffusion and places it in the hands of users, allowing them to fine-tune pre-trained models to create custom images based on their unique concepts. What sets DreamBooth apart is its ability to achieve this customization with just a handful of images—typically 10 to 20—making it accessible and efficient.
The core idea behind DreamBooth is to teach the model a new concept, and this is done through a process called fine-tuning. You start with a pre-existing Stable Diffusion model and provide it with a set of images that represent your concept. This could be anything from images of your pet dog to a specific artistic style. DreamBooth then guides the model to generate images that align with your concept, using a designated token (often denoted as ‘V’ in rectangular braces) to represent your concept.
Selecting the right name token for your concept is crucial for successful fine-tuning. The name token serves as a unique identifier for your concept within the model. Choosing a name that won’t clash with existing concepts already known to the model is important. Here are some guidelines:
Start creating
As you've seen throughout this article, Stable Diffusion and DreamBooth offer incredible possibilities for creating custom AI-generated images. Now, it's your turn to use these powerful. PfpicMaker provides you with the tools to experiment, customize, and create unique portraits, profile pictures and headshots that fit your vision.
You can create personalized images, fine-tune the models, and bring your unique style to any scene.
Sign up today and discover how you can transform your creative ideas into stunning visuals with Stable Diffusion and DreamBooth. Get started now and unlock a world of possibilities!