However you don’t need any picture—you need the picture you specified, usually with a textual content immediate. And so the diffusion mannequin is paired with a second mannequin—resembling a big language mannequin (LLM) educated to match photographs with textual content descriptions—that guides every step of the cleanup course of, pushing the diffusion mannequin towards photographs that the massive language mannequin considers a superb match to the immediate.
An apart: This LLM isn’t pulling the hyperlinks between textual content and pictures out of skinny air. Most text-to-image and text-to-video fashions at the moment are educated on giant knowledge units that comprise billions of pairings of textual content and pictures or textual content and video scraped from the web (a follow many creators are very sad about). Which means what you get from such fashions is a distillation of the world because it’s represented on-line, distorted by prejudice (and pornography).
It is best to think about diffusion fashions working with photographs. However the approach can be utilized with many sorts of information, together with audio and video. To generate film clips, a diffusion mannequin should clear up sequences of photographs—the consecutive frames of a video—as a substitute of only one picture.
What’s a latent diffusion mannequin?
All this takes an enormous quantity of compute (learn: power). That’s why most diffusion fashions used for video era use a method known as latent diffusion. As a substitute of processing uncooked knowledge—the thousands and thousands of pixels in every video body—the mannequin works in what’s often known as a latent area, wherein the video frames (and textual content immediate) are compressed right into a mathematical code that captures simply the important options of the info and throws out the remaining.
The same factor occurs everytime you stream a video over the web: A video is distributed from a server to your display screen in a compressed format to make it get to you quicker, and when it arrives, your pc or TV will convert it again right into a watchable video.