With the recent launch of Open AI's Sora [1] and Google's Gemini [2], not without its controversies [3], I believe it's time to talk a little about one of the technologies behind some of the components of these projects: the diffusion models; and how they relate with image generation tools like Stable Diffusion [4], Midjourney [5] and Dall-E [6], among others.
In this article, we will explore what diffusion models are, how they are related to an ancient deep-learning technique, and what their applications are, especially in the area of image generation, where they currently thrive.
What are Diffusion Models?
Diffusion models became prominent after Song & Ermon's seminal paper was published in 2019 [7], and their popularity has skyrocketed ever since.
Diffusion models are a type of generative model [8] specialized in noise reduction. Thanks to their high representational power, they became an alternative to GANs, which don't require adversarial training and provide better coverage. They also became a faster alternative to autoregressive models, which are slow for sampling data. As a consequence, they have become the de facto models for producing perceptual signs such as images and sound.
In simple terms, these models work by gradually learning to remove noise from the dataset they were trained on. To accomplish this, the first part of the process is to gradually produce noise from the data (e.g., images) at different levels of graduation and, with that, create a dataset mapping images to noise. This dataset is used to train the first part of the model, which is a sort of encoder that learns how to produce noise from images.
After that, the next part of the process uses the model to produce data reversely by removing the noise. This process is done in a step-by-step fashion that gradually removes layers of noise to try and recover a data point from the original dataset. The crucial part of this process is that the model doesn't necessarily reproduce an exact data point from the noise, but it samples a data point from the data distribution it learned to generate. For example, in the case of images, it generates the pixel distribution. With enough data and steps, the model learns to generate data from any noisy input.
I think I've heard about this process before
If you feel that the process of taking an input, removing some information from it, i.e., making it more noisy, and then trying to reproduce that original input rings a bell, it's because, like me, you are probably from the old school and have heard about denoising autoencoders [9], an old companion of deep learning researchers.
Well, the concept of diffusion is somewhat similar to that of denoising autoencoders, albeit there are some evident differences between them. Notably, there is the limitation of auto-encoders in contrast to diffusion models. If auto-encoders were able to achieve what diffusion models did, we wouldn't be requiring the latter after all. There's an excellent post by Sander Dieleman that explores this in much more detail [10].
The extra representation power of diffusion models
Regarding image representation, one thing that differentiated diffusion models from previous iterations, especially when dealing with image generation, was their ability to generate high-resolution images. If you think about it, to generate a 4k resolution image, the model should have layers of more than 8 million parameters, something that becomes unmanageable when trying to train deep models.
The thing is, noise doesn't require a large dimension to be represented, after all, it's noise. So, in their work, Rombach et al. [11] came up with the excellent idea that the diffusion process could be done in a smaller space. Thus, they project the image to that space first and make the training process much faster.
Stable Diffusion is an application of diffusion models for text-to-image generation that takes advantage of this conversion between vector spaces to provide the process of generating images via text when combined with large language model representations of such text. The excellent work of Jay Alammar, "The Illustrated Stable Diffusion"Â [12], explores in more detail and with beautiful visual aids, all this process for image generation.
Applications of Diffusion Models
Most of us know the more classical application of diffusion models, i.e., the text-to-image generation models, or its more recent variant for text-to-video generation like OpenAI's Sora [1], or even more recently, the unveiling of Google Genie for the generation of 2D video games [13]. These are the "cool" applications, visually stunning ones that catch your attention very quickly and are very useful for marketing purposes. However, in all these cases, that is only part of the components in the system architecture, as there are other components, such as the large language models required to understand the user's input.
These, however, aren't the only applications. The technique is relatively new and offers various paths for exploration to researchers who want to try it out. Essentially, the model is a generative one and it can be used for any application that is tied to data generation [8]. Although there are limitations, these are powerful generative models that have had exciting applications beyond image generation:
Inpainting is useful for restoring old images and removing certain data from them, such as the people that appear in the background of the photos you take with your phone.
Super-resolution: To enhance or upscale the resolution of an image.
Text generation and summarisation: Yes, these models are even able to work under certain conditions for the generation and summarisation of text (of course, they require a specific autoregressive structure to do so correctly).
The State of the Art for Diffusion Models in Image Generation
As we said before, the most prominent use of diffusion models is image generation. The following are some of the models that have been developed by some of the top companies in the field:
OpenAI: Their mode Dall-E was among the first publicly released. It's currently in version 3Â [14]. They offer it through their ChatGPT application, and the free version is limited.
Google: With the launching of Gemini, Google has had some rough edges to polish when it comes to text-to-image generation [3]. It's not clear either if the model behind the image generation was based on the Imagen [15] that was part of the Google Brain project.
Adobe: The company behind giants like Photoshop and Lightroom has their version of image generation in Firefly [16].
Midjourney: An independent research label that currently provides an open beta [5] of their eponymous model, which has even won prizes and created some viral images.
StabilityAI: This is the company behind the only model currently open source: Stable Diffusion, which recently released an early preview of version number 3 of their model [17].
Controversy and challenges of image generation models
Bias
As these models are in everyone's Internet feed, they are no strangers to controversy. The highlight of these last weeks has been the series of problems that drove Google to shut down the image generation of people with their Gemini model, which ended up generating "racially diverse Nazis" [3]. This is a problem of bias, but as stated by Keras creator François Chollet, it's not just a problem of data but also of the model itself, a complex problem in and of itself [18].
The challenge of bias has been a major headache for tech AI that are constantly using different techniques to "tame" their rogue models into something that doesn't become a major PR nightmare.
Copyright infringement and fair use
Another very common problem with all these kinds of generative models, not just diffusion models but large language models as well, is that they require massive amounts of data for training. In the case of images, there's a limit to what someone can use from the public domain to train a model.
In particular, if you want your models to provide images in a style other than photographic realism or old art movements like Renaissance, romanticism, impressionism, etc., i.e., those styles that don't have copyright but would rather like something more modern or based on things like Japanese anime or American comics, you require more recent images that are most likely covered under copyright law.
This challenge has resulted in legal battles that might limit the progress of text-to-image models [19].
Malicious usage
A couple of years ago, a user published in a subreddit a pornographic video that was edited with convincing faces of celebrities [20]. From that moment on, the "deepfakes" as these types of editions were called, began to flood the internet. Well, those were based on GANs, and with the surge of hyperrealistic diffusion models, the internet became a hotbed for deepfakes again. Last month, for example, X (ex-Twitter) was flooded with an income of AI-generated images of the pop singer Taylor Swift [21] that prompted the site to take down the search for Taylor Swift from its engine.
A slightly less malicious but still very controversial use of this model was the first prize awarded to an image generated by Midjourney: "Théâtre D'opéra Spatial" [22]. This image is the one presented at the beginning of this article.
This has become another PR nightmare not just for the companies that provide the models but also for other companies that act as a buffer to reproduce the generated data, such as social networks. The challenge of marking content before it can be generated by a model is a very difficult one that also requires much investment from these companies.
The Ouroboros problem
Finally, this is also a direct problem of the Internet being flooded with artificially generated data. This problem applies not just to image generation but also to text generation.
The Ouroboros is an ancient symbol of a serpent eating its tail. It appears in many ancient mythologies, and it perfectly represents the problem that AI is facing nowadays, something I came across multiple times while researching semi-supervised approaches based on using the generated data to train a new model. This is a severe problem because of the propagation of errors and noise, which becomes worse and worse the more a model is using its own generated data, or another's model-generated data, for that matter, to train the next version.
As people begin to abandon regular platforms like StackOverflow, Reddit, etc., and these become flooded with bot-generated data, it becomes difficult to see how the new versions of these models will be trained and tuned.
Final Thoughts
Diffusion models are an area in active development, with different applications and constant research being done with them. Their most popular application has revolutionized the Internet in the past couple of years, both for good and bad reasons.
There is still untapped potential to explore these tools, not just for image generation but also for generating other kinds of data. Some tools, like Hugging Face, provide open-source versions of some of these models. They are worth examining and studying, not just for their capabilities but also for how we can provide solutions for their challenges.
References
[1] OpenAI's Sora: Creating video from text. https://openai.com/sora
[2] Pichai S., Hassabis D. (2024). Our next-generation model: Gemini 1.5 https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/
[3] Robertson A. (2024). "Google apologizes for ‘missing the mark’ after Gemini generated racially diverse Nazis" https://www.theverge.com/2024/2/21/24079371/google-ai-gemini-generative-inaccurate-historical
[4] Stability AI: Activating humanity's potential through generative AI. https://stability.ai/
[5] Midjourney. https://www.midjourney.com/home
[6] Open AI's Dall-E: Creating images from text. https://openai.com/research/dall-e
[7] Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32.
[8] Cardellino, C. (2024). "Generative vs. Discriminative Models in Machine Learning" https://www.transcendent-ai.com/post/generative-vs-discriminative-models-in-machine-learning
[9] Hinton, G. E., & Zemel, R. (1993). Autoencoders, minimum description length and Helmholtz free energy. Advances in neural information processing systems, 6.
[10] Dieleman, S. (2022). "Diffusion models are autoencoders" https://sander.ai/2022/01/31/diffusion.html
[11] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).
[12] Alammar, J. (2022). "The Illustrated Stable Diffusion". https://jalammar.github.io/illustrated-stable-diffusion/
[13] Townsend, C. (2024). "Google Genie lets users generate AI outputs resembling video games" https://mashable.com/article/google-genie-can-create-video-games-2d-platformers
[14] Open AI's: Dall-E 3. https://openai.com/dall-e-3
[15] Tirumalashetty, V. (2023). "Imagen 2 on Vertex AI is now generally available" https://cloud.google.com/blog/products/ai-machine-learning/imagen-2-on-vertex-ai-is-now-generally-available
[16] Adobe Firefly. https://firefly.adobe.com/
[17] Stability AI: Stable Diffusion 3. https://stability.ai/news/stable-diffusion-3
[18] @fchollet on Twitter https://twitter.com/fchollet/status/1761162270204465612
[19] Toolify (2024) "Getty Images' Lawsuit Threatens Future of Text-to-Image AI". https://www.toolify.ai/ai-news/getty-images-lawsuit-threatens-future-of-texttoimage-ai-970607
[20] Cole, Samantha (24 January 2018). "We Are Truly Fucked: Everyone Is Making AI-Generated Fake Porn Now". Vice. Archived from the original on 7 September 2019. Retrieved 4 May 2019.
[21] Weatherbed, J. (2024) "Trolls have flooded X with graphic Taylor Swift AI fakes" https://www.theverge.com/2024/1/25/24050334/x-twitter-taylor-swift-ai-fake-images-trending
[22] Roose, Kevin (2022-09-02). "An A.I.-Generated Picture Won an Art Prize. Artists Aren't Happy". The New York Times. ISSNÂ 0362-4331. Retrieved 2023-03-26.
Comments