Yige from Aofei Temple
Quantum bits | Official account QbitAI
The first review of the recent popular " diffusion model (diffusion model)" is here!
is the new SOTA in the deep generative model. At present, the theory and practice about it are still in the "wild growth" stage, and there is a lack of systematic review.
In order to reflect the progress in this rapid development field, this review develops from the refinement classification of , and other five generative models, and the application of in the seven major fields, etc. Finally, it proposes the existing limitations of the diffusion model and the future development direction of .
authors are Ming-Hsuan Yang from the University of California & Google Research, Cui Bin Laboratory of Peking University, and research teams such as CMU, UCLA, and Montreal Mila Research Institute. Yang Ling, the first author of
, is a Ph.D. from Peking University.
Professionals who have read it said: Many cited papers are from 2022, which shows how difficult it is to track SOTA and how important these surveys are.
It is worth mentioning that the author also disclosed the classification summary of the papers of this review diffusion model GitHub link. (attached at the end of the article~)
Without further ado, let’s take a closer look.
1. Introduction to
diffusion model (diffusion models) is the new SOTA in the deep generation model. The
diffusion model surpasses the original SOTA:GAN in the picture generation task, and has performed well in many application fields, such as computer vision , NLP, waveform signal processing, multimodal modeling, molecular graph modeling, time series modeling, adversarial purification, etc.
In addition, the diffusion model is closely related to other research fields, such as robust learning, representation learning, and reinforcement learning.
However, the original diffusion model also has its drawbacks, it is slow to sample and usually requires thousands of evaluation steps to draw a sample; its maximum likelihood estimate cannot be compared with likelihood-based models; it has poor ability to generalize to various data types.
Nowadays, many studies have made many efforts to solve the above limitations from the perspective of practical application, or analyzed the model capabilities from the theoretical perspective. However, a systematic review of the latest advances in diffusion models from algorithms to applications is still lacking.
To reflect the progress in this rapid development field, we conducted the first comprehensive review of the diffusion model . We envision our work will shed light on design considerations and advanced approaches to diffusion models, demonstrate its application in different fields, and point to future research directions.
The summary of this review is shown in the figure below:
Although diffusion model has excellent performance in various tasks, it still has its own shortcomings, and many studies have improved diffusion model.
In order to systematically clarify the research progress of diffusion model, we summarized three main shortcomings of the original diffusion model, slow sampling speed, poor maximization likelihood, and weak data generalization ability. We proposed to divide the research on diffusion models into three corresponding categories: sampling speed improvement , maximum likelihood enhancement and data generalization enhancement .
We first explain the motivation for improvement, and then further classify the research on each improvement direction according to the characteristics of the method, so as to clearly show the connections and differences between methods.
Here we only select some important methods as examples. In our work, we have introduced each method in detail, as shown in the figure:
After analyzing the three types of diffusion models, we will introduce the other five generative models of GAN, VAE, Autoregressive model, Normalizing flow, Energy-based model.
Considering the excellent nature of the diffusion model, researchers have combined diffusion model with other generative models according to its characteristics. Therefore, in order to further demonstrate the characteristics and improvement work of diffusion model, we introduced in detail the work of combining diffusion model and other generative models and explained the improvements on the original generative model.
Diffusion model has excellent performance in many fields, and considering that diffusion model produces different deformations in applications in different fields, we systematically introduce the application research of diffusion model, including the following fields: computer vision, NLP, waveform signal processing, multimodal modeling, molecular graph modeling, time series modeling, and adversarial purification.
For each task, we define this task and introduce the work of using the diffusion model to process the task. We summarize the main contributions of this work as follows :
New classification method : We propose a new and systematic classification method for the diffusion model and its application. Specifically, we divide the model into three categories: sampling speed enhancement, maximum likelihood estimation enhancement, and data generalization enhancement.
Further, we divide the application of diffusion models into seven categories: computer vision, NLP, waveform signal processing, multimodal modeling, molecular graph modeling, time series modeling, and adversarial purification.
Comprehensive review : We have a comprehensive overview of modern diffusion models and their applications for the first time. We show the major improvements of each diffusion model, make the necessary comparisons with the original model, and summarize the corresponding papers.
For each type of application of diffusion model, we show the main problems that diffusion models want to solve and explain how they solve them.
Future research direction : We have raised open questions for future research and provided some suggestions for the future development of diffusion models in algorithms and applications.
2. Basics of diffusion model One of the core issues of generative modeling is the trade-off between model flexibility and computability. The basic idea of the
diffusion model is to systematically perturb the distribution in the data by learning the reverse diffusion process, thus generating a highly flexible and easy-to-calculate generative model.
1, Denoising Diffusion Probabilistic Models (DDPM)
A DDPM consists of two parameterized Markov chains and uses variational inference to generate samples consistent with the original data distribution after a finite time. The function of the
forward chain is to perturb the data, which gradually adds Gaussian noise to the data according to the pre-designed noise progress until the data distribution tends to a priori distribution, that is, the standard Gaussian distribution .
reverse chain starts from a given prior and uses a parameterized Gaussian transformation kernel to learn to gradually restore the original data distribution.
2, Score-Based Generative Models (SGM)
The above DDPM can be regarded as a discrete form of SGM. SGM constructs a random differential equation (SDE) to smoothly disrupt the data distribution, converting the original data distribution to a known prior distribution:
and a corresponding inverse SDE to transform the prior distribution back to the original data distribution:
Therefore, to reverse the diffusion process and generate data, the only information we need is the fractional function at each time point. Using score-matching techniques, we can learn the score function through the following loss function:
A further introduction to the two methods and the relationship between them is provided in the paper details. Three main disadvantages of the original diffusion model of
are slow sampling speed, poor maximization likelihood, and weak data generalization ability. Many recent studies have solved these shortcomings, so we divided the improved diffusion model into three categories:
sampling speed increase , maximum likelihood enhancement , and data generalization enhancement.
In the next three, four and five sections, we will introduce these three types of models in detail.
3. Sampling acceleration method
When applied, in order to achieve the best quality of the new sample, the diffusion model often requires thousands of steps to obtain a new sample. This limits the practical application value of diffusion model, because in practical application, we often need to generate a large number of new samples to provide materials for the next step of processing.
researchers have conducted a lot of research on improving the sampling speed of diffusion model. We elaborate on these studies in detail. We classify it in three ways: Discretization Optimization, Non-Markovian Process, Partial Sampling.
, Discretization Optimization methods optimize the method to solve diffusion SDE. Because in reality, the solution of complex SDEs can only use discrete solutions to approximate the true solution, this type of method tries to optimize the discreteization method of SDEs, reducing the number of discrete steps while ensuring sample quality.SGM proposes a general method to solve the reverse process, that is, to adopt the same discrete method for the forward and backward processes. If the discrete method of forward SDE is given:
then we can discrete the inverse SDE in the same way:
This method is slightly better than naive DDPM. Furthermore, SGM adds a corrector to the SDE solver, so that the samples generated at each step have the correct distribution.
At each step of the solution, after the solver gives a sample, the corrector uses the Markov chain Monte Carlo method to correct the distribution of the newly generated samples. Experiments show that adding a corrector to the solver is more efficient than directly increasing the number of steps of the solver. The

The main working DDIM is no longer assumed that the forward process is a Markov process, but it obeys the following distribution:
DDIM sampling process can be regarded as discrete divine differential equation , its sampling process is more efficient and supports interpolation of samples. Further research found that DDIM can be regarded as a special case of the on-manifold diffusion model PNDM.

For example, Progressive Distillation distiles a more efficient diffusion model from a trained diffusion model. For a trained diffusion model, Progressive Distillation will train a diffusion model again, so that one step of the new diffusion model corresponds to two steps of the trained diffusion model, so that the new model can save half of the sampling process of the old model. The specific algorithm of
is as follows:
Continuous circulation of this distillation process can reduce the sampling steps exponentially.
4. Maximum Likelihood Estimation Strengthens the performance of the
diffusion model in maximum likelihood estimation is worse than that of the generative model based on likelihood functions, but maximizing likelihood estimation is of great significance in many application scenarios, such as image compression, semi-supervised learning, and adversarial purification.
Since log likelihood is difficult to calculate directly, the research mainly focuses on optimizing and analyzing the lower bound of variation (VLB). We elaborate on models that improve maximum likelihood estimation of diffusion models.
We classify it into three categories: Objectives Designing, Noise Schedule Optimization, Learnable Reverse Variance.
and Objectives Designing methods use diffusion SDE to deduce the relationship between the log likelihood of the generated data and the loss function matching the fractional function. This way, by properly designing the loss function, VLB and log likelihoods can be maximized.Song et al. proves that the weight function of the loss function can be designed so that the likelihood function value of the plug-in reverse SDE generates the sample is less than or equal to the loss function value, that is, the loss function is the upper bound of the likelihood function.The loss function fitting of fractional function is as follows:
We only need to set the weight function λ(t) to the diffusion coefficient g(t) to make the loss function the VLB of the likelihood function, that is:

Then when the number of discrete walks is close to infinite, VLB can be optimized by learning the endpoint of the signal-to-noise ratio function SNR(t), and other improvements in the model can be achieved by learning the function value of the intermediate part of the signal-to-noise ratio function.

Using the above formula and the trained score function, the optimal VLB can be approximately achieved under the conditions of the forward process.
5. Data generalization enhancement
diffusion model assumes that the data exists in Euclidean space, that is, manifolds with planar geometry, and adding Gaussian noise will inevitably convert the data into a continuous state space. Therefore, the diffusion model can only process continuous data such as pictures at the beginning, and the effect of directly applying discrete data or other data types is poor.
This limits the application scenarios of the diffusion model. Several research efforts generalize diffusion models to models suitable for other data types, and we elaborate on these methods in detail. We classify it into two categories: Feature Space Unification, Data-Dependent Transition Kernels.
, Feature Space Unification methods convert data to a unified form of late space, and then diffuse it on late space. LSGM proposes to convert data to a continuous latent space through the VAE framework before diffusing it on it. The difficulty of this method lies in how to train both the VAE and the diffusion model. LSGM shows that score matching loss no longer applies since the potential prior is intractable. LSGM directly uses the traditional loss function ELBO in VAE as the loss function, and derives the relationship between ELBO and fraction matching:This formula holds true in the sense of ignoring constants. By parameterizing the fractional function of the sample during diffusion, LSGM can efficiently learn and optimize ELBO.

D3PM has designed a transition kernel for discrete data, which can be set as lazy random-walk, absorbing state, etc.
GEODIFF designed a translation-rotation invariant graph neural network for 3D molecular graph data, and proved that the initial distribution with invariant and transition kernel can derive edge distribution with invariant. Assume T is a translation-rotation transformation, such as:
, then the generated sample distribution also has translation-rotation invariance:
VI. Relationship with other generative models
In each subsection below, we first introduce the other five important generative models and analyze their advantages and limitations. We then introduce how diffusion models are associated with them and explain how these generative models are improved by combining diffusion models. The relationship between
VAE, GAN, Autoregressive model, Normalizing flow, Energy-based model and diffusion model is shown in the figure below:




7. Application of diffusion model
In this section, we introduce the application of diffusion model in seven major application directions, including computer vision, natural language processing, waveform signal processing, multimodal learning, molecular graph generation, time series, and adversarial learning, and subdivides and analyzes the methods in each type of application. For example, in computer vision, diffusion model can be used for image completion repair (RePaint):
In multimodal tasks, diffusion model can be used for text-to-image generation (GLIDE):
can also use diffusion model to generate drug molecules and protein molecules in molecular map generation (GeoDiff):
Application classification summary is shown in the table:
8. Future research directions
1. Application hypothesis retest.
We need to check the assumptions we generally accept in our application. For example, it is generally believed in practice that the forward process of the diffusion model will convert the data into a standard Gaussian distribution, but this is not the case. More forward diffusion steps will make the final sample distribution closer to the standard Gaussian distribution and consistent with the sampling process; but more forward diffusion steps will also make estimating the fractional function more difficult. The conditions of theory are difficult to obtain, so in practice, it will lead to mismatch between theory and practice. We should be aware of this situation and design appropriate diffusion models.
2. From discrete time to continuous time.
Due to the flexibility of the diffusion model, many empirical methods can be enhanced by further analysis. This research idea is promising by transforming the discrete time model into the corresponding continuous time model, and then designing more and better discrete methods.
3. New generation process. The
diffusion model generates samples through two main methods: one is to discrete the reverse diffusion SDE, and then generate the samples through the discrete reverse SDE; the other is to gradually denoise the samples using Markov properties during the inverse process. However, for some tasks, it is difficult to apply these methods to generate samples in practice. Therefore, it is necessary to further study new generation processes and perspectives.
4. Generalize to more complex scenarios and more research areas.
Although the diffusion model has been applied to multiple scenarios at present, most of them are limited to single input and single output scenarios. In the future, it can be considered to apply it to more complex scenarios, such as text-to-audiovisual speech synchronization. It can also be considered in conjunction with more research areas.
paper link:
https://arxiv.org/pdf/2209.00796.pdf
GitHub link:
https://github.com/YangLing0818/Diffusion-Models-Papers-Survey-Taxonomy#application-taxonomy-1
— End —
Quantum QbitAI · Toutiao Sign
Follow us and learn about cutting-edge technology dynamics