Diffusion-based planner has been widely used in robot motion planning thanks to its ability to handle multi-modal demonstration, scalability to high-dim state space, and stability in training.
However, most diffusion-based planners are fully data-driven, which requires high-quality data and cannot adapt to new dynamics effectively due to the model-free nature.
The goal of this tutorial is to introduce a new model-based diffusion (MBD) planner that leverages model information to generate high-quality trajectories without data while enable the planner use diverse quality data more effectively.

Background: Diffusion Model in Planning

Diffusion model as a powerful sampler

A diffusion model is a generative model capable of generating samples from a given distribution. It serves as a powerful tool for distribution matching, extensively utilized in image generation, text generation, and other creative tasks.

diffusion examples

However, sampling from the target distribution is hard due to its high-dim and sparse nature. Diffusion model introduces a forward process to corrupt the distribution with noise to make it smooth and easier to sample from.

The reverse process is introduced to recover the original distribution by removing the noise.
For the standard diffusion model, the reverse process is achieved by learning the score function (the arrows in the figure blow) from data.

By using reverse SDE, the diffusion model can recover the original distribution iteratively.
At the core of the diffusion model is the score function, representing the noise direction used to update samples to match the target distribution. Through learning this score function, the diffusion model can adeptly generate samples from the specified distribution.

The diffusion model have several advantages over other generative models, including but not limited to the following:

Multimodal: It effectively handles multimodal distributions, a challenge often encountered when directly predicting distributions.

Scalable: This approach scales well with high-dimensional distribution matching problems, making it versatile for various applications.

Stable: Grounded in solid math and a standard multi-stage diffusion training procedure, the model ensures stability during training.

Non-autoregressive: Its capability to predict entire trajectory sequences in a non-autoregressive manner enables efficient handling of non-autoregressive and multimodal distribution matching challenges.

Diffusion solving trajectory optimization as a sampling problem

The diffuser operates by concatenating the state and action, allowing us to diffuse the state-action sequence. This process is akin to diffusing a single-channel image.

For a throwing ball over a wall to maximize the distance, the cost function visualized as below:

throw ball task

cost function

Files

The cost function is non-convex, nonlinear, and discontinuous, which makes it hard to solve with traditional optimization methods.

Diffusion-based Planer Approach TO by Coverting Optimization to Sampling Problem: Diffusion model bypasses the difficulty by transforming the optimization problem into a sampling problem.

When temperature is low, sampling from it is equivalent to solving the optimization problem (the darkest red curve in the figure below).

target distribution

Back to the throwing ball example, the forward the backward process would look like the following, where the sharp peak is the original distribution and the smooth distribution is the corrupted Gaussian distribution.

throw ball forward

Then, the backward process would start from the corrupted distribution and recover the original distribution iteratively with reverse SDE. With that, we are able to sample from the original sparse distribution effectively.

The key ingradient of the reverse process is the score function ∇Y(i)log pi(Yk(i)), which model-free diffusion learned merely from data without any model information.

Introduction: Why is Model-based Diffusion?

Limitation of Model-Free Diffusion: not generalizable

Model-free diffusion learns the score function merely from data, which means even if the target distribution has changed, the model-free diffusion would still generate trajectories based on the old dynamics/tasks.

This means if the dynamics has changed or the data itself is not high-quality, the model-free diffusion would fail to generate high-quality trajectories. The following plot shows the change of target distribution when the dynamics has changed:

throw ball dynamics change

For text/image generative task, the target distribution is unknown, so it is hard to predict the target distribution given a new task. However, for trajectory optimization, the target distribution is known given the dynamics and objective, just hard to sample from.

Model-based Diffusion Make Score Function Conditional on Model

Model-based Diffusion (MBD) is introduced to leverage the readily available model information to conduct the reverse process more effectively. (When refer to model information, I mean we can evalute the dynamics function f and the cost function l at any state and action pair. With that, we can evalute target distribution p0 at any state and action pair.)

Model-based Diffusion achieve that with a novel data-free score computation method and a new backward process designed for the optimization problem. With that design, MBD can use diverse quality data more effectively and adapt to dynamic changes more effectively.

The following is the comparison between model-based diffusion and model-free diffusion:

Aspect

Model-Based Diffusion (MBD)

Model-Free Diffusion (MFD)

Target distribution

Known, but hard to sample

Unknown, but have data from it

Objective

Sample high-likelihood solution

Generate diverse samples

Method: What is Model-based Diffusion?

Score Function Computation with Model

Recall that the score function is the key to the reverse process in the diffusion model which guide the sample to the target distribution.

diffusion reverse

But given the model, how can we do the score computation? The major challenge here is that the score function ∇Y(i)log pi(Yk(i)), where pi = ∫pi|0p0dY(0), is intractable to compute.

To address this challenge, we propose a novel method to compute the score function with the model information. The key idea is to use Monte Carlo estimation to approximate the score function.

For the backward process, instead of use standard reverse SDE, we propose Monte Carlo Score Ascent to recover the original distribution faster by leveraging the fact that we only care about the high likelihood solution.

larger step size according to the smoothness of the distribution

no noise term in the update

To see how MBD behaves differently from MFD, here is use two method to optimize a synthetic highly non-convex objective function.

MCSA

Given each intermediate distribution, the backward with MCSA converges faster due to larger step size and lower sampling noise while still capturing the multimodality.

MBD v.s. MFD

With above two key components, MBD is able to generate high-quality trajectories without relying on data. Here is the full algorithm of MBD:

General MBD Algorithm

Here is the full comparison between MBD and MFD:

Aspect

Model-Based Diffusion (MBD)

Model-Free Diffusion (MFD)

Target distribution

Known, but hard to sample

Unknown, but have data from it

Objective

Sample high-likelihood solution

Generate diverse samples

Score Approximation

From model + data (optional)

From data

Backward Process

Monte Carlo Score Ascent

Reverse SDE

Solve Full Trajectory Optimization Problem with MBD

The above discussion is based on constrain-free optimization problem. In the standard trajectory optimization problem, where the goal is to find a trajectory that minimizes a cost function.

The target distribution p0(Y(0)) is proportional to dynamical feasibility $p_d(Y) \propto \prod_{t=1}^{T} \mathbf{1}(x_t = f_{t-1}(x_{t-1},u_{t-1}))$, optimality $p_J(Y) \propto \exp{(-\frac{J(Y)}{\lambda})}$ and the constraints $p_g(Y) \propto \prod_{t=1}^{T} \mathbf{1}(g_t(x_t, u_t)\leq 0)$, i.e.,

p0(Y) ∝ pd(Y)pJ(Y)pg(Y)

Directly sampling from the target distribution is hard due to the high-dim and sparse nature of the distribution. MBD navigate this challenge by only sampling from feasible trajectory space, which is similar to the shooting method:

Up to now, we have introduced the full version of MBD for trajectory optimization. However, there is still one reminding issue: due to the combinatorial nature of pg(Y), we might still sampling from infeasible trajectory space. To address this issue, we introduce data to guide the diffusion process.

Augment MBD with Diverse Data

Even though MBD is data-free, it can still leverage diverse data more effectively with the help of model. The key idea is to use data as a regularization term to guide the diffusion process to the high-likelihood solution while allowing to use model further to refine the solution.

With the demonstration data, we redefine our target distribution as:

η is a constant to balance the model and the demonstration.
Here, we have introduced two extra constant terms pJ(Ydemo) and pg(Ydemo) to ensure that the demonstration likelihood is properly scaled to match the model likelihood p0(Y(0)).

Example: MBD with data vs. without data on a nonconvex function with constraints.
We want MBD converge to the optimal point 8 with the help of demonstration data. Although the
demostration point is not optimal, MBD can still converge to the optimal point with the guidance of
the demonstration data. Here data serves as a regularization term to guide the diffusion process to the
negative optimal point while allowing to use model further to refine the solution.

Results: How does MBD Perform?

MBD for Zeroth Order Optimization

Performance of MBD on high-dimensional black-box optimization benchmarks. MBD
outperforms other Gaussian Process-based Bayesian Optimization methods by 23%.

Black Box Optimization

MBD for Trajectory Optimization

We evalute our method on a set of challenging non-continous tasks, involving long-horizon push-T tash, which is generally considered as a challenging task even for RL.

non-continous task

Our benchmark shows that MBD outperforms PPO by 59%:

Reward Comparison

Achieving that performance, MBD only requires 10% computational time:

Time Comparison

For more visual results, please check our website.

MBD with Diverse Data

We also evaluate the performance of MBD with data augmentation on the Car UMaze Navigation and
Humanoid Jogging tasks to see how partial and dynamically infeasible data can help the exploration
of MBD and regularize the solution by steering the diffusion process.

reference task

Code Example

Takeaway

MBD achieve generalizable trajectory optimization by leveraging model information (i.e. we can evaluate the cost function and dynamics).

MBD can use diverse quality data more effectively to generate high-quality trajectories.