User input (sketch or physical correction) overwrites the agent (red) states in real-time (click-n-drag the mouse). OP maximizes policy alignment at the cost of potential distribution shift. Predictions in collision turn white. |
Exploring the learned motion manifold of ACT (execute user inputs and visualize policy predictions) |
Exploring the learned motion manifold of DP (execute user inputs and visualize policy predictions) |
User input (point or sketch) is used to rank policy outputs by L2 similarity. PR introduces minimal distribution shift but only improves alignment if there already exists aligned samples in the unconditional predictions. |
As seen above, ACT does not produce a diverse set of predictions, leading to limited alignment improvement with PR. DP, however, exhibits higher degree of distribution multimodality and constraint satisfaction after being driven to OOD locations. Hence, PR can improve alignment, but not modify unconditional samples to be more similar to user input. |
PR selects the best DP output based on sketch |
|
User input (point or sketch) is used to initialize the initial noise distribution (instead of Gaussian) for a DP. Similar to PR, BI offers user limited control as the diffusion sampling process is still unconditional. |
BI biases the noise distribution input of DP |
|
User input (point or sketch) is used to guide the diffusion process with gradients of L2 similarity between the user input and the policy output. Different from PR and BI, GD can discover new trajectories close to user input that do not necessarily live on the original motion manifold (see stacking experiments). Hence, there is no guarantee that the execution will still satisfy the original constraints and be successful eventually. In fact, sampling with weighted sum of denoising and alignment gradients can lead to OOD samples because it is equivalent to sampling from an unnormalized sum of policy distribution and objective distribution. |
GD guides sampling with gradients of L2 similarity |
|
SS is an improved version of GD that can generate trajectories closer to user input while maintaining the original motion constraints. Since a step of diffusion reverse sampling step at the same fixed noise level (from timestep t to timestep t) is equivalent to ULA MCMC sampling at the same fixed noise level, we can still approximately sample from the product distribution of the policy and the user input with a weighted sum of denoising and alignment gradients by repeating each guided diffusion step M steps in an annealed MCMC fashion. |
SS achieves the best alignment-constraint satisfaction trade-off. |
|
|
|
|
Unconditional MCMC sampling from a pre-trained distribution with Langevin dynamics |
Conditional GD sampling with weighted sum of denoising and alignment gradients |
Conditional SS sampling with weighted sum of denoising and alignment gradients |
|
Denoising evolution under BI |
Denoising evolution under GD |
Denoising evolution under SS |
|
PR cannot generate user-aligned plans if they are missing from initial samples |
SS generates OOD plans close to user input if alignment gradients are added to all diffusion steps |
SS generates plans closer to training distribution if alignment gradients are only added to early diffusion steps |
|
Exploring pre-trained DP and ACT by visualizing predictions at mouse positions |
Compositing positive (gray line) and negative (black line) sketches to shape policy output |