Inference-Time Policy Steering

Method - How to steer policy without adding distribution shift?

Three interaction inputs: point, sketch, physical correction

Six sampling strategies

We consider 6 sampling strategies: Random Sampling (RS), Output Perturbation (OP), Post-Hoc Ranking (PR), Biased Initialization (BI), Guided Diffusion (GD), and Guided Stochastic Sampling (SS). RS, OP and PR are applied to the policy output and are agnostic to policy class. We experiment with two policy classes: Action Chunking Transformer (ACT) and Diffusion Policy (DP) in this work. BI, GD and SS are unique to diffusion and are applied to the noise input and diffusion process of the policy respectively. To implement conditional sampling based on user interactions, we use hand-crafted L2 distance metric to convert point and sketch inputs into inference-time cost objectives ξ(·) to compose with the frozen policy. Physical correction input overwrites the policy output and is only compatible with OP. We use a maze navigation task to illustrate how the six sampling strategies balance inference-time user alignment and constraint satisfaction, where alignment is measured by the L2 distance between the user input and the policy output and constraint satisfaction corresponds to maintaining collision-free.

Output Perturbation

User input (sketch or physical correction) overwrites the agent (red) states in real-time (click-n-drag the mouse). OP maximizes policy alignment at the cost of potential distribution shift. Predictions in collision turn white.

Exploring the learned motion manifold of ACT (execute user inputs and visualize policy predictions)

Exploring the learned motion manifold of DP (execute user inputs and visualize policy predictions)

Post-Hoc Ranking

User input (point or sketch) is used to rank policy outputs by L2 similarity. PR introduces minimal distribution shift but only improves alignment if there already exists aligned samples in the unconditional predictions.

As seen above, ACT does not produce a diverse set of predictions, leading to limited alignment improvement with PR. DP, however, exhibits higher degree of distribution multimodality and constraint satisfaction after being driven to OOD locations. Hence, PR can improve alignment, but not modify unconditional samples to be more similar to user input.

PR selects the best DP output based on sketch

Biased Initialization

User input (point or sketch) is used to initialize the initial noise distribution (instead of Gaussian) for a DP. Similar to PR, BI offers user limited control as the diffusion sampling process is still unconditional.

BI biases the noise distribution input of DP

Guided Diffusion

User input (point or sketch) is used to guide the diffusion process with gradients of L2 similarity between the user input and the policy output. Different from PR and BI, GD can discover new trajectories close to user input that do not necessarily live on the original motion manifold (see stacking experiments). Hence, there is no guarantee that the execution will still satisfy the original constraints and be successful eventually. In fact, sampling with weighted sum of denoising and alignment gradients can lead to OOD samples because it is equivalent to sampling from an unnormalized sum of policy distribution and objective distribution.

GD guides sampling with gradients of L2 similarity

Stochastic Sampling

SS is an improved version of GD that can generate trajectories closer to user input while maintaining the original motion constraints. Since a step of diffusion reverse sampling step at the same fixed noise level (from timestep t to timestep t) is equivalent to ULA MCMC sampling at the same fixed noise level, we can still approximately sample from the product distribution of the policy and the user input with a weighted sum of denoising and alignment gradients by repeating each guided diffusion step M steps in an annealed MCMC fashion.

SS achieves the best alignment-constraint satisfaction trade-off.

Guided Diffusion vs. Stochastic Sampling

Consider the toy example of composing a pre-trained policy distribution with an inference-time point input. The goal is to sample highly likely data point in the pre-trained distribution while maximizing alignment with the constructed point objective. Since we cannot directly sample from the composed distribution, we can perform MCMC using Langevin dynamics. The sampling process corresponds to a random walk guided by the weighted sum of denoising and alignment gradients. When it converges, we can estimate the probability distribution (shown by contour lines) we are effectively sampling from using all the samples collected. As shown on the right, GD approximately samples from an unnormalized sum distribution, which can result in OOD samples that violate likelihood constraints learned by the pre-trained distribution. In contrast, SS approximately samples from the product distribution via Annealed MCMC. See below for videos of Langevin dynamics with combined denoising and alignment gradients in GD and SS.

Unconditional MCMC sampling from a pre-trained distribution with Langevin dynamics

Conditional GD sampling with weighted sum of denoising and alignment gradients

Conditional SS sampling with weighted sum of denoising and alignment gradients

Experiments

Maze2D Navigation Task - Continuous Motion Alignment

Qualitative results of benchmarking six steering strategies with two pre-trained policies on a 2D maze navigation task. The central challenge of inference-time policy steering is to trade off between alignment with user input and staying in distribution of the pre-trained policy to maintain constraint satisfaction. We show SS is the best at producing aligned success.

Denoising evolution under BI

Denoising evolution under GD

Denoising evolution under SS

Block Stacking Task - Discrete Task Alignment

Given a pre-trained block stacking DP policy that randomly picks a block and place it on another random block until a tower is built, we show a user can guide the policy with sketches to build a specific tower in a particular order (alignment) without exacerbating covariate shift (failed grasps or placement). We show SS strategy below where 2D sketches are projected to the Y-Z plane through the end-effector to compute the L2 similarity metric. One can also use VR controllers to draw 3D sketches to guide the policy as shown in the paper.

To highlight the difference between PR and SS, we show PR cannot only improve alignment unless the policy predictions already contain user-aligned trajectories, while SS can generate novel plans with user-defined shapes even if they are missing from the initial samples. Additionally, we show one can choose whether to sample plans closer to user input or policy training distribution by adjusting the number of diffusion steps that are guided by user input.

PR cannot generate user-aligned plans if they are missing from initial samples

SS generates OOD plans close to user input if alignment gradients are added to all diffusion steps

SS generates plans closer to training distribution if alignment gradients are only added to early diffusion steps

PushT Task - Composing Multiple Inference-Time Objectives

To illustrate our framework's flexibility to compose multiple inference-time objectives, we show a user can use both positive sketches as preferences and negative sketches as collision constraints to shape the policy output. We take pre-trained pushT policies and real-time interactive version of pushT from this LeRobot branch and use gray lines to represent positive sketches that bias sampling towards them and black lines to represent negative sketches that bias sampling away from them. Note the original pushT domain does not have constraints and the policy is trained to push the block to the goal location.

Exploring pre-trained DP and ACT by visualizing predictions at mouse positions

Compositing positive (gray line) and negative (black line) sketches to shape policy output

Real-World Kitchen Task - Discrete Task Alignment

Visualizing the learned motion manifold of pre-trained DP shows multimodality only at limited parts of the state space. We show OP can improve alignment with physical corrections but suffers from covariate shift and thus grasp failures. Meanwhile, it is hard to tune guide ratio for point input with GD. While small guide ratio does not change the policy output, large guide ratio can lead to OOD incoherent plans. The best steering strategy is SS with point input that generates aligned plans close to user input while maintaining the original motion manifold.