Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations

Yanwei Wang Nadia Figueroa Shen Li Ankit Shah Julie Shah

Interactive Robotics Lab
Massachusetts Institute of Technology

Abstract

Learning from demonstration (LfD) has succeeded in tasks featuring a long time horizon. However, when the problem complexity also includes human-in-the-loop perturbations, state-of-the-art approaches do not guarantee the successful reproduction of a task. In this work, we identify the roots of this challenge as the failure of a learned continuous policy to satisfy the discrete plan implicit in the demonstration. By utilizing modes (rather than subgoals) as the discrete abstraction and motion policies with both mode invariance and goal reachability properties, we prove our learned continuous policy can simulate any discrete plan specified by a linear temporal logic (LTL) formula. Consequently, an imitator is robust to both task- and motion-level perturbations and guaranteed to achieve task success.

Paper

Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations
Yanwei Wang, Nadia Figueroa, Shen Li, Ankit Shah, Julie Shah
arxiv / review / code / PBS News Coverage
CoRL 2022 (Oral, acceptance rate: 6.5%)
IROS 2023 Workshop (Best Student Paper, Learning Meets Model-based Methods for Manipulation and Grasping Workshop)

Teaser

Our Method (LTL-DS) inputs (1) an LTL formula that specifies all valid mode transitions for a task and (2) demonstrations that successfully complete the task, and outputs (1) a task automaton that can reactively sequence (2) a set of learned per-mode dynamical systems policy [SM Khansari-Zadeh 2011] to guarantee constraint satisfaction and goal reachability despite arbitrary external perturbations.

Talk

Main Question: Given a discrete task plan encoded by LTL that is reactive to perturbations, how to ensure the plan is feasible for continuous policies learned from demonstrations, i.e. to guarantee motion imitation satisfies LTL?

Main Takeaway: Any arbitrary discrete task plan of mode sequence is achievable by a continuous motion imitation system, if every learned per-mode policy satisfies both mode invariance and goal reachability.

How TLI (yellow box) is related to prior work (gray boxes)

Generically Learned Motion Policy / Motion Policy with Stability Guarantee

Given a few demonstrations (red trajectories), generically learned (state-based behavior cloning) motion policies do not guarantee policy rollouts will always reach a goal given perturbations (on the left), while dynamical systems policy (a BC-variant with G.A.S. property) guarantees goal reachability.

Motion Policy without Mode Invariance / with Mode Invariance

The task is to transition through the white, yellow, pink, and green regions consecutively. The pink region can only be entered from the yellow region, and the green region can only be entered from the pink region. Motion policies without mode invariance-the property that policy rollouts do not leave a mode prematurely-lead to looping despite LTL'reactivity (on the left), while motion policies with mode invariance (achieved by boundary estimation and modulation) ensure both constraint satisfaction and goal reachability.

Iterative Boundary Estimation of Unknown Mode with Cutting Planes
	To modulate motion policies so that they become mode-invariant, unknown mode boundary is first estimated. Invariance failures detected by sensors are used to find cutting planes that bound the mode within which DS flows are modulated to stay. Note flows that have left the mode will re-enter the mode due to LTL's reactivity, and iteratively increasingly better boundary estimation is attained.

MIT Museum Demo

A permanent interactive exhibition at MIT museum of programming robots via demonstrations

Grounding Language Plans in Demonstrations through Counterfactual Perturbations
Yanwei Wang, Tsun-Hsuan Wang, Jiayuan Mao, Michael Hagenow, Julie Shah
arxiv / code / project page
ICLR 2024 (Spotlight, acceptance rate: 5%)

This work learns grounding classifiers for LLM planning. By locally perturbing a few human demonstrations, we augment the dataset with more successful executions and failing counterfactuals. Our end-to-end explanation-based network is trained to differentiate successes from failures and as a by-product learns classifiers that ground continuous states into discrete manipulation mode families without dense labeling.

Generalization to New Tasks by Reusing Learned Skills

Line Inspection Task

Color Tracing Task

Scooping Task