Academic Project Page

Abstract

Learning visuomotor policies from expert demonstrations is an important frontier in modern robotics research, however, most popular methods require copious efforts for collecting teleoperation data and struggle to generalize out-of-distribution. Scaling data collection has been explored through leveraging human videos, as well as demonstration augmentation techniques. The latter approach typically requires expensive simulation rollouts and trains policies with synthetic image data, therefore introducing a sim-to-real gap. In parallel, alternative state representations such as keypoints have shown great promise for category-level generalization. In this work, we bring these avenues together in a unified framework: PAD (Parse-Augment-Distill), for learning generalizable bimanual policies from a single human video. Our method relies on three steps: (a) parsing a human video demo into a robot-executable keypoint-action trajectory, (b) employing bimanual task-and-motion-planning to augment the demonstration at scale without simulators, and (c) distilling the augmented trajectories into a keypoint-conditioned policy. Empirically, we showcase that PAD outperforms state-of-the-art bimanual demonstration augmentation works relying on image policies with simulation rollouts, both in terms of success rate and sample/cost efficiency. We deploy our framework in six diverse real-world bimanual tasks such as pouring drinks, cleaning trash and opening containers, producing one-shot policies that generalize in unseen spatial arrangements, object instances and background distractors.

Parse-Augment-Distill Framework

Given one human video demonstration, PAD executes three subsequent steps: (a) parsing the video into a robot-executable state-action trajectory, (b) spatially augmenting the demo at scale via bimanual TAMP, and (c) distilling the generated data into a closed-loop keypoint policy. The obtained policy can generalize to unseen spatial arrangements, object instances and background scene noise.

Parse

Parsing source video into a robot-executable keypoint-action trajectory. Keypoints are annotated manually and tracked over frames with a point-tracker. End-effector actions are produced by mapping hand points, estimated by a hand pose estimator, to 6-dof gripper actions.

Augment

Grounding trajectory segments with bimanual task templates. The video is abstracted into a symbolic template denoting task stages, hand-object assignments and requirement for synchronization. For each arm-stage exact timetsamps in the demo are grounded into motion (orange), idle (gray), asynchronous skill (blue) and synchronous skill (green) segments.

Augmenting keypoint-action trajectories with SE(3)-equivariant transforms. EE actions are produced such that the relative pose between robot-object remains the same. Keypoint tracks are produced assuming that they move rigidly together with the EE, preserving their relative pose during grasping.

Handling bimanual issues during augmentation, such as novel hand-object assignements (top) and re-synchronization between arms before synchronous stages (bottom).

Augment demo trajectory with bimanual TAMP.

Our spatial augmentation framework is:

general, as it doesn't depend on object semantics and doesn't require skill labels.
simulation-free, as it doesn't require simulation rollouts with digital twins and significantly boosts collection time (1000 demos under 1min.).
view-invariant, as generated data are expressed wrt. a calibrated frame, independent of downstream camera placement.
bimanual-aware, as it handles bimanual issues such as out-of-range hand-object assignments and arm de-synchronization.
embodiment-agnostic, as all planning is done in EE space, any robot morphology and motion planner can be plugged in.

Distill

Kp-RDT architecture

Example rollout from robot-view

Distill generated data into a closed-loop, keypoint-conditioned diffusion policy.

Sample-efficient Policy Learning without Simulators

PAD leads to higher success rates, sample-efficiency and faster data collection time, as it doesn't require expensive simulation rollouts to generate data.

Generalizable One-Shot Bimanual Policies

BibTeX

@misc{tziafas2025parseaugmentdistilllearninggeneralizablebimanual, title={Parse-Augment-Distill: Learning Generalizable Bimanual Visuomotor Policies from Single Human Video}, author={Georgios Tziafas and Jiayun Zhang and Hamidreza Kasaei}, year={2025}, eprint={2509.20286}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2509.20286}, }