H2R-BM: Can Leveraging Human Videos Enhance Performance and Generalizability in Bimanual Manipulation?

1Xiaomi EV, Beijing, China, 2Institute of Automation, CAS, Beijing, China, 3Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, 4Inner Mongolia University, Hohhot, China, 5School of Computer Science, Wuhan University, Wuhan, China,
*Corresponding author

Motivation of H2R-BM

Teaser
Acquiring robot videos via teleoperation is costly and inefficient, whereas human videos from wearable devices provide a cheaper, more efficient alternative. H2R-BM leverages human videos to minimize dependence on robot data for bimanual manipulation, reducing data collection costs.

Abstract

Bimanual manipulation (BM) is a critical challenge in robotic learning, requiring precise coordination of two arms for complex tasks. However, training robots for BM relies heavily on costly and time-consuming robot demonstration data. In contrast, human videos, being more cost-effective and easier to collect, have gained significant attention. This paper explores whether leveraging human videos can enhance performance and generalizability in BM. We propose H2R-BM, a framework that uses human demonstration videos as a scalable alternative to robot data. By transferring knowledge from human to robot domains, H2R-BM reduces reliance on expensive robot data while maintaining task performance. Central to our approach is the Spatial-Temporal Alignment (STA) module, which ensures consistency between human and robot demonstrations. H2R-BM not only lowers the barrier to robotic learning but also advances the development of versatile and adaptive robotic systems. Extensive experiments show that H2R-BM offers three key advantages: (1) Performance Improvement: Human videos significantly boost performance across diverse long-horizon BM tasks at varying difficulty levels. (2) Enhanced Generalization: Human videos improve generalization capabilities, including object, viewpoint, and positional generalization. (3) Human Video Scaling Law: H2R-BM effectively leverages the efficiency of human data collection, exhibiting a pronounced scaling effect that substantially boosts task performance beyond what is achievable with robot data alone. H2R-BM introduces a novel paradigm in robotic learning, paving the way for future bimanual manipulation research.

Retry Phenomenon Example

With H2R-BM

Without H2R-BM

With H2R-BM

Without H2R-BM

Framework
Overview of H2R-BM. Visual observations and embodiment proprioception from both human and robot demonstrations are processed by the Spatial-Temporal Alignment (STA) module and a shared visuomotor backbone. The resulting feature representations are utilized in two ways: (1) both human and robot features are fed to the 12 DoF Trajectory Head for general skill learning, and (2) only robot features are fed to the 14 DoF Action Head for robotic control learning.

Example Training Data

Human Video

Robot Video

Performance Results

Performance Results

Generalization Results

Generalization Results

Scaling Law

Scaling Law