Bimanual manipulation (BM) is a critical challenge in robotic learning, requiring precise coordination of two arms for complex tasks. However, training robots for BM relies heavily on costly and time-consuming robot demonstration data. In contrast, human videos, being more cost-effective and easier to collect, have gained significant attention. This paper explores whether leveraging human videos can enhance performance and generalizability in BM. We propose H2R-BM, a framework that uses human demonstration videos as a scalable alternative to robot data. By transferring knowledge from human to robot domains, H2R-BM reduces reliance on expensive robot data while maintaining task performance. Central to our approach is the Spatial-Temporal Alignment (STA) module, which ensures consistency between human and robot demonstrations. H2R-BM not only lowers the barrier to robotic learning but also advances the development of versatile and adaptive robotic systems. Extensive experiments show that H2R-BM offers three key advantages: (1) Performance Improvement: Human videos significantly boost performance across diverse long-horizon BM tasks at varying difficulty levels. (2) Enhanced Generalization: Human videos improve generalization capabilities, including object, viewpoint, and positional generalization. (3) Human Video Scaling Law: H2R-BM effectively leverages the efficiency of human data collection, exhibiting a pronounced scaling effect that substantially boosts task performance beyond what is achievable with robot data alone. H2R-BM introduces a novel paradigm in robotic learning, paving the way for future bimanual manipulation research.
With H2R-BM
Without H2R-BM
With H2R-BM
Without H2R-BM
Human Video
Robot Video