Do as I Do: Turning Everyday Human Videos Into Dexterous Robot Data (2026)

Do as I Do is a two-step algorithm that reconstructs and retargets hand-object interactions from monocular RGB videos onto multi-fingered dexterous robot hands.

Our hand-object reconstruction process outperforms state-of-the-art on relevant metrics and handles diverse videos — whether egocentric or exocentric, ranging from in-the-wild internet clips to outputs of generative video models.

Our retargeting process improves upon existing scalable dynamics-aware retargeting techniques by introducing novel components that robustify noisy reconstructed reference trajectories.

The resulting robot data is playable on a dexterous robot hand and arm, completing the first pipeline that can go from an internet video to real dexterous hand rollouts.

Reconstruction Method

The reconstruction process takes monocular RGB video as input and outputs a full hand-object trajectory. It works across camera viewpoints and video quality levels, from professional recordings to casual smartphone clips.

Diagram showing the Do as I Do pipeline from input video through reconstruction to robot execution

Retargeting Method

The retargeting step aims to reproduce the reconstructed hand-object trajectory on a robot hand. However, human and robot morphologies differ, and contact information and forces are absent from the kinematic signal. Prior works address this with kinematic solvers or robotic heuristics, but they do not ensure physical plausibility or lose general-purpose expressiveness.

Do as I Do performs dynamics-aware retargeting, which follows the reference while ensuring realism within physics simulation. Building on the MPPI framework, the method uses sampling-based optimization with a kernel that is annealed across both iterations and the prediction horizon, shifting from broad exploration to local refinement.

Experimental Setup

Across all tasks, the 22-DoF Sharpa Wave hand is used. Real-world deployment results are demonstrated on a bimanual setup with Sharpa Wave hands and UR3e arms, both commanded at 50 Hz.

Retargeting Results

On reconstructed in-the-wild data, Do as I Do reaches a 71% success rate, significantly improving over the baseline of 25%. The main differentiator is warmup, which discovers initial states that are much more stable and natural than the noisy initial frame, thereby leading to successful tracking in subsequent timesteps. Perturbation noticeably improves the qualitative results (e.g., natural grasps) despite marginally affecting quantitative metrics, and the transition reward encourages successful picks and places for trajectories that otherwise would have missed the object during crucial transition timesteps.

Validating the method on OakInk2 also shows consistent improvement with each component, moving from a baseline of 72% up to 81%. This demonstrates that the retargeting approach, despite being designed for imperfect reconstructed references, produces effective gains even with clean MoCap trajectories and scales well to the 1,000+ bimanual tasks in this benchmark.

Conclusion

Do as I Do provides a framework for reconstructing and retargeting everyday human videos onto dexterous robot hands. The method is effective across egocentric, exocentric, and online video sources, showing a path towards scaling robot data by simply observing humans.

Limitations. The approach assumes rigid objects and semi-accurate metric depth predictions from monocular RGB, and may fail when either assumption does not hold. Monocular observations also suffer from ambiguity in the true hand-object distance, making it difficult to distinguish physical contact from mere visual occlusion. The method reconstructs only the hand and an object, rather than the full scene, and cannot reason about environmental constraints such as obstacles or articulations. Finally, current physics simulators model real world dynamics only approximately, which places an upper bound on achievable real-world performance.

Frequently Asked Questions

What video types does Do as I Do support? The method handles egocentric, exocentric, and in-the-wild internet videos, as well as outputs from generative video models.

How does the retargeting handle differences between human and robot hands? It uses dynamics-aware retargeting with MPPI-style sampling optimization and novel components like warmup, perturbation, and transition rewards to handle noisy reconstructed references.

What hardware was used for real-world validation? All experiments used the 22-DoF Sharpa Wave hand with UR3e arms in a bimanual setup commanded at 50 Hz.

What are the main limitations of the current approach? The method assumes rigid objects, requires semi-accurate metric depth from monocular RGB, and cannot reconstruct full scenes or reason about environmental constraints.