MOCHI Cleans Up Noisy Multi-Human Object Interaction Data

Jiye Lee, Yonghun Choi, Jungdam Won

6 хв читання17 черв. 2026 р.

Two people lifting a table, handing a tool back and forth, or assembling a piece of furniture — these collaborative human-object interactions are messy to capture. Motion capture systems struggle with hand-object contact misalignment, temporal jitter, and missing finger details when multiple people and a shared object are involved. MOCHI (MOtion Enhancement of Collaborative Human-object Interactions) is a two-stage framework that takes that noisy data and outputs clean, physically plausible multi-human object interaction sequences.

What the Researchers Built

MOCHI is a two-stage pipeline that enhances noisy motion capture data of multiple humans interacting with the same object (multi-human object interaction, or MHOI). The first stage focuses on hand-object contact: given noisy body pose input, it optimizes hand grasps to be both physically plausible (no penetration, stable contact) and semantically consistent with the body motion. These optimized grasps are then extended into full hand-object interaction sequences.

The second stage refines the entire motion of all participants using a diffusion-based noise optimization framework. Because diffusion models typically work with single-person motion priors, the researchers introduced new optimization objectives that encode human-object and human-human interaction information directly into those single-person priors. The result is a complete, temporally consistent, and physically coherent multi-person animation.

MOCHI works on data from any source — captured by existing motion capture systems or synthesized by generative models — and can handle varying numbers of participants and interaction types. It also enables practical applications like keyframe-based MHOI creation and data augmentation by swapping object geometries.

Key Results

The abstract does not provide specific numerical benchmarks, but the researchers demonstrate the pipeline’s effectiveness across diverse MHOI datasets. Qualitative results show significant reduction in:

  • Contact misalignment – hands no longer float near or pass through objects.
  • Motion jitter – temporal inconsistencies are smoothed without losing dynamic detail.
  • Missing finger articulation – finger-level motion is recovered and synchronized with body pose.

The system shows robustness to varying participant counts (dyads, triads, more) and interaction types (lifting, handing, assembling). As a validation of practical utility, MOCHI enables keyframe-based MHOI creation — an animator can specify a few key poses and the system generates a full interaction — as well as data augmentation by changing object shape while maintaining natural human-object contact.

How It Works

MOCHI works in two sequential stages. Stage one addresses hand-object contact. Given noisy body motion (positions and rotations of bones, but missing or noisy hand data), the system formulates an optimization problem that searches for hand poses satisfying two criteria: physical plausibility (minimal object penetration, stable grasp points) and semantic consistency (the grasp looks natural for the body configuration, e.g., a power grip vs. precision pinch when lifting a heavy box). The optimizer uses a physics-inspired cost function that penalizes interpenetration and rewards surface contact area. It outputs a smooth, temporally consistent sequence of hand poses that match the object motion inferred from the body.

Stage two refines the entire full-body motion for all participants. This stage treats motion refinement as a diffusion-based noise optimization problem. It starts with the raw noisy sequence and iteratively denoises it using a pretrained single-person diffusion model. The key innovation is the addition of two interaction-aware objectives injected into the denoising loop:

  • Human-object objective: ensures that each person’s hands stay properly aligned with the object without violating contact constraints.
  • Human-human objective: prevents penetrations and maintains plausible spatial relationships between participants (e.g., two people facing each other during a handoff).

Because these objectives are applied as optimization terms inside the diffusion sampling process, the final output is a clean, multi-person motion that respects all physical and interaction constraints. No additional multi-person diffusion model training is required.

ComponentInputOutputMethod
Stage 1 (Hand Grasp Optimization)Noisy body poseOptimized hand grasps + full hand sequencePhysics-inspired cost minimization
Stage 2 (Full-Body Refinement)Body + hand motion from Stage 1Clean multi-person motionDiffusion-based noise optimization with interaction objectives

Why This Matters for Robotics

High-quality motion data of humans handling objects is the fuel for many robotic systems: imitation learning, human-robot collaboration, and synthetic training data generation. Most of the existing mocap datasets involve single humans interacting with objects, but real-world tasks — shipping, warehousing, assembly — involve collaborative manipulation. MOCHI lowers the barrier to acquiring such data by cleaning up the inherently noisy recordings.

For companies deploying warehouse robots or cobots that need to work alongside multiple people, having realistic interaction data is critical for training perception and control policies. MOCHI also enables data augmentation (varying object geometry) which helps simulation-to-real transfer. And for humanoid robots learning from human demonstrations, the refined motion can serve as high-quality reference trajectories.

Limitations and Open Questions

The framework depends on the quality of the single-person motion priors used in the diffusion stage. If the priors were trained only on simple, single-person motions (e.g., walking, running), they may struggle to generalize to the complex, coordinated movements of MHOI. The authors address this by injecting interaction objectives during inference, but the robustness to entirely novel interaction types remains untested.

The computational cost of the two-stage optimization is not reported, but iterative diffusion sampling is typically slow — real-time applications are unlikely with current methods. Additionally, MOCHI refines existing noisy data but does not generate entirely new interactions from scratch (except keyframe-based creation, which still requires manual keypose specification).

Frequently Asked Questions

What problem does MOCHI solve? MOCHI cleans up noisy motion capture data of multiple people interacting with the same object, such as lifting a table or handing a tool.

Does MOCHI work with any number of people? Yes, the framework is robust to varying numbers of participants and different interaction types, from two people to larger groups.

Can MOCHI be used to create new motion data? It can generate full interactions from user-specified keyframes, and it supports data augmentation by changing the object geometry while preserving natural contact.

Is MOCHI a generative model or a denoiser? It is primarily a denoising/refinement framework — it takes noisy input motion and outputs a cleaner version using optimization and diffusion, not a standalone generative model.

Conclusion

MOCHI offers a practical, two-stage solution for cleaning up the messy reality of multi-human object interaction motion capture. By combining hand grasp optimization with interaction-aware diffusion refinement, it produces physically plausible and temporally consistent animations from noisy data. This work opens up better training data for collaborative robotics and animation, and its keyframe-based creation and augmentation features make it a versatile tool.

🍪 Налаштування файлів cookie

Ми використовуємо файли cookie для вимірювання продуктивності. Політика конфіденційності