Zero-Shot Long-Horizon Dexterous Manipulation with Multi-View 3D Grounded VLM Reasoning (2026)

A long-standing goal in robotics is to build general-purpose systems that perform long-horizon manipulation from high-level language instructions. Beyond recognizing objects, such systems must ground instructions in task-relevant 3D geometry: where to place an object, which part to contact, and how to orient and move a tool during execution. This requirement is especially stringent for dexterous hands, where small 3D grounding errors cause unstable grasps, collisions, inverse-kinematics failures, or contact on the wrong functional region of a tool.

The inferred 3D grounding is coupled with a library of reusable atomic primitives. Tool-use behaviors are represented as a Bag of Atomic Actions, a library of short 6D object trajectories indexed by interaction type. For a new scene, the appropriate primitive is retrieved and aligned to the grounded task geometry. To support dexterous-hand execution, the same multi-view grounding is applied to estimate functional contact regions, generate candidate grasps on those regions, and filter them by inverse-kinematics and collision feasibility over the full tool-use trajectory. For long-horizon tasks, closed-loop verification and retry let the system re-ground or replan after execution failures.

Experiments

Long-horizon manipulation sequence showing a robot arm with dexterous hand performing multi-step tool tasks

The framework is evaluated on zero-shot robot manipulation in a real-world tabletop setting, assessing its scalability from simple tasks to long-horizon scenarios. The evaluation covers four key capabilities: (1) target grounding amidst distractors and collision robustness (e.g., placing inferred trash into a basket), (2) spatial-relation reasoning (e.g., placing tools on a stove), (3) affordance-aware tool use (e.g., sweeping objects with a broom), and (4) long-horizon sequencing (e.g., cooking and organizing 3-4 objects). Additional tool-use scenarios are provided in the supplementary material.

Hardware Setup

The system features an xArm equipped with an Inspire dexterous hand. The tabletop environment is monitored by multiple calibrated RGB cameras, including a stereo pair. FoundationStereo is used for stereo depth estimation and FoundationPose for multi-object 6D pose estimation.

Baselines

The zero-shot framework is compared against an RGB-D grounding baseline and two Vision-Language-Action (VLA) models. The RGB-D baseline predicts a 2D keypoint from a single view and lifts it to 3D using the aligned depth map. For the VLA models, pretrained models are fine-tuned using 30 task-specific teleoperation demonstrations per task, whereas our method operates entirely zero-shot, relying solely on VLM reasoning for 3D grounding and manipulation.

Metrics

Success Rate. A trial is considered successful if the robot completes the task according to the text instruction. For tasks with a specified target object or target location, we check whether the target object is placed at the desired location after execution.

Collision Error. We evaluate whether the predicted waypoint or placement grounding causes collision when the manipulated object is placed at the corresponding location. The metric reports the mean maximum penetration depth between the manipulated object and the surrounding environment.

Long-Horizon Success Rate. For sequential tasks, a trial is considered successful only if all required steps are completed in the correct order. Because long-horizon real-robot trials are time-consuming, the number of trials may differ across tasks. We report both the number of trials and the success rate. When retries are used, a trial is counted as successful if the task is completed within the retry budget.

Discussion

Bag of Atomic Actions alignment diagram showing how tool-use primitives are matched to task geometry

We present a zero-shot, long-horizon manipulation framework that bridges VLM reasoning with physical execution via multi-view 3D grounding. By decomposing language instructions into sequences of 3D-grounded manipulation primitives, the system seamlessly supports both standard pick-and-place and complex tool-use tasks by spatially aligning object-centric atomic actions to the target scene. Experimental results demonstrate that the multi-view fusion strategy significantly outperforms single-view RGB-D baselines in spatial accuracy and robustness against occlusion. Furthermore, the primitive-level formulation naturally enables closed-loop execution, allowing the system to verify task progress and dynamically recover from intermediate failures during long-horizon tasks.

Comparison of 3D Grounding Methods

We further analyze the behavior of the single-view RGB-D grounding baseline and the multi-view grounding approach in cluttered real-world scenes. Due to its reliance on a single observation, the RGB-D baseline is sensitive to occlusion and incomplete geometry, often resulting in misplaced 3D targets. In contrast, the multi-view approach aggregates semantic grounding cues across views and produces more consistent task-relevant 3D estimates in cluttered environments.

Cylindrical-Template-Based Grasp Generation

For tool-use tasks, directly optimizing fingertip contacts can be insufficient because successful tool use requires grasps that remain stable and action-consistent throughout motion execution. Many household tools contain approximately cylindrical grasp affordances, such as broom handles, bottles, and pan handles. When the estimated affordance region corresponds to such a cylindrical region, this structural prior is exploited to initialize palm poses.

A surface vertex is sampled near the region center and its outward surface normal is used to define a palm pose anchor, controlling the palm reference point, desired palm normal, and palm-to-surface offset. To cover diverse grasp styles, different palm orientations are sampled around the approach direction while preserving the normal alignment. For each sampled palm pose, finger closure is optimized using simulation-based grasp refinement. The resulting candidates are validated in simulation by applying external forces and torques along all six axes to assess grasp stability.

Implementation Details

The following hyperparameters are used in all experiments.

Frequently Asked Questions

How does the system handle occlusion during 3D grounding? The multi-view fusion strategy aggregates semantic cues from multiple calibrated RGB cameras, significantly outperforming single-view RGB-D baselines in spatial accuracy and robustness against occlusion in cluttered environments.

What types of tool-use tasks can the framework perform? The system supports diverse tasks including placing objects into baskets, positioning tools on a stove, sweeping with a broom, and long-horizon sequences like cooking and organizing multiple objects.

How are grasps generated for dexterous tool manipulation? The system uses a cylindrical-template-based approach that exploits structural priors from household tools, followed by simulation-based finger closure optimization and stability validation under external forces.

Can the system recover from failures during execution? Yes, the primitive-level formulation enables closed-loop execution with verification and retry mechanisms, allowing the system to re-ground or replan after intermediate failures within a retry budget.