Zero-Shot Long-Horizon Dexterous Manipulation
via Multi-View 3D-Grounded VLM Reasoning

1 Seoul National University 2 RLWRLD
Preprint 2026

TL;DR: We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D plans from calibrated multi-view RGB images.

Abstract

We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This uplifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for candidates consistent across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action indexed by the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation with closed-loop retry, enabling zero-shot execution of unseen objects and tool-use tasks in novel scenes.

Method Overview

Method overview figure



Our pipeline takes as input calibrated multi-view RGB images and a high-level language instruction, and outputs a physically feasible arm-hand execution plan. Our pipeline consists of four main stages: (1) reference-view semantic grounding (2) multi-view fusion-based 3D uplifting (3) object-centric atomic action alignment for tool-use (4) affordance-guided dexterous grasp and motion generation.


Zero-Shot Execution Result


Zero-Shot Execution Result - Tool Related

Object Centric Atomic Action Alignment
Tool-related atomic execution

Leveraging 3D grounded keypoints derived from VLM reasoning and multi-view 3D lifting, we align the 6D tool trajectory to the scene, yielding a task-aligned trajectory.


Long-horizon Manipulation Result


Tool-Use Cases

BibTeX

          Coming Soon!