Zero-Shot Long-Horizon Dexterous Manipulation
via Multi-View 3D-Grounded VLM Reasoning
Abstract
We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This uplifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for candidates consistent across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action indexed by the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation with closed-loop retry, enabling zero-shot execution of unseen objects and tool-use tasks in novel scenes.
Method Overview
Our pipeline takes as input calibrated multi-view RGB images and a high-level language instruction, and
outputs a physically feasible arm-hand execution plan. Our pipeline consists of four main stages:
(1)
reference-view semantic grounding (2) multi-view fusion-based 3D uplifting (3)
object-centric atomic
action alignment for tool-use (4) affordance-guided dexterous grasp and motion generation.
Zero-Shot Execution Result
"Throw away Pepsi to a basket."
"Throw away Pepsi to a basket."
"Throw away Pepsi to a basket."
"Put the object outside the baskets into the matching basket."
"Put the object outside the baskets into the matching basket."
"Put the pot on the stove."
"Put the pot on the stove."
"Place the apple on the mug."
Zero-Shot Execution Result - Tool Related
Long-horizon Manipulation Result
Tool-Use Cases
"Hit sth on the cutting board with hammer."
"Pour water into the dripper"
"Sweep the object on the cutting board toward the matching category object"
"Toss the food in the wok."
BibTeX
Coming Soon!