GeneralVLA-2: Geometry-Aware Reconstruction and
Governed Memory for Robot Planning

Haoyu Wang1*    Guoqing Ma2*    Zeyu Zhang1*†    Yandong Guo3    Boxin Shi1    Hao Tang1‡
1School of Computer Science, Peking University    2CASIA    3AI2 Robotics
*Equal contribution.   Project lead.   Corresponding author.
[Paper]
[Code]
[Model]

Abstract: Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively.



Method Overview

Overview of GeneralVLA-2 for robot manipulation
Overview of GeneralVLA-2 for robot manipulation. When calibrated multi-view object observations are available, GeoFuse-MV3D converts the views, masks, and camera poses into refined object-centric 3D evidence; the 3D-capable planning agent combines this evidence with governed KnowledgeBank retrieval; and the robot execution module follows the resulting multi-stage end-effector trajectory.


GeoFuse-MV3D Reconstruction

GeoFuse-MV3D reconstruction branch in GeneralVLA-2
GeoFuse-MV3D reconstruction branch in GeneralVLA-2. The branch keeps the same multi-view inputs, masks, and poses as MV-SAM3D, then refines the baseline with two complementary geometry sources and conservative geometry-only fusion.


Governed KnowledgeBank

Architecture of the governed KnowledgeBank module used by GeneralVLA-2
Architecture of the governed KnowledgeBank module used by GeneralVLA-2. The module writes verifier-labeled memories, retrieves high-quality records, and manages their lifecycle before conditioning the 3DAgent planner.


Simulation



Real World



MV-SAM3D vs GeoFuse-MV3D

Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the first set of GSO-30 objects
Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the first set of GSO-30 objects. Each row uses the same five input views, masks, and camera poses for both methods.
Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the second set of GSO-30 objects
Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the second set of GSO-30 objects. GeoFuse-MV3D applies mask-verified, appearance-preserving geometry refinement, improving object completeness and pose consistency without changing the input protocol.


Bibtex

@article{wang2026generalvla2,
    title={GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning},
    author={Wang, Haoyu and Ma, Guoqing and Zhang, Zeyu and Guo, Yandong and Shi, Boxin and Tang, Hao},
    journal={arXiv preprint arXiv:2606.17480},
    year={2026}
}