GeneralVLA-2: Geometry-Aware Reconstruction and
Governed Memory for Robot Planning

Haoyu Wang^1* Guoqing Ma^2* Zeyu Zhang^1*† Yandong Guo³ Boxin Shi¹ Hao Tang^1‡

¹School of Computer Science, Peking University ²CASIA ³AI² Robotics

^*Equal contribution. ^†Project lead. ^‡Corresponding author.

Abstract: Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively.

Method Overview

Overview of GeneralVLA-2 for robot manipulation. When calibrated multi-view object observations are available, GeoFuse-MV3D converts the views, masks, and camera poses into refined object-centric 3D evidence; the 3D-capable planning agent combines this evidence with governed KnowledgeBank retrieval; and the robot execution module follows the resulting multi-stage end-effector trajectory.

GeoFuse-MV3D Reconstruction

Governed KnowledgeBank

Simulation

Real World

MV-SAM3D vs GeoFuse-MV3D

Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the first set of GSO-30 objects. Each row uses the same five input views, masks, and camera poses for both methods.

Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the second set of GSO-30 objects. GeoFuse-MV3D applies mask-verified, appearance-preserving geometry refinement, improving object completeness and pose consistency without changing the input protocol.

Bibtex

@article{wang2026generalvla2, title={GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning}, author={Wang, Haoyu and Ma, Guoqing and Zhang, Zeyu and Guo, Yandong and Shi, Boxin and Tang, Hao}, journal={arXiv preprint arXiv:2606.17480}, year={2026} }