GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

Guoqing Ma^1* Siheng Wang^2* Zeyu Zhang^2*† Shan Yu¹ Hao Tang^2‡

¹CASIA ²Peking University

^*Equal contribution. ^†Project lead. ^‡Corresponding author.

TL;DR

GeneralVLA design integrates ASM and Knowledge-Guided Trajectory Planning for generalizable zero-shot robotic manipulation and scalable demonstration-free data generation across tasks.

Real World Results

Selected real-world episodes (videos 7–10) showcase the model acting in unconstrained physical settings. Each clip is labeled beneath the footage for quick reference.

Pick up the spray bottle and place it on the blue pad, taking into account the bottle's height during handling.

Open the drawer using its handle, and be mindful of the correct direction for opening.

Uncap the transparent bottle.

Pick up the cabbage and place it into the box.

Simulation Results

Controlled simulation episodes (clips 1–6) highlight the model’s behavior in the virtual testbed, with motion traces rendered as GIFs for quick comparison.

Play Jenga simulation — Pull out the Jenga block carefully, ensuring to note and apply the correct direction of pull.

Lamp on simulation — Turn on the lamp. Press the green button to power it on.

Pick up cup simulation — Pick up the red cup by grasping its edge securely.

Open jar simulation — Uncap the red jar and place its cap on the green area.

Stack blocks simulation — Stack the two blocks together within the green area, aligning them carefully while considering their height for stability.

Method

GeneralVLA overview illustration — Overview of GeneralVLA, VLAs and earlier imitation learning methods. GeneralVLA’s hierarchical design results in better generalization. It enables 3D trajectory planning framework that fully exploits the prior knowledge of foundation models.

GeneralVLA inference pipeline — Inference workflow of GeneralVLA. (a) The high-level ASM is called to generate the 2D points and corresponding semantic information. (b) The mid-level Knowledge-Guided Trajectory Planning carries out task understanding, 3D reasoning and planning to produce a 3D path indicating the desired robot end-effector trajectory. (c) The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy enhanced by HGM for precise manipulation.

ASM and 3DAgent framework — Detailed framework of ASM and 3DAgent. (a) Given the input image and task text as query, the multimodal LLM (e.g., LLaVA [36]) generates text output. The last-layer embedding for the <SEG> token is then decoded into the segmentation mask via the decoder. We use LoRA [19] for efficient fine-tuning. The choice of vision backbone can be flexible (e.g., SAM3 [7]).

BibTeX

@article{ma2026generalvla,
  title={GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning},
  author={Ma, Guoqing and Wang, Siheng and Zhang, Zeyu and Yu, Shan and Tang, Hao},
  journal={arXiv preprint arXiv:2602.04315},
  year={2026}
}