EvoVLA: Self-Evolving Vision-Language-Action Model

TL;DR: EvoVLA mitigates long-horizon stage hallucination with self-supervised rewards, pose-grounded exploration, and selective memory, achieving substantially higher success on Discoverse-L and strong Sim2Real robustness.

Real World Results

Eye-in-hand

Eye-to-hand

build a bridge with green bars and fill with purple blocks

Stack four cups and insert the banana-shaped object into the topmost cup.

put the jujube into the coffee cup and place the cup on the plate.

stack the green block on the blue block, then stack the red block on top.

Simulation Results

Eye-in-hand

Eye-to-hand

build a bridge with green bars and fill with purple blocks

put the jujube into the coffee cup and place the cup on the plate

stack the green block on the blue block, then stack the red block on top

EvoVLA

You browser does not support this image.

EvoVLA overview. Built on OpenVLA-OFT backbone, EvoVLA integrates three modules: Stage-Aligned Reward (SAR) with hard negatives and temporal smoothing, Pose-Based Object Exploration (POE) via world models, and Long-Horizon Memory with context selection and gated fusion. The framework couples with Discoverse-L for training and deploys to real robots.

EvoVLA Data Engine

EvoVLA Data Engine. Aligned with Discoverse-L and the video-driven stage discovery pipeline to close the data–reward–policy loop.

BibTeX

@article{liu2025evovla,
  title={EvoVLA: Self-Evolving Vision-Language-Action Model},
  author={Liu, Zeting and Yang, Zida and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2511.16166},
  year={2025}
}