EvoVLA: Self-Evolving Vision-Language-Action Model
Peking University
*Equal contribution.
†Project lead.
‡Corresponding author.
TL;DR: EvoVLA mitigates long-horizon stage hallucination with self-supervised rewards, pose-grounded exploration, and selective memory, achieving substantially higher success on Discoverse-L and strong Sim2Real robustness.
Real World Results
Eye-in-hand
Eye-to-hand
Simulation Results
Eye-in-hand
Eye-to-hand
EvoVLA
EvoVLA overview. Built on OpenVLA-OFT backbone, EvoVLA integrates three modules: Stage-Aligned Reward (SAR) with hard negatives and temporal smoothing, Pose-Based Object Exploration (POE) via world models, and Long-Horizon Memory with context selection and gated fusion. The framework couples with Discoverse-L for training and deploys to real robots.
EvoVLA Data Engine
EvoVLA Data Engine. Aligned with Discoverse-L and the video-driven stage discovery pipeline to close the data–reward–policy loop.