MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
TL;DR: MobileVLA-R1 enables robust real-world quadruped control by unifying language reasoning and continuous action through structured CoT alignment and GRPO training.
Real World Results
Simulation Results
MobileVLA-R1
Architecture of MobileVLA-R1. MobileVLA-R1 is an end-to-end framework that integrates natural-language instructions with multimodal perception. It processes RGB, depth, and point cloud observations together with textual commands to generate continuous locomotion actions, enabling mobile robots to follow complex instructions and adapt to diverse environments in real time.
CoT Data Engine
CoT Data Engine. We construct the MobileVLA-CoT by defining navigation and step-level instructions, integrating RGB–Depth visual inputs, and specifying structured reasoning prompts. These inputs are fed into Gemini-2.5-Flash, which generates multi-granularity Chain-of-Thought (CoT) annotations with corresponding action outputs.
RLVR
The pipeline of RL policy. The model generates N responses from a given input, rewards are then computed for each response. After normalizing and clipping, these rewards are conflated with a KL-divergence term, which prevents the model from over-updating, to update the policy.