M o t i o n V L A

Vision-Language-Action Model for Humanoid Motion

Nonghai Zhang1* Siyu Zhai1* Yanjun Li1* Zeyu Zhang1*† Zhihan Yin1 Yandong Guo2 Boxin Shi1 Hao Tang1‡
1 Peking University   2 AI2 Robotics
*Equal contribution. Project lead. Corresponding author.

Real World Deployment

Real-robot deployment of MotionVLA on a Unitree G1 EDU humanoid robot.

The person walks straight ahead to the other end of the room.

The person turns and then walks to the end of the room.

The person walks straight ahead and then turns.

MuJoCo Simulation

All qualitative motion visualizations in this paper are produced using MuJoCo, a physics engine widely used in locomotion and character animation research.

Text to Humanoid Motion

The ballerina extends her arms and turns to the left, then rises onto her toes and performs a series of ballet movements.

The man walks into the room, approaches a table, picks up an object, then walks towards a projection on the wall. He interacts with the object in his hands while standing near the projection.

Scene-Conditioned Motion

Scene-conditioned motion scene 1

Generate motion for: The person takes off a shirt and puts it on their head, then bends down to pick up something from the ground.

Scene-conditioned motion scene 2

Generate motion for: The man walks towards the camera.

MotionVLA & DSFT

Overview of MotionVLA

Overview of MotionVLA. (a) DSFT performs dual-stream frequency tokenization by decomposing motion into Base and Phys components and converting them into discrete tokens. (b) During training, MotionVLA learns to autoregressively predict the unified motion token sequence under text and scene-image conditioning, supervised by DSFT tokens derived from ground-truth motion. (c) At inference time, the model generates Base and Phys tokens conditioned on multimodal inputs, which are then decoded and recombined to reconstruct the final motion sequence.

Citation

@misc{motionvla2026,
 title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
 author={Nonghai Zhang and Siyu Zhai and Yanjun Li and Zeyu Zhang and Zhihan Yin and Yandong Guo and Boxin Shi and Hao Tang},
 journal={Tech Report},
 year={2026}
}