ReMoMask

ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li^1* Siheng Wang^2* Zeyu Zhang^1*† Hao Tang^1‡

¹Peking University ²Jiangsu University

^*Equal contribution. ^†Project lead. ^‡Corresponding author.

TLDR: ReMoMask is a retrieval-augmented masked text-to-motion generation model that enhances retrieval precision, improves temporal-spatial alignment, and achieves state-of-the-art performance on motion generation benchmarks.

🔁 Bidirectional Momentum Contrastive Learning: Enables larger negative sample pools to boost retrieval quality beyond batch-size limitations.

🧠 Semantic Spatial-Temporal Attention (SSTA): Fuses retrieved motion and text features with part-level spatiotemporal awareness to improve coherence and realism.

⚡ Efficient & High-Quality Motion Generation: Built on RVQ-VAE, ReMoMask achieves strong zero-shot performance on HumanML3D and KIT-ML with fewer decoding steps.

Method

In ReMoMask, a unified framework integrating three key innovations: (1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; (2) A Semantic Spatiotemporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; (3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization.

Qualitative Results

a man is pretending to be a chicken. constantly pecking at the ground and waving his arms like a chicken.

a man is walking forward, favoring his left leg and shifting his walk. he is possibly drunk.

a man kicks something or someone with his left leg.

a person is looking around, turns to the left, then looks around again.

A person is walking on a circle.

a person jumps forward three times and then walks a few steps.

a person jumps forward with both legs and the continues walking until reaching the other side.

a person raises its arms then puts them back down 3 times.

a_person_raises_their_arms_towards_their_shoulders.

a_person_walks_forward_then_turns_to_the_right_and_continues_to_walk.

a_person_walks_slowly_forward_then_toward_the_left_hand_side_and_stands_facing_that_direction.

a_person_walks_to_the_right_in_a_partial_circle.

a person walks toward the front, turns to the right, bounces into a squat , and places both arms.

person lifts their hands up twice on the same spot.

person turns one direction then other direction standing feet apart and arms side to side.

someone working on the construction site.

Compare Results

ReMoMask

a man walks forwards and then stops.

a person is balancing on something.

a person walks forward in a clumsy way.

a_person_walks_forward_rather_slowly.

MoGenTS

ReMoDiffuse

TMR

Citation

@article{li2025remomask,
  title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
  author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2508.02605},
  year={2025}
}