ReMoMask: Retrieval-Augmented Masked Motion Generation


Zhengdao Li1*     Siheng Wang2*     Zeyu Zhang1*†     Hao Tang1‡    
1Peking University     2Jiangsu University    

*Equal contribution. Project lead. Corresponding author.




TLDR: ReMoMask is a retrieval-augmented masked text-to-motion generation model that enhances retrieval precision, improves temporal-spatial alignment, and achieves state-of-the-art performance on motion generation benchmarks.
  • 🔁 Bidirectional Momentum Contrastive Learning: Enables larger negative sample pools to boost retrieval quality beyond batch-size limitations.
  • 🧠 Semantic Spatial-Temporal Attention (SSTA): Fuses retrieved motion and text features with part-level spatiotemporal awareness to improve coherence and realism.
  • Efficient & High-Quality Motion Generation: Built on RVQ-VAE, ReMoMask achieves strong zero-shot performance on HumanML3D and KIT-ML with fewer decoding steps.


  • Method


    In ReMoMask, a unified framework integrating three key innovations: (1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; (2) A Semantic Spatiotemporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; (3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization.


    Qualitative Results


    a man is pretending to be a chicken. constantly pecking at the ground and waving his arms like a chicken.

    a man is walking forward, favoring his left leg and shifting his walk. he is possibly drunk.

    a man kicks something or someone with his left leg.

    a person is looking around, turns to the left, then looks around again.

    A person is walking on a circle.

    a person jumps forward three times and then walks a few steps.

    a person jumps forward with both legs and the continues walking until reaching the other side.

    a person raises its arms then puts them back down 3 times.

    a_person_raises_their_arms_towards_their_shoulders.

    a_person_walks_forward_then_turns_to_the_right_and_continues_to_walk.

    a_person_walks_slowly_forward_then_toward_the_left_hand_side_and_stands_facing_that_direction.

    a_person_walks_to_the_right_in_a_partial_circle.

    a person walks toward the front, turns to the right, bounces into a squat , and places both arms.

    person lifts their hands up twice on the same spot.

    person turns one direction then other direction standing feet apart and arms side to side.

    someone working on the construction site.


    Compare Results


    ReMoMask

    a man walks forwards and then stops.

    a person is balancing on something.

    a person walks forward in a clumsy way.

    a_person_walks_forward_rather_slowly.

    MoGenTS
    ReMoDiffuse
    TMR

    Citation


    @article{li2025remomask,
      title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
      author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
      journal={arXiv preprint arXiv:2508.02605},
      year={2025}
    }