TLDR:
ReMoMask is a retrieval-augmented masked text-to-motion generation model that enhances retrieval precision, improves temporal-spatial alignment, and achieves state-of-the-art performance on motion generation benchmarks.
🔁 Bidirectional Momentum Contrastive Learning: Enables larger negative sample pools to boost retrieval quality beyond batch-size limitations.
🧠 Semantic Spatial-Temporal Attention (SSTA): Fuses retrieved motion and text features with part-level spatiotemporal awareness to improve coherence and realism.
⚡ Efficient & High-Quality Motion Generation: Built on RVQ-VAE, ReMoMask achieves strong zero-shot performance on HumanML3D and KIT-ML with fewer decoding steps.