Nav-R1: Reasoning and Navigation in Embodied Scenes

1Shanghai University of Engineering Science      2Peking University
*Equal contribution.    Project lead.    Corresponding author.

Overview

TL;DR: Nav-R1 is an embodied foundation model that integrates dialogue, reasoning, planning, and navigation capabilities to enable intelligent interaction and task execution in 3D environments.

Real World Navigation

Real World Understanding

Embodied Dialogue

Embodied Reasoning

Embodied Planning

Simulated Navigation

Multi-Task Generalist

Multimodal Understanding

Nav-R1 demonstrates strong multimodal understanding, effectively aligning visual, language, and action inputs for navigation.

Detailed Planning

Nav-R1 enables detailed planning by generating precise, step-by-step trajectories for complex navigation tasks.

Robust Navigation

Nav-R1 achieves robust navigation, maintaining reliable performance across diverse and challenging environments.

teaser
Figure 2: Nav-R1 is an embodied foundation model that integrates dialogue, reasoning, planning, and navigation capabilities to enable intelligent interaction and task execution in 3D environments.

RL Policy

Understanding Reward

Nav-R1 employs an understanding reward to enhance semantic grounding and improve instruction comprehension.

Navigation Reward

Nav-R1 incorporates a navigation reward to promote accurate trajectory following and successful task completion.

Format Reward

Nav-R1 leverages a format reward to ensure well-structured reasoning chains and action outputs during navigation.

RL
Figure 3: The pipeline of RL Policy. The policy model generates N outputs from text-image input. Then understanding reward (answer correctness and semantic alignment), navigation reward (path fidelity and endpoint accuracy), and format reward (structure adherence) are computed, grouped, and combined with a KL term to a frozen reference model to update the policy.

Nav-CoT-110K

Data Engine

Nav-CoT-110K is built with a Gemini 2.5 Pro data engine that systematically generates large-scale, diverse navigation trajectories and instructions.

CoT Annotations

Nav-CoT-110K provides high-quality chain-of-thought annotations that deliver explicit step-by-step reasoning for navigation tasks.

Diverse Modality Coverage

Nav-CoT-110K offers diverse modality coverage, spanning language, vision, and action signals for robust navigation learning.

RL
Figure 4: CoT Data Engine. We construct the Nav-CoT dataset by defining navigation instructions, integrating egocentric visual inputs, providing action options and specifying the output format. These components are fed into Gemini 2.5 Pro, which generates step-by-step reasoning and action decisions aligned with navigation goals.

Citation

@article{liu2025navr1,
  title={Nav-R1: Reasoning and Navigation in Embodied Scenes},
  author={Liu, Qingxiang and Huang, Ting and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2509.10884},
  year={2025}
}