Nav-R1

Overview

TL;DR: Nav-R1 is an embodied foundation model that integrates dialogue, reasoning, planning, and navigation capabilities to enable intelligent interaction and task execution in 3D environments.

Real World Navigation

Bird's Eye View

Ego View

Depth

Point Map

VLN

Start from the initial position and walk straight toward the chair on the left front. Pause briefly near the chair, then continue moving forward in a straight line toward the glass door on the right front, and stop at the door.

Walk straight down, then turn right and approach the row of black chairs. Stop beside the nearest chair.

Walk straight ahead, keeping the big wooden table on your left. Go all the way to the side of the table and then turn right. After that, just head straight until you reach the wall in front of you and stop there.

ObjectNav

Find the blue dustpan on the left side near the chairs.

Go to the chair with wooden legs on the left side.

Bird's Eye View

Ego View

Depth

Point Map

VLN

Head straight, then turn left when you reach the couch, and stop in front of it.

Walk straight ahead, then turn when you reach the couch on your right, and stop next to it.

Start from the beginning, walk to the side table on your right and pause there. Then go straight towards the front-left and stop at the wall.

ObjectNav

Head towards the front-right area and look for the white mouse near the sofa and side table.

Move straight ahead and look for the keyboard along the wall in front.

Bird's Eye View

Ego View

Depth

Point Map

VLN

Walk straight to the black chair on your left. At the chair, turn left and stop facing the wall.

Walk straight to the black chair on your right. Then turn right and stop facing the chair.

Go to the black chair on your left and pause, then move forward to the front-right and stop at the blue umbrella.

ObjectNav

Move straight ahead and look for the blue umbrella in front.

Head towards the front-left and find the black chair.

Real World Understanding

Embodied Dialogue

Embodied Reasoning

Embodied Planning

Simulated Navigation

Ego View

Depth

BEV Map

VLN

Walk past the striped area rug. Make a right onto the marble floor. Walk past the bathroom on the left. Make a left opposite the zebra painting. Walk through the open door, and wait at the mirror.

Walk straight down, then turn right and approach the row of black chairs. Stop beside the nearest chair.

ObjectNav

Search for a tv monitor.

Ego View

Depth

BEV Map

VLN

Turn around and take a right. Enter into the bedroom on the left and wait there.

Walk past staircase, turn left at dining table and stop in front of desk.

ObjectNav

Search for a chair.

Ego View

Depth

BEV Map

VLN

Go down the hallway passed the stairs and living area with the wood table. Exit out into the backyard through the doorway.

Walk through the door by the sink into the middle of the next room. Turn right and walk down the hallway and enter the third door on your right.

ObjectNav

Search for a plant.

Ego View

Depth

BEV Map

VLN

Go straight past the couches. Exit the room using the two double doors on the right. Turn right and wait across the zebra painting on the left wall.

Walk past brown leather recliner. Walk through open french doors. Make hard left opposite zebra painting. Wait at mirror.

ObjectNav

Search for a toilet.

Navigation Foundation Model

Fast-in-Slow Reasoning

Nav-R1 features a Fast-in-Slow design that ensures rapid decision-making within long-horizon planning.

Zero-Shot Generalizability

Nav-R1 exhibits strong zero-shot generalizability, adapting to unseen environments without additional training.

Test-Time Efficiency

Nav-R1 demonstrates strong test-time efficiency, enabling fast and reliable deployment on robots.

Figure 1: Architecture of Nav-R1. Nav-R1 designs a Fast-in-Slow reasoning paradigm that processes egocentric RGB-D views, scene point cloud, and language instructions. The slow system performs long-horizon semantic reasoning, while the fast system executes real-time navigation, enabling coherent reasoning and low-latency control in embodied environments.

Multi-Task Generalist

Multimodal Understanding

Nav-R1 demonstrates strong multimodal understanding, effectively aligning visual, language, and action inputs for navigation.

Detailed Planning

Nav-R1 enables detailed planning by generating precise, step-by-step trajectories for complex navigation tasks.

Robust Navigation

Nav-R1 achieves robust navigation, maintaining reliable performance across diverse and challenging environments.

Figure 2: Nav-R1 is an embodied foundation model that integrates dialogue, reasoning, planning, and navigation capabilities to enable intelligent interaction and task execution in 3D environments.

RL Policy

Understanding Reward

Nav-R1 employs an understanding reward to enhance semantic grounding and improve instruction comprehension.

Navigation Reward

Nav-R1 incorporates a navigation reward to promote accurate trajectory following and successful task completion.

Format Reward

Nav-R1 leverages a format reward to ensure well-structured reasoning chains and action outputs during navigation.

Figure 3: The pipeline of RL Policy. The policy model generates N outputs from text-image input. Then understanding reward (answer correctness and semantic alignment), navigation reward (path fidelity and endpoint accuracy), and format reward (structure adherence) are computed, grouped, and combined with a KL term to a frozen reference model to update the policy.

Nav-CoT-110K

Data Engine

Nav-CoT-110K is built with a Gemini 2.5 Pro data engine that systematically generates large-scale, diverse navigation trajectories and instructions.

CoT Annotations

Nav-CoT-110K provides high-quality chain-of-thought annotations that deliver explicit step-by-step reasoning for navigation tasks.

Diverse Modality Coverage

Nav-CoT-110K offers diverse modality coverage, spanning language, vision, and action signals for robust navigation learning.

Figure 4: CoT Data Engine. We construct the Nav-CoT dataset by defining navigation instructions, integrating egocentric visual inputs, providing action options and specifying the output format. These components are fed into Gemini 2.5 Pro, which generates step-by-step reasoning and action decisions aligned with navigation goals.

Citation

@article{liu2025navr1,
  title={Nav-R1: Reasoning and Navigation in Embodied Scenes},
  author={Liu, Qingxiang and Huang, Ting and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2509.10884},
  year={2025}
}