PlatonicNav:
Unveiling Semantic Correspondence in
Navigation with Platonic Topological Maps
TL;DR: PlatonicNav enables training-free embodied navigation through blind semantic matching between vision and language, unifying VLN and ObjNav using self-supervised visual representations.
Real World Evaluation: ObjNav
Demo 1: Pre-exploration
Demo 1: Navigation
Demo 2: Pre-exploration
Demo 2: Navigation
Demo 3: Pre-exploration
Demo 3: Navigation
Real World Evaluation: VLN
Find the Plant
Go to the Chair
Go to the Lamp
Simulation: ObjNav (OVON)
Demo 1: Refrigerator
Demo 2: TV Stand
Demo 3: Desk
Demo 4: Sofa Chair
Demo 5: Dining Chair
Demo 6: Chair
Demo 7: Photo
Simulation: VLN (R2R-CE)
Demo 1: Fireplace
From here, walk to the front of the fireplace.
Demo 2: Stairs
From here, head towards the stairs. Stop on the round rug next to the flowers.
Demo 3: Couch
From here, turn left and go straight until you get to three tables with chairs. Turn left and wait near the couch.
Demo 4: Island
From here, walk into the dining room area. Stop in front of the island.
Demo 5: Table
From here, walk into the kitchen, around the dining table to the buffet. Stop and wait there.
Demo 6: Desk
From here, walk towards the desk in the office area. Stop next to the desk.
Demo 7: Chair
From here, move ahead in between bar and table to the chair.
Demo 8: Table
From here, turn left and go straight until you get to a large table.
Demo 9: Stairs
From here, walk down the first set of stairs. Wait there.
Demo 10: Stairs
From here, turn left continue down the hallway until you get to the stairs. Wait there.
Demo 11: Stairs
From here, exit the living room, turn left, wait at the bottom of the stairs.
Demo 12: Stairs
From here, then turn left again and go down the stairs. Stop before going outside.
Demo 13: Stairs
From here, walk down stairs. Wait at bottom of stairs.
Method
PlatonicNav Pipeline. (a) Mapping: We construct Platonic Topological Map as a semantic scene graph, where image segments are used as object nodes, and edges are weighted by both geometric distance and semantic distance computed from vision embedding space. (b) Goal Selection: Given the natural-language instruction, we pairwise blind match language embeddings of goal object category and visual embedding of segment cluster, selecting the candidate goal nodes in Platonic Topological Map. (c) Execution: Given the map and candidate goal nodes, we compute the paths to the goal node which can be reached by lightest edge weight; the resulting path lengths are assigned to segmentation masks to form a PlatonicObject Costmap for control prediction.
Blind Matching of Vision and Language in Navigation
Blind matching of vision and language in navigation scene. Text and images are both abstractions of the same underlying world. Vision and language encoders fv and fl learn similar pairwise relations between concepts. We exploit these pairwise relations in a matching solver to recover valid correspondences between vision and language representations without requiring any paired data.
Citation
@techreport{long2026platonicnav,
title={PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps},
author={Junlin Long and Zeyu Zhang and Xu Deng and Yiran Wang and Yue Yang and Luke Borgnolo and Maxwell Twelftree and Yang Zhao},
institution={Technical Report},
year={2026}
}