What Limits Vision-and-Language Navigation?

Abstract

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments.

Pilot Study

The Impact of Visual Uncertainty

Impact of visual uncertainty on VLN agents. (a) Top: Visual examples of four common perturbations during embodied navigation. (b) Bottom: Performance degradation of representative open-source VLN methods, where the LLaVA-based and Qwen-based methods correspond to StreamVLN and JanusVLN, respectively. Although existing agents perform competitively in the ideal setting, their navigation performance degrades markedly under severe visual perturbations.

The Impact of Instruction Ambiguity

Impact of instructional under-specification on VLN agents. (a) Top: Representative cases of Directional Ambiguity, where under-specified route or orientation cues permit multiple feasible paths, and Docking Ambiguity, where vague goal descriptions permit multiple plausible stopping targets. (b) Bottom: Distributions of ambiguity scores across representative VLMs, where lower scores indicate stronger ambiguity effects.

Overview

Overview of StereoNav. StereoNav takes stereo RGB observations, a navigation instruction, and a target-location prior as input. The target prior is rendered as persistent visual guidance, while stereo observations are encoded into unified semantic, structural, and geometric tokens through 2D semantic, 2D structural, and 3D geometry encoders. These tokens are then processed by the MLLM for joint action and depth prediction. The right panels illustrate detailed designs of selected modules.

Simulation Results and Visualization

Method	Size	Observation			Prediction		R2R-CE Val-Unseen				RxR-CE Val-Unseen
Method	Size	Pano.	Depth	RGB	Extra	Action	NE↓	OSR↑	SR↑	SPL↑	NE↓	SR↑	SPL↑
Panoramic RGB-D Agent
ETPNav	-	✓	✓	✗	✗	✓	4.7	65.0	57.0	49.0	5.6	54.8	44.9
CLASH	-	✓	✓	✗	✗	✓	4.1	73.0	65.0	55.0	-	-	-
D3D-VLP	2B	✓	✓	✗	✗	✓	4.7	67.2	61.3	56.1	-	-	-
ETP-R1	-	✓	✓	✗	✗	✓	3.9	72.0	65.0	56.0	5.2	59.9	49.0
P³Nav	-	✓	✓	✗	✗	✓	4.4	69.0	62.0	52.0	5.4	58.0	47.9
Panoramic RGB Agent
AO-Planner	-	✓	✗	✗	✗	✓	5.6	59.0	47.0	33.0	7.1	43.3	30.5
NavFoM	7B	✓	✗	✗	✗	✓	4.6	72.1	61.7	55.3	4.7	64.4	56.2
ABot-N0	4B	✓	✗	✗	✗	✓	3.8	70.8	66.4	63.9	3.8	69.3	60.0
NavForesee	3B	✓	✗	✗	✓	✓	3.9	78.4	66.2	59.7	4.2	66.3	53.2
SPAN-Nav	-	✓	✗	✗	✓	✓	4.1	75.3	66.3	59.3	4.2	69.7	60.1
Egocentric RGB-D Agent
NaVid-4D	7B	✗	✓	✓	✗	✓	6.0	55.7	43.8	37.1	-	-	-
Dynam3D	7B	✗	✓	✓	✗	✓	5.3	62.1	52.9	45.7	-	-	-
NavMorph	-	✗	✓	✓	✓	✓	5.8	56.9	47.9	33.2	8.9	30.8	22.8
InternVLA-N1	8B	✗	✓	✓	✗	✓	4.8	63.3	58.2	54.0	5.9	53.5	46.1
AgentVLN	3B	✗	✓	✓	✗	✓	3.9	73.5	67.2	64.7	3.9	69.5	61.3
Egocentric RGB Agent
NaVid	7B	✗	✗	✓	✗	✓	5.5	49.1	37.4	35.9	-	-	-
Uni-NaVid	7B	✗	✗	✓	✗	✓	5.6	53.3	47.0	42.7	6.2	48.7	40.9
NaVILA	8B	✗	✗	✓	✗	✓	5.2	62.5	54.0	49.0	6.8	49.3	44.0
StreamVLN	7B	✗	✗	✓	✗	✓	5.0	64.2	56.9	51.9	6.2	52.9	46.0
InternVLA-N1	8B	✗	✗	✓	✗	✓	4.9	60.6	55.4	52.1	6.4	49.5	41.8
NavFoM	7B	✗	✗	✓	✗	✓	5.0	64.9	56.2	51.2	5.5	57.4	49.4
DualVLN	8B	✗	✗	✓	✗	✓	4.1	70.7	64.3	58.5	4.6	61.4	51.8
Efficient-VLN	4B	✗	✗	✓	✗	✓	4.2	73.7	64.2	55.9	3.9	67.0	54.3
JanusVLN	8B	✗	✗	✓	✗	✓	4.8	65.2	60.5	56.8	6.1	56.2	47.5
PROSPECT	9B	✗	✗	✓	✓	✓	4.9	65.2	58.9	54.0	5.7	54.6	46.2
SACA	8B	✗	✗	✓	✗	✓	4.2	69.3	64.7	56.9	4.8	62.1	51.7
NaVIDA	3B	✗	✗	✓	✗	✓	4.3	69.5	61.4	54.7	5.2	57.4	49.6
DyGeoVLN	9B	✗	✗	✓	✓	✓	4.4	70.1	60.8	55.8	-	-	-
DecoVLN	7B	✗	✗	✓	✗	✓	5.0	63.5	56.3	50.5	5.7	54.2	46.3
StereoNav	3B	✗	✗	✓	✓	✓	3.0 (-1.1)	76.6 (+2.9)	72.8 (+8.1)	56.4 (+2.1)	5.9 (+2.0)	58.0 (-9.0)	43.5 (-8.5)
StereoNav	3B	✗	✗	✓	✓	✓	2.1 (-2.0)	82.4 (+8.7)	81.1 (+16.4)	68.3 (+9.8)	4.6 (+0.7)	67.5 (+0.5)	52.0 (-2.3)