logo

What Limits Vision-and-Language Navigation ?

Yunheng Wang1 · Yuetong Fang1 · Taowen Wang1 · Lusong Li2 · Kun Liu2 · Junzhe Xu1,2
Zizhao Yuan1 · Yixiao Feng1 · Jiaxi Zhang1 · Wei Lu1
Zecui Zeng2,† · Renjing Xu1,†
1HKUST(GZ)    2JD Explore Academy
HKUST(GZ) JD Explore Academy

Abstract

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments.

Pilot Study

The Impact of Visual Uncertainty

Impact of visual uncertainty

Impact of visual uncertainty on VLN agents. (a) Top: Visual examples of four common perturbations during embodied navigation. (b) Bottom: Performance degradation of representative open-source VLN methods, where the LLaVA-based and Qwen-based methods correspond to StreamVLN and JanusVLN, respectively. Although existing agents perform competitively in the ideal setting, their navigation performance degrades markedly under severe visual perturbations.

The Impact of Instruction Ambiguity

Impact of instructional under-specification

Impact of instructional under-specification on VLN agents. (a) Top: Representative cases of Directional Ambiguity, where under-specified route or orientation cues permit multiple feasible paths, and Docking Ambiguity, where vague goal descriptions permit multiple plausible stopping targets. (b) Bottom: Distributions of ambiguity scores across representative VLMs, where lower scores indicate stronger ambiguity effects.

Overview

Overview of StereoNav

Overview of StereoNav. StereoNav takes stereo RGB observations, a navigation instruction, and a target-location prior as input. The target prior is rendered as persistent visual guidance, while stereo observations are encoded into unified semantic, structural, and geometric tokens through 2D semantic, 2D structural, and 3D geometry encoders. These tokens are then processed by the MLLM for joint action and depth prediction. The right panels illustrate detailed designs of selected modules.

Simulation Results and Visualization

Method Size Observation Prediction R2R-CE Val-Unseen RxR-CE Val-Unseen
Pano.DepthRGB ExtraAction NE↓OSR↑SR↑SPL↑ NE↓SR↑SPL↑
Panoramic RGB-D Agent
ETPNav-4.765.057.049.05.654.844.9
CLASH-4.173.065.055.0---
D3D-VLP2B4.767.261.356.1---
ETP-R1-3.972.065.056.05.259.949.0
P³Nav-4.469.062.052.05.458.047.9
Panoramic RGB Agent
AO-Planner-5.659.047.033.07.143.330.5
NavFoM7B4.672.161.755.34.764.456.2
ABot-N04B3.870.866.463.93.869.360.0
NavForesee3B3.978.466.259.74.266.353.2
SPAN-Nav-4.175.366.359.34.269.760.1
Egocentric RGB-D Agent
NaVid-4D7B6.055.743.837.1---
Dynam3D7B5.362.152.945.7---
NavMorph-5.856.947.933.28.930.822.8
InternVLA-N18B4.863.358.254.05.953.546.1
AgentVLN3B3.973.567.264.73.969.561.3
Egocentric RGB Agent
NaVid7B5.549.137.435.9---
Uni-NaVid7B5.653.347.042.76.248.740.9
NaVILA8B5.262.554.049.06.849.344.0
StreamVLN7B5.064.256.951.96.252.946.0
InternVLA-N18B4.960.655.452.16.449.541.8
NavFoM7B5.064.956.251.25.557.449.4
DualVLN8B4.170.764.358.54.661.451.8
Efficient-VLN4B4.273.764.255.93.967.054.3
JanusVLN8B4.865.260.556.86.156.247.5
PROSPECT9B4.965.258.954.05.754.646.2
SACA8B4.269.364.756.94.862.151.7
NaVIDA3B4.369.561.454.75.257.449.6
DyGeoVLN9B4.470.160.855.8---
DecoVLN7B5.063.556.350.55.754.246.3
StereoNav3B 3.0 (-1.1) 76.6 (+2.9) 72.8 (+8.1) 56.4 (+2.1) 5.9 (+2.0) 58.0 (-9.0) 43.5 (-8.5)
StereoNav3B 2.1 (-2.0) 82.4 (+8.7) 81.1 (+16.4) 68.3 (+9.8) 4.6 (+0.7) 67.5 (+0.5) 52.0 (-2.3)

Note: Light/dark blue rows = StereoNav w/o and w/ external data. Top-3: bold, underline, dotted underline.

Visualization on R2R-CE

Left
Left
Left
Left
Left

Stereo Egocentric View

Depth Prediction

Visualization on RxR-CE

Left
Left
Left
Left
Left

Stereo Egocentric View

Depth Prediction

Real-World Visualization

Lobby Scene

Outdoor Scene

Gym Scene

Office Scene

Citation