GroundControl: Anticipating Navigation Failures in Vision-Language Agents Using Trajectory-Consistent Uncertainty (2026)

For reliable deployment, navigation systems require an uncertainty signal indicating whether an ongoing trajectory is deviating from successful goal-directed execution. Existing uncertainty proxies, however, are poorly suited for this setting. Most rely on instantaneous signals such as predictive entropy over action distributions or token-level confidence. These measures capture ambiguity in individual decisions but do not reflect whether the resulting trajectory remains consistent with geometric progress toward the goal. An agent may therefore maintain high step-wise confidence while repeatedly executing actions that lead to oscillation, stagnation, or inefficient detours.

This suggests that uncertainty in embodied navigation should reflect trajectory-level consistency of goal-directed dynamics. In successful episodes, the distance-to-goal signal typically follows a structured evolution characterized by sustained progress with bounded variation. Systematic violations of this structure, such as oscillation, stagnation, divergence, or low path efficiency relative to displacement, provide quantitative evidence that execution is deviating from the intended navigation objective. Under this view, uncertainty estimation becomes the problem of detecting statistically significant deviations from expected goal-directed motion.

To evaluate uncertainty independently of raw task success, we introduce Selective Risk–Coverage Navigation (SRCN), a protocol for trajectory-level uncertainty signals that measures how effectively an uncertainty score ranks navigation episodes by failure using risk–coverage curves and summary metrics including AURC and excess-AURC. This formulation isolates ranking quality without modifying the underlying navigation policy and enables comparison across entropy and behavioral estimators.

Contributions

We formalize trajectory-level consistency of distance-to-goal dynamics as a foundation for uncertainty estimation in VLN-based embodied navigation.

We introduce GroundControl, a lightweight trajectory-consistent estimator that detects statistically significant deviations from nominal goal-directed motion.

Comparison of successful versus failure trajectories under complex instructions

Across five EB-Navigation splits (a large dataset of navigation episodes), our trajectory-consistent uncertainty achieves near-oracle ordering under success-based selective risk with weighted-average area under the risk-coverage curve, outperforming entropy-, conformal-, and heuristic baselines while remaining competitive under SPL-based selective evaluation.

An episode is considered successful if the agent reaches the target within a distance threshold epsilon, denoted by success indicator. In addition to Success Rate, we report Success weighted by Path Length (SPL).

For each episode i, an uncertainty estimator produces a scalar score u_i, where lower values indicate higher confidence. The score may originate from internal state statistics such as posterior covariance or innovation energy, model-internal signals including attention entropy or belief dispersion, or post hoc behavioral measures such as action entropy, plan instability, invalid-action rates, or conformal nonconformity.

This abstraction allows heterogeneous uncertainty estimators to be evaluated within a common framework while isolating the quality of their episode-level ranking. In particular, the SRCN evaluation introduced later depends only on the ordering induced by u_i through thresholding.

Baseline Uncertainty Estimators

We compare against seven representative uncertainty baselines spanning conformal, entropy-based, trajectory-based, and heuristic signals. Each baseline produces an episode-level score u_i evaluated under the SRCN protocol.

Predictive Entropy. Normalized Shannon entropy of the episode action histogram, H, measuring dispersion in action usage.

Self-Consistency. Plan instability is measured as 1 minus mean Jaccard similarity, where Jaccard similarity is the mean Jaccard similarity between consecutive executable plans extracted from VLM reasoning.

Invalid-action rate. Fraction of steps where the executed action is rejected by the environment.

Random. Random uncertainty scores Uniform(0,1) as a lower bound.

Experimental Protocol and Results

Table I presents baseline navigation performance across three LLM backbones: GPT-4o, GPT-5-mini, and Gemini-1.5-Flash. For GPT-4o, success rates exceed 53% on four splits, but fall sharply to 16.7% on long_horizon, where lengthy execution chains increase compounding errors. Using GPT-5-mini, success exceeds 65% on four splits, but the success rate for long_horizon does not improve. The resulting degradation in both Success Rate and SPL makes this split a stringent test of trajectory-level uncertainty ranking.

LLM Backbone	Base SR	Common Sense SR	Complex Instr. SR	Long Horizon SR	Avg. SR	Avg. SPL
GPT-4o	53.4%	56.7%	56.7%	16.7%	48.3%	0.33
GPT-5-mini	65.6%	68.9%	65.6%	18.9%	56.1%	0.40
Gemini-1.5-Flash	50.0%	47.8%	38.9%	14.4%	38.3%	0.25

Risk–Coverage Curves and Diagnostic Plots

Risk-coverage curves showing success rate as a function of coverage for different uncertainty estimators on the base navigation split

Figure 4 shows risk–coverage curves under SPL-based loss, which penalizes inefficient trajectories in addition to outright failures. Trajectory-consistent uncertainty maintains low selective risk across coverage levels, indicating sensitivity to gradual degradation in navigation efficiency rather than only terminal failure. This behavior is particularly relevant for robotic navigation, where inefficient wandering, oscillatory motion, or repeated backtracking often precede failure and consume limited execution time or energy.

Frequently Asked Questions

What makes GroundControl different from existing uncertainty methods for navigation? GroundControl focuses on trajectory-level consistency of distance-to-goal dynamics rather than instantaneous action-level signals, allowing it to detect systematic deviations like oscillation or stagnation that step-wise confidence measures miss.

How does the SRCN protocol evaluate uncertainty quality independently of navigation policy? SRCN uses risk–coverage curves and summary metrics (AURC, excess-AURC) to measure how effectively uncertainty scores rank episodes by failure, without modifying the underlying navigation policy.

Which baselines does GroundControl outperform in the experiments? GroundControl achieves near-oracle ordering under success-based selective risk, outperforming predictive entropy, self-consistency, invalid-action rate, random baselines, and conformal methods across all five EB-Navigation splits.

Why does the long_horizon split pose a particular challenge for uncertainty estimation? The long_horizon split has sharply lower success rates (16.7% for GPT-4o, 18.9% for GPT-5-mini) due to compounding errors in lengthy execution chains, making it a stringent test of trajectory-level uncertainty ranking.

GroundControl: Anticipating Navigation Failures in Vision-Language Agents Using Trajectory-Consistent Uncertainty

Selective Risk–Coverage Navigation Protocol