Slow Brain, Fast Planner: Keeping Robots Safe When AI Vision Takes Its Time (2026)

We propose a hybrid architecture that combines a fast local planner and a slow Vision-Language Model (VLM). The planner generates dynamically feasible candidate trajectories at high frequency and the VLM provides semantic judgment asynchronously with 1–2 s latency. The key challenge is bridging this temporal mismatch: how can stale VLM advice improve real-time trajectory selection?

Visual Trajectory Selection

We render candidate trajectories onto the current camera image as numbered, colored annotations and let an off-the-shelf VLM pick an index.

Visual Overlay. We project each candidate trajectory in the robot body frame onto the camera image using the known camera extrinsics and intrinsics. Each trajectory is rendered as a colored polyline with its index labeled at the endpoint. The goal direction is optionally marked with an arrow. This visual representation lets the VLM reason directly in pixel space: it can see where each trajectory leads relative to sidewalk boundaries, pedestrians, and obstacles.

Training-Free Deployment. We use off-the-shelf VLMs (Gemini, GPT-5, Qwen) without any fine-tuning. The visual prompting interface transforms trajectory selection into a visual reasoning task that general-purpose VLMs can solve zero-shot. This eliminates the need for VLA training data, domain adaptation, or specialized model architectures.

Illustration of the visual overlay showing candidate trajectories as colored polylines with index labels on a camera image

Experiments

The evaluation has three components: (1) offline trajectory selection on real-world navigation logs; (2) closed-loop simulation under VLM latency with a controlled corrupted planner, isolating the effect of latency and fusion policy; (3) a real-world deployment on campus sidewalks under realistic cellular latency.

Closed-Loop Simulation Under VLM Latency

We study whether VLMs are still helpful under 1–3 s of latency in closed loop, and whether fusion preserves the headroom that direct execution of stale VLM trajectories does not.

Comparison of different latency-handling policies in simulation showing trajectory selection performance

Conclusion

We have presented a latency-resilient approach to VLM-augmented navigation that enables continuous robot control despite 1–2 s VLM inference delays. Our key insight is that a fast planner and a slow VLM provide complementary capabilities that can be fused rather than forced into a single system. Off-the-shelf VLMs excel at trajectory selection in semantically challenging scenarios (30% ADE reduction), while learned planners remain competitive in routine situations—motivating a fusion approach rather than VLM-only control. Score and Probability Fusion enable continuous control under latency. Real-world deployment with Probability Fusion and VLM Streaming substantially reduces human interventions compared to both planner-only and naive VLM execution.

Our approach inherits the planner's candidate set; if no good candidate exists, the VLM cannot help. We find VLM selection does not universally outperform the planner: in routine scenarios, the planner's learned scoring often suffices, and VLM queries consume computational resources without benefit. Our closed-loop simulator also models the VLM as a delayed oracle, which cannot exhibit real-world scene drift. Next steps include adaptive querying that invokes the VLM only when the planner is uncertain, studying the interaction between VLM and planner beyond trajectory selection, and testing the whole system in more realistic simulators.

Interface Overview

Our VLM does trajectory selection rather than low-level control: at each step, a fast local planner proposes a discrete set of short-horizon candidate trajectories (4 s horizon), and the VLM returns either (i) the index of one candidate to execute next or (ii) a stop decision when none appear safe. This constrains VLM outputs to dynamically feasible motions and enables safety fallbacks.

Local Planner: Anchor-Based Candidate Generation

We adopt S2E as our local planner. Unlike diffusion-based navigation models (e.g., NoMaD), S2E uses anchor-guided distribution matching to produce a structured set of candidate trajectories.

Anchor set. The model defines 64 anchor points obtained via k-means clustering over trajectory endpoints in the training data. Each anchor represents a prototypical behavioral mode (e.g., go straight, veer left, slow down, sharp turn). These anchors are fixed after training and serve as queries in a cross-attention decoder.

Architecture and outputs. Given the current RGB observation (past 4 frames) and a goal coordinate, an EfficientNet encoder and a Transformer encoder produce scene context embeddings. A Transformer decoder then cross-attends from the 64 anchor queries to these context embeddings, producing per-anchor features. Three lightweight heads decode each anchor feature into:

a score (softmax-normalized), representing the model's confidence that anchor is the best behavioral mode for the current situation;
a regression trajectory: a sequence of 20 waypoints (normalized offsets from the anchor), forming a 4 s, 20-waypoint polyline in the robot frame;
a velocity scale, converting the normalized trajectory into metric coordinates.

The result is 64 candidate trajectories, each with an associated planner score. In our pipeline, we select the top-k candidates by score (default 8) for presentation to the VLM.

Candidate Visualization (Overlay Design)

What is rendered. We render an overlay on the front camera image with:

candidate trajectories as colored polylines (or optionally as swept-footprint corridors);
a small dot at each candidate endpoint;
an integer index label near each endpoint (the label text is the authoritative ID);
an optional goal cue (magenta GOAL marker and/or a "hanging" arrow).

Projection and geometry. Candidates are defined in the robot body frame on the ground plane and projected into the camera image using a lightweight fisheye projection. The overlay legend reminds the VLM that fisheye distortion is normal near image borders.

Label-to-line disambiguation. To reduce index confusion when trajectories overlap, each label is drawn with a background color matching its trajectory color; if a label must be moved for readability, a thin leader line connects the label to the endpoint dot.

Prompt Design

Separation of concerns. The system prompt enforces a safety-first policy and defines output format. The user prompt provides per-step state: the goal (if any), candidate count, and a table for displayed candidates including geometry (and optionally planner confidence, which we often hide to avoid anchoring).

Short-horizon semantics. The prompt explicitly states that candidates cover only 4 s and that the goal can be off-screen and far beyond the horizon; therefore the correct behavior is to pick a locally safe candidate that makes progress, not to "reach the goal" in one step.

Output Validation and Robust Parsing

To make execution and evaluation robust, we validate VLM outputs and normalize common formatting deviations. The parser handles the following cases in order:

Code-fence stripping: JSON wrapped in Markdown code fences (e.g., triple-backtick json blocks) is extracted before parsing.
JSON object extraction: the first {...} block is parsed; action-field values select_trajectory, select, stop, and halt are all accepted.
Bare integer fallback: if the response is a single integer (no JSON), it is treated as a trajectory index.
Index validation: if the returned index is not in the set of displayed labels, the parser attempts a rank-based mapping (interpreting the integer as a 0-based row index into the candidate table). If mapping also fails, the output is treated as invalid.

If parsing fails entirely or the index is out of range after mapping, we treat the step as invalid and fall back to a safe behavior (planner argmax or stop) in deployment.

Policies and Latency Handling

We evaluate three families of policies: (i) direct execution of stale VLM trajectories (VLM Hold and VLM Stream); (ii) matching the stale VLM trajectory to the closest current candidate (VLM Match); and (iii) fusion policies that bias the planner selection toward the stale VLM intention while still choosing among current candidates (Score Fusion / Probability Fusion).

Request scheduling and pipelining. We distinguish sequential request policies (submit the next query only after receiving the previous response; single in-flight request) from streaming request policies (submit at a fixed cadence; multiple pipelined in-flight requests). This separation isolates the effect of latency from throughput limitations.

System Architecture

The real-world system follows a two-rate architecture: a fast onboard local planner continuously proposes short-horizon, dynamically feasible candidate trajectories, while a slower VLM is queried asynchronously to provide high-level intent in the form of trajectory selection. Crucially, control and planning never block on the VLM response. Instead, the system (i) executes the planner in a receding-horizon loop and (ii) incorporates the most recent available VLM intent using the latency-handling policies described in the main paper (direct execution, matching, or fusion).

Asynchronous execution and time alignment. Each VLM request is tagged with a monotonically increasing request ID and a timestamp corresponding to the camera frame used for the overlay. When the response arrives, the policy aligns it to the current planning tick using (a) the request ID and (b) the current candidate set, applying either: (i) hold-style execution (execute the stale intent directly when feasible), (ii) match (map the stale intent to the closest current candidate), or (iii) fusion (bias the current planner selection toward the stale VLM intent while still selecting among up-to-date candidates). If no valid VLM output is available, the system falls back to a safe default (planner-only with conservative stopping).

VLM Query, Staleness Handling, and Safety Mechanisms

Query scheduling. The VLM is queried asynchronously using the overlay image plus the text prompt. We support two scheduling modes:

Sequential (used by vlm_hold and vlm_hold_match): a single request is in flight at any time; the next request is submitted only after the previous response is received and processed. This maximizes freshness per response but limits throughput.
Streaming (used by vlm_stream, score_fusion_stream, prob_fusion_stream): requests are submitted at a fixed cadence (default 1 Hz) regardless of whether previous responses have arrived, allowing multiple in-flight requests. When responses return (potentially out of order due to variable network latency), the system adopts the newest advice by query timestamp.

Output validation. All VLM outputs are parsed and validated. Invalid outputs (non-integer index, out-of-range index, or unparsable formatting) are discarded and treated as missing.

Human-in-the-loop safety. All real-world runs include a trained safety operator with an immediate override capability (teleoperation or emergency stop). Any intervention immediately cancels the current VLM intent and returns control to a safe mode; the resulting takeover event is logged and used to compute the safety metrics reported in the main paper. Speed limits are always enforced, and the robot is operated only in pedestrian environments where a safe stop is feasible at all times.

Evaluation Protocol and Metrics

Environments and routes. Evaluation is conducted on outdoor pedestrian routes (e.g., sidewalks and campus pathways) containing natural obstacles such as pedestrians, curb cuts, surface boundaries (grass/planters), and intersections/forks. Each route is executed multiple times per method under similar conditions; all sensor streams, planner candidates, chosen indices, and operator interventions are logged with timestamps.

Runs and completion. A run begins at a fixed start pose and continues until the robot reaches the route endpoint (within a small tolerance). In our experimental protocol, every trial is completed regardless of the number of takeovers: when a takeover occurs, the safety operator manually guides the robot back to a safe pose and returns control to the autonomous policy. The run then continues from that point. This ensures that all metrics (takeover rate, trajectory smoothness, completion time) are comparable across methods.

Frequently Asked Questions

How does the system handle VLM latency without blocking robot control? The robot uses a two-rate architecture where a fast local planner runs continuously in a receding-horizon loop, and VLM advice is incorporated asynchronously using fusion policies that never block on the VLM response.

What fusion policies did the authors evaluate for combining stale VLM advice with fresh planner output? They evaluated Score Fusion and Probability Fusion, which bias the current planner's trajectory selection toward the stale VLM intention while still choosing among up-to-date candidate trajectories.

Does the VLM always outperform the learned planner in trajectory selection? No, the VLM excels in semantically challenging scenarios but the learned planner is often competitive in routine situations, which motivates the fusion approach rather than VLM-only control.

How are VLM outputs validated and made robust to formatting errors? The parser handles code-fence stripping, JSON extraction, bare integer fallback, and index validation with rank-based mapping, falling back to a safe behavior if parsing fails entirely.