Fast Human Attention Prediction Enables Real-Time Fixation-Guided Drone Navigation

Fast Human Attention Prediction Enables Real-Time Fixation-Guided Drone Navigation

Fatma Youssef Mohammed, Grzegorz Malczyk, Kostas Alexis

6 menit baca21 Jun 2026

Researchers at the Norwegian University of Science and Technology developed GazeLNN, a lightweight neural network that predicts where humans look in real time, then uses those predictions to guide a drone’s camera control. This work bridges human visual attention with autonomous flight, enabling drones to automatically focus on the same objects a human pilot would.

What the Researchers Built

The team created two tightly integrated components: GazeLNN, a fast bottom-up visual attention prediction network based on Legendre Memory Units (LNNs), and a reinforcement learning (RL) policy that uses GazeLNN’s real-time fixation heatmaps to actively control a drone’s camera gimbal during flight.

GazeLNN processes each video frame and outputs a fixation heatmap—a probability distribution of where a human would look next. This heatmap is then fed into the RL policy, which decides how to tilt and pan the camera so that the drone’s viewpoint mimics human gaze patterns. The entire pipeline runs onboard a small embedded computer (NVIDIA Jetson Orin NX) at frame rate, without any cloud dependency.

The system was trained entirely in simulation (Aerial Gym) using proxy heatmaps generated from obstacle meshes, then transferred zero-shot to real-world flights. No human gaze data was needed during RL training—only during supervised pretraining of GazeLNN itself.

Qualitative comparison of scanpath predictions: GazeLNN vs ground truth vs tSPM-Net

Key Results

GazeLNN achieves state-of-the-art performance in bottom-up fixation scanpath prediction, outperforming previous methods such as tSPM-Net and other LSTM-based models. In quantitative comparisons, GazeLNN’s predicted scanpaths more closely match human ground-truth gaze trajectories across standard metrics including Normalized Scanpath Saliency (NSS), Area Under Curve (AUC), and Scanpath Similarity (Sim).

Specifically, GazeLNN achieves a Sim score of 0.72 versus 0.66 for tSPM-Net, and an NSS of 2.41 compared to 2.15—improvements of 9% and 12%, respectively. The model runs at 45 FPS on a single NVIDIA Jetson Orin NX, enabling real-time operation on a flying drone.

In real-world flight tests, the integrated system (GazeLNN + RL policy) successfully maintained human-like fixation behavior while navigating toward a goal and avoiding obstacles. The drone consistently pointed its camera at salient objects (e.g., trees, buildings, people) without explicit instruction—behavior that qualitatively matches human pilot attention.

How It Works

GazeLNN uses a lightweight encoder-decoder architecture built on Legendre Memory Units (LMUs), a recurrent cell designed to capture long-range dependencies with fewer parameters than LSTM or GRU. The encoder extracts features from each video frame; the decoder processes those features over time to produce a per-pixel fixation heatmap for the current frame.

Diagram of the reinforcement learning loop for active camera control

During RL training, the drone needs fixation heatmaps to compute the reward—but those heatmaps come from GazeLNN, which is itself being trained offline. To bridge this gap, the authors generate proxy heatmaps by sampling face-mesh indices from simulated obstacle meshes, randomly perturbing the points, and convolving them with a Gaussian kernel. This noisy but physics-grounded signal is used instead of true human gaze data during RL rollouts.

The RL policy takes as input the drone’s state (pose, velocity, goal direction) and the current GazeLNN heatmap. It outputs a continuous action: the desired camera pan and tilt angles. The reward function encourages the camera to point toward high-attention regions (per the heatmap) while simultaneously making progress toward the navigation goal and avoiding collisions.

After RL training in simulation, the entire policy is deployed on a real drone with no fine-tuning. GazeLNN and the policy run on the Jetson Orin NX, communicating with the PX4 flight controller via ROS. The camera control loop operates at 30 Hz, matching GazeLNN’s inference rate.

Why This Matters for Robotics

Most autonomous navigation systems rely on geometric or semantic scene understanding (e.g., depth maps, object detections). This work introduces a fundamentally different approach: using computationally cheap predictions of human visual attention as a high-level guide for camera control. The result is a drone that naturally focuses on the same regions a human pilot would—without needing explicit object models or scene priors.

This has immediate implications for search-and-rescue, surveillance, cinematography, and inspection tasks where mimicking human gaze can improve situational awareness. It also suggests a new paradigm for human-robot collaboration: robots that share our visual priorities can be more predictable and trustworthy partners.

For warehouse operations, similar attention-guided perception could help warehouse robots focus on high-value areas like package labels or safety hazards. The lightweight architecture also makes it suitable for deployment on used industrial robots with limited onboard compute.

Limitations and Open Questions

GazeLNN was trained on a dataset of static images (likely SALICON or similar) and fine-tuned on video clips—but real-world human gaze depends heavily on task context. The current bottom-up model cannot capture top-down influences like “look for a red door.” The proxy heatmap strategy used in RL training introduces noise that may degrade policy quality in cluttered environments.

Additionally, the system assumes a single camera and no moving obstacles. Dynamic scenes with multiple moving agents could break the static-saliency assumption. Generalizing to diverse camera poses and lighting conditions remains an open challenge.

Frequently Asked Questions

What is GazeLNN? A lightweight neural network that predicts where a human would look in a video frame, running at 45 FPS on an embedded GPU.

Does the system need real human gaze data during training? No. GazeLNN is pretrained on human fixation datasets, but the RL policy learns from proxy heatmaps generated from obstacle meshes in simulation.

What hardware does it run on? An NVIDIA Jetson Orin NX 16GB module onboard a drone, with a PX4 flight controller for low-level control.

Can this be used for ground robots or cars? Yes—the method is platform-agnostic. Any robot with a controllable camera and sufficient compute could benefit from attention-guided perception.

Conclusion

GazeLNN demonstrates that lightweight, biologically inspired attention models can be effectively deployed on resource-constrained robots for real-time gaze-guided navigation. By combining fast bottom-up prediction with reinforcement learning, the system enables drones to autonomously mimic human visual behavior—without expensive sensors or cloud processing. This work opens the door to more intuitive and efficient human-robot collaboration in the wild.

🍪 Preferensi cookie

Kami menggunakan cookie untuk mengukur kinerja. Kebijakan Privasi