Vision-Language Model Gives Warehouse Robots Context-Aware Semantic Maps

Vision-Language Model Gives Warehouse Robots Context-Aware Semantic Maps

Marvin Rüdt, Hao Pang, Constantin Enke, Zäzilia Seibold, Kai Furmans

8 นาทีในการอ่าน24 มิ.ย. 2569

Warehouse robots now get a major upgrade: researchers built a pipeline that lets autonomous mobile robots (AMRs) understand not just what objects are in a scene, but whether they can be moved. By combining SLAM, Segment Anything (SAM), and a vision-language model, the system creates semantic maps that distinguish static shelves from movable pallets and mobile forklifts without any task-specific training.

What the Researchers Built

The team from Karlsruhe Institute of Technology developed a contextual semantic mapping pipeline for intralogistics environments that runs on a standard industrial robot sensor suite: two 2D laser scanners and a forward-facing RGB camera. The pipeline works in five stages. First, it builds a 2D geometric map using GMapping SLAM. Second, it runs SAM’s automatic mask generation on every camera frame to produce class-agnostic segmentation masks. Third, it projects those masks into the map coordinate frame and clusters overlapping instances across frames to create persistent object representations. Fourth, a vision-language model (VLM) reasons over the aggregated multi-view observations of each object cluster to infer its semantic class (e.g., “shelf”, “pallet”, “forklift”) and its movability — the critical property that determines whether the object is static infrastructure or a potentially dynamic obstacle. The VLM returns structured JSON with class, movability, and an explanation. Finally, a map fusion module attaches these semantic attributes to the geometric map points, producing a 6-dimensional point cloud (x, y, class, movability, and two auxiliary fields). The system works entirely zero-shot and open-vocabulary — no predefined object categories needed.

Key Results

The pipeline was evaluated in a real intralogistics test environment against ground-truth semantic labels. The best-performing VLM configuration — Gemini 3.1 Flash Lite with direct JSON prompting — achieved a mean Intersection over Union (mIoU) of 98.93% for semantic segmentation and a panoptic quality (PQ) of 56.82%. Movability classification reached a balanced per-class accuracy (mAcc) of 84.86%. These numbers are striking because they come from a zero-shot setup — the model never saw the environment before.

The researchers also conducted a thorough component analysis. Removing multi-view reasoning (i.e., using single-frame observations) dropped mIoU by over 10 points and caused fragmented, inconsistent object labels across the map. The VLM reasoning step proved to be the primary bottleneck for movability estimation, while instance association errors were the main limitation for panoptic performance. A simple baseline using nearest-neighbor label propagation from the nearest mask failed entirely, confirming that VLM reasoning is essential.

Example input images shown to the VLM: a panoramic scene segmentation mask with a highlighted object, and a cropped close-up of the same object.

Table: Performance of best VLM configuration on key metrics

MetricScore
Semantic segmentation mIoU98.93%
Movability classification mAcc84.86%
Panoptic Quality (PQ)56.82%

How It Works

The system’s key innovation is the way it integrates multi-view observations with VLM reasoning inside the mapping pipeline, rather than applying language models as a post-processing step. After building a 2D geometric map from laser scans via GMapping SLAM, the pipeline runs SAM on every RGB frame to produce fine-grained, class-agnostic masks. A point-to-pixel correspondence — established by temporal synchronization between the 2D laser scanners and the camera — allows each mask to be projected into the geometric map coordinate frame.

Instance clustering then groups projected masks across frames using pairwise Intersection-over-Union (IoU). Any two instances with IoU exceeding a threshold (set to 0.5 in the experiments) are considered observations of the same physical object. This clustering serves two purposes: it creates persistent object-level representations for the final map, and it aggregates all camera views of that object for the VLM reasoning step.

The VLM receives a composite input: a full-scene segmentation mask showing the object’s location with a bounding-box overlay, plus a cropped close-up of the object itself. The researchers found this composite format was critical — it provides spatial context while focusing the VLM’s attention on the target object, avoiding distraction from visually dominant elements (e.g., large shelves). The prompt includes an explicit movability ontology: immovable (attached to floor/structure), movable (can be relocated by robot but stays still when empty), and mobile (self-moving vehicles like forklifts). The VLM returns structured JSON with class, movability, and a short explanation for traceability. If confidence is low, it falls back to “unknown” for both fields.

The whole pipeline runs offline on pre-recorded data. The authors used Gemini 3.1 Flash Lite for their best results, but the architecture is model-agnostic.

Why This Matters for Robotics

For warehouses and fulfillment centers, the ability to distinguish static infrastructure from movable or mobile objects is the difference between a robot that gets stuck and one that adapts. A classic occupancy grid map tells the robot that a pallet is in the way — but it doesn’t know the pallet can be pushed aside or that the forklift will move on its own. This contextual semantic map enables higher-level operations: “transport the pallet from the transfer station to the shelf” requires knowing both what and where, plus whether the pallet is movable.

The zero-shot, open-vocabulary nature means these maps can be generated without creating a training dataset for every new warehouse layout. That lowers the barrier for deploying AMRs in facilities that constantly reconfigure their layouts. The system also supports natural language queries: a warehouse manager could ask “where are all the movable pallets?” and the robot can answer because the map encodes that attribute.

This technology directly applies to warehouse robots and used industrial robots that need to operate safely alongside dynamic objects like forklifts and workers. For systems using used cobots for sale, a similar approach could enable them to avoid or interact with movable objects without reprogramming.

A visualization of the final contextual semantic map showing different object classes and movability statuses overlaid on the geometric map.

Limitations and Open Questions

The biggest limitation is that the pipeline currently runs offline on recorded data. For real-time operations, the system would need to incrementally update the map as new objects appear, move, or disappear — a challenge the authors acknowledge as future work. The evaluation was also performed in a single controlled test environment; generalizing to the full messiness of real industrial sites (dust, poor lighting, occlusions) remains an open question.

The VLM reasoning step is the primary bottleneck for movability estimation. While Gemini 3.1 Flash Lite performed well, the authors note that the model’s reasoning can be brittle — it sometimes confuses “movable” with “mobile” for objects like pallet trucks that share characteristics of both. The 56.82% panoptic quality indicates that instance association (grouping the same object across frames) is still a weak link. Finally, the system uses only 2D laser data; extending to 3D LiDAR would provide richer geometric context for more robust reasoning.

Frequently Asked Questions

What is a contextual semantic map? It’s a geometric map (e.g., occupancy grid) that attaches semantic attributes — object class, movability status — to each mapped point, enabling the robot to understand not just where objects are, but what they are and how they behave.

Which vision-language model did the researchers use? The best results came from Gemini 3.1 Flash Lite with a direct JSON prompting strategy. However, the pipeline is model-agnostic and could use other VLMs.

How does the system handle objects it has never seen before? It uses a zero-shot, open-vocabulary approach — the VLM can classify any object and infer its movability without needing a predefined list of categories or task-specific training data.

Can this system run in real time? Currently, it runs offline on pre-recorded data. Enabling online incremental map updates is flagged as future work.

Conclusion

By combining geometric SLAM, SAM segmentation, and vision-language model reasoning, researchers have built a pipeline that gives warehouse robots rich contextual understanding of their environment — distinguishing static fixtures from movable or mobile objects without any training data. The 98.93% semantic accuracy and zero-shot flexibility make this a promising step toward truly adaptive intralogistics automation.

🍪 การตั้งค่าคุกกี้

เราใช้คุกกี้เพื่อวัดประสิทธิภาพ นโยบายความเป็นส่วนตัว