OpenReLoc: Object-Level Camera Relocalization with Open-Vocabulary Understanding

OpenReLoc: Object-Level Camera Relocalization with Open-Vocabulary Understanding

Zhaopeng Cui, Jiarui Hu, Jingbo Liu, Boming Zhao, Xiyue Guo +5 lainnya

6 menit baca24 Jun 2026

OpenReLoc is a new camera relocalization system that uses object-level representations and open-vocabulary understanding to estimate camera pose from a single RGB image. Unlike prior methods that rely on closed-vocabulary object matching, OpenReLoc can recognize and match any object—even never-before-seen categories—making it far more practical for real-world indoor environments.

What the Researchers Built

OpenReLoc is a complete indoor camera relocalization system that estimates the 6‑DOF camera pose of a query RGB image using a pre-built map of object landmarks. The map is constructed from posed RGB-D images of the scene, storing for each object its semantic label, shape, neighbour relationships, and—crucially—a natural language description generated by a large language model (LLM). When a new query image arrives, OpenReLoc detects objects, matches them against the map using open‑vocabulary embeddings (CLIP), and then refines the pose through a coarse‑to‑fine optimisation pipeline. A key innovation is the dual‑path 2D ICP loss that combines geometric alignment with semantic supervision, and a scene graph analysis step that resolves ambiguous matches caused by repeated or similar objects. This makes OpenReLoc the first object-level relocalisation system capable of handling scalable, real-world scenes without being limited to a fixed vocabulary of objects.

Example of open-vocabulary object matching results across different scenes

Key Results

The researchers evaluated OpenReLoc on the challenging ScanNet and ScanNet++ datasets, which contain diverse, real-world indoor scenes with long-tail object distributions. Compared to the prior state‑of‑the‑art GoReloc, OpenReLoc achieved a dramatically higher success rate—GoReloc frequently failed to identify valid matching objects because the scene contained objects outside its closed vocabulary. OpenReLoc’s open‑vocabulary matching succeeded across all scenes. In terms of accuracy, even when GoReloc did find matches, it suffered from drift due to the lack of a dedicated optimisation loss, whereas OpenReLoc’s dual‑path ICP loss delivered stable, precise poses.

Ablation studies confirmed the importance of each component: - Removing either the coarse or fine stage degraded performance, proving the coarse‑to‑fine mechanism essential. - Without scene graph analysis, the system confused repeated objects (e.g., multiple chairs). - Dropping the LLM‑generated language descriptions hurt robustness under occlusion or visual noise. - The DIOU‑based retrieval for pose priors outperformed naive visibility‑based strategies. - Filtering out invalid objects (walls, floors) improved landmark association and scene graph quality.

Scene graph visualization showing object relationships and matched landmarks

How It Works

OpenReLoc operates in two stages: a coarse stage that retrieves a rough pose hypothesis, and a fine stage that refines it precisely.

Map Building (Offline): From posed RGB-D images, objects are detected, segmented, and assigned a semantic label. For each object, its 3D point cloud, bounding box, and relationships to neighbouring objects are stored. A pretrained LLM (queried via API) generates a natural language description of each object (e.g., “a red office chair with armrests”). These descriptions are encoded into a shared open‑vocabulary embedding space using CLIP.

Coarse Stage (Query): The query RGB image undergoes object detection. Each detected object is encoded into the same CLIP space and matched to the most similar object in the map. To produce a pose prior, the system uses a DIOU (Distance‑Intersection over Union) retrieval method that considers both 2D bounding box overlap and 3D distance between matched object pairs. This yields a reliable initial camera pose.

Fine Stage (Refinement): A dual‑path 2D ICP loss is minimised. Path 1 aligns the 2D projections of map object centroids to detected object centres using a chamfer distance. Path 2 adds a semantic consistency term—projected map points that fall inside a query detection should have the same object label. Many candidate matches exist; a scene graph analysis filters out geometrically inconsistent ones by checking the neighbour relationships between candidate pairs. Invalid objects (walls, ceilings, floors) are pre‑filtered because they connect to too many objects and distort the graph.

The final pose is obtained by non‑linear least squares optimisation. The entire pipeline runs in real time on a standard GPU, though the current dependence on a closed‑source LLM API introduces latency.

Why This Matters for Robotics

Reliable camera relocalization is a foundational capability for any mobile robot operating indoors—from autonomous warehouse pallet movers to service robots in hospitals. Traditional methods either require visual features that break under lighting changes or rely on a pre‑defined set of object categories that cannot handle novel objects. OpenReLoc solves both problems: it works with any object, and it uses semantic understanding from LLMs to cope with occlusion.

For warehouse robots, this means a robot that has mapped an aisle once can relocalise itself even when the scene contains new boxes, misplaced pallets, or different equipment. The open‑vocabulary aspect is especially powerful in dynamic environments where object inventories change frequently. For used industrial robots being redeployed into new workspaces, a system like OpenReLoc could drastically reduce setup time by eliminating the need to manually label objects.

Limitations and Open Questions

The main limitation is handling extreme object repetition. In a room with hundreds of identical chairs, the scene graph and object descriptions become indistinguishable, leading to matching ambiguity. The researchers note this is an open challenge. Another practical issue is latency: the current system depends on a closed‑source LLM for generating object descriptions. Each description requires an API call, making the offline map building slow. The authors plan to replace the remote LLM with a local model in future work. Additionally, OpenReLoc currently requires posed RGB-D input for mapping; relaxing this to monocular video would be a natural next step.

Frequently Asked Questions

What exactly does OpenReLoc do? It estimates the 6‑DOF camera pose of an RGB image by matching detected objects to a pre‑built map, using language descriptions to recognise objects that were never seen during training.

How is it different from older methods like GoReloc? OpenReLoc uses open‑vocabulary matching (via CLIP and LLM descriptions) so it can handle any object, not just a fixed list. It also includes a dedicated ICP loss and scene graph analysis for better accuracy and robustness.

What kind of hardware does OpenReLoc require? A standard RGB or RGB-D camera for the query image, and a GPU for running the neural networks. The offline mapping step uses posed RGB-D images, which can come from any SLAM pipeline.

Why is open‑vocabulary understanding important for relocalization? Indoor scenes contain countless object types—tools, packaging, personal items—that no closed vocabulary can cover. Open‑vocabulary allows the system to recognise and match these objects, making relocalization possible in real‑world environments where objects change frequently.

Conclusion

OpenReLoc demonstrates that object-level camera relocalization can achieve practical, scalable performance by combining open‑vocabulary language understanding with a carefully designed coarse‑to‑fine optimisation pipeline. It overcomes the closed‑vocabulary limitations of prior work and handles real‑world scene diversity. The main open challenges—handling extreme repetition and reducing LLM latency—are clear targets for future work.

🍪 Preferensi cookie

Kami menggunakan cookie untuk mengukur kinerja. Kebijakan Privasi