Where Am I? Semantic Map Grounding via Vision-Language Models for Multi-Modal Localization

Aarav Shah; Madhu Vadali; Suraj Borate

arxiv: 2607.01079 · v1 · pith:CLPOLG3Snew · submitted 2026-07-01 · 💻 cs.RO

Where Am I? Semantic Map Grounding via Vision-Language Models for Multi-Modal Localization

Suraj Borate , Aarav Shah , Madhu Vadali This is my paper

Pith reviewed 2026-07-02 11:12 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot localizationvision-language modelssemantic mapsmulti-modal sensingpose estimationindoor navigationcross-modal fusion

0 comments

The pith

A vision-language model predicts continuous robot pose from a camera image, LiDAR scan and semantic grid map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes indoor robot localization as a semantic reasoning task that a vision-language model can solve from three inputs at once. It shows that after fine-tuning, the model outputs position and orientation values directly while maintaining high accuracy even when objects or map sections are unfamiliar. A reader would care because the approach supplies a fallback when one sensor fails and adapts to partial maps without requiring perfect geometric alignment. The evidence centers on sustained performance across test scenes and ablations that isolate each modality's contribution.

Core claim

The authors establish that a vision-language model with an added regression head can infer robot pose (x, y, theta) from a front camera image, polar LiDAR scan and top-down semantic grid map. Training on a large simulated dataset yields 98.23 percent position accuracy, 98.00 percent direction accuracy and 96.75 percent full pose accuracy on in-distribution tests, with mean errors of 0.11 m and 5.7 degrees. Accuracy on seven unseen object categories remains 90.99 percent, and fine-tuning restores 93.72 percent position accuracy on incomplete maps. Ablations demonstrate that LiDAR sustains 92.33 percent accuracy in views containing no visible objects while the camera-plus-map combination alone

What carries the argument

A vision-language model with a lightweight regression head attached to its final hidden state that directly outputs continuous pose coordinates, trained with a composite position-and-direction loss on multi-modal inputs.

If this is right

Position accuracy remains 95.06 percent when LiDAR input is removed, showing that camera and map inputs alone suffice for most scenes.
LiDAR input alone sustains 92.33 percent position accuracy in camera views that contain no visible objects.
Fine-tuning recovers 93.72 percent position accuracy when the supplied semantic map is incomplete or stale.
Accuracy on completely unseen object categories drops by only 7.2 percentage points, indicating the model reasons about spatial relations rather than memorizing appearances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture could support localization in environments whose maps change over time without requiring full retraining each time a new object appears.
Pairing the semantic predictor with a conventional geometric filter might reduce residual error in cases where the model outputs are uncertain.
Running controlled trials that gradually increase real-world sensor noise and lighting variation would quantify how much of the reported accuracy depends on simulation fidelity.

Load-bearing premise

The custom simulation environment produces sensor data and map variations representative enough of real indoor conditions that the learned semantic reasoning transfers outside the training distribution.

What would settle it

Deploying the trained model on a physical robot in a real indoor space containing novel objects and measuring whether full pose accuracy stays above 85 percent on held-out scenes would directly test the transfer claim.

Figures

Figures reproduced from arXiv: 2607.01079 by Aarav Shah, Madhu Vadali, Suraj Borate.

**Figure 2.** Figure 2: Example multi-modal input (scene 1357, cell C4, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Loss converges cleanly; validation closely tracks [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Testing across environment change. Scene appearance variation is achieved using Google NanoBanana, altering [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

We address robot localization in GPS-denied indoor environments by reframing it as a semantic reasoning task rather than a geometric estimation problem. Motivated by how humans localize using object-level cues and labeled maps, we ask whether a vision-language model, given a front camera image, a polar LiDAR scan, and a top-down semantic grid map, can infer the robot pose. We fine-tune Qwen2.5-VL-7B with LoRA and attach a lightweight regression head that predicts continuous pose coordinates (x, y, theta) directly from the final hidden state, bypassing text generation. Training uses a composite position-and-direction loss with curriculum learning on a custom Gazebo dataset of 120,112 samples and 527 scenes. On the in-distribution test set of 18,017 samples, the model achieves 98.23 percent position accuracy, 98.00 percent direction accuracy, 96.75 percent full pose accuracy, a mean position error of 0.11 m, and a mean orientation error of 5.7 degrees at 0.62 s per sample. Position accuracy drops by only 7.2 percentage points on seven unseen object categories, reaching 90.99 percent, supporting semantic spatial reasoning rather than appearance memorization. With incomplete maps, fine-tuning recovers performance to 93.72 percent position accuracy, showing adaptability to stale or partial map information. Two ablations highlight cross-modal complementarity. Without LiDAR, using only camera and map inputs, position accuracy remains 95.06 percent, only 3.2 percentage points below the full system. However, when the camera sees no visible objects in a wall-facing view, LiDAR sustains 92.33 percent position accuracy, compared with 70.74 percent when neither LiDAR nor visible objects are available. This shows that LiDAR becomes the primary localization signal when camera semantics are unavailable and provides a reliable fallback under occlusion or sparse layouts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a VLM can regress pose from camera, LiDAR and semantic map in Gazebo with solid numbers and some generalization, but everything stays in simulation.

read the letter

The core result is that fine-tuning Qwen2.5-VL-7B with LoRA and a regression head produces direct (x, y, theta) predictions from the three inputs. On their 18k held-out Gazebo samples it reaches 96.75 percent full pose accuracy, mean errors of 0.11 m and 5.7 degrees, and only a 7.2-point drop to 90.99 percent on seven unseen object categories.

The direct regression head and the curriculum training are the concrete additions. The ablations are straightforward and useful: removing LiDAR drops performance modestly, while LiDAR alone keeps 92.33 percent when the camera sees no objects. The incomplete-map recovery to 93.72 percent is also a practical check.

Those numbers and the explicit unseen-category test are the parts done cleanly. The dataset size and the split are large enough to make the reported figures credible inside the simulator.

The obvious limitation is that every result comes from the same custom Gazebo environment with perfect semantic grids and controlled noise. No real-robot data, no lighting variation, no map staleness from actual deployment, and no comparison against standard methods such as AMCL or visual-inertial odometry. The interpretation that the small accuracy drop proves semantic reasoning rather than simulation-specific fitting therefore rests on an untested transfer assumption.

This is for robotics researchers already working on multi-modal indoor localization who want a worked example of a VLM regression pipeline. A reader can pull the training setup and run the same ablations.

Send it to peer review. The experimental structure is clear and the numbers are reported in enough detail to be checked, even if the sim-to-real step will need substantial additional work.

Referee Report

2 major / 2 minor

Summary. The paper reframes indoor robot localization as a semantic reasoning task using a fine-tuned Qwen2.5-VL-7B VLM with LoRA and a regression head to predict continuous pose (x, y, theta) from camera image, polar LiDAR, and top-down semantic grid map inputs. On a custom Gazebo dataset (120k training samples, 18k test), it reports 98.23% position accuracy, 96.75% full pose accuracy, 0.11 m mean position error, and 5.7° orientation error in-distribution; accuracy drops only 7.2 points to 90.99% on seven unseen object categories, with ablations showing LiDAR fallback value and recovery to 93.72% on incomplete maps.

Significance. If the results transfer, the work demonstrates that VLMs can ground semantic maps for localization via object-level cues, with explicit evidence of generalization beyond appearance memorization via the unseen-category test and cross-modal ablations. Credit is due for the large held-out test set, concrete generalization numbers on novel categories, and modality ablations with quantitative results. Significance is limited by the simulation-only evaluation.

major comments (2)

[Abstract / Results (generalization test)] Abstract and results on generalization: the central interpretive claim—that the 7.2-point accuracy drop on seven unseen object categories demonstrates semantic spatial reasoning rather than memorization—depends on the custom Gazebo simulator (perfect semantic grids, controlled noise) being representative of real indoor conditions; no real-robot experiments or real-sensor validation are reported, making transferability load-bearing for this claim.
[Dataset and Experiments sections] Evaluation setup: all metrics (including 90.99% on unseen categories and 93.72% on incomplete maps) are obtained exclusively inside the Gazebo simulation with idealized maps; this is load-bearing because the paper's reframing as semantic reasoning (vs. geometric estimation) and the cross-modal complementarity conclusions rest on the assumption that simulation artifacts do not drive the reported performance.

minor comments (2)

[Abstract] The abstract states inference at 0.62 s per sample but provides no hardware specification or batch-size details for this timing.
[Methods] Notation for the composite loss and regression head attachment could be clarified with an equation or diagram in the methods.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for recognizing the scale of the held-out test set, the unseen-category generalization results, and the modality ablations. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract / Results (generalization test)] Abstract and results on generalization: the central interpretive claim—that the 7.2-point accuracy drop on seven unseen object categories demonstrates semantic spatial reasoning rather than memorization—depends on the custom Gazebo simulator (perfect semantic grids, controlled noise) being representative of real indoor conditions; no real-robot experiments or real-sensor validation are reported, making transferability load-bearing for this claim.

Authors: We agree that the interpretive claim of semantic spatial reasoning would be strengthened by real-robot validation. The simulation environment was deliberately chosen to enable controlled, large-scale tests of generalization to unseen categories and incomplete maps that are difficult to replicate precisely in the real world. We will revise the abstract and results sections to explicitly qualify all reported generalization numbers as holding within the Gazebo simulator and will add a dedicated limitations paragraph discussing the sim-to-real gap together with the need for future physical validation. These changes will prevent overstatement while preserving the concrete evidence the experiments provide inside the evaluated domain. revision: yes
Referee: [Dataset and Experiments sections] Evaluation setup: all metrics (including 90.99% on unseen categories and 93.72% on incomplete maps) are obtained exclusively inside the Gazebo simulation with idealized maps; this is load-bearing because the paper's reframing as semantic reasoning (vs. geometric estimation) and the cross-modal complementarity conclusions rest on the assumption that simulation artifacts do not drive the reported performance.

Authors: All quantitative results are indeed obtained inside the Gazebo simulator, as already stated in the manuscript. To address the concern that simulation artifacts may drive performance, we will expand the Dataset and Experiments sections with further details on the sensor noise models, semantic map generation procedure, and scene randomization. We will also add new ablation experiments that systematically increase simulated sensor noise and map incompleteness to quantify robustness. These revisions will provide additional evidence that the observed semantic-reasoning and cross-modal effects are not artifacts of idealized conditions. revision: yes

standing simulated objections not resolved

Real-robot or real-sensor validation experiments, which would require physical hardware, new data collection, and calibration effort beyond the scope and timeline of the current work.

Circularity Check

0 steps flagged

No circularity: empirical train/test evaluation on disjoint splits with no equations or self-citations reducing results to inputs by construction.

full rationale

The paper reports standard supervised fine-tuning of Qwen2.5-VL-7B on a custom Gazebo dataset (120k samples) followed by evaluation on held-out test splits (in-distribution and unseen categories). All accuracy figures (98.23% position, 90.99% on unseen, etc.) are direct empirical measurements from this process. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make any reported number equivalent to its inputs by definition. The interpretive claim that the 7.2-point drop demonstrates semantic reasoning is an external inference from the numbers, not a mathematical reduction. The simulation-to-real transfer assumption is a validity concern, not circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim rests on the untested transfer from Gazebo simulation to reality and on the assumption that the semantic labels in the grid map are perfectly registered with the robot's camera and LiDAR views; no new physical entities are introduced.

free parameters (2)

composite loss weights
Position and direction loss terms are combined; their relative weighting is chosen during training and affects the reported accuracy numbers.
LoRA rank and alpha
Standard but unspecified hyperparameters that control how much the base VLM is adapted.

axioms (2)

domain assumption Gazebo-generated sensor data and semantic maps are statistically close enough to real indoor environments for the learned mapping to generalize.
Invoked when claiming that 90.99 percent accuracy on unseen objects demonstrates semantic reasoning transferable beyond the training distribution.
domain assumption The regression head attached to the final hidden state can produce continuous pose values without the discretization or tokenization artifacts that would arise from text generation.
Stated when the authors bypass text generation in favor of direct numeric output.

pith-pipeline@v0.9.1-grok · 5898 in / 1725 out tokens · 37887 ms · 2026-07-02T11:12:37.528324+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Monte Carlo Local- ization for Mobile Robots,

F. Dellaert, D. Fox, W. Burgard, and S. Thrun, “Monte Carlo Local- ization for Mobile Robots,”Proc. IEEE ICRA, 1999

1999
[2]

Thrun, W

S. Thrun, W. Burgard, and D. Fox,Probabilistic Robotics. MIT Press, 2005

2005
[3]

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition,

R. Arandjelovi ´c et al., “NetVLAD: CNN Architecture for Weakly Supervised Place Recognition,”Proc. IEEE CVPR, 2016

2016
[4]

From Coarse to Fine: Robust Hierarchical Localization at Large Scale,

P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From Coarse to Fine: Robust Hierarchical Localization at Large Scale,”Proc. IEEE CVPR, 2019

2019
[5]

Localiza- tion from Semantic Observations via the Matrix Permanent,

N. Atanasov, M. Zhu, K. Daniilidis, and G. J. Pappas, “Localiza- tion from Semantic Observations via the Matrix Permanent,”Int. J. Robotics Research, vol. 35, no. 1-3, 2016

2016
[6]

Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,

N. Hughes, Y . Chang, and L. Carlone, “Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,”Proc. RSS, 2022

2022
[7]

GPT-4 Technical Report

OpenAI, “GPT-4 Technical Report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual Instruction Tuning,”Proc. NeurIPS, 2024

2024
[9]

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,

D. Shah, B. Osi ´nski, B. Ichter, and S. Levine, “LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,”Proc. CoRL, 2023

2023
[10]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,

A. Brohan et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,”Proc. CoRL, 2023

2023
[11]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need,”Proc. NeurIPS, 2017

2017
[12]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,”Proc. NeurIPS, 2022

2022
[13]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,”Proc. NeurIPS, 2020

2020
[14]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,”Proc. ICLR, 2022

2022
[15]

Qwen2.5-VL Technical Report,

Qwen Team, “Qwen2.5-VL Technical Report,” Alibaba Group, 2024

2024

[1] [1]

Monte Carlo Local- ization for Mobile Robots,

F. Dellaert, D. Fox, W. Burgard, and S. Thrun, “Monte Carlo Local- ization for Mobile Robots,”Proc. IEEE ICRA, 1999

1999

[2] [2]

Thrun, W

S. Thrun, W. Burgard, and D. Fox,Probabilistic Robotics. MIT Press, 2005

2005

[3] [3]

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition,

R. Arandjelovi ´c et al., “NetVLAD: CNN Architecture for Weakly Supervised Place Recognition,”Proc. IEEE CVPR, 2016

2016

[4] [4]

From Coarse to Fine: Robust Hierarchical Localization at Large Scale,

P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From Coarse to Fine: Robust Hierarchical Localization at Large Scale,”Proc. IEEE CVPR, 2019

2019

[5] [5]

Localiza- tion from Semantic Observations via the Matrix Permanent,

N. Atanasov, M. Zhu, K. Daniilidis, and G. J. Pappas, “Localiza- tion from Semantic Observations via the Matrix Permanent,”Int. J. Robotics Research, vol. 35, no. 1-3, 2016

2016

[6] [6]

Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,

N. Hughes, Y . Chang, and L. Carlone, “Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,”Proc. RSS, 2022

2022

[7] [7]

GPT-4 Technical Report

OpenAI, “GPT-4 Technical Report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual Instruction Tuning,”Proc. NeurIPS, 2024

2024

[9] [9]

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,

D. Shah, B. Osi ´nski, B. Ichter, and S. Levine, “LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,”Proc. CoRL, 2023

2023

[10] [10]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,

A. Brohan et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,”Proc. CoRL, 2023

2023

[11] [11]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need,”Proc. NeurIPS, 2017

2017

[12] [12]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,”Proc. NeurIPS, 2022

2022

[13] [13]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,”Proc. NeurIPS, 2020

2020

[14] [14]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,”Proc. ICLR, 2022

2022

[15] [15]

Qwen2.5-VL Technical Report,

Qwen Team, “Qwen2.5-VL Technical Report,” Alibaba Group, 2024

2024