GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap

Task & Motivation

Most evaluation pipelines for navigation instructions still rely on reference-based text metrics borrowed from machine translation and captioning — BLEU, ROUGE, METEOR, CIDEr. These metrics assume that there is a single correct way to describe a route and that lexical similarity correlates with functional utility. In navigation, this assumption breaks down: "Turn left at the bank" and "Turn right at the bank" share almost all of their tokens yet describe opposite trajectories, while "Go past the drug store and turn left" and "Take a left before the light on your right" can share zero n-grams while describing the same valid action.

Pragmatic alternatives evaluate instructions by sending a follower agent through a high-fidelity visual simulator (Matterport3D, Google Street View) and measuring its success rate. This conflates linguistic quality with visual recognition, hits the licensing and bandwidth ceiling of proprietary panoramas, and limits reproducibility to well-funded labs.

GROKE inverts the standard VLN question. Instead of asking "how well did the agent perform?", we ask "how navigable is this instruction?" — and treat agent execution metrics (Navigation Error, Success Rate, SDTW) as proxy scores for the input text. By replacing pixels with OpenStreetMap graphs (nodes, edges, POIs), GROKE removes the visual recognition bottleneck and enables scalable, vision-free assessment of instruction quality.

Comparison of traditional textual metrics and proposed pragmatic evaluation metrics. — Figure 1: Comparison of traditional textual metrics and the proposed pragmatic evaluation metrics.

Method

We formulate VLN over OpenStreetMap as sequential decision-making on a graph $\mathcal{G} = (V, E, P)$, where $V$ is the set of navigable nodes, $E$ the directed edges with associated headings, and $P$ the points of interest with semantic tags. Given an instruction $I$, the goal is a policy $\pi: (I, v_t, h_t, \mathcal{G}_t) \rightarrow v_{t+1}$ that selects the next waypoint from the local map context.

GROKE is a training-free, multi-agent hierarchical system composed of three modules:

Sub-Goal & POI Extraction. An LLM parser decomposes the instruction into atomic sub-goals (MOVE_FORWARD, TURN_LEFT, TURN_RIGHT) and extracts landmark mentions, which are grounded onto OSM POIs via fuzzy string matching (RapidFuzz).
Visible Area Construction. Starting from the current node and heading, we traverse forward to the next intersections, modelling the human "line of sight" along the street. The result is a structured local graph annotated with POI proximity and relative direction (Forward / Left / Right / Back).
Navigator Agent. Given the current sub-goal, position, heading, and the structured visible area (encoded as JSON), the agent selects the next waypoint and updates the sub-goal status (IN_PROGRESS, COMPLETED).

GROKE overall architecture: Sub-Goal & POI Extraction, Visible Area Construction, and Navigator Agent. — Figure 2: Overall GROKE architecture — Sub-Goal & POI Extraction, Visible Area Construction, and Navigator Agent.

Contributions

Spatial representation study. GROKE systematically compares four spatial encodings — Textual narratives, Structured JSON, Graphviz-style notation, and Grid matrices — and shows that structured JSON dominates, achieving Navigation Error of 41.3 m and Success Rate of 74.0% on the validation slice.
Agent-as-Judge formalisation. We frame instruction evaluation as agentic execution and provide an experimental protocol that validates the judge itself by correlating its metrics with human navigability ratings.
Open implementation. We release a detailed implementation of the framework, including graph serialisation algorithms, prompt-engineering strategies for hierarchical reasoning, and statistical analyses of agent trajectories.

Experimental Results

We evaluate GROKE on two 700-instance test splits of Map2Seq (Test_A and Test_B, corresponding to the Test_Seen and Test_Unseen splits of the original release). Following standard VLN protocols we report Navigation Error (NE), Success Rate (SR), Oracle Success Rate (OSR), and Success weighted by Dynamic Time Warping (SDTW). The Navigator Agent runs on Gemini-3 Pro with the default thinking configuration.

Method	Test_A				Test_B
Method	NE ↓	SR ↑	OSR ↑	SDTW ↑	NE ↓	SR ↑	OSR ↑	SDTW ↑
Random Walker	259.0	4.4%	5.7%	0.026	244.3	6.1%	7.1%	0.029
Action Sampling	250.1	5.1%	6.0%	0.037	241.6	7.4%	8.1%	0.039
Heuristic Agent	180.6	18.0%	18.9%	0.155	173.0	17.9%	19.1%	0.159
GROKE (Ours)	56.8	66.4%	78.4%	0.634	59.8	63.3%	78.0%	0.609

Table 1: Overall navigation execution results. The best baseline per column is underlined.

GROKE reduces Navigation Error by roughly 68.5% compared with the strongest baseline (Heuristic Agent) and lifts Success Rate from 18.0% to 66.4% on Test_A. The gains transfer to the unseen split (Test_B: SR 17.9% → 63.3%), confirming that semantic graph reasoning generalises beyond the training distribution.

Correlation with human navigability judgments

On 100 randomly sampled instructions rated by human annotators, Navigation Error shows the strongest agreement with human ratings (Pearson r = −0.31, p < 0.01; Spearman ρ = −0.32, p < 0.01). SR, SDTW, and nDTW are also significantly correlated; OSR alone fails to reach significance. These results support using NE and nDTW as primary proxies when human-in-the-loop validation is infeasible.

Ablation Studies

1. Spatial representation matters

We compare four ways of feeding the visible area to the LLM: a Textual narrative, a Structured JSON graph, Graphviz-style notation, and a Grid (matrix) rendering. Structured JSON and Textual encodings clearly outperform the visual-style formats; the Grid representation collapses to near-baseline performance (NE 175.4 m, SR 10%). On hard instructions JSON pulls further ahead of Textual, suggesting that hierarchical structure helps the LLM recover from compounding errors.

Overall performance comparison across spatial information representation formats. — Figure 4: Overall performance across spatial-information representations.

2. Sub-instruction decomposition is essential

We compare three instruction-feeding strategies: complete instruction, rule-based sentence split, and our LLM Divider that produces semantically grounded atomic sub-goals. The LLM Divider reduces NE by 42.6% relative to the complete-instruction baseline (41.3 m vs. 71.9 m) and raises SR from 51.0% to 74.0%. The advantage is largest on hard instructions, where SR jumps from 23.1% to 53.8%.

Method	NE ↓	SR ↑	OSR ↑	nDTW ↑
Complete Instruction	71.9	51.0%	55.0%	0.496
Rule-based Split	58.4	54.0%	69.0%	0.509
LLM Divider (Ours)	41.3	74.0%	82.0%	0.714

Table 2: Impact of sub-instruction decomposition (100 validation instances).

3. Thinking budget vs. cost

Higher thinking budgets help on hard instructions but follow a clear law of diminishing returns. The Low setting consumes ~33k tokens on average, High ~41k, and Auto ~44k. Each additional metre of NE reduction costs roughly 1,334 thought tokens (~1.6¢ at $12 / M tokens), and a 0.01 increase in nDTW costs ~1,731 tokens.

Marginal cost of performance improvements measured in thought tokens and cents. — Figure 5: Marginal cost of performance improvements across navigation metrics.

Case study

A common failure pattern is overshooting the goal: when a landmark (e.g. a bus stop) is mentioned as a stopping reference, the agent occasionally treats it as a navigation destination and walks past the intended stopping intersection.

BibTeX

@misc{shami2026grokevisionfreenavigationinstruction,
      title={GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap},
      author={Farzad Shami and Subhrasankha Dey and Nico Van de Weghe and Henrikki Tenkanen},
      year={2026},
      eprint={2601.07375},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.07375},
}