DART-VLN: Test-Time Memory Decay and Anti-Loop Regularization for Discrete Vision-Language Navigation

Jie Mei; Shaoheng Zhang; Zhichen Li

arxiv: 2607.01043 · v1 · pith:KY5ZJS6Gnew · submitted 2026-07-01 · 💻 cs.RO · cs.AI

DART-VLN: Test-Time Memory Decay and Anti-Loop Regularization for Discrete Vision-Language Navigation

Shaoheng Zhang , Zhichen Li , Jie Mei This is my paper

Pith reviewed 2026-07-02 11:14 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords discrete vision-language navigationtest-time controlmemory decayanti-loop regularizationVLNR2RREVERIE

0 comments

The pith

Test-time memory reweighting and reversal penalties improve reliability and efficiency in frozen discrete VLN agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that two lightweight test-time rules can correct common failure modes in memory-based discrete vision-language navigation without retraining the backbone. Test-Time Memory Decay downweights stale evidence at readout, while Anti-Loop Regularization adds a penalty against immediate reversals when choosing the next action. If these rules work, agents achieve shorter trajectories, lower runtime, and higher navigation success on benchmarks such as R2R and REVERIE by using only the existing frozen model.

Core claim

DART-VLN is a training-free framework that applies Test-Time Memory Decay, a read-side reweighting rule suppressing stale and redundant stored evidence, together with Anti-Loop Regularization, a next-hop penalty discouraging immediate reversals, to produce shorter trajectories, reduced runtime, and better navigation metrics while leaving the learned backbone unchanged.

What carries the argument

Test-Time Memory Decay (read-side reweighting) combined with Anti-Loop Regularization (next-hop penalty rule)

Load-bearing premise

The two failure modes of stale historical evidence at memory readout and inefficient local backtracking are the dominant problems that can be fixed by these reweighting and penalty rules without creating new failure modes.

What would settle it

Applying the decay and anti-loop rules to an R2R or REVERIE evaluation run and observing no reduction in average trajectory length or no gain in success rate relative to the unmodified frozen backbone would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.01043 by Jie Mei, Shaoheng Zhang, Zhichen Li.

**Figure 1.** Figure 1: DART-VLN test-time pipeline for discrete VLN. The environment maintains explicit grid memory with optional write-side updates (update [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative local behavior under anti-loop regularization. Compared with the baseline, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Memory-based discrete vision-language navigation (VLN) agents must act under partial observability, yet even strong frozen backbones remain vulnerable at test time. Two common failure modes are stale historical evidence at memory readout and inefficient local backtracking during action selection. We present DART-VLN, a training-free test-time control framework for discrete VLN. DART-VLN combines Test-Time Memory Decay, a read-side memory reweighting rule that suppresses stale and redundant evidence without rewriting stored content, with Anti-Loop Regularization, a lightweight next-hop penalty that discourages immediate reversals during action selection. The framework introduces no new learnable parameters and leaves the learned backbone unchanged. Experiments on R2R and REVERIE show a consistent pattern: decay-only provides stable read-side gains, while decay+anti-loop achieves the best overall quality-efficiency trade-off, yielding shorter trajectories, lower runtime, and improved navigation performance in key settings. Behavioral analysis further confirms that anti-loop regularization reduces local backtracking and improves path efficiency under frozen backbones. Overall, the results show that modest test-time control can make memory-based discrete VLN more reliable and efficient without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DART-VLN adds test-time memory decay and anti-loop penalties to frozen VLN agents, but the abstract gives no numbers so the size of any gains and the risk of new failure modes stay unclear.

read the letter

The main thing here is that DART-VLN is a training-free test-time framework that reweights memory readout to drop stale evidence and adds a next-hop penalty to discourage immediate reversals in discrete VLN.

The new element is the named pairing of these two rules for this setting. Memory decay works on the read side without altering stored content, and anti-loop regularization is a lightweight action penalty. The abstract states that decay alone gives stable read-side gains while the combination yields shorter trajectories, lower runtime, and better performance on R2R and REVERIE, backed by behavioral checks that show less local backtracking.

This setup has the practical merit of leaving the backbone untouched and introducing no new parameters, so it can sit on top of existing models.

The soft spots are straightforward. The abstract supplies no quantitative results, baselines, or error bars, so it is impossible to tell how large the improvements are or whether they hold across the full test distribution. The stress-test point about anti-loop regularization is worth taking seriously: in dead-ends, narrow corridors, or noisy observations an immediate reversal may be the only fix, and a fixed penalty could trap the agent or lengthen paths. The paper's claim that behavioral analysis confirms efficiency gains does not automatically rule out new failure modes on the complete test set.

The work is aimed at researchers already working on memory-based discrete VLN who want simple inference-time adjustments. A reader in that niche might pick up the ideas to test themselves.

It deserves peer review so the full experiments, implementation equations, and robustness checks can be examined.

Referee Report

2 major / 2 minor

Summary. The paper proposes DART-VLN, a training-free test-time framework for discrete VLN that applies Test-Time Memory Decay (read-side reweighting to suppress stale/redundant memory evidence) and Anti-Loop Regularization (next-hop penalty to discourage immediate reversals) to frozen backbones. It claims these address two failure modes, yielding shorter trajectories, lower runtime, improved navigation metrics on R2R and REVERIE, and reduced backtracking per behavioral analysis, all without new parameters or retraining.

Significance. If the results hold, the work is significant for showing that lightweight, parameter-free test-time interventions can improve reliability and efficiency of memory-based VLN agents in partially observable settings without modifying the learned backbone; the explicit separation of decay-only vs. decay+anti-loop ablations and the focus on behavioral analysis of backtracking are strengths.

major comments (2)

[Abstract] Abstract and Experiments section: the central claim of 'consistent improvements' and 'best overall quality-efficiency trade-off' is asserted without any reported success rate, SPL, trajectory length, or runtime numbers, baselines, or statistical significance; this prevents verification that the gains are load-bearing rather than marginal and that anti-loop does not degrade performance on the full test distribution.
[Behavioral analysis] Behavioral analysis and Anti-Loop Regularization description: the assumption that the fixed next-hop penalty avoids new failure modes is load-bearing for the claim of no side effects on the frozen backbone, yet the manuscript does not report results on environments containing dead-ends, narrow corridors, or high observation noise where immediate reversal is the only corrective action; without such targeted evaluation the risk that the penalty traps agents or forces longer detours remains unaddressed.

minor comments (2)

[Method] Clarify the exact functional form of the memory decay reweighting rule and the anti-loop penalty (e.g., additive vs. multiplicative, dependence on step count) with pseudocode or equations to enable reproducibility.
[Experiments] Add a table comparing decay-only vs. decay+anti-loop vs. baseline across all standard VLN metrics (SR, SPL, NE, TL) on both R2R and REVERIE to make the trade-off explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on result presentation and evaluation of potential side effects. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: the central claim of 'consistent improvements' and 'best overall quality-efficiency trade-off' is asserted without any reported success rate, SPL, trajectory length, or runtime numbers, baselines, or statistical significance; this prevents verification that the gains are load-bearing rather than marginal and that anti-loop does not degrade performance on the full test distribution.

Authors: The experiments section reports success rate, SPL, trajectory length, and runtime metrics for decay-only, decay+anti-loop, and baselines on R2R and REVERIE, with direct comparisons showing that anti-loop does not degrade performance relative to decay-only. We agree the abstract would benefit from explicit quantitative support for the claims. We will revise the abstract to include key metrics (e.g., SPL gains and trajectory length reductions) from the reported tables. revision: yes
Referee: [Behavioral analysis] Behavioral analysis and Anti-Loop Regularization description: the assumption that the fixed next-hop penalty avoids new failure modes is load-bearing for the claim of no side effects on the frozen backbone, yet the manuscript does not report results on environments containing dead-ends, narrow corridors, or high observation noise where immediate reversal is the only corrective action; without such targeted evaluation the risk that the penalty traps agents or forces longer detours remains unaddressed.

Authors: The behavioral analysis and all quantitative results are performed on the full R2R and REVERIE test sets, which contain dead-ends, narrow corridors, and varying observation conditions; these results show reduced backtracking with no increase in trajectory length or drop in success rate when anti-loop is added. We acknowledge that isolated, controlled tests on high-noise reversal-only scenarios would provide additional reassurance. We will add a limitations paragraph noting this scope and confirming that no trapping or forced detours were observed in the standard benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: test-time rules are direct additions independent of training

full rationale

The paper introduces DART-VLN as a training-free test-time framework consisting of explicit memory reweighting (decay) and next-hop penalty (anti-loop) rules applied to a frozen backbone. No equations, fitted parameters, or predictions derived from data subsets are presented; the methods are described as parameter-free modifications that leave the learned model unchanged. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The central claims rest on empirical evaluation on R2R and REVERIE rather than any self-referential derivation, making the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical derivations, free parameters, or background axioms; only high-level natural-language descriptions of the two rules.

pith-pipeline@v0.9.1-grok · 5743 in / 1005 out tokens · 32618 ms · 2026-07-02T11:14:50.654549+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 3 canonical work pages

[1]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3674–3683

2018
[2]

Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,

Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,”Trans. Mach. Learn. Res., 2024

2024
[3]

GridMM: Grid memory map for vision-and-language navigation,

Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “GridMM: Grid memory map for vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 15 579–15 590

2023
[4]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual-scale graph transformer for vision-and-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 537–16 547

2022
[5]

BEVBert: Topo-metric map pre-training for language-guided navi- gation,

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao, “BEVBert: Topo-metric map pre-training for language-guided navi- gation,”arXiv:2212.04385, 2022

work page arXiv 2022
[6]

VLN- BERT: A recurrent vision-and-language BERT for navigation,

Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “VLN- BERT: A recurrent vision-and-language BERT for navigation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1643–1653

2021
[7]

History aware multimodal transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multimodal transformer for vision-and-language navigation,” inAdv. Neural Inf. Process. Syst., 2021

2021
[8]

Adaptive zone-aware hierarchical planner for vision-language navigation,

C. Gao, X. Peng, M. Yan, H. Wang, L. Yang, H. Ren, H. Li, and S. Liu, “Adaptive zone-aware hierarchical planner for vision-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 14 911–14 920

2023
[9]

SE-VLN: A self-evolving vision-language navigation frame- work based on multimodal large language models,

X. Dong, H. Zhao, J. Gao, H. Li, X. Ma, Y . Zhou, F. Chen, and J. Liu, “SE-VLN: A self-evolving vision-language navigation frame- work based on multimodal large language models,”arXiv:2507.13152, 2025

work page arXiv 2025
[10]

3D Gaussian map with open-set seman- tic grouping for vision-and-language navigation,

J. Gao, R. Liu, and W. Wang, “3D Gaussian map with open-set seman- tic grouping for vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 9252–9262

2025
[11]

COSMO: Combination of selective memorization for low-cost vision- and-language navigation,

S. Zhang, Y . Qiao, Q. Wang, Z. Yan, Q. Wu, Z. Wei, and J. Liu, “COSMO: Combination of selective memorization for low-cost vision- and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 5511–5522

2025
[12]

Iterative vision-and-language navigation,

J. Krantz, S. Banerjee, W. Zhu, J. Corso, P. Anderson, S. Lee, and J. Thomason, “Iterative vision-and-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 14 921– 14 930

2023
[13]

The regretful agent: Heuristic-aided navigation through progress estimation,

C.-Y . Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira, “The regretful agent: Heuristic-aided navigation through progress estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6732–6740

2019
[14]

DREAMW ALKER: Mental planning for continuous vision-language navigation,

H. Wang, W. Liang, L. Van Gool, and W. Wang, “DREAMW ALKER: Mental planning for continuous vision-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 10 839–10 849

2023
[15]

NavQ: Learning a Q-model for foresighted vision-and-language navigation,

P. Xu, X. Gong, and Y . Mu, “NavQ: Learning a Q-model for foresighted vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 6327–6341

2025
[16]

Cross-modal map learning for vision and language navigation,

G. Georgakis, K. Schmeckpeper, K. Wanchoo, S. Dan, E. Miltsakaki, D. Roth, and K. Daniilidis, “Cross-modal map learning for vision and language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 15 460–15 470

2022
[17]

TRA VEL: Training-free retrieval and align- ment for vision-and-language navigation,

N. Rajabi and J. Kosecka, “TRA VEL: Training-free retrieval and align- ment for vision-and-language navigation,”arXiv:2502.07306, 2025

work page arXiv 2025
[18]

Active test-time vision-language navigation,

H. Ko, S. Kim, G. Oh, J. Yoon, H. Lee, S. Jang, S. Kim, and S. Kim, “Active test-time vision-language navigation,” inAdv. Neural Inf. Process. Syst., 2025

2025
[19]

Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,” inProc. Eur. Conf. Comput. Vis., 2020, pp. 104–120

2020
[20]

REVERIE: Remote embodied visual referring expression in real indoor environments,

Y . Qi, Q. Wu, P. Anderson, X. Wang, W. Y . Wang, C. Shen, and A. van den Hengel, “REVERIE: Remote embodied visual referring expression in real indoor environments,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9979–9988

2020
[21]

Airbert: In-domain pretraining for vision-and-language navigation,

P.-L. Guhur, M. Tapaswi, H. Chen, I. Laptev, and C. Schmid, “Airbert: In-domain pretraining for vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1614–1623

2021
[22]

HOP+: History- enhanced and order-aware pre-training for vision-and-language navi- gation,

Y . Qiao, Y . Qi, Y . Hong, Z. Yu, P. Wang, and Q. Wu, “HOP+: History- enhanced and order-aware pre-training for vision-and-language navi- gation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8524–8537, 2023

2023
[23]

Target-driven structured transformer planner for vision- language navigation,

Y . Zhao, J. Chen, C. Gao, W. Wang, L. Yang, H. Ren, H. Xia, and S. Liu, “Target-driven structured transformer planner for vision- language navigation,” inProc. 30th ACM Int. Conf. Multimedia, 2022, pp. 4194–4203

2022

[1] [1]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3674–3683

2018

[2] [2]

Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,

Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,”Trans. Mach. Learn. Res., 2024

2024

[3] [3]

GridMM: Grid memory map for vision-and-language navigation,

Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “GridMM: Grid memory map for vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 15 579–15 590

2023

[4] [4]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual-scale graph transformer for vision-and-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 537–16 547

2022

[5] [5]

BEVBert: Topo-metric map pre-training for language-guided navi- gation,

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao, “BEVBert: Topo-metric map pre-training for language-guided navi- gation,”arXiv:2212.04385, 2022

work page arXiv 2022

[6] [6]

VLN- BERT: A recurrent vision-and-language BERT for navigation,

Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “VLN- BERT: A recurrent vision-and-language BERT for navigation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1643–1653

2021

[7] [7]

History aware multimodal transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multimodal transformer for vision-and-language navigation,” inAdv. Neural Inf. Process. Syst., 2021

2021

[8] [8]

Adaptive zone-aware hierarchical planner for vision-language navigation,

C. Gao, X. Peng, M. Yan, H. Wang, L. Yang, H. Ren, H. Li, and S. Liu, “Adaptive zone-aware hierarchical planner for vision-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 14 911–14 920

2023

[9] [9]

SE-VLN: A self-evolving vision-language navigation frame- work based on multimodal large language models,

X. Dong, H. Zhao, J. Gao, H. Li, X. Ma, Y . Zhou, F. Chen, and J. Liu, “SE-VLN: A self-evolving vision-language navigation frame- work based on multimodal large language models,”arXiv:2507.13152, 2025

work page arXiv 2025

[10] [10]

3D Gaussian map with open-set seman- tic grouping for vision-and-language navigation,

J. Gao, R. Liu, and W. Wang, “3D Gaussian map with open-set seman- tic grouping for vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 9252–9262

2025

[11] [11]

COSMO: Combination of selective memorization for low-cost vision- and-language navigation,

S. Zhang, Y . Qiao, Q. Wang, Z. Yan, Q. Wu, Z. Wei, and J. Liu, “COSMO: Combination of selective memorization for low-cost vision- and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 5511–5522

2025

[12] [12]

Iterative vision-and-language navigation,

J. Krantz, S. Banerjee, W. Zhu, J. Corso, P. Anderson, S. Lee, and J. Thomason, “Iterative vision-and-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 14 921– 14 930

2023

[13] [13]

The regretful agent: Heuristic-aided navigation through progress estimation,

C.-Y . Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira, “The regretful agent: Heuristic-aided navigation through progress estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6732–6740

2019

[14] [14]

DREAMW ALKER: Mental planning for continuous vision-language navigation,

H. Wang, W. Liang, L. Van Gool, and W. Wang, “DREAMW ALKER: Mental planning for continuous vision-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 10 839–10 849

2023

[15] [15]

NavQ: Learning a Q-model for foresighted vision-and-language navigation,

P. Xu, X. Gong, and Y . Mu, “NavQ: Learning a Q-model for foresighted vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 6327–6341

2025

[16] [16]

Cross-modal map learning for vision and language navigation,

G. Georgakis, K. Schmeckpeper, K. Wanchoo, S. Dan, E. Miltsakaki, D. Roth, and K. Daniilidis, “Cross-modal map learning for vision and language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 15 460–15 470

2022

[17] [17]

TRA VEL: Training-free retrieval and align- ment for vision-and-language navigation,

N. Rajabi and J. Kosecka, “TRA VEL: Training-free retrieval and align- ment for vision-and-language navigation,”arXiv:2502.07306, 2025

work page arXiv 2025

[18] [18]

Active test-time vision-language navigation,

H. Ko, S. Kim, G. Oh, J. Yoon, H. Lee, S. Jang, S. Kim, and S. Kim, “Active test-time vision-language navigation,” inAdv. Neural Inf. Process. Syst., 2025

2025

[19] [19]

Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,” inProc. Eur. Conf. Comput. Vis., 2020, pp. 104–120

2020

[20] [20]

REVERIE: Remote embodied visual referring expression in real indoor environments,

Y . Qi, Q. Wu, P. Anderson, X. Wang, W. Y . Wang, C. Shen, and A. van den Hengel, “REVERIE: Remote embodied visual referring expression in real indoor environments,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9979–9988

2020

[21] [21]

Airbert: In-domain pretraining for vision-and-language navigation,

P.-L. Guhur, M. Tapaswi, H. Chen, I. Laptev, and C. Schmid, “Airbert: In-domain pretraining for vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1614–1623

2021

[22] [22]

HOP+: History- enhanced and order-aware pre-training for vision-and-language navi- gation,

Y . Qiao, Y . Qi, Y . Hong, Z. Yu, P. Wang, and Q. Wu, “HOP+: History- enhanced and order-aware pre-training for vision-and-language navi- gation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8524–8537, 2023

2023

[23] [23]

Target-driven structured transformer planner for vision- language navigation,

Y . Zhao, J. Chen, C. Gao, W. Wang, L. Yang, H. Ren, H. Xia, and S. Liu, “Target-driven structured transformer planner for vision- language navigation,” inProc. 30th ACM Int. Conf. Multimedia, 2022, pp. 4194–4203

2022