Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence

Shih-Fang Chen

arxiv: 2607.01395 · v1 · pith:Q7B2J4SQnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI· cs.LG· cs.MM· eess.IV

Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence

Shih-Fang Chen This is my paper

Pith reviewed 2026-07-03 20:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MMeess.IV

keywords generic object trackingvisual object trackinghuman visual perceptiontarget discriminationrobust adaptationgeometric reasoningcomputer vision

0 comments

The pith

Enhancing target discrimination, robust adaptation, and geometric reasoning narrows the gap between machine trackers and human visual perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make generic object tracking more like human vision, which maintains coherent understanding by integrating prior knowledge, spatial geometry, and semantic context. Current models struggle with unpredictable events, leading to failures under deformation, distractors, or novel categories. The proposed methods aim to fix this by boosting three key capabilities: better distinguishing the target, adapting online to changes, and reasoning about geometry. A sympathetic reader would care because this could make automated tracking reliable in real-world dynamic environments where humans succeed naturally.

Core claim

Generic object tracking can be advanced toward human-level performance by a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models, thereby addressing bottlenecks in generalization and online adaptation for unpredictable future events and variations.

What carries the argument

A series of methods enhancing target discrimination against distractors, robust online adaptation to variations, and geometric reasoning about spatial context in models started from a single bounding box.

If this is right

Trackers maintain visual continuity despite severe target deformation.
Models better resist complex distractors and significant environmental changes.
Performance improves on object categories unseen during training.
Reliable localization continues from an initial bounding box in dynamic streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stronger adaptation and geometry modules might reduce reliance on massive labeled training sets.
The same three enhancements could transfer to related tasks like video object segmentation.
Success would suggest that targeted capability boosts, rather than full scene semantics, suffice for human-like tracking.

Load-bearing premise

That the main bottlenecks of generalization and online adaptation can be addressed by systematically enhancing target discrimination, robust adaptation, and geometric reasoning.

What would settle it

A sequence of test videos where a tracker using the proposed enhancements still loses the target on a novel combination of severe deformation and unseen-category distractors.

Figures

Figures reproduced from arXiv: 2607.01395 by Shih-Fang Chen.

**Figure 3.1.** Figure 3.1: Teaser of our method PiVOT. Given the features of [PITH_FULL_IMAGE:figures/full_fig_p033_3_1.png] view at source ↗

**Figure 3.2.** Figure 3.2: Overview of PiVOT. During the (a) training phase, we [PITH_FULL_IMAGE:figures/full_fig_p036_3_2.png] view at source ↗

**Figure 3.3.** Figure 3.3: Success plots of the proposed and competing methods. [PITH_FULL_IMAGE:figures/full_fig_p041_3_3.png] view at source ↗

**Figure 3.4.** Figure 3.4: Attribute analysis on AVisT compares PiVOT with [PITH_FULL_IMAGE:figures/full_fig_p046_3_4.png] view at source ↗

**Figure 3.5.** Figure 3.5: Failure cases of PiVOT. Visual comparison of track [PITH_FULL_IMAGE:figures/full_fig_p049_3_5.png] view at source ↗

**Figure 3.6.** Figure 3.6: Attribute-based analysis of LaSOT and AVisT, com [PITH_FULL_IMAGE:figures/full_fig_p050_3_6.png] view at source ↗

**Figure 3.7.** Figure 3.7: Attribute-based analysis of OTB-100 and UAV123, [PITH_FULL_IMAGE:figures/full_fig_p050_3_7.png] view at source ↗

**Figure 3.8.** Figure 3.8: Visualization of visual prompting through PiVOT. [PITH_FULL_IMAGE:figures/full_fig_p053_3_8.png] view at source ↗

**Figure 3.9.** Figure 3.9: Visualization results of PiVOT. Visual comparison of [PITH_FULL_IMAGE:figures/full_fig_p053_3_9.png] view at source ↗

**Figure 4.1.** Figure 4.1: Teaser of our method GOT-JEPA. (a) GOT-JEPA ex [PITH_FULL_IMAGE:figures/full_fig_p061_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: Overview of the proposed framework. (a) We pre-train a [PITH_FULL_IMAGE:figures/full_fig_p064_4_2.png] view at source ↗

**Figure 4.3.** Figure 4.3: Attribute analysis of OTB-100, AVisT, and LaSOT [PITH_FULL_IMAGE:figures/full_fig_p075_4_3.png] view at source ↗

**Figure 4.4.** Figure 4.4: Comparison of methods using NPr, Pr, and SUC plots [PITH_FULL_IMAGE:figures/full_fig_p078_4_4.png] view at source ↗

**Figure 4.5.** Figure 4.5: Comparison of methods using NPr, Pr, and SUC plots [PITH_FULL_IMAGE:figures/full_fig_p078_4_5.png] view at source ↗

**Figure 4.6.** Figure 4.6: Comparison of methods using NPr, Pr, and SUC plots [PITH_FULL_IMAGE:figures/full_fig_p078_4_6.png] view at source ↗

**Figure 4.7.** Figure 4.7: An analysis of the validation curve: how tracker pre [PITH_FULL_IMAGE:figures/full_fig_p083_4_7.png] view at source ↗

**Figure 4.8.** Figure 4.8: Visual comparisons of tracking results from raw annota [PITH_FULL_IMAGE:figures/full_fig_p085_4_8.png] view at source ↗

**Figure 4.9.** Figure 4.9: An ablation study investigates the frame gap between [PITH_FULL_IMAGE:figures/full_fig_p086_4_9.png] view at source ↗

**Figure 4.10.** Figure 4.10: Comparison of relative learning rates between ProjNet [PITH_FULL_IMAGE:figures/full_fig_p086_4_10.png] view at source ↗

**Figure 4.11.** Figure 4.11: Point refinement visualization. Col 1: Initial frame [PITH_FULL_IMAGE:figures/full_fig_p087_4_11.png] view at source ↗

**Figure 5.1.** Figure 5.1: The GOT-Edit framework. GOT-Edit facilitates the [PITH_FULL_IMAGE:figures/full_fig_p094_5_1.png] view at source ↗

**Figure 5.2.** Figure 5.2: From left to right, success plots of competing methods [PITH_FULL_IMAGE:figures/full_fig_p103_5_2.png] view at source ↗

**Figure 5.3.** Figure 5.3: Attribute analysis of OTB, AVisT, and LaSOT from [PITH_FULL_IMAGE:figures/full_fig_p105_5_3.png] view at source ↗

**Figure 5.4.** Figure 5.4: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p110_5_4.png] view at source ↗

**Figure 5.5.** Figure 5.5: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p110_5_5.png] view at source ↗

**Figure 5.6.** Figure 5.6: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p110_5_6.png] view at source ↗

**Figure 5.7.** Figure 5.7: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p111_5_7.png] view at source ↗

**Figure 5.8.** Figure 5.8: Visual comparisons of tracking results from GOT-Edit, [PITH_FULL_IMAGE:figures/full_fig_p112_5_8.png] view at source ↗

read the original abstract

At the heart of human visual perception lies the ability to maintain a continuous and coherent understanding of the external world. By integrating observations with accumulated experience, the human visual system can continuously adapt to variations in both the target and its surrounding environment, while preserving robust visual continuity as scene dynamics evolve. Human vision can therefore integrate prior knowledge, spatial geometry, and semantic context to understand complex scenes and their changes. As a core problem in computer vision, visual object tracking aims to bring machine perception closer to human visual perception. These capabilities are central to the task of Generic Object Tracking (GOT). In this task, a visual tracker is initialized only with the bounding box of an arbitrarily specified target in the first frame, and must continuously localize the target in subsequent dynamic visual streams. However, future events, observations, and real-world variations are inherently unpredictable; therefore, the model's generalization and online adaptation capabilities remain bottlenecks. Tracking reliability can deteriorate when the target undergoes severe deformation, is affected by complex distractors, encounters significant environmental changes, or belongs to a category unseen during training. This dissertation aims to narrow the gap between machine visual tracking systems and human visual perception by proposing a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a dissertation proposal that flags real tracking problems but contains no methods, experiments, or results.

read the letter

The core fact here is that this document is a plan, not a completed piece of work. It correctly names the usual failure modes in generic object tracking—severe deformation, distractors, unseen categories, and lack of online adaptation—but stops at stating the intention to fix them through better target discrimination, adaptation, and geometric reasoning.

What it does reasonably is lay out the human-vision analogy in plain terms and tie it to the standard GOT setup. That framing is standard in the field and the bottlenecks it lists match what most tracking papers already acknowledge.

The obvious limitation is the complete absence of any concrete proposal. No architecture, no loss function, no training regime, no dataset, and no numbers. Without those, there is nothing to evaluate for soundness or novelty. The abstract itself says the work “aims to” do these things; it does not claim to have done them.

A reader looking for new algorithms or empirical findings will find none. Someone writing a survey or planning their own thesis might skim the motivation section for the problem list, but that is about the extent of the value.

I would not bring this to a reading group and would not cite it. A serious editor should desk-reject rather than send it out for review, because there is no technical contribution to referee.

Referee Report

1 major / 0 minor

Summary. The manuscript is a dissertation proposal that identifies limitations in generic object tracking (GOT), including poor generalization and online adaptation to unpredictable events such as target deformation, distractors, environmental changes, and unseen categories. It claims that integrating human-like capabilities for target discrimination, robust adaptation, and geometric reasoning will narrow the gap to human-level perceptual intelligence, but provides no specific methods, derivations, experiments, or results.

Significance. Advancing visual tracking toward human-level robustness would be significant for computer vision applications. However, because the manuscript supplies no methods, data, or evidence, no assessment of achieved significance is possible; the contribution remains aspirational.

major comments (1)

[Abstract] Abstract: The central claim that 'a series of methods' will systematically enhance target discrimination, robust adaptation, and geometric reasoning is unsupported by any description of those methods, any equations, any experimental design, or any preliminary results. This renders the claim an intention rather than a testable or verifiable contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review. Our manuscript is a dissertation proposal that frames open challenges in generic object tracking and outlines a research agenda; it does not claim to deliver completed methods or results. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'a series of methods' will systematically enhance target discrimination, robust adaptation, and geometric reasoning is unsupported by any description of those methods, any equations, any experimental design, or any preliminary results. This renders the claim an intention rather than a testable or verifiable contribution.

Authors: We agree that no concrete methods, equations, experimental designs, or results are supplied. The manuscript is explicitly a dissertation proposal whose abstract states the intended research program ('This dissertation aims to narrow the gap ... by proposing a series of methods'). The contribution at this stage is the identification of the three core bottlenecks (target discrimination, robust adaptation, geometric reasoning) and the argument that addressing them would move tracking closer to human-level robustness. Because the document is a proposal rather than a completed study, the absence of implementation details is by design; the abstract accurately describes the scope of the planned dissertation work. revision: no

Circularity Check

0 steps flagged

No circularity: proposal without derivations or equations

full rationale

The document is a dissertation proposal whose abstract states an intention to propose methods for target discrimination, robust adaptation, and geometric reasoning to approach human-level tracking. No equations, parameter fits, self-citations, uniqueness theorems, or ansatzes are supplied. The central text contains no derivation chain that could reduce to its own inputs by construction; the claim is aspirational rather than a completed result. This matches the default expectation of no significant circularity for a high-level goal statement.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details, equations, parameters, or specific assumptions are described in the abstract.

pith-pipeline@v0.9.1-grok · 5755 in / 1022 out tokens · 21614 ms · 2026-07-03T20:59:58.100154+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

212 extracted references · 212 canonical work pages · 1 internal anchor

[1]

Picture perception reveals mental geometry of 3d scene inferences,

E. Koch, F. Baig, and Q. Zaidi, “Picture perception reveals mental geometry of 3d scene inferences,” Proceedings of the National Academy of Sciences of the United States of America (PNAS) , 2018. 1, 79

work page 2018
[2]

Knowledge in perception and illusion,

R. L. Gregory, “Knowledge in perception and illusion,” Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences (PHILOS T R SOC B), 1997. 1, 79

work page 1997
[3]

Learning discriminative model prediction for tracking,

G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2019. 1, 10, 19, 25, 26, 31, 47, 51, 61, 79, 83, 87

work page 2019
[4]

Siamrpn++: Evolution of siamese visual tracking with very deep networks,

B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019. 1, 11, 19, 32, 34, 47, 79

work page 2019
[5]

Visual object tracking with discriminative ﬁlters and siamese networks: a survey and outlook,

S. Javed, M. Danelljan, F. S. Khan, M. H. Khan, and J. Matas, “Visual object tracking with discriminative ﬁlters and siamese networks: a survey and outlook,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) , 2022. 1, 10, 11, 27, 47, 51, 79, 83

work page 2022
[6]

A vist: A benchmark for visual object tracking in adverse visibility,

M. Noman, W. A. Ghallabi, D. Najiha, C. Mayer, A. Dudhane, M. Danelljan, H. Cholakkal, S. Khan, L. Van Gool, and F. S. Khan, “A vist: A benchmark for visual object tracking in adverse visibility,” in Proc. Brit. Mach. Vis. Conf. (BMVC) ,

work page
[7]

2, 16, 28, 29, 33, 43, 59, 68, 88

work page
[8]

Improving visual object tracking through visual prompting,

S.-F. Chen, J.-C. Chen, I.-H. Jhuo, and Y.-Y. Lin, “Improving visual object tracking through visual prompting,” IEEE Transactions on Multimedia (TMM) , 2025. 4, 8, 47, 49, 51, 52, 53, 55, 58, 59, 60, 61, 62, 66, 67, 83, 87, 89, 90, 96

work page 2025
[9]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2021. 4, 8, 19, 22, 29, 35, 36, 41, 66 106

work page 2021
[10]

GOT-JEPA: Generic object tracking with model adaptation and occlusion handling using joint-embedding pre- dictive architecture,

S.-F. Chen, J.-C. Chen, I.-H. Jhuo, and Y.-Y. Lin, “GOT-JEPA: Generic object tracking with model adaptation and occlusion handling using joint-embedding pre- dictive architecture,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT) , 2026. 5, 46

work page 2026
[11]

A path towards autonomous machine intelligence,

Y. LeCun, “A path towards autonomous machine intelligence,” https:// openreview.net/forum?id=BZ5a1r-kVsf, 2022. 5, 6, 8, 13, 49, 52

work page 2022
[12]

Cotracker: It is better to track together,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024. 6, 8, 14, 15, 49, 55, 56, 75, 76, 79

work page 2024
[13]

GOT-Edit: Geometry-aware generic object tracking via online model editing,

S.-F. Chen, J.-C. Chen, I. hong Jhuo, and Y.-Y. Lin, “GOT-Edit: Geometry-aware generic object tracking via online model editing,” in The Fourteenth International Conference on Learning Representations , 2026. 6, 78

work page 2026
[14]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025. 6, 8, 15, 76, 79, 80, 84, 89

work page 2025
[15]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fer- nandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” Trans. Mach. Learn. Res. (TMLR) , 2023. 6, 8, 12, 21, 28, 41, 53, 61, 62, 67, 80, 84, 89, 90

work page 2023
[16]

Alphaedit: Null-space constrained knowledge editing for language models,

J. Fang, H. Jiang, K. Wang, Y. Ma, S. Jie, X. Wang, X. He, and T.-S. Chua, “Alphaedit: Null-space constrained knowledge editing for language models,” in Proc. Int. Conf. Learn. Represent. (ICLR) , 2025. 7, 8, 80, 82

work page 2025
[17]

Visual object tracking using adaptive correlation ﬁlters,

D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation ﬁlters,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2010. 10

work page 2010
[18]

Exploiting the circulant structure of tracking-by-detection with kernels,

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2012. 10, 83

work page 2012
[19]

Learning multi-domain convolutional neural networks for vi- sual tracking,

H. Nam and B. Han, “Learning multi-domain convolutional neural networks for vi- sual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) ,

work page
[20]

Learning background-aware correlation ﬁlters for visual tracking,

H. Kiani Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware correlation ﬁlters for visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2017. 10 107

work page 2017
[21]

Joint representation and truncated inference learning for correlation ﬁlter based tracking,

Y. Yao, X. Wu, S. Shan, and W. Zuo, “Joint representation and truncated inference learning for correlation ﬁlter based tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018. 10

work page 2018
[22]

Discriminative correlation ﬁlter with channel and spatial reliability,

A. Lukezic, T. Vojir, L. Cehovin Zajc, J. Matas, and M. Kristan, “Discriminative correlation ﬁlter with channel and spatial reliability,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017. 10

work page 2017
[23]

Eco: Eﬃcient con- volution operators for tracking,

M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg, “Eco: Eﬃcient con- volution operators for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017. 10

work page 2017
[24]

Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking,

M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg, “Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2016. 10, 66

work page 2016
[25]

Learning a novel ensemble tracker for robust visual tracking,

K. Nai and S. Chen, “Learning a novel ensemble tracker for robust visual tracking,” IEEE Trans. Multimedia (TMM) , 2023. 10, 32, 42

work page 2023
[26]

Robust tracking against adversarial attacks,

S. Jia, C. Ma, Y. Song, and X. Yang, “Robust tracking against adversarial attacks,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2024. 10

work page 2024
[27]

Occlusion-aware real-time object tracking,

X. Dong, J. Shen, D. Yu, W. Wang, J. Liu, and H. Huang, “Occlusion-aware real-time object tracking,” IEEE Trans. Image Process. (TIP) , 2016. 10

work page 2016
[28]

Semantics-aware visual object tracking,

R. Yao, G. Lin, C. Shen, Y. Zhang, and Q. Shi, “Semantics-aware visual object tracking,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT) , 2019. 10

work page 2019
[29]

Mining spatial-temporal similarity for visual tracking,

Y. Zhang, X. Gao, Z. Chen, H. Zhong, H. Xie, and C. Yan, “Mining spatial-temporal similarity for visual tracking,” IEEE Trans. Image Process. (TIP) , 2020. 10

work page 2020
[30]

Tracking-learning-detection,

Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) , 2011. 10

work page 2011
[31]

Rtracker: Recoverable tracking via pn tree structured memory,

Y. Huang, X. Li, Z. Zhou, Y. Wang, Z. He, and M.-H. Yang, “Rtracker: Recoverable tracking via pn tree structured memory,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 10

work page 2024
[32]

High-speed tracking with kernelized correlation ﬁlters,

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation ﬁlters,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) ,

work page
[33]

Joint spatio-temporal similarity and discrimination learning for visual tracking,

Y. Liang, H. Chen, Q. Wu, C. Xia, and J. Li, “Joint spatio-temporal similarity and discrimination learning for visual tracking,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT), 2025. 10 108

work page 2025
[34]

Deformable object tracking with gated fusion,

W. Liu, Y. Song, D. Chen, S. He, Y. Yu, T. Yan, G. P. Hancke, and R. W. Lau, “Deformable object tracking with gated fusion,” IEEE Trans. Image Process. (TIP),

work page
[35]

Transformer meets tracker: Exploiting temporal context for robust visual tracking,

N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021. 11, 32, 34, 43, 91, 96

work page 2021
[36]

Transforming model prediction for tracking,

C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022. 10, 14, 22, 24, 25, 26, 31, 32, 34, 42, 43, 47, 49, 51, 52, 53, 55, 58, 59, 60, 61, 63, 66, 67, 80, 83, 84, 85, 87, 89, 91

work page 2022
[37]

Model-agnostic meta-learning for fast adapta- tion of deep networks,

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adapta- tion of deep networks,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2017. 10

work page 2017
[38]

Meta-learning via hypernetworks,

D. Zhao, S. Kobayashi, J. Sacramento, and J. von Oswald, “Meta-learning via hypernetworks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2020. 10, 53

work page 2020
[39]

A simple neural attentive meta-learner,

N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” in Proc. Int. Conf. Learn. Represent. (ICLR) , 2018. 10

work page 2018
[40]

Fully- convolutional siamese networks for object tracking,

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully- convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2016. 11

work page 2016
[41]

High performance visual tracking with siamese region proposal network,

B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018. 11

work page 2018
[42]

Siamcar: Siamese fully convolu- tional classiﬁcation and regression for visual tracking,

D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “Siamcar: Siamese fully convolu- tional classiﬁcation and regression for visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020. 11

work page 2020
[43]

Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,

Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2020. 11

work page 2020
[44]

Siam r-cnn: Visual tracking by re-detection,

P. Voigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) ,

work page
[45]

Deformable siamese attention net- works for visual object tracking,

Y. Yu, Y. Xiong, W. Huang, and M. R. Scott, “Deformable siamese attention net- works for visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020. 11 109

work page 2020
[46]

Ocean: Object-aware anchor-free tracking,

Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2020. 11, 32

work page 2020
[47]

Siamese instance search for tracking,

R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2016. 11

work page 2016
[48]

Learning spatio-temporal transformer for visual tracking,

B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2021. 11, 34, 42, 43, 66

work page 2021
[49]

Transformer tracking,

X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021. 11, 32, 34, 43, 66

work page 2021
[50]

Joint feature learning and relation modeling for tracking: A one-stream framework,

B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022. 11, 32, 42, 58, 63

work page 2022
[51]

Learning target- aware representation for visual tracking via informative interactions,

M. Guo, Z. Zhang, H. Fan, L. Jing, Y. Lyu, B. Li, and W. Hu, “Learning target- aware representation for visual tracking via informative interactions,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2022. 11

work page 2022
[52]

Robust object modeling for visual tracking,

Y. Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2023. 11, 34, 42, 43, 58, 66, 96

work page 2023
[53]

Aiatrack: Attention in attention for transformer visual tracking,

S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2022. 11, 32, 34

work page 2022
[54]

Target-aware tracking with long-term context attention,

K. He, C. Zhang, S. Xie, Z. Li, and Z. Wang, “Target-aware tracking with long-term context attention,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2023. 11, 21, 34, 42

work page 2023
[55]

Reading relevant feature from global representation memory for visual object tracking,

X. Zhou, P. Guo, L. Hong, J. Li, W. Zhang, W. Ge, and W. Zhang, “Reading relevant feature from global representation memory for visual object tracking,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2023. 11

work page 2023
[56]

Mixformer: End-to-end tracking with iter- ative mixed attention,

Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iter- ative mixed attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022. 11, 32, 34, 43, 66, 91

work page 2022
[57]

Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks,

Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan, and A. B. Chan, “Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 11, 42 110

work page 2023
[58]

Representation learning for visual object tracking by masked appearance transfer,

H. Zhao, D. Wang, and H. Lu, “Representation learning for visual object tracking by masked appearance transfer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 11, 42

work page 2023
[59]

Context-guided black-box attack for visual tracking,

X. Huang, D. Miao, H. Wang, Y. Wang, and X. Li, “Context-guided black-box attack for visual tracking,” IEEE Trans. Multimedia (TMM) , 2024. 11

work page 2024
[60]

Seqtrack: Sequence to sequence learning for visual object tracking,

X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 11, 21, 32, 34, 42, 43, 59, 61, 63, 66, 87, 90

work page 2023
[61]

Autoregressive visual tracking,

X. Wei, Y. Bai, Y. Zheng, D. Shi, and Y. Gong, “Autoregressive visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 11, 32, 42

work page 2023
[62]

Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

X. Jinxia, Z. Bineng, M. Zhiyi, Z. Shengping, S. Liangtao, S. Shuxiang, and J. Rongrong, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2024. 11, 42

work page 2024
[63]

Explicit visual prompts for visual object tracking,

L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2024. 11, 42, 66

work page 2024
[64]

Hiptrack: Visual tracking with historical prompts,

W. Cai, Q. Liu, and Y. Wang, “Hiptrack: Visual tracking with historical prompts,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2024. 11, 42, 58, 66

work page 2024
[65]

Diﬀ-tracker: Text-to-image diﬀusion models are unsupervised trackers,

Z. Zhang, L. Xu, D. Peng, H. Rahmani, and J. Liu, “Diﬀ-tracker: Text-to-image diﬀusion models are unsupervised trackers,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024. 11

work page 2024
[66]

Diﬀusiontrack: Point set diﬀusion model for vi- sual object tracking,

F. Xie, Z. Wang, and C. Ma, “Diﬀusiontrack: Point set diﬀusion model for vi- sual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 11, 42, 63

work page 2024
[67]

Dreamtrack: Dreaming the future for multimodal visual object tracking,

M. Guo, W. Tan, W. Ran, L. Jing, and Z. Zhang, “Dreamtrack: Dreaming the future for multimodal visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025. 11

work page 2025
[68]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y. Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2025. 11 111

work page 2025
[69]

Mambalct: Boosting tracking via long-term context state space model,

X. Li, B. Zhong, Q. Liang, G. Li, Z. Mo, and S. Song, “Mambalct: Boosting tracking via long-term context state space model,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2025. 11, 61, 90

work page 2025
[70]

Exploring enhanced contextual information for video-level object tracking,

B. Kang, X. Chen, S. Lai, Y. Liu, Y. Liu, and D. Wang, “Exploring enhanced contextual information for video-level object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2025. 11, 87, 90, 91

work page 2025
[71]

Robust tracking via mamba- based context-aware token learning,

J. Xie, B. Zhong, Q. Liang, N. Li, Z. Mo, and S. Song, “Robust tracking via mamba- based context-aware token learning,” in Proc. AAAI Conf. Artif. Intell. (AAAI) ,

work page
[72]

Ovtrack: Open- vocabulary multiple object tracking,

S. Li, T. Fischer, L. Ke, H. Ding, M. Danelljan, and F. Yu, “Ovtrack: Open- vocabulary multiple object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 12

work page 2023
[73]

Citetracker: Correlating image and text for visual tracking,

X. Li, Y. Huang, Z. He, Y. Wang, H. Lu, and M.-H. Yang, “Citetracker: Correlating image and text for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023. 12, 32, 42

work page 2023
[74]

Onetracker: Unifying visual object tracking with foundation models and eﬃcient tuning,

L. Hong, S. Yan, R. Zhang, W. Li, X. Zhou, P. Guo, K. Jiang, Y. Chen, J. Li, Z. Chen, et al., “Onetracker: Unifying visual object tracking with foundation models and eﬃcient tuning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 12, 13, 42, 43

work page 2024
[75]

Visual prompt multi-modal tracking,

J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi-modal tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 12, 13, 15

work page 2023
[76]

Divert more attention to vision-language tracking,

M. Guo, Z. Zhang, H. Fan, and L. Jing, “Divert more attention to vision-language tracking,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2022. 12

work page 2022
[77]

Context-aware meta-learning,

C. Fifty, D. Duan, R. G. Junkins, E. Amid, J. Leskovec, C. Ré, and S. Thrun, “Context-aware meta-learning,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2024. 12, 21

work page 2024
[78]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2023. 13, 19

work page 2023
[79]

Segment everything everywhere all at once,

X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2023. 13, 19 112

work page 2023
[80]

Segment anything meets point tracking,

F. Rajič, L. Ke, Y.-W. Tai, C.-K. Tang, M. Danelljan, and F. Yu, “Segment anything meets point tracking,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. W ACV), 2025. 13

work page 2025

Showing first 80 references.

[1] [1]

Picture perception reveals mental geometry of 3d scene inferences,

E. Koch, F. Baig, and Q. Zaidi, “Picture perception reveals mental geometry of 3d scene inferences,” Proceedings of the National Academy of Sciences of the United States of America (PNAS) , 2018. 1, 79

work page 2018

[2] [2]

Knowledge in perception and illusion,

R. L. Gregory, “Knowledge in perception and illusion,” Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences (PHILOS T R SOC B), 1997. 1, 79

work page 1997

[3] [3]

Learning discriminative model prediction for tracking,

G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2019. 1, 10, 19, 25, 26, 31, 47, 51, 61, 79, 83, 87

work page 2019

[4] [4]

Siamrpn++: Evolution of siamese visual tracking with very deep networks,

B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019. 1, 11, 19, 32, 34, 47, 79

work page 2019

[5] [5]

Visual object tracking with discriminative ﬁlters and siamese networks: a survey and outlook,

S. Javed, M. Danelljan, F. S. Khan, M. H. Khan, and J. Matas, “Visual object tracking with discriminative ﬁlters and siamese networks: a survey and outlook,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) , 2022. 1, 10, 11, 27, 47, 51, 79, 83

work page 2022

[6] [6]

A vist: A benchmark for visual object tracking in adverse visibility,

M. Noman, W. A. Ghallabi, D. Najiha, C. Mayer, A. Dudhane, M. Danelljan, H. Cholakkal, S. Khan, L. Van Gool, and F. S. Khan, “A vist: A benchmark for visual object tracking in adverse visibility,” in Proc. Brit. Mach. Vis. Conf. (BMVC) ,

work page

[7] [7]

2, 16, 28, 29, 33, 43, 59, 68, 88

work page

[8] [8]

Improving visual object tracking through visual prompting,

S.-F. Chen, J.-C. Chen, I.-H. Jhuo, and Y.-Y. Lin, “Improving visual object tracking through visual prompting,” IEEE Transactions on Multimedia (TMM) , 2025. 4, 8, 47, 49, 51, 52, 53, 55, 58, 59, 60, 61, 62, 66, 67, 83, 87, 89, 90, 96

work page 2025

[9] [9]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2021. 4, 8, 19, 22, 29, 35, 36, 41, 66 106

work page 2021

[10] [10]

GOT-JEPA: Generic object tracking with model adaptation and occlusion handling using joint-embedding pre- dictive architecture,

S.-F. Chen, J.-C. Chen, I.-H. Jhuo, and Y.-Y. Lin, “GOT-JEPA: Generic object tracking with model adaptation and occlusion handling using joint-embedding pre- dictive architecture,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT) , 2026. 5, 46

work page 2026

[11] [11]

A path towards autonomous machine intelligence,

Y. LeCun, “A path towards autonomous machine intelligence,” https:// openreview.net/forum?id=BZ5a1r-kVsf, 2022. 5, 6, 8, 13, 49, 52

work page 2022

[12] [12]

Cotracker: It is better to track together,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024. 6, 8, 14, 15, 49, 55, 56, 75, 76, 79

work page 2024

[13] [13]

GOT-Edit: Geometry-aware generic object tracking via online model editing,

S.-F. Chen, J.-C. Chen, I. hong Jhuo, and Y.-Y. Lin, “GOT-Edit: Geometry-aware generic object tracking via online model editing,” in The Fourteenth International Conference on Learning Representations , 2026. 6, 78

work page 2026

[14] [14]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025. 6, 8, 15, 76, 79, 80, 84, 89

work page 2025

[15] [15]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fer- nandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” Trans. Mach. Learn. Res. (TMLR) , 2023. 6, 8, 12, 21, 28, 41, 53, 61, 62, 67, 80, 84, 89, 90

work page 2023

[16] [16]

Alphaedit: Null-space constrained knowledge editing for language models,

J. Fang, H. Jiang, K. Wang, Y. Ma, S. Jie, X. Wang, X. He, and T.-S. Chua, “Alphaedit: Null-space constrained knowledge editing for language models,” in Proc. Int. Conf. Learn. Represent. (ICLR) , 2025. 7, 8, 80, 82

work page 2025

[17] [17]

Visual object tracking using adaptive correlation ﬁlters,

D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation ﬁlters,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2010. 10

work page 2010

[18] [18]

Exploiting the circulant structure of tracking-by-detection with kernels,

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2012. 10, 83

work page 2012

[19] [19]

Learning multi-domain convolutional neural networks for vi- sual tracking,

H. Nam and B. Han, “Learning multi-domain convolutional neural networks for vi- sual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) ,

work page

[20] [20]

Learning background-aware correlation ﬁlters for visual tracking,

H. Kiani Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware correlation ﬁlters for visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2017. 10 107

work page 2017

[21] [21]

Joint representation and truncated inference learning for correlation ﬁlter based tracking,

Y. Yao, X. Wu, S. Shan, and W. Zuo, “Joint representation and truncated inference learning for correlation ﬁlter based tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018. 10

work page 2018

[22] [22]

Discriminative correlation ﬁlter with channel and spatial reliability,

A. Lukezic, T. Vojir, L. Cehovin Zajc, J. Matas, and M. Kristan, “Discriminative correlation ﬁlter with channel and spatial reliability,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017. 10

work page 2017

[23] [23]

Eco: Eﬃcient con- volution operators for tracking,

M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg, “Eco: Eﬃcient con- volution operators for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017. 10

work page 2017

[24] [24]

Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking,

M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg, “Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2016. 10, 66

work page 2016

[25] [25]

Learning a novel ensemble tracker for robust visual tracking,

K. Nai and S. Chen, “Learning a novel ensemble tracker for robust visual tracking,” IEEE Trans. Multimedia (TMM) , 2023. 10, 32, 42

work page 2023

[26] [26]

Robust tracking against adversarial attacks,

S. Jia, C. Ma, Y. Song, and X. Yang, “Robust tracking against adversarial attacks,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2024. 10

work page 2024

[27] [27]

Occlusion-aware real-time object tracking,

X. Dong, J. Shen, D. Yu, W. Wang, J. Liu, and H. Huang, “Occlusion-aware real-time object tracking,” IEEE Trans. Image Process. (TIP) , 2016. 10

work page 2016

[28] [28]

Semantics-aware visual object tracking,

R. Yao, G. Lin, C. Shen, Y. Zhang, and Q. Shi, “Semantics-aware visual object tracking,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT) , 2019. 10

work page 2019

[29] [29]

Mining spatial-temporal similarity for visual tracking,

Y. Zhang, X. Gao, Z. Chen, H. Zhong, H. Xie, and C. Yan, “Mining spatial-temporal similarity for visual tracking,” IEEE Trans. Image Process. (TIP) , 2020. 10

work page 2020

[30] [30]

Tracking-learning-detection,

Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) , 2011. 10

work page 2011

[31] [31]

Rtracker: Recoverable tracking via pn tree structured memory,

Y. Huang, X. Li, Z. Zhou, Y. Wang, Z. He, and M.-H. Yang, “Rtracker: Recoverable tracking via pn tree structured memory,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 10

work page 2024

[32] [32]

High-speed tracking with kernelized correlation ﬁlters,

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation ﬁlters,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) ,

work page

[33] [33]

Joint spatio-temporal similarity and discrimination learning for visual tracking,

Y. Liang, H. Chen, Q. Wu, C. Xia, and J. Li, “Joint spatio-temporal similarity and discrimination learning for visual tracking,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT), 2025. 10 108

work page 2025

[34] [34]

Deformable object tracking with gated fusion,

W. Liu, Y. Song, D. Chen, S. He, Y. Yu, T. Yan, G. P. Hancke, and R. W. Lau, “Deformable object tracking with gated fusion,” IEEE Trans. Image Process. (TIP),

work page

[35] [35]

Transformer meets tracker: Exploiting temporal context for robust visual tracking,

N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021. 11, 32, 34, 43, 91, 96

work page 2021

[36] [36]

Transforming model prediction for tracking,

C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022. 10, 14, 22, 24, 25, 26, 31, 32, 34, 42, 43, 47, 49, 51, 52, 53, 55, 58, 59, 60, 61, 63, 66, 67, 80, 83, 84, 85, 87, 89, 91

work page 2022

[37] [37]

Model-agnostic meta-learning for fast adapta- tion of deep networks,

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adapta- tion of deep networks,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2017. 10

work page 2017

[38] [38]

Meta-learning via hypernetworks,

D. Zhao, S. Kobayashi, J. Sacramento, and J. von Oswald, “Meta-learning via hypernetworks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2020. 10, 53

work page 2020

[39] [39]

A simple neural attentive meta-learner,

N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” in Proc. Int. Conf. Learn. Represent. (ICLR) , 2018. 10

work page 2018

[40] [40]

Fully- convolutional siamese networks for object tracking,

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully- convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2016. 11

work page 2016

[41] [41]

High performance visual tracking with siamese region proposal network,

B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018. 11

work page 2018

[42] [42]

Siamcar: Siamese fully convolu- tional classiﬁcation and regression for visual tracking,

D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “Siamcar: Siamese fully convolu- tional classiﬁcation and regression for visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020. 11

work page 2020

[43] [43]

Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,

Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2020. 11

work page 2020

[44] [44]

Siam r-cnn: Visual tracking by re-detection,

P. Voigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) ,

work page

[45] [45]

Deformable siamese attention net- works for visual object tracking,

Y. Yu, Y. Xiong, W. Huang, and M. R. Scott, “Deformable siamese attention net- works for visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020. 11 109

work page 2020

[46] [46]

Ocean: Object-aware anchor-free tracking,

Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2020. 11, 32

work page 2020

[47] [47]

Siamese instance search for tracking,

R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2016. 11

work page 2016

[48] [48]

Learning spatio-temporal transformer for visual tracking,

B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2021. 11, 34, 42, 43, 66

work page 2021

[49] [49]

Transformer tracking,

X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021. 11, 32, 34, 43, 66

work page 2021

[50] [50]

Joint feature learning and relation modeling for tracking: A one-stream framework,

B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022. 11, 32, 42, 58, 63

work page 2022

[51] [51]

Learning target- aware representation for visual tracking via informative interactions,

M. Guo, Z. Zhang, H. Fan, L. Jing, Y. Lyu, B. Li, and W. Hu, “Learning target- aware representation for visual tracking via informative interactions,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2022. 11

work page 2022

[52] [52]

Robust object modeling for visual tracking,

Y. Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2023. 11, 34, 42, 43, 58, 66, 96

work page 2023

[53] [53]

Aiatrack: Attention in attention for transformer visual tracking,

S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2022. 11, 32, 34

work page 2022

[54] [54]

Target-aware tracking with long-term context attention,

K. He, C. Zhang, S. Xie, Z. Li, and Z. Wang, “Target-aware tracking with long-term context attention,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2023. 11, 21, 34, 42

work page 2023

[55] [55]

Reading relevant feature from global representation memory for visual object tracking,

X. Zhou, P. Guo, L. Hong, J. Li, W. Zhang, W. Ge, and W. Zhang, “Reading relevant feature from global representation memory for visual object tracking,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2023. 11

work page 2023

[56] [56]

Mixformer: End-to-end tracking with iter- ative mixed attention,

Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iter- ative mixed attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022. 11, 32, 34, 43, 66, 91

work page 2022

[57] [57]

Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks,

Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan, and A. B. Chan, “Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 11, 42 110

work page 2023

[58] [58]

Representation learning for visual object tracking by masked appearance transfer,

H. Zhao, D. Wang, and H. Lu, “Representation learning for visual object tracking by masked appearance transfer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 11, 42

work page 2023

[59] [59]

Context-guided black-box attack for visual tracking,

X. Huang, D. Miao, H. Wang, Y. Wang, and X. Li, “Context-guided black-box attack for visual tracking,” IEEE Trans. Multimedia (TMM) , 2024. 11

work page 2024

[60] [60]

Seqtrack: Sequence to sequence learning for visual object tracking,

X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 11, 21, 32, 34, 42, 43, 59, 61, 63, 66, 87, 90

work page 2023

[61] [61]

Autoregressive visual tracking,

X. Wei, Y. Bai, Y. Zheng, D. Shi, and Y. Gong, “Autoregressive visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 11, 32, 42

work page 2023

[62] [62]

Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

X. Jinxia, Z. Bineng, M. Zhiyi, Z. Shengping, S. Liangtao, S. Shuxiang, and J. Rongrong, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2024. 11, 42

work page 2024

[63] [63]

Explicit visual prompts for visual object tracking,

L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2024. 11, 42, 66

work page 2024

[64] [64]

Hiptrack: Visual tracking with historical prompts,

W. Cai, Q. Liu, and Y. Wang, “Hiptrack: Visual tracking with historical prompts,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2024. 11, 42, 58, 66

work page 2024

[65] [65]

Diﬀ-tracker: Text-to-image diﬀusion models are unsupervised trackers,

Z. Zhang, L. Xu, D. Peng, H. Rahmani, and J. Liu, “Diﬀ-tracker: Text-to-image diﬀusion models are unsupervised trackers,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024. 11

work page 2024

[66] [66]

Diﬀusiontrack: Point set diﬀusion model for vi- sual object tracking,

F. Xie, Z. Wang, and C. Ma, “Diﬀusiontrack: Point set diﬀusion model for vi- sual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 11, 42, 63

work page 2024

[67] [67]

Dreamtrack: Dreaming the future for multimodal visual object tracking,

M. Guo, W. Tan, W. Ran, L. Jing, and Z. Zhang, “Dreamtrack: Dreaming the future for multimodal visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025. 11

work page 2025

[68] [68]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y. Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2025. 11 111

work page 2025

[69] [69]

Mambalct: Boosting tracking via long-term context state space model,

X. Li, B. Zhong, Q. Liang, G. Li, Z. Mo, and S. Song, “Mambalct: Boosting tracking via long-term context state space model,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2025. 11, 61, 90

work page 2025

[70] [70]

Exploring enhanced contextual information for video-level object tracking,

B. Kang, X. Chen, S. Lai, Y. Liu, Y. Liu, and D. Wang, “Exploring enhanced contextual information for video-level object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2025. 11, 87, 90, 91

work page 2025

[71] [71]

Robust tracking via mamba- based context-aware token learning,

J. Xie, B. Zhong, Q. Liang, N. Li, Z. Mo, and S. Song, “Robust tracking via mamba- based context-aware token learning,” in Proc. AAAI Conf. Artif. Intell. (AAAI) ,

work page

[72] [72]

Ovtrack: Open- vocabulary multiple object tracking,

S. Li, T. Fischer, L. Ke, H. Ding, M. Danelljan, and F. Yu, “Ovtrack: Open- vocabulary multiple object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 12

work page 2023

[73] [73]

Citetracker: Correlating image and text for visual tracking,

X. Li, Y. Huang, Z. He, Y. Wang, H. Lu, and M.-H. Yang, “Citetracker: Correlating image and text for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023. 12, 32, 42

work page 2023

[74] [74]

Onetracker: Unifying visual object tracking with foundation models and eﬃcient tuning,

L. Hong, S. Yan, R. Zhang, W. Li, X. Zhou, P. Guo, K. Jiang, Y. Chen, J. Li, Z. Chen, et al., “Onetracker: Unifying visual object tracking with foundation models and eﬃcient tuning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 12, 13, 42, 43

work page 2024

[75] [75]

Visual prompt multi-modal tracking,

J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi-modal tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 12, 13, 15

work page 2023

[76] [76]

Divert more attention to vision-language tracking,

M. Guo, Z. Zhang, H. Fan, and L. Jing, “Divert more attention to vision-language tracking,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2022. 12

work page 2022

[77] [77]

Context-aware meta-learning,

C. Fifty, D. Duan, R. G. Junkins, E. Amid, J. Leskovec, C. Ré, and S. Thrun, “Context-aware meta-learning,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2024. 12, 21

work page 2024

[78] [78]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2023. 13, 19

work page 2023

[79] [79]

Segment everything everywhere all at once,

X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2023. 13, 19 112

work page 2023

[80] [80]

Segment anything meets point tracking,

F. Rajič, L. Ke, Y.-W. Tai, C.-K. Tang, M. Danelljan, and F. Yu, “Segment anything meets point tracking,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. W ACV), 2025. 13

work page 2025