pith. sign in

arxiv: 2607.01395 · v1 · pith:Q7B2J4SQnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI· cs.LG· cs.MM· eess.IV

Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence

Pith reviewed 2026-07-03 20:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MMeess.IV
keywords generic object trackingvisual object trackinghuman visual perceptiontarget discriminationrobust adaptationgeometric reasoningcomputer vision
0
0 comments X

The pith

Enhancing target discrimination, robust adaptation, and geometric reasoning narrows the gap between machine trackers and human visual perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make generic object tracking more like human vision, which maintains coherent understanding by integrating prior knowledge, spatial geometry, and semantic context. Current models struggle with unpredictable events, leading to failures under deformation, distractors, or novel categories. The proposed methods aim to fix this by boosting three key capabilities: better distinguishing the target, adapting online to changes, and reasoning about geometry. A sympathetic reader would care because this could make automated tracking reliable in real-world dynamic environments where humans succeed naturally.

Core claim

Generic object tracking can be advanced toward human-level performance by a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models, thereby addressing bottlenecks in generalization and online adaptation for unpredictable future events and variations.

What carries the argument

A series of methods enhancing target discrimination against distractors, robust online adaptation to variations, and geometric reasoning about spatial context in models started from a single bounding box.

If this is right

  • Trackers maintain visual continuity despite severe target deformation.
  • Models better resist complex distractors and significant environmental changes.
  • Performance improves on object categories unseen during training.
  • Reliable localization continues from an initial bounding box in dynamic streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Stronger adaptation and geometry modules might reduce reliance on massive labeled training sets.
  • The same three enhancements could transfer to related tasks like video object segmentation.
  • Success would suggest that targeted capability boosts, rather than full scene semantics, suffice for human-like tracking.

Load-bearing premise

That the main bottlenecks of generalization and online adaptation can be addressed by systematically enhancing target discrimination, robust adaptation, and geometric reasoning.

What would settle it

A sequence of test videos where a tracker using the proposed enhancements still loses the target on a novel combination of severe deformation and unseen-category distractors.

Figures

Figures reproduced from arXiv: 2607.01395 by Shih-Fang Chen.

Figure 3.1
Figure 3.1. Figure 3.1: Teaser of our method PiVOT. Given the features of [PITH_FULL_IMAGE:figures/full_fig_p033_3_1.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Overview of PiVOT. During the (a) training phase, we [PITH_FULL_IMAGE:figures/full_fig_p036_3_2.png] view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: Success plots of the proposed and competing methods. [PITH_FULL_IMAGE:figures/full_fig_p041_3_3.png] view at source ↗
Figure 3.4
Figure 3.4. Figure 3.4: Attribute analysis on AVisT compares PiVOT with [PITH_FULL_IMAGE:figures/full_fig_p046_3_4.png] view at source ↗
Figure 3.5
Figure 3.5. Figure 3.5: Failure cases of PiVOT. Visual comparison of track [PITH_FULL_IMAGE:figures/full_fig_p049_3_5.png] view at source ↗
Figure 3.6
Figure 3.6. Figure 3.6: Attribute-based analysis of LaSOT and AVisT, com [PITH_FULL_IMAGE:figures/full_fig_p050_3_6.png] view at source ↗
Figure 3.7
Figure 3.7. Figure 3.7: Attribute-based analysis of OTB-100 and UAV123, [PITH_FULL_IMAGE:figures/full_fig_p050_3_7.png] view at source ↗
Figure 3.8
Figure 3.8. Figure 3.8: Visualization of visual prompting through PiVOT. [PITH_FULL_IMAGE:figures/full_fig_p053_3_8.png] view at source ↗
Figure 3.9
Figure 3.9. Figure 3.9: Visualization results of PiVOT. Visual comparison of [PITH_FULL_IMAGE:figures/full_fig_p053_3_9.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Teaser of our method GOT-JEPA. (a) GOT-JEPA ex [PITH_FULL_IMAGE:figures/full_fig_p061_4_1.png] view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Overview of the proposed framework. (a) We pre-train a [PITH_FULL_IMAGE:figures/full_fig_p064_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Attribute analysis of OTB-100, AVisT, and LaSOT [PITH_FULL_IMAGE:figures/full_fig_p075_4_3.png] view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: Comparison of methods using NPr, Pr, and SUC plots [PITH_FULL_IMAGE:figures/full_fig_p078_4_4.png] view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: Comparison of methods using NPr, Pr, and SUC plots [PITH_FULL_IMAGE:figures/full_fig_p078_4_5.png] view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Comparison of methods using NPr, Pr, and SUC plots [PITH_FULL_IMAGE:figures/full_fig_p078_4_6.png] view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: An analysis of the validation curve: how tracker pre [PITH_FULL_IMAGE:figures/full_fig_p083_4_7.png] view at source ↗
Figure 4.8
Figure 4.8. Figure 4.8: Visual comparisons of tracking results from raw annota [PITH_FULL_IMAGE:figures/full_fig_p085_4_8.png] view at source ↗
Figure 4.9
Figure 4.9. Figure 4.9: An ablation study investigates the frame gap between [PITH_FULL_IMAGE:figures/full_fig_p086_4_9.png] view at source ↗
Figure 4.10
Figure 4.10. Figure 4.10: Comparison of relative learning rates between ProjNet [PITH_FULL_IMAGE:figures/full_fig_p086_4_10.png] view at source ↗
Figure 4.11
Figure 4.11. Figure 4.11: Point refinement visualization. Col 1: Initial frame [PITH_FULL_IMAGE:figures/full_fig_p087_4_11.png] view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: The GOT-Edit framework. GOT-Edit facilitates the [PITH_FULL_IMAGE:figures/full_fig_p094_5_1.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: From left to right, success plots of competing methods [PITH_FULL_IMAGE:figures/full_fig_p103_5_2.png] view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Attribute analysis of OTB, AVisT, and LaSOT from [PITH_FULL_IMAGE:figures/full_fig_p105_5_3.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p110_5_4.png] view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p110_5_5.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p110_5_6.png] view at source ↗
Figure 5.7
Figure 5.7. Figure 5.7: Comparison of methods using NPr, Pr, and SUC on [PITH_FULL_IMAGE:figures/full_fig_p111_5_7.png] view at source ↗
Figure 5.8
Figure 5.8. Figure 5.8: Visual comparisons of tracking results from GOT-Edit, [PITH_FULL_IMAGE:figures/full_fig_p112_5_8.png] view at source ↗
read the original abstract

At the heart of human visual perception lies the ability to maintain a continuous and coherent understanding of the external world. By integrating observations with accumulated experience, the human visual system can continuously adapt to variations in both the target and its surrounding environment, while preserving robust visual continuity as scene dynamics evolve. Human vision can therefore integrate prior knowledge, spatial geometry, and semantic context to understand complex scenes and their changes. As a core problem in computer vision, visual object tracking aims to bring machine perception closer to human visual perception. These capabilities are central to the task of Generic Object Tracking (GOT). In this task, a visual tracker is initialized only with the bounding box of an arbitrarily specified target in the first frame, and must continuously localize the target in subsequent dynamic visual streams. However, future events, observations, and real-world variations are inherently unpredictable; therefore, the model's generalization and online adaptation capabilities remain bottlenecks. Tracking reliability can deteriorate when the target undergoes severe deformation, is affected by complex distractors, encounters significant environmental changes, or belongs to a category unseen during training. This dissertation aims to narrow the gap between machine visual tracking systems and human visual perception by proposing a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript is a dissertation proposal that identifies limitations in generic object tracking (GOT), including poor generalization and online adaptation to unpredictable events such as target deformation, distractors, environmental changes, and unseen categories. It claims that integrating human-like capabilities for target discrimination, robust adaptation, and geometric reasoning will narrow the gap to human-level perceptual intelligence, but provides no specific methods, derivations, experiments, or results.

Significance. Advancing visual tracking toward human-level robustness would be significant for computer vision applications. However, because the manuscript supplies no methods, data, or evidence, no assessment of achieved significance is possible; the contribution remains aspirational.

major comments (1)
  1. [Abstract] Abstract: The central claim that 'a series of methods' will systematically enhance target discrimination, robust adaptation, and geometric reasoning is unsupported by any description of those methods, any equations, any experimental design, or any preliminary results. This renders the claim an intention rather than a testable or verifiable contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review. Our manuscript is a dissertation proposal that frames open challenges in generic object tracking and outlines a research agenda; it does not claim to deliver completed methods or results. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'a series of methods' will systematically enhance target discrimination, robust adaptation, and geometric reasoning is unsupported by any description of those methods, any equations, any experimental design, or any preliminary results. This renders the claim an intention rather than a testable or verifiable contribution.

    Authors: We agree that no concrete methods, equations, experimental designs, or results are supplied. The manuscript is explicitly a dissertation proposal whose abstract states the intended research program ('This dissertation aims to narrow the gap ... by proposing a series of methods'). The contribution at this stage is the identification of the three core bottlenecks (target discrimination, robust adaptation, geometric reasoning) and the argument that addressing them would move tracking closer to human-level robustness. Because the document is a proposal rather than a completed study, the absence of implementation details is by design; the abstract accurately describes the scope of the planned dissertation work. revision: no

Circularity Check

0 steps flagged

No circularity: proposal without derivations or equations

full rationale

The document is a dissertation proposal whose abstract states an intention to propose methods for target discrimination, robust adaptation, and geometric reasoning to approach human-level tracking. No equations, parameter fits, self-citations, uniqueness theorems, or ansatzes are supplied. The central text contains no derivation chain that could reduce to its own inputs by construction; the claim is aspirational rather than a completed result. This matches the default expectation of no significant circularity for a high-level goal statement.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details, equations, parameters, or specific assumptions are described in the abstract.

pith-pipeline@v0.9.1-grok · 5755 in / 1022 out tokens · 21614 ms · 2026-07-03T20:59:58.100154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

212 extracted references · 212 canonical work pages · 1 internal anchor

  1. [1]

    Picture perception reveals mental geometry of 3d scene inferences,

    E. Koch, F. Baig, and Q. Zaidi, “Picture perception reveals mental geometry of 3d scene inferences,” Proceedings of the National Academy of Sciences of the United States of America (PNAS) , 2018. 1, 79

  2. [2]

    Knowledge in perception and illusion,

    R. L. Gregory, “Knowledge in perception and illusion,” Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences (PHILOS T R SOC B), 1997. 1, 79

  3. [3]

    Learning discriminative model prediction for tracking,

    G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2019. 1, 10, 19, 25, 26, 31, 47, 51, 61, 79, 83, 87

  4. [4]

    Siamrpn++: Evolution of siamese visual tracking with very deep networks,

    B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019. 1, 11, 19, 32, 34, 47, 79

  5. [5]

    Visual object tracking with discriminative filters and siamese networks: a survey and outlook,

    S. Javed, M. Danelljan, F. S. Khan, M. H. Khan, and J. Matas, “Visual object tracking with discriminative filters and siamese networks: a survey and outlook,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) , 2022. 1, 10, 11, 27, 47, 51, 79, 83

  6. [6]

    A vist: A benchmark for visual object tracking in adverse visibility,

    M. Noman, W. A. Ghallabi, D. Najiha, C. Mayer, A. Dudhane, M. Danelljan, H. Cholakkal, S. Khan, L. Van Gool, and F. S. Khan, “A vist: A benchmark for visual object tracking in adverse visibility,” in Proc. Brit. Mach. Vis. Conf. (BMVC) ,

  7. [7]

    2, 16, 28, 29, 33, 43, 59, 68, 88

  8. [8]

    Improving visual object tracking through visual prompting,

    S.-F. Chen, J.-C. Chen, I.-H. Jhuo, and Y.-Y. Lin, “Improving visual object tracking through visual prompting,” IEEE Transactions on Multimedia (TMM) , 2025. 4, 8, 47, 49, 51, 52, 53, 55, 58, 59, 60, 61, 62, 66, 67, 83, 87, 89, 90, 96

  9. [9]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2021. 4, 8, 19, 22, 29, 35, 36, 41, 66 106

  10. [10]

    GOT-JEPA: Generic object tracking with model adaptation and occlusion handling using joint-embedding pre- dictive architecture,

    S.-F. Chen, J.-C. Chen, I.-H. Jhuo, and Y.-Y. Lin, “GOT-JEPA: Generic object tracking with model adaptation and occlusion handling using joint-embedding pre- dictive architecture,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT) , 2026. 5, 46

  11. [11]

    A path towards autonomous machine intelligence,

    Y. LeCun, “A path towards autonomous machine intelligence,” https:// openreview.net/forum?id=BZ5a1r-kVsf, 2022. 5, 6, 8, 13, 49, 52

  12. [12]

    Cotracker: It is better to track together,

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024. 6, 8, 14, 15, 49, 55, 56, 75, 76, 79

  13. [13]

    GOT-Edit: Geometry-aware generic object tracking via online model editing,

    S.-F. Chen, J.-C. Chen, I. hong Jhuo, and Y.-Y. Lin, “GOT-Edit: Geometry-aware generic object tracking via online model editing,” in The Fourteenth International Conference on Learning Representations , 2026. 6, 78

  14. [14]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025. 6, 8, 15, 76, 79, 80, 84, 89

  15. [15]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fer- nandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” Trans. Mach. Learn. Res. (TMLR) , 2023. 6, 8, 12, 21, 28, 41, 53, 61, 62, 67, 80, 84, 89, 90

  16. [16]

    Alphaedit: Null-space constrained knowledge editing for language models,

    J. Fang, H. Jiang, K. Wang, Y. Ma, S. Jie, X. Wang, X. He, and T.-S. Chua, “Alphaedit: Null-space constrained knowledge editing for language models,” in Proc. Int. Conf. Learn. Represent. (ICLR) , 2025. 7, 8, 80, 82

  17. [17]

    Visual object tracking using adaptive correlation filters,

    D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2010. 10

  18. [18]

    Exploiting the circulant structure of tracking-by-detection with kernels,

    J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2012. 10, 83

  19. [19]

    Learning multi-domain convolutional neural networks for vi- sual tracking,

    H. Nam and B. Han, “Learning multi-domain convolutional neural networks for vi- sual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) ,

  20. [20]

    Learning background-aware correlation filters for visual tracking,

    H. Kiani Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware correlation filters for visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2017. 10 107

  21. [21]

    Joint representation and truncated inference learning for correlation filter based tracking,

    Y. Yao, X. Wu, S. Shan, and W. Zuo, “Joint representation and truncated inference learning for correlation filter based tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018. 10

  22. [22]

    Discriminative correlation filter with channel and spatial reliability,

    A. Lukezic, T. Vojir, L. Cehovin Zajc, J. Matas, and M. Kristan, “Discriminative correlation filter with channel and spatial reliability,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017. 10

  23. [23]

    Eco: Efficient con- volution operators for tracking,

    M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg, “Eco: Efficient con- volution operators for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017. 10

  24. [24]

    Beyond correlation filters: Learning continuous convolution operators for visual tracking,

    M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2016. 10, 66

  25. [25]

    Learning a novel ensemble tracker for robust visual tracking,

    K. Nai and S. Chen, “Learning a novel ensemble tracker for robust visual tracking,” IEEE Trans. Multimedia (TMM) , 2023. 10, 32, 42

  26. [26]

    Robust tracking against adversarial attacks,

    S. Jia, C. Ma, Y. Song, and X. Yang, “Robust tracking against adversarial attacks,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2024. 10

  27. [27]

    Occlusion-aware real-time object tracking,

    X. Dong, J. Shen, D. Yu, W. Wang, J. Liu, and H. Huang, “Occlusion-aware real-time object tracking,” IEEE Trans. Image Process. (TIP) , 2016. 10

  28. [28]

    Semantics-aware visual object tracking,

    R. Yao, G. Lin, C. Shen, Y. Zhang, and Q. Shi, “Semantics-aware visual object tracking,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT) , 2019. 10

  29. [29]

    Mining spatial-temporal similarity for visual tracking,

    Y. Zhang, X. Gao, Z. Chen, H. Zhong, H. Xie, and C. Yan, “Mining spatial-temporal similarity for visual tracking,” IEEE Trans. Image Process. (TIP) , 2020. 10

  30. [30]

    Tracking-learning-detection,

    Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) , 2011. 10

  31. [31]

    Rtracker: Recoverable tracking via pn tree structured memory,

    Y. Huang, X. Li, Z. Zhou, Y. Wang, Z. He, and M.-H. Yang, “Rtracker: Recoverable tracking via pn tree structured memory,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 10

  32. [32]

    High-speed tracking with kernelized correlation filters,

    J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) ,

  33. [33]

    Joint spatio-temporal similarity and discrimination learning for visual tracking,

    Y. Liang, H. Chen, Q. Wu, C. Xia, and J. Li, “Joint spatio-temporal similarity and discrimination learning for visual tracking,” IEEE Trans. Circ. Syst. Video Tech. (TCSVT), 2025. 10 108

  34. [34]

    Deformable object tracking with gated fusion,

    W. Liu, Y. Song, D. Chen, S. He, Y. Yu, T. Yan, G. P. Hancke, and R. W. Lau, “Deformable object tracking with gated fusion,” IEEE Trans. Image Process. (TIP),

  35. [35]

    Transformer meets tracker: Exploiting temporal context for robust visual tracking,

    N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021. 11, 32, 34, 43, 91, 96

  36. [36]

    Transforming model prediction for tracking,

    C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022. 10, 14, 22, 24, 25, 26, 31, 32, 34, 42, 43, 47, 49, 51, 52, 53, 55, 58, 59, 60, 61, 63, 66, 67, 80, 83, 84, 85, 87, 89, 91

  37. [37]

    Model-agnostic meta-learning for fast adapta- tion of deep networks,

    C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adapta- tion of deep networks,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2017. 10

  38. [38]

    Meta-learning via hypernetworks,

    D. Zhao, S. Kobayashi, J. Sacramento, and J. von Oswald, “Meta-learning via hypernetworks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2020. 10, 53

  39. [39]

    A simple neural attentive meta-learner,

    N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” in Proc. Int. Conf. Learn. Represent. (ICLR) , 2018. 10

  40. [40]

    Fully- convolutional siamese networks for object tracking,

    L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully- convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2016. 11

  41. [41]

    High performance visual tracking with siamese region proposal network,

    B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018. 11

  42. [42]

    Siamcar: Siamese fully convolu- tional classification and regression for visual tracking,

    D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “Siamcar: Siamese fully convolu- tional classification and regression for visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020. 11

  43. [43]

    Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,

    Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2020. 11

  44. [44]

    Siam r-cnn: Visual tracking by re-detection,

    P. Voigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) ,

  45. [45]

    Deformable siamese attention net- works for visual object tracking,

    Y. Yu, Y. Xiong, W. Huang, and M. R. Scott, “Deformable siamese attention net- works for visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020. 11 109

  46. [46]

    Ocean: Object-aware anchor-free tracking,

    Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2020. 11, 32

  47. [47]

    Siamese instance search for tracking,

    R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2016. 11

  48. [48]

    Learning spatio-temporal transformer for visual tracking,

    B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2021. 11, 34, 42, 43, 66

  49. [49]

    Transformer tracking,

    X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021. 11, 32, 34, 43, 66

  50. [50]

    Joint feature learning and relation modeling for tracking: A one-stream framework,

    B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022. 11, 32, 42, 58, 63

  51. [51]

    Learning target- aware representation for visual tracking via informative interactions,

    M. Guo, Z. Zhang, H. Fan, L. Jing, Y. Lyu, B. Li, and W. Hu, “Learning target- aware representation for visual tracking via informative interactions,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2022. 11

  52. [52]

    Robust object modeling for visual tracking,

    Y. Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2023. 11, 34, 42, 43, 58, 66, 96

  53. [53]

    Aiatrack: Attention in attention for transformer visual tracking,

    S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , 2022. 11, 32, 34

  54. [54]

    Target-aware tracking with long-term context attention,

    K. He, C. Zhang, S. Xie, Z. Li, and Z. Wang, “Target-aware tracking with long-term context attention,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2023. 11, 21, 34, 42

  55. [55]

    Reading relevant feature from global representation memory for visual object tracking,

    X. Zhou, P. Guo, L. Hong, J. Li, W. Zhang, W. Ge, and W. Zhang, “Reading relevant feature from global representation memory for visual object tracking,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2023. 11

  56. [56]

    Mixformer: End-to-end tracking with iter- ative mixed attention,

    Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iter- ative mixed attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022. 11, 32, 34, 43, 66, 91

  57. [57]

    Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks,

    Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan, and A. B. Chan, “Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 11, 42 110

  58. [58]

    Representation learning for visual object tracking by masked appearance transfer,

    H. Zhao, D. Wang, and H. Lu, “Representation learning for visual object tracking by masked appearance transfer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 11, 42

  59. [59]

    Context-guided black-box attack for visual tracking,

    X. Huang, D. Miao, H. Wang, Y. Wang, and X. Li, “Context-guided black-box attack for visual tracking,” IEEE Trans. Multimedia (TMM) , 2024. 11

  60. [60]

    Seqtrack: Sequence to sequence learning for visual object tracking,

    X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 11, 21, 32, 34, 42, 43, 59, 61, 63, 66, 87, 90

  61. [61]

    Autoregressive visual tracking,

    X. Wei, Y. Bai, Y. Zheng, D. Shi, and Y. Gong, “Autoregressive visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 11, 32, 42

  62. [62]

    Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

    X. Jinxia, Z. Bineng, M. Zhiyi, Z. Shengping, S. Liangtao, S. Shuxiang, and J. Rongrong, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2024. 11, 42

  63. [63]

    Explicit visual prompts for visual object tracking,

    L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2024. 11, 42, 66

  64. [64]

    Hiptrack: Visual tracking with historical prompts,

    W. Cai, Q. Liu, and Y. Wang, “Hiptrack: Visual tracking with historical prompts,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2024. 11, 42, 58, 66

  65. [65]

    Diff-tracker: Text-to-image diffusion models are unsupervised trackers,

    Z. Zhang, L. Xu, D. Peng, H. Rahmani, and J. Liu, “Diff-tracker: Text-to-image diffusion models are unsupervised trackers,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024. 11

  66. [66]

    Diffusiontrack: Point set diffusion model for vi- sual object tracking,

    F. Xie, Z. Wang, and C. Ma, “Diffusiontrack: Point set diffusion model for vi- sual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 11, 42, 63

  67. [67]

    Dreamtrack: Dreaming the future for multimodal visual object tracking,

    M. Guo, W. Tan, W. Ran, L. Jing, and Z. Zhang, “Dreamtrack: Dreaming the future for multimodal visual object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025. 11

  68. [68]

    Less is more: Token context-aware learning for object tracking,

    C. Xu, B. Zhong, Q. Liang, Y. Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2025. 11 111

  69. [69]

    Mambalct: Boosting tracking via long-term context state space model,

    X. Li, B. Zhong, Q. Liang, G. Li, Z. Mo, and S. Song, “Mambalct: Boosting tracking via long-term context state space model,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2025. 11, 61, 90

  70. [70]

    Exploring enhanced contextual information for video-level object tracking,

    B. Kang, X. Chen, S. Lai, Y. Liu, Y. Liu, and D. Wang, “Exploring enhanced contextual information for video-level object tracking,” in Proc. AAAI Conf. Artif. Intell. (AAAI) , 2025. 11, 87, 90, 91

  71. [71]

    Robust tracking via mamba- based context-aware token learning,

    J. Xie, B. Zhong, Q. Liang, N. Li, Z. Mo, and S. Song, “Robust tracking via mamba- based context-aware token learning,” in Proc. AAAI Conf. Artif. Intell. (AAAI) ,

  72. [72]

    Ovtrack: Open- vocabulary multiple object tracking,

    S. Li, T. Fischer, L. Ke, H. Ding, M. Danelljan, and F. Yu, “Ovtrack: Open- vocabulary multiple object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. 12

  73. [73]

    Citetracker: Correlating image and text for visual tracking,

    X. Li, Y. Huang, Z. He, Y. Wang, H. Lu, and M.-H. Yang, “Citetracker: Correlating image and text for visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023. 12, 32, 42

  74. [74]

    Onetracker: Unifying visual object tracking with foundation models and efficient tuning,

    L. Hong, S. Yan, R. Zhang, W. Li, X. Zhou, P. Guo, K. Jiang, Y. Chen, J. Li, Z. Chen, et al., “Onetracker: Unifying visual object tracking with foundation models and efficient tuning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. 12, 13, 42, 43

  75. [75]

    Visual prompt multi-modal tracking,

    J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi-modal tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023. 12, 13, 15

  76. [76]

    Divert more attention to vision-language tracking,

    M. Guo, Z. Zhang, H. Fan, and L. Jing, “Divert more attention to vision-language tracking,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2022. 12

  77. [77]

    Context-aware meta-learning,

    C. Fifty, D. Duan, R. G. Junkins, E. Amid, J. Leskovec, C. Ré, and S. Thrun, “Context-aware meta-learning,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2024. 12, 21

  78. [78]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2023. 13, 19

  79. [79]

    Segment everything everywhere all at once,

    X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2023. 13, 19 112

  80. [80]

    Segment anything meets point tracking,

    F. Rajič, L. Ke, Y.-W. Tai, C.-K. Tang, M. Danelljan, and F. Yu, “Segment anything meets point tracking,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. W ACV), 2025. 13

Showing first 80 references.