pith. sign in

arxiv: 2607.02284 · v1 · pith:S5CORFGSnew · submitted 2026-07-02 · 💻 cs.CV

FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval

Pith reviewed 2026-07-03 15:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot composed image retrievalflow matchingsemantic transportvision-language modelsnegation handlingimage retrievalconditional transport
0
0 comments X

The pith

FlowCIR casts zero-shot composed image retrieval as conditional semantic transport learned by flow matching on fixed vision-language model embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that converting a reference image and text instruction into a target query can be done by training a transport field that moves the instruction embedding toward the correct target embedding, conditioned on the reference. This replaces the textual-inversion step used in earlier methods, which the authors view as lossy for fine details. Because the transport module trains only on pre-extracted embeddings and leaves the encoders untouched, the approach requires far less compute than inversion-based training. The work further introduces an inference-time correction that steers away from negated concepts when the instruction contains removal or negation language.

Core claim

Zero-shot composed image retrieval is reformulated as learning a conditional flow-matching transport field that maps an instruction representation, given the reference image, directly to a target-aligned query embedding; the resulting lightweight module produces competitive retrieval accuracy on standard benchmarks while using roughly ten times fewer training resources than textual-inversion baselines and incorporates a Multi-Negative Steering procedure to offset vision-language model weaknesses on negation.

What carries the argument

Conditional flow matching transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image.

If this is right

  • The method reaches strong performance on existing CIR benchmarks without requiring domain-specific triplet annotations.
  • Training cost is reduced by a factor of roughly ten compared with prior textual-inversion pipelines.
  • Multi-Negative Steering at inference improves results on queries that contain negation or removal instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transport formulation could be tested on other vision-language tasks that currently rely on token inversion or simple concatenation for composition.
  • Because the approach never updates the underlying encoders, it may allow reuse of the same transport module across different vision-language model backbones.

Load-bearing premise

A lightweight transport module trained solely on fixed pre-extracted vision-language model embeddings can capture the fine-grained semantics needed for accurate target retrieval without any encoder updates.

What would settle it

A controlled experiment showing that FlowCIR retrieval accuracy on standard benchmarks drops below that of a textual-inversion baseline when both methods receive identical training compute and the same pre-trained encoders.

Figures

Figures reproduced from arXiv: 2607.02284 by Long Chen, Teng Wang, Yanghao Wang, Yuanpei Liu, Zhenqi He, Ziqi Jiang.

Figure 1
Figure 1. Figure 1: (a) Illustration of Composed Image Retrieval. (b) Prior methods are composed by textual inversion and token-level fusion in text space. (c) FlowCIR composes via conditional flow matching to produce a target-oriented retrieval query. images that match the intended edit (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance on CIRCO (mAP5) and CIRR (Recall1) with training hours (in single GPU) as bubble size. Recent zero-shot CIR methods [4, 19, 44, 47] largely build on textual inversion over well-aligned vision–language mod￾els [30,41], effectively reducing cross-modal composition to a text-only manipulation problem. Concretely, they learn a projec￾tor to transfer the reference image Ir into a small set of pseudo… view at source ↗
Figure 3
Figure 3. Figure 3: Framework overview of FlowCIR. (a) In training, FlowCIR learns a condi￾tional flow-matching transport from relative-instruction embeddings to target-oriented text embeddings under the reference-image condition, together with top-K hard￾negative retrieval supervision. (b) In inference, an inference-only Multi-Negative Steer￾ing module adjusts negation-containing instructions before one-step transport for ta… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative retrieval results illustrating the effect of reference-conditioned trans￾port and Multi-Negative Steering. Predicted targets are marked with for correct re￾trievals and for incorrect ones. pose; when the reference is a fluffy white dog, the retrieved images remain visu￾ally consistent with that appearance and mainly modify the head orientation. A similar phenomenon appears in the third and four… view at source ↗
read the original abstract

Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emph{conditional flow matching}, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly $10\times$ fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes FlowCIR for zero-shot composed image retrieval (ZS-CIR), framing the task as conditional semantic transport via flow matching. A lightweight transport module is trained on fixed pre-extracted VLM embeddings to map an instruction representation (conditioned on a reference image) to a target-aligned query embedding, without updating encoders or using task-specific triplets. It claims this yields competitive performance on standard CIR benchmarks while requiring roughly 10× fewer training resources than textual-inversion baselines, and introduces an inference-only Multi-Negative Steering heuristic to address VLM limitations on negation/removal.

Significance. If the performance and efficiency claims hold, the work offers a paradigm shift from textual inversion to flow-based transport on frozen embeddings, potentially lowering the barrier for ZS-CIR research. The explicit identification of negation as a VLM failure mode and the proposed mitigation are constructive contributions.

major comments (2)
  1. [Abstract / Method] Abstract and method description: The central claim that a small conditional flow-matching network trained solely on fixed VLM embeddings can accurately recover fine-grained target semantics (including directional composition) is load-bearing, yet the paper itself identifies negation/removal as a major VLM failure mode and resorts to a separate inference-time heuristic; this indicates the learned transport field may not fully compensate for information lost in the embedding space without additional supervision or adaptation.
  2. [Abstract] Abstract: The efficiency claim of 'roughly 10× fewer training resources' is presented without concrete metrics (e.g., parameter count of the transport module, GPU-hours, epochs, or side-by-side comparison tables), which is required to substantiate the load-bearing advantage over textual-inversion approaches.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence description of the specific conditional flow-matching objective or network architecture to clarify how the transport field is parameterized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential paradigm shift offered by FlowCIR. We address each major comment below with point-by-point responses.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: The central claim that a small conditional flow-matching network trained solely on fixed VLM embeddings can accurately recover fine-grained target semantics (including directional composition) is load-bearing, yet the paper itself identifies negation/removal as a major VLM failure mode and resorts to a separate inference-time heuristic; this indicates the learned transport field may not fully compensate for information lost in the embedding space without additional supervision or adaptation.

    Authors: We agree that negation and removal constitute a notable VLM limitation, which is why the manuscript explicitly identifies this failure mode and introduces Multi-Negative Steering as a targeted inference-time mitigation. The conditional flow-matching transport is trained to learn directional semantic mappings on the fixed embeddings for general compositional instructions, and benchmark results indicate it recovers target semantics effectively in most cases. The steering heuristic specifically addresses residual negation handling issues that are not fully resolved in the VLM embedding space. We will revise the abstract and method sections to more clearly separate the scope of the learned transport from the additional steering strategy and to discuss this distinction as a limitation. revision: partial

  2. Referee: [Abstract] Abstract: The efficiency claim of 'roughly 10× fewer training resources' is presented without concrete metrics (e.g., parameter count of the transport module, GPU-hours, epochs, or side-by-side comparison tables), which is required to substantiate the load-bearing advantage over textual-inversion approaches.

    Authors: We acknowledge that the efficiency claim requires concrete supporting metrics to be fully substantiated. In the revised manuscript we will add explicit details on the transport module's parameter count, training epochs, approximate GPU-hours, and a side-by-side resource comparison against textual-inversion baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines FlowCIR as a conditional flow-matching transport module trained on fixed pre-extracted VLM embeddings to map reference+instruction to target-aligned queries, with an added inference-time Multi-Negative Steering heuristic. All load-bearing steps (embedding extraction, flow training, and benchmark evaluation) operate on external VLM features and standard CIR datasets without any reduction of the reported performance metrics to quantities defined by the fitted parameters themselves or to self-citations. The efficiency claim (10× fewer resources) follows directly from the architectural choice of freezing encoders rather than from any definitional equivalence. No self-definitional, fitted-input-as-prediction, or uniqueness-imported patterns appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the transport field and Multi-Negative Steering are presented at the level of high-level technique names without further decomposition.

pith-pipeline@v0.9.1-grok · 5816 in / 1123 out tokens · 26643 ms · 2026-07-03T15:39:06.080601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    arXiv preprint arXiv:2405.02951 (2024)

    Agnolucci, L., Baldrati, A., Bertini, M., Del Bimbo, A.: isearle: Improving textual inversion for zero-shot composed image retrieval. arXiv preprint arXiv:2405.02951 (2024)

  2. [2]

    In: ICLR (2023)

    Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. In: ICLR (2023)

  3. [3]

    In: CVPR (2025) 16 He et al

    Alhamoud, K., Alshammari, S., Tian, Y., Li, G., Torr, P.H., Kim, Y., Ghassemi, M.: Vision-language models do not understand negation. In: CVPR (2025) 16 He et al

  4. [4]

    In: ICCV (2023)

    Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023)

  5. [5]

    In: CVPR (2022)

    Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining clip-based features. In: CVPR (2022)

  6. [6]

    IJCV (2025)

    Bogensperger, L., Narnhofer, D., Falk, A., Schindler, K., Pock, T.: Flowsdf: Flow matching for medical image segmentation using distance transforms. IJCV (2025)

  7. [7]

    In: ICCV (2025)

    Byun, J., Jeong, S., Kim, W., Chun, S., Moon, T.: An efficient post-hoc framework for reducing task discrepancy of text encoders for composed image retrieval. In: ICCV (2025)

  8. [8]

    arXiv preprint arXiv:2305.15241 (2023)

    Chen, H., Dong, Y., Wang, Z., Yang, X., Duan, C., Su, H., Zhu, J.: Robust classi- fication via a single diffusion model. arXiv preprint arXiv:2305.15241 (2023)

  9. [9]

    In: NeurIPS (2023)

    Clark, K., Jaini, P.: Text-to-image diffusion models are zero shot classifiers. In: NeurIPS (2023)

  10. [10]

    this is my unicorn, fluffy

    Cohen, N., Gal, R., Meirom, E.A., Chechik, G., Atzmon, Y.: “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In: ECCV (2022)

  11. [11]

    In: ICLR (2022)

    Delmas, G., Sampaio de Rezende, R., Csurka, G., Larlus, D.: Artemis: Attention- based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)

  12. [12]

    In: ICLR (2024)

    Du, Y., Wang, M., Zhou, W., Hui, S., Li, H.: Image2sentence based asymmetrical zero-shot composed image retrieval. In: ICLR (2024)

  13. [13]

    In: CVPR (2025)

    Duan, S., Sun, Y., Peng, D., Liu, Z., Song, X., Hu, P.: Fuzzy multimodal learning for trusted cross-modal retrieval. In: CVPR (2025)

  14. [14]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

  15. [15]

    arXiv preprint arXiv:2403.12803 (2024)

    Fu, Y., Chen, C., Qiao, Y., Yu, Y.: Dreamda: Generative data augmentation with diffusion models. arXiv preprint arXiv:2403.12803 (2024)

  16. [16]

    In: NeurIPS (2025)

    Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. In: NeurIPS (2025)

  17. [17]

    In: CVPR (2022)

    Goenka, S., Zheng, Z., Jaiswal, A., Chada, R., Wu, Y., Hedau, V., Natarajan, P.: Fashionvlp: Vision language transformer for fashion retrieval with feedback. In: CVPR (2022)

  18. [18]

    TMLR (2024)

    Gu, G., Chun, S., Kim, W., Jun, H., Kang, Y., Yun, S.: Compodiff: Versatile composed image retrieval with latent diffusion. TMLR (2024)

  19. [19]

    In: CVPR (2024)

    Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only training of zero-shot composed image retrieval. In: CVPR (2024)

  20. [20]

    In: AAAI (2025)

    Gui, M., Schusterbauer, J., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Baumann, S.A., Hu, V.T., Ommer, B.: Depthfm: Fast generative monocular depth estimation with flow matching. In: AAAI (2025)

  21. [21]

    In: ICCV (2025)

    He, J., Yu, Q., Liu, Q., Chen, L.C.: Flowtok: Flowing seamlessly across text and image tokens. In: ICCV (2025)

  22. [22]

    In: CVPR (2026)

    He, Z., Li, L., Chen, L.: Flowcomposer: Composable flows for compositional zero- shot learning. In: CVPR (2026)

  23. [23]

    NeurIPS (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)

  24. [24]

    In: ICLR (2025)

    Hui, M., Yang, S., Zhao, B., Shi, Y., Wang, H., Wang, P., Zhou, Y., Xie, C.: Hq- edit: A high-quality dataset for instruction-based image editing. In: ICLR (2025)

  25. [25]

    In: CVPR (2024)

    Islam, K., Zaheer, M.Z., Mahmood, A., Nandakumar, K.: Diffusemix: Label- preserving data augmentation with diffusion models. In: CVPR (2024)

  26. [26]

    In: ICLR (2026) FlowCIR 17

    Jiang, Z., Wang, Y., Chen, L.: Exploring cross-modal flows for few-shot learning. In: ICLR (2026) FlowCIR 17

  27. [27]

    In: ICLR (2024)

    Karthik, S., Roth, K., Mancini, M., Akata, Z.: Vision-by-language for training-free compositional image retrieval. In: ICLR (2024)

  28. [28]

    In: CVPR (2025)

    Koh, G., Oh, H.J., Noh, J., Jeong, W.K.: Synthetic data augmentation using pre- trained diffusion models for long-tailed food image classification. In: CVPR (2025)

  29. [29]

    In: ICCV (2023)

    Li, A.C., Prabhudesai, M., Duggal, S., Brown, E., Pathak, D.: Your diffusion model is secretly a zero-shot classifier. In: ICCV (2023)

  30. [30]

    In: ICML (2022)

    Li, J., et al.: Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. In: ICML (2022)

  31. [31]

    In: ICML (2026)

    Li, L., Jiang, Z., Ye, G., He, Z., Li, J., Xiao, J., Cheng, K.T., Chen, L.: Path- decoupled hyperbolic flow matching for few-shot adaptation. In: ICML (2026)

  32. [32]

    In: NeurIPS (2024)

    Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. In: NeurIPS (2024)

  33. [33]

    In: ICML (2024)

    Li, W., Fan, H., Wong, Y., Yang, Y., Kankanhalli, M.S.: Improving context under- standing in multimodal large language models via multimodal composition learn- ing. In: ICML (2024)

  34. [34]

    In: ICLR (2023)

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR (2023)

  35. [35]

    In: CVPR (2025)

    Liu, Q., Yin, X., Yuille, A., Brown, A., Singh, M.: Flowing from words to pixels: A noise-free framework for cross-modality evolution. In: CVPR (2025)

  36. [36]

    In: ICCV (2025)

    Liu, X., Pu, N., Zheng, H., Li, W., Sebe, N., Zhong, Z.: Generate, refine, and encode: Leveraging synthesized novel samples for on-the-fly fine-grained category discovery. In: ICCV (2025)

  37. [37]

    In: ICLR (2023)

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2023)

  38. [38]

    In: CVPR (2021)

    Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: CVPR (2021)

  39. [39]

    In: ACCV (2024)

    Ning, W., Chang, D., Tong, Y., He, Z., Liang, K., Ma, Z.: Hierarchical prompting for diffusion classifiers. In: ACCV (2024)

  40. [40]

    arXiv preprint arXiv:2412.12594 (2024)

    Qi, Z., Liu, B., Zhang, S., Li, B., Xu, Z., Xiong, H., Xie, Z.: A simple and efficient baseline for zero-shot generative classification. arXiv preprint arXiv:2412.12594 (2024)

  41. [41]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  42. [42]

    arXiv preprint arXiv:2511.12331 (2025)

    Ranjbar, S.K., Alhamoud, K., Ghassemi, M.: Spacevlm: Sub-space modeling of negation in vision-language models. arXiv preprint arXiv:2511.12331 (2025)

  43. [43]

    In: CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

  44. [44]

    In: CVPR (2023)

    Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023)

  45. [45]

    In: ICCV (2025)

    Sun, Z., Jing, D., Lu, Z.: Cotmr: Chain-of-thought multi-scale reasoning for training-free zero-shot composed image retrieval. In: ICCV (2025)

  46. [46]

    In: AAAI (2026)

    Tang, H., Wang, J., Zhao, M., Meng, G., Luo, R., Chen, L., Xia, S.T.: Heteroge- neous uncertainty-guided composed image retrieval with fine-grained probabilistic learning. In: AAAI (2026)

  47. [47]

    In: AAAI (2024)

    Tang, Y., Yu, J., Gai, K., Zhuang, J., Xiong, G., Hu, Y., Wu, Q.: Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed im- age retrieval. In: AAAI (2024)

  48. [48]

    In: CVPR (2019) 18 He et al

    Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019) 18 He et al

  49. [49]

    In: CVPR (2025)

    Wang, L., Ao, W., Boddeti, V.N., Lim, S.N.: Generative zero-shot composed image retrieval. In: CVPR (2025)

  50. [50]

    Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likelihood score alignment for visual-condition controllable generation (2026)

  51. [51]

    In: CVPR (2025)

    Wang, Y., Chen, L.: Inversion circle interpolation: Diffusion-based image augmen- tation for data-scarce classification. In: CVPR (2025)

  52. [52]

    In: NeurIPS (2025)

    Wang, Y., Chen, L.: Noise matters: Optimizing matching noise for diffusion clas- sifiers. In: NeurIPS (2025)

  53. [53]

    arXiv preprint arXiv:2510.06139 (2025)

    Wang, Z., Jiang, D., Li, L., Dang, S., Li, C., Yang, H., Dai, G., Wang, M., Wang, J.: Deforming videos to masks: Flow matching for referring video segmentation. arXiv preprint arXiv:2510.06139 (2025)

  54. [54]

    In: CVPR (2021)

    Wu,H.,Gao,Y.,Guo,X.,Al-Halah,Z.,Rennie,S.,Grauman,K.,Feris,R.:Fashion iq: A new dataset towards retrieving images by natural language feedback. In: CVPR (2021)

  55. [55]

    In: Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval (2024)

    Yang, Z., Xue, D., Qian, S., Dong, W., Xu, C.: Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In: Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval (2024)

  56. [56]

    In: CVPR (2024)

    Yue, Z., Zhou, P., Hong, R., Zhang, H., Sun, Q.: Few-shot learner parameterization by diffusion time-steps. In: CVPR (2024)

  57. [57]

    In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI)

    Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual- language foundation models for image retrieval. In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI). pp. 1–8. IEEE (2025)

  58. [58]

    In: ICML (2024)

    Zhang, K., Luan, Y., Hu, H., Lee, K., Qiao, S., Chen, W., Su, Y., Chang, M.W.: MagicLens: Self-supervised image retrieval with open-ended instructions. In: ICML (2024)

  59. [59]

    Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024)

  60. [60]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)

    Zhao, J., Li, J., Lian, D., Sun, L., Lv, P.: Dualcir: Enhancing training-free com- posed image retrieval via dual-directional descriptions. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)