FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval

Long Chen; Teng Wang; Yanghao Wang; Yuanpei Liu; Zhenqi He; Ziqi Jiang

arxiv: 2607.02284 · v1 · pith:S5CORFGSnew · submitted 2026-07-02 · 💻 cs.CV

FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval

Zhenqi He , Ziqi Jiang , Yuanpei Liu , Yanghao Wang , Teng Wang , Long Chen This is my paper

Pith reviewed 2026-07-03 15:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot composed image retrievalflow matchingsemantic transportvision-language modelsnegation handlingimage retrievalconditional transport

0 comments

The pith

FlowCIR casts zero-shot composed image retrieval as conditional semantic transport learned by flow matching on fixed vision-language model embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that converting a reference image and text instruction into a target query can be done by training a transport field that moves the instruction embedding toward the correct target embedding, conditioned on the reference. This replaces the textual-inversion step used in earlier methods, which the authors view as lossy for fine details. Because the transport module trains only on pre-extracted embeddings and leaves the encoders untouched, the approach requires far less compute than inversion-based training. The work further introduces an inference-time correction that steers away from negated concepts when the instruction contains removal or negation language.

Core claim

Zero-shot composed image retrieval is reformulated as learning a conditional flow-matching transport field that maps an instruction representation, given the reference image, directly to a target-aligned query embedding; the resulting lightweight module produces competitive retrieval accuracy on standard benchmarks while using roughly ten times fewer training resources than textual-inversion baselines and incorporates a Multi-Negative Steering procedure to offset vision-language model weaknesses on negation.

What carries the argument

Conditional flow matching transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image.

If this is right

The method reaches strong performance on existing CIR benchmarks without requiring domain-specific triplet annotations.
Training cost is reduced by a factor of roughly ten compared with prior textual-inversion pipelines.
Multi-Negative Steering at inference improves results on queries that contain negation or removal instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transport formulation could be tested on other vision-language tasks that currently rely on token inversion or simple concatenation for composition.
Because the approach never updates the underlying encoders, it may allow reuse of the same transport module across different vision-language model backbones.

Load-bearing premise

A lightweight transport module trained solely on fixed pre-extracted vision-language model embeddings can capture the fine-grained semantics needed for accurate target retrieval without any encoder updates.

What would settle it

A controlled experiment showing that FlowCIR retrieval accuracy on standard benchmarks drops below that of a textual-inversion baseline when both methods receive identical training compute and the same pre-trained encoders.

Figures

Figures reproduced from arXiv: 2607.02284 by Long Chen, Teng Wang, Yanghao Wang, Yuanpei Liu, Zhenqi He, Ziqi Jiang.

**Figure 1.** Figure 1: (a) Illustration of Composed Image Retrieval. (b) Prior methods are composed by textual inversion and token-level fusion in text space. (c) FlowCIR composes via conditional flow matching to produce a target-oriented retrieval query. images that match the intended edit (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Performance on CIRCO (mAP5) and CIRR (Recall1) with training hours (in single GPU) as bubble size. Recent zero-shot CIR methods [4, 19, 44, 47] largely build on textual inversion over well-aligned vision–language models [30,41], effectively reducing cross-modal composition to a text-only manipulation problem. Concretely, they learn a projector to transfer the reference image Ir into a small set of pseudo… view at source ↗

**Figure 3.** Figure 3: Framework overview of FlowCIR. (a) In training, FlowCIR learns a conditional flow-matching transport from relative-instruction embeddings to target-oriented text embeddings under the reference-image condition, together with top-K hardnegative retrieval supervision. (b) In inference, an inference-only Multi-Negative Steering module adjusts negation-containing instructions before one-step transport for ta… view at source ↗

**Figure 4.** Figure 4: Qualitative retrieval results illustrating the effect of reference-conditioned transport and Multi-Negative Steering. Predicted targets are marked with for correct retrievals and for incorrect ones. pose; when the reference is a fluffy white dog, the retrieved images remain visually consistent with that appearance and mainly modify the head orientation. A similar phenomenon appears in the third and four… view at source ↗

read the original abstract

Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emph{conditional flow matching}, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly $10\times$ fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlowCIR swaps textual inversion for a small conditional flow-matching transport on frozen VLM embeddings and adds an inference heuristic for negation, but the abstract supplies no equations, ablations, or error breakdowns to confirm the transport actually recovers the claimed semantics.

read the letter

The core move is treating ZS-CIR as learning a conditional transport field with flow matching instead of inverting the reference image into text tokens and concatenating. The model trains only a lightweight module on fixed embeddings, which the abstract says cuts training cost by roughly 10x. That efficiency claim is the clearest practical difference from prior work.

It also flags negation and removal as a persistent VLM weakness and counters it with a separate Multi-Negative Steering step at inference. That is a reasonable engineering patch, but it sits outside the learned transport.

The soft spot is verification. Without equations for the flow-matching objective, without ablation tables on the transport module size or conditioning, and without breakdowns on negation-heavy queries, it is impossible to tell whether the reported competitive numbers come from the flow field itself or from the steering heuristic and the underlying VLM. The weakest assumption in the abstract—that a small network on static embeddings can reconstruct fine-grained target semantics the VLM already dropped—remains untested in the supplied text.

This is a narrow but self-contained idea aimed at the ZS-CIR subfield. Readers already running retrieval experiments on CLIP-style embeddings could extract the method and check the efficiency numbers themselves. The work is coherent on its own terms and shows clear engagement with the limitations of current VLM composition, so it clears the bar for a serious referee even if the final verdict depends on the missing tables.

Referee Report

2 major / 1 minor

Summary. The paper proposes FlowCIR for zero-shot composed image retrieval (ZS-CIR), framing the task as conditional semantic transport via flow matching. A lightweight transport module is trained on fixed pre-extracted VLM embeddings to map an instruction representation (conditioned on a reference image) to a target-aligned query embedding, without updating encoders or using task-specific triplets. It claims this yields competitive performance on standard CIR benchmarks while requiring roughly 10× fewer training resources than textual-inversion baselines, and introduces an inference-only Multi-Negative Steering heuristic to address VLM limitations on negation/removal.

Significance. If the performance and efficiency claims hold, the work offers a paradigm shift from textual inversion to flow-based transport on frozen embeddings, potentially lowering the barrier for ZS-CIR research. The explicit identification of negation as a VLM failure mode and the proposed mitigation are constructive contributions.

major comments (2)

[Abstract / Method] Abstract and method description: The central claim that a small conditional flow-matching network trained solely on fixed VLM embeddings can accurately recover fine-grained target semantics (including directional composition) is load-bearing, yet the paper itself identifies negation/removal as a major VLM failure mode and resorts to a separate inference-time heuristic; this indicates the learned transport field may not fully compensate for information lost in the embedding space without additional supervision or adaptation.
[Abstract] Abstract: The efficiency claim of 'roughly 10× fewer training resources' is presented without concrete metrics (e.g., parameter count of the transport module, GPU-hours, epochs, or side-by-side comparison tables), which is required to substantiate the load-bearing advantage over textual-inversion approaches.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence description of the specific conditional flow-matching objective or network architecture to clarify how the transport field is parameterized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential paradigm shift offered by FlowCIR. We address each major comment below with point-by-point responses.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The central claim that a small conditional flow-matching network trained solely on fixed VLM embeddings can accurately recover fine-grained target semantics (including directional composition) is load-bearing, yet the paper itself identifies negation/removal as a major VLM failure mode and resorts to a separate inference-time heuristic; this indicates the learned transport field may not fully compensate for information lost in the embedding space without additional supervision or adaptation.

Authors: We agree that negation and removal constitute a notable VLM limitation, which is why the manuscript explicitly identifies this failure mode and introduces Multi-Negative Steering as a targeted inference-time mitigation. The conditional flow-matching transport is trained to learn directional semantic mappings on the fixed embeddings for general compositional instructions, and benchmark results indicate it recovers target semantics effectively in most cases. The steering heuristic specifically addresses residual negation handling issues that are not fully resolved in the VLM embedding space. We will revise the abstract and method sections to more clearly separate the scope of the learned transport from the additional steering strategy and to discuss this distinction as a limitation. revision: partial
Referee: [Abstract] Abstract: The efficiency claim of 'roughly 10× fewer training resources' is presented without concrete metrics (e.g., parameter count of the transport module, GPU-hours, epochs, or side-by-side comparison tables), which is required to substantiate the load-bearing advantage over textual-inversion approaches.

Authors: We acknowledge that the efficiency claim requires concrete supporting metrics to be fully substantiated. In the revised manuscript we will add explicit details on the transport module's parameter count, training epochs, approximate GPU-hours, and a side-by-side resource comparison against textual-inversion baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines FlowCIR as a conditional flow-matching transport module trained on fixed pre-extracted VLM embeddings to map reference+instruction to target-aligned queries, with an added inference-time Multi-Negative Steering heuristic. All load-bearing steps (embedding extraction, flow training, and benchmark evaluation) operate on external VLM features and standard CIR datasets without any reduction of the reported performance metrics to quantities defined by the fitted parameters themselves or to self-citations. The efficiency claim (10× fewer resources) follows directly from the architectural choice of freezing encoders rather than from any definitional equivalence. No self-definitional, fitted-input-as-prediction, or uniqueness-imported patterns appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the transport field and Multi-Negative Steering are presented at the level of high-level technique names without further decomposition.

pith-pipeline@v0.9.1-grok · 5816 in / 1123 out tokens · 26643 ms · 2026-07-03T15:39:06.080601+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

arXiv preprint arXiv:2405.02951 (2024)

Agnolucci, L., Baldrati, A., Bertini, M., Del Bimbo, A.: isearle: Improving textual inversion for zero-shot composed image retrieval. arXiv preprint arXiv:2405.02951 (2024)

work page arXiv 2024
[2]

In: ICLR (2023)

Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. In: ICLR (2023)

work page 2023
[3]

In: CVPR (2025) 16 He et al

Alhamoud, K., Alshammari, S., Tian, Y., Li, G., Torr, P.H., Kim, Y., Ghassemi, M.: Vision-language models do not understand negation. In: CVPR (2025) 16 He et al

work page 2025
[4]

In: ICCV (2023)

Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023)

work page 2023
[5]

In: CVPR (2022)

Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining clip-based features. In: CVPR (2022)

work page 2022
[6]

IJCV (2025)

Bogensperger, L., Narnhofer, D., Falk, A., Schindler, K., Pock, T.: Flowsdf: Flow matching for medical image segmentation using distance transforms. IJCV (2025)

work page 2025
[7]

In: ICCV (2025)

Byun, J., Jeong, S., Kim, W., Chun, S., Moon, T.: An efficient post-hoc framework for reducing task discrepancy of text encoders for composed image retrieval. In: ICCV (2025)

work page 2025
[8]

arXiv preprint arXiv:2305.15241 (2023)

Chen, H., Dong, Y., Wang, Z., Yang, X., Duan, C., Su, H., Zhu, J.: Robust classi- fication via a single diffusion model. arXiv preprint arXiv:2305.15241 (2023)

work page arXiv 2023
[9]

In: NeurIPS (2023)

Clark, K., Jaini, P.: Text-to-image diffusion models are zero shot classifiers. In: NeurIPS (2023)

work page 2023
[10]

this is my unicorn, fluffy

Cohen, N., Gal, R., Meirom, E.A., Chechik, G., Atzmon, Y.: “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In: ECCV (2022)

work page 2022
[11]

In: ICLR (2022)

Delmas, G., Sampaio de Rezende, R., Csurka, G., Larlus, D.: Artemis: Attention- based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)

work page 2022
[12]

In: ICLR (2024)

Du, Y., Wang, M., Zhou, W., Hui, S., Li, H.: Image2sentence based asymmetrical zero-shot composed image retrieval. In: ICLR (2024)

work page 2024
[13]

In: CVPR (2025)

Duan, S., Sun, Y., Peng, D., Liu, Z., Song, X., Hu, P.: Fuzzy multimodal learning for trusted cross-modal retrieval. In: CVPR (2025)

work page 2025
[14]

In: ICML (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

work page 2024
[15]

arXiv preprint arXiv:2403.12803 (2024)

Fu, Y., Chen, C., Qiao, Y., Yu, Y.: Dreamda: Generative data augmentation with diffusion models. arXiv preprint arXiv:2403.12803 (2024)

work page arXiv 2024
[16]

In: NeurIPS (2025)

Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. In: NeurIPS (2025)

work page 2025
[17]

In: CVPR (2022)

Goenka, S., Zheng, Z., Jaiswal, A., Chada, R., Wu, Y., Hedau, V., Natarajan, P.: Fashionvlp: Vision language transformer for fashion retrieval with feedback. In: CVPR (2022)

work page 2022
[18]

TMLR (2024)

Gu, G., Chun, S., Kim, W., Jun, H., Kang, Y., Yun, S.: Compodiff: Versatile composed image retrieval with latent diffusion. TMLR (2024)

work page 2024
[19]

In: CVPR (2024)

Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only training of zero-shot composed image retrieval. In: CVPR (2024)

work page 2024
[20]

In: AAAI (2025)

Gui, M., Schusterbauer, J., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Baumann, S.A., Hu, V.T., Ommer, B.: Depthfm: Fast generative monocular depth estimation with flow matching. In: AAAI (2025)

work page 2025
[21]

In: ICCV (2025)

He, J., Yu, Q., Liu, Q., Chen, L.C.: Flowtok: Flowing seamlessly across text and image tokens. In: ICCV (2025)

work page 2025
[22]

In: CVPR (2026)

He, Z., Li, L., Chen, L.: Flowcomposer: Composable flows for compositional zero- shot learning. In: CVPR (2026)

work page 2026
[23]

NeurIPS (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)

work page 2020
[24]

In: ICLR (2025)

Hui, M., Yang, S., Zhao, B., Shi, Y., Wang, H., Wang, P., Zhou, Y., Xie, C.: Hq- edit: A high-quality dataset for instruction-based image editing. In: ICLR (2025)

work page 2025
[25]

In: CVPR (2024)

Islam, K., Zaheer, M.Z., Mahmood, A., Nandakumar, K.: Diffusemix: Label- preserving data augmentation with diffusion models. In: CVPR (2024)

work page 2024
[26]

In: ICLR (2026) FlowCIR 17

Jiang, Z., Wang, Y., Chen, L.: Exploring cross-modal flows for few-shot learning. In: ICLR (2026) FlowCIR 17

work page 2026
[27]

In: ICLR (2024)

Karthik, S., Roth, K., Mancini, M., Akata, Z.: Vision-by-language for training-free compositional image retrieval. In: ICLR (2024)

work page 2024
[28]

In: CVPR (2025)

Koh, G., Oh, H.J., Noh, J., Jeong, W.K.: Synthetic data augmentation using pre- trained diffusion models for long-tailed food image classification. In: CVPR (2025)

work page 2025
[29]

In: ICCV (2023)

Li, A.C., Prabhudesai, M., Duggal, S., Brown, E., Pathak, D.: Your diffusion model is secretly a zero-shot classifier. In: ICCV (2023)

work page 2023
[30]

In: ICML (2022)

Li, J., et al.: Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. In: ICML (2022)

work page 2022
[31]

In: ICML (2026)

Li, L., Jiang, Z., Ye, G., He, Z., Li, J., Xiao, J., Cheng, K.T., Chen, L.: Path- decoupled hyperbolic flow matching for few-shot adaptation. In: ICML (2026)

work page 2026
[32]

In: NeurIPS (2024)

Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. In: NeurIPS (2024)

work page 2024
[33]

In: ICML (2024)

Li, W., Fan, H., Wong, Y., Yang, Y., Kankanhalli, M.S.: Improving context under- standing in multimodal large language models via multimodal composition learn- ing. In: ICML (2024)

work page 2024
[34]

In: ICLR (2023)

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR (2023)

work page 2023
[35]

In: CVPR (2025)

Liu, Q., Yin, X., Yuille, A., Brown, A., Singh, M.: Flowing from words to pixels: A noise-free framework for cross-modality evolution. In: CVPR (2025)

work page 2025
[36]

In: ICCV (2025)

Liu, X., Pu, N., Zheng, H., Li, W., Sebe, N., Zhong, Z.: Generate, refine, and encode: Leveraging synthesized novel samples for on-the-fly fine-grained category discovery. In: ICCV (2025)

work page 2025
[37]

In: ICLR (2023)

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2023)

work page 2023
[38]

In: CVPR (2021)

Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: CVPR (2021)

work page 2021
[39]

In: ACCV (2024)

Ning, W., Chang, D., Tong, Y., He, Z., Liang, K., Ma, Z.: Hierarchical prompting for diffusion classifiers. In: ACCV (2024)

work page 2024
[40]

arXiv preprint arXiv:2412.12594 (2024)

Qi, Z., Liu, B., Zhang, S., Li, B., Xu, Z., Xiong, H., Xie, Z.: A simple and efficient baseline for zero-shot generative classification. arXiv preprint arXiv:2412.12594 (2024)

work page arXiv 2024
[41]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021
[42]

arXiv preprint arXiv:2511.12331 (2025)

Ranjbar, S.K., Alhamoud, K., Ghassemi, M.: Spacevlm: Sub-space modeling of negation in vision-language models. arXiv preprint arXiv:2511.12331 (2025)

work page arXiv 2025
[43]

In: CVPR (2022)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

work page 2022
[44]

In: CVPR (2023)

Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023)

work page 2023
[45]

In: ICCV (2025)

Sun, Z., Jing, D., Lu, Z.: Cotmr: Chain-of-thought multi-scale reasoning for training-free zero-shot composed image retrieval. In: ICCV (2025)

work page 2025
[46]

In: AAAI (2026)

Tang, H., Wang, J., Zhao, M., Meng, G., Luo, R., Chen, L., Xia, S.T.: Heteroge- neous uncertainty-guided composed image retrieval with fine-grained probabilistic learning. In: AAAI (2026)

work page 2026
[47]

In: AAAI (2024)

Tang, Y., Yu, J., Gai, K., Zhuang, J., Xiong, G., Hu, Y., Wu, Q.: Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed im- age retrieval. In: AAAI (2024)

work page 2024
[48]

In: CVPR (2019) 18 He et al

Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019) 18 He et al

work page 2019
[49]

In: CVPR (2025)

Wang, L., Ao, W., Boddeti, V.N., Lim, S.N.: Generative zero-shot composed image retrieval. In: CVPR (2025)

work page 2025
[50]

Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likelihood score alignment for visual-condition controllable generation (2026)

work page 2026
[51]

In: CVPR (2025)

Wang, Y., Chen, L.: Inversion circle interpolation: Diffusion-based image augmen- tation for data-scarce classification. In: CVPR (2025)

work page 2025
[52]

In: NeurIPS (2025)

Wang, Y., Chen, L.: Noise matters: Optimizing matching noise for diffusion clas- sifiers. In: NeurIPS (2025)

work page 2025
[53]

arXiv preprint arXiv:2510.06139 (2025)

Wang, Z., Jiang, D., Li, L., Dang, S., Li, C., Yang, H., Dai, G., Wang, M., Wang, J.: Deforming videos to masks: Flow matching for referring video segmentation. arXiv preprint arXiv:2510.06139 (2025)

work page arXiv 2025
[54]

In: CVPR (2021)

Wu,H.,Gao,Y.,Guo,X.,Al-Halah,Z.,Rennie,S.,Grauman,K.,Feris,R.:Fashion iq: A new dataset towards retrieving images by natural language feedback. In: CVPR (2021)

work page 2021
[55]

In: Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval (2024)

Yang, Z., Xue, D., Qian, S., Dong, W., Xu, C.: Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In: Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval (2024)

work page 2024
[56]

In: CVPR (2024)

Yue, Z., Zhou, P., Hong, R., Zhang, H., Sun, Q.: Few-shot learner parameterization by diffusion time-steps. In: CVPR (2024)

work page 2024
[57]

In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI)

Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual- language foundation models for image retrieval. In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI). pp. 1–8. IEEE (2025)

work page 2025
[58]

In: ICML (2024)

Zhang, K., Luan, Y., Hu, H., Lee, K., Qiao, S., Chen, W., Su, Y., Chang, M.W.: MagicLens: Self-supervised image retrieval with open-ended instructions. In: ICML (2024)

work page 2024
[59]

Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024)

work page 2024
[60]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)

Zhao, J., Li, J., Lian, D., Sun, L., Lv, P.: Dualcir: Enhancing training-free com- posed image retrieval via dual-directional descriptions. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)

work page 2025

[1] [1]

arXiv preprint arXiv:2405.02951 (2024)

Agnolucci, L., Baldrati, A., Bertini, M., Del Bimbo, A.: isearle: Improving textual inversion for zero-shot composed image retrieval. arXiv preprint arXiv:2405.02951 (2024)

work page arXiv 2024

[2] [2]

In: ICLR (2023)

Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. In: ICLR (2023)

work page 2023

[3] [3]

In: CVPR (2025) 16 He et al

Alhamoud, K., Alshammari, S., Tian, Y., Li, G., Torr, P.H., Kim, Y., Ghassemi, M.: Vision-language models do not understand negation. In: CVPR (2025) 16 He et al

work page 2025

[4] [4]

In: ICCV (2023)

Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023)

work page 2023

[5] [5]

In: CVPR (2022)

Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining clip-based features. In: CVPR (2022)

work page 2022

[6] [6]

IJCV (2025)

Bogensperger, L., Narnhofer, D., Falk, A., Schindler, K., Pock, T.: Flowsdf: Flow matching for medical image segmentation using distance transforms. IJCV (2025)

work page 2025

[7] [7]

In: ICCV (2025)

Byun, J., Jeong, S., Kim, W., Chun, S., Moon, T.: An efficient post-hoc framework for reducing task discrepancy of text encoders for composed image retrieval. In: ICCV (2025)

work page 2025

[8] [8]

arXiv preprint arXiv:2305.15241 (2023)

Chen, H., Dong, Y., Wang, Z., Yang, X., Duan, C., Su, H., Zhu, J.: Robust classi- fication via a single diffusion model. arXiv preprint arXiv:2305.15241 (2023)

work page arXiv 2023

[9] [9]

In: NeurIPS (2023)

Clark, K., Jaini, P.: Text-to-image diffusion models are zero shot classifiers. In: NeurIPS (2023)

work page 2023

[10] [10]

this is my unicorn, fluffy

Cohen, N., Gal, R., Meirom, E.A., Chechik, G., Atzmon, Y.: “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In: ECCV (2022)

work page 2022

[11] [11]

In: ICLR (2022)

Delmas, G., Sampaio de Rezende, R., Csurka, G., Larlus, D.: Artemis: Attention- based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)

work page 2022

[12] [12]

In: ICLR (2024)

Du, Y., Wang, M., Zhou, W., Hui, S., Li, H.: Image2sentence based asymmetrical zero-shot composed image retrieval. In: ICLR (2024)

work page 2024

[13] [13]

In: CVPR (2025)

Duan, S., Sun, Y., Peng, D., Liu, Z., Song, X., Hu, P.: Fuzzy multimodal learning for trusted cross-modal retrieval. In: CVPR (2025)

work page 2025

[14] [14]

In: ICML (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

work page 2024

[15] [15]

arXiv preprint arXiv:2403.12803 (2024)

Fu, Y., Chen, C., Qiao, Y., Yu, Y.: Dreamda: Generative data augmentation with diffusion models. arXiv preprint arXiv:2403.12803 (2024)

work page arXiv 2024

[16] [16]

In: NeurIPS (2025)

Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. In: NeurIPS (2025)

work page 2025

[17] [17]

In: CVPR (2022)

Goenka, S., Zheng, Z., Jaiswal, A., Chada, R., Wu, Y., Hedau, V., Natarajan, P.: Fashionvlp: Vision language transformer for fashion retrieval with feedback. In: CVPR (2022)

work page 2022

[18] [18]

TMLR (2024)

Gu, G., Chun, S., Kim, W., Jun, H., Kang, Y., Yun, S.: Compodiff: Versatile composed image retrieval with latent diffusion. TMLR (2024)

work page 2024

[19] [19]

In: CVPR (2024)

Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only training of zero-shot composed image retrieval. In: CVPR (2024)

work page 2024

[20] [20]

In: AAAI (2025)

Gui, M., Schusterbauer, J., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Baumann, S.A., Hu, V.T., Ommer, B.: Depthfm: Fast generative monocular depth estimation with flow matching. In: AAAI (2025)

work page 2025

[21] [21]

In: ICCV (2025)

He, J., Yu, Q., Liu, Q., Chen, L.C.: Flowtok: Flowing seamlessly across text and image tokens. In: ICCV (2025)

work page 2025

[22] [22]

In: CVPR (2026)

He, Z., Li, L., Chen, L.: Flowcomposer: Composable flows for compositional zero- shot learning. In: CVPR (2026)

work page 2026

[23] [23]

NeurIPS (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)

work page 2020

[24] [24]

In: ICLR (2025)

Hui, M., Yang, S., Zhao, B., Shi, Y., Wang, H., Wang, P., Zhou, Y., Xie, C.: Hq- edit: A high-quality dataset for instruction-based image editing. In: ICLR (2025)

work page 2025

[25] [25]

In: CVPR (2024)

Islam, K., Zaheer, M.Z., Mahmood, A., Nandakumar, K.: Diffusemix: Label- preserving data augmentation with diffusion models. In: CVPR (2024)

work page 2024

[26] [26]

In: ICLR (2026) FlowCIR 17

Jiang, Z., Wang, Y., Chen, L.: Exploring cross-modal flows for few-shot learning. In: ICLR (2026) FlowCIR 17

work page 2026

[27] [27]

In: ICLR (2024)

Karthik, S., Roth, K., Mancini, M., Akata, Z.: Vision-by-language for training-free compositional image retrieval. In: ICLR (2024)

work page 2024

[28] [28]

In: CVPR (2025)

Koh, G., Oh, H.J., Noh, J., Jeong, W.K.: Synthetic data augmentation using pre- trained diffusion models for long-tailed food image classification. In: CVPR (2025)

work page 2025

[29] [29]

In: ICCV (2023)

Li, A.C., Prabhudesai, M., Duggal, S., Brown, E., Pathak, D.: Your diffusion model is secretly a zero-shot classifier. In: ICCV (2023)

work page 2023

[30] [30]

In: ICML (2022)

Li, J., et al.: Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. In: ICML (2022)

work page 2022

[31] [31]

In: ICML (2026)

Li, L., Jiang, Z., Ye, G., He, Z., Li, J., Xiao, J., Cheng, K.T., Chen, L.: Path- decoupled hyperbolic flow matching for few-shot adaptation. In: ICML (2026)

work page 2026

[32] [32]

In: NeurIPS (2024)

Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. In: NeurIPS (2024)

work page 2024

[33] [33]

In: ICML (2024)

Li, W., Fan, H., Wong, Y., Yang, Y., Kankanhalli, M.S.: Improving context under- standing in multimodal large language models via multimodal composition learn- ing. In: ICML (2024)

work page 2024

[34] [34]

In: ICLR (2023)

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR (2023)

work page 2023

[35] [35]

In: CVPR (2025)

Liu, Q., Yin, X., Yuille, A., Brown, A., Singh, M.: Flowing from words to pixels: A noise-free framework for cross-modality evolution. In: CVPR (2025)

work page 2025

[36] [36]

In: ICCV (2025)

Liu, X., Pu, N., Zheng, H., Li, W., Sebe, N., Zhong, Z.: Generate, refine, and encode: Leveraging synthesized novel samples for on-the-fly fine-grained category discovery. In: ICCV (2025)

work page 2025

[37] [37]

In: ICLR (2023)

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2023)

work page 2023

[38] [38]

In: CVPR (2021)

Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: CVPR (2021)

work page 2021

[39] [39]

In: ACCV (2024)

Ning, W., Chang, D., Tong, Y., He, Z., Liang, K., Ma, Z.: Hierarchical prompting for diffusion classifiers. In: ACCV (2024)

work page 2024

[40] [40]

arXiv preprint arXiv:2412.12594 (2024)

Qi, Z., Liu, B., Zhang, S., Li, B., Xu, Z., Xiong, H., Xie, Z.: A simple and efficient baseline for zero-shot generative classification. arXiv preprint arXiv:2412.12594 (2024)

work page arXiv 2024

[41] [41]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021

[42] [42]

arXiv preprint arXiv:2511.12331 (2025)

Ranjbar, S.K., Alhamoud, K., Ghassemi, M.: Spacevlm: Sub-space modeling of negation in vision-language models. arXiv preprint arXiv:2511.12331 (2025)

work page arXiv 2025

[43] [43]

In: CVPR (2022)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

work page 2022

[44] [44]

In: CVPR (2023)

Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023)

work page 2023

[45] [45]

In: ICCV (2025)

Sun, Z., Jing, D., Lu, Z.: Cotmr: Chain-of-thought multi-scale reasoning for training-free zero-shot composed image retrieval. In: ICCV (2025)

work page 2025

[46] [46]

In: AAAI (2026)

Tang, H., Wang, J., Zhao, M., Meng, G., Luo, R., Chen, L., Xia, S.T.: Heteroge- neous uncertainty-guided composed image retrieval with fine-grained probabilistic learning. In: AAAI (2026)

work page 2026

[47] [47]

In: AAAI (2024)

Tang, Y., Yu, J., Gai, K., Zhuang, J., Xiong, G., Hu, Y., Wu, Q.: Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed im- age retrieval. In: AAAI (2024)

work page 2024

[48] [48]

In: CVPR (2019) 18 He et al

Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019) 18 He et al

work page 2019

[49] [49]

In: CVPR (2025)

Wang, L., Ao, W., Boddeti, V.N., Lim, S.N.: Generative zero-shot composed image retrieval. In: CVPR (2025)

work page 2025

[50] [50]

Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likelihood score alignment for visual-condition controllable generation (2026)

work page 2026

[51] [51]

In: CVPR (2025)

Wang, Y., Chen, L.: Inversion circle interpolation: Diffusion-based image augmen- tation for data-scarce classification. In: CVPR (2025)

work page 2025

[52] [52]

In: NeurIPS (2025)

Wang, Y., Chen, L.: Noise matters: Optimizing matching noise for diffusion clas- sifiers. In: NeurIPS (2025)

work page 2025

[53] [53]

arXiv preprint arXiv:2510.06139 (2025)

Wang, Z., Jiang, D., Li, L., Dang, S., Li, C., Yang, H., Dai, G., Wang, M., Wang, J.: Deforming videos to masks: Flow matching for referring video segmentation. arXiv preprint arXiv:2510.06139 (2025)

work page arXiv 2025

[54] [54]

In: CVPR (2021)

Wu,H.,Gao,Y.,Guo,X.,Al-Halah,Z.,Rennie,S.,Grauman,K.,Feris,R.:Fashion iq: A new dataset towards retrieving images by natural language feedback. In: CVPR (2021)

work page 2021

[55] [55]

In: Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval (2024)

Yang, Z., Xue, D., Qian, S., Dong, W., Xu, C.: Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In: Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval (2024)

work page 2024

[56] [56]

In: CVPR (2024)

Yue, Z., Zhou, P., Hong, R., Zhang, H., Sun, Q.: Few-shot learner parameterization by diffusion time-steps. In: CVPR (2024)

work page 2024

[57] [57]

In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI)

Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual- language foundation models for image retrieval. In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI). pp. 1–8. IEEE (2025)

work page 2025

[58] [58]

In: ICML (2024)

Zhang, K., Luan, Y., Hu, H., Lee, K., Qiao, S., Chen, W., Su, Y., Chang, M.W.: MagicLens: Self-supervised image retrieval with open-ended instructions. In: ICML (2024)

work page 2024

[59] [59]

Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024)

work page 2024

[60] [60]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)

Zhao, J., Li, J., Lian, D., Sun, L., Lv, P.: Dualcir: Enhancing training-free com- posed image retrieval via dual-directional descriptions. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)

work page 2025