FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval
Pith reviewed 2026-07-03 15:39 UTC · model grok-4.3
The pith
FlowCIR casts zero-shot composed image retrieval as conditional semantic transport learned by flow matching on fixed vision-language model embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Zero-shot composed image retrieval is reformulated as learning a conditional flow-matching transport field that maps an instruction representation, given the reference image, directly to a target-aligned query embedding; the resulting lightweight module produces competitive retrieval accuracy on standard benchmarks while using roughly ten times fewer training resources than textual-inversion baselines and incorporates a Multi-Negative Steering procedure to offset vision-language model weaknesses on negation.
What carries the argument
Conditional flow matching transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image.
If this is right
- The method reaches strong performance on existing CIR benchmarks without requiring domain-specific triplet annotations.
- Training cost is reduced by a factor of roughly ten compared with prior textual-inversion pipelines.
- Multi-Negative Steering at inference improves results on queries that contain negation or removal instructions.
Where Pith is reading between the lines
- The same transport formulation could be tested on other vision-language tasks that currently rely on token inversion or simple concatenation for composition.
- Because the approach never updates the underlying encoders, it may allow reuse of the same transport module across different vision-language model backbones.
Load-bearing premise
A lightweight transport module trained solely on fixed pre-extracted vision-language model embeddings can capture the fine-grained semantics needed for accurate target retrieval without any encoder updates.
What would settle it
A controlled experiment showing that FlowCIR retrieval accuracy on standard benchmarks drops below that of a textual-inversion baseline when both methods receive identical training compute and the same pre-trained encoders.
Figures
read the original abstract
Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emph{conditional flow matching}, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly $10\times$ fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FlowCIR for zero-shot composed image retrieval (ZS-CIR), framing the task as conditional semantic transport via flow matching. A lightweight transport module is trained on fixed pre-extracted VLM embeddings to map an instruction representation (conditioned on a reference image) to a target-aligned query embedding, without updating encoders or using task-specific triplets. It claims this yields competitive performance on standard CIR benchmarks while requiring roughly 10× fewer training resources than textual-inversion baselines, and introduces an inference-only Multi-Negative Steering heuristic to address VLM limitations on negation/removal.
Significance. If the performance and efficiency claims hold, the work offers a paradigm shift from textual inversion to flow-based transport on frozen embeddings, potentially lowering the barrier for ZS-CIR research. The explicit identification of negation as a VLM failure mode and the proposed mitigation are constructive contributions.
major comments (2)
- [Abstract / Method] Abstract and method description: The central claim that a small conditional flow-matching network trained solely on fixed VLM embeddings can accurately recover fine-grained target semantics (including directional composition) is load-bearing, yet the paper itself identifies negation/removal as a major VLM failure mode and resorts to a separate inference-time heuristic; this indicates the learned transport field may not fully compensate for information lost in the embedding space without additional supervision or adaptation.
- [Abstract] Abstract: The efficiency claim of 'roughly 10× fewer training resources' is presented without concrete metrics (e.g., parameter count of the transport module, GPU-hours, epochs, or side-by-side comparison tables), which is required to substantiate the load-bearing advantage over textual-inversion approaches.
minor comments (1)
- [Abstract] The abstract would benefit from a one-sentence description of the specific conditional flow-matching objective or network architecture to clarify how the transport field is parameterized.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential paradigm shift offered by FlowCIR. We address each major comment below with point-by-point responses.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: The central claim that a small conditional flow-matching network trained solely on fixed VLM embeddings can accurately recover fine-grained target semantics (including directional composition) is load-bearing, yet the paper itself identifies negation/removal as a major VLM failure mode and resorts to a separate inference-time heuristic; this indicates the learned transport field may not fully compensate for information lost in the embedding space without additional supervision or adaptation.
Authors: We agree that negation and removal constitute a notable VLM limitation, which is why the manuscript explicitly identifies this failure mode and introduces Multi-Negative Steering as a targeted inference-time mitigation. The conditional flow-matching transport is trained to learn directional semantic mappings on the fixed embeddings for general compositional instructions, and benchmark results indicate it recovers target semantics effectively in most cases. The steering heuristic specifically addresses residual negation handling issues that are not fully resolved in the VLM embedding space. We will revise the abstract and method sections to more clearly separate the scope of the learned transport from the additional steering strategy and to discuss this distinction as a limitation. revision: partial
-
Referee: [Abstract] Abstract: The efficiency claim of 'roughly 10× fewer training resources' is presented without concrete metrics (e.g., parameter count of the transport module, GPU-hours, epochs, or side-by-side comparison tables), which is required to substantiate the load-bearing advantage over textual-inversion approaches.
Authors: We acknowledge that the efficiency claim requires concrete supporting metrics to be fully substantiated. In the revised manuscript we will add explicit details on the transport module's parameter count, training epochs, approximate GPU-hours, and a side-by-side resource comparison against textual-inversion baselines. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines FlowCIR as a conditional flow-matching transport module trained on fixed pre-extracted VLM embeddings to map reference+instruction to target-aligned queries, with an added inference-time Multi-Negative Steering heuristic. All load-bearing steps (embedding extraction, flow training, and benchmark evaluation) operate on external VLM features and standard CIR datasets without any reduction of the reported performance metrics to quantities defined by the fitted parameters themselves or to self-citations. The efficiency claim (10× fewer resources) follows directly from the architectural choice of freezing encoders rather than from any definitional equivalence. No self-definitional, fitted-input-as-prediction, or uniqueness-imported patterns appear in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2405.02951 (2024)
Agnolucci, L., Baldrati, A., Bertini, M., Del Bimbo, A.: isearle: Improving textual inversion for zero-shot composed image retrieval. arXiv preprint arXiv:2405.02951 (2024)
-
[2]
Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. In: ICLR (2023)
work page 2023
-
[3]
Alhamoud, K., Alshammari, S., Tian, Y., Li, G., Torr, P.H., Kim, Y., Ghassemi, M.: Vision-language models do not understand negation. In: CVPR (2025) 16 He et al
work page 2025
-
[4]
Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023)
work page 2023
-
[5]
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining clip-based features. In: CVPR (2022)
work page 2022
-
[6]
Bogensperger, L., Narnhofer, D., Falk, A., Schindler, K., Pock, T.: Flowsdf: Flow matching for medical image segmentation using distance transforms. IJCV (2025)
work page 2025
-
[7]
Byun, J., Jeong, S., Kim, W., Chun, S., Moon, T.: An efficient post-hoc framework for reducing task discrepancy of text encoders for composed image retrieval. In: ICCV (2025)
work page 2025
-
[8]
arXiv preprint arXiv:2305.15241 (2023)
Chen, H., Dong, Y., Wang, Z., Yang, X., Duan, C., Su, H., Zhu, J.: Robust classi- fication via a single diffusion model. arXiv preprint arXiv:2305.15241 (2023)
-
[9]
Clark, K., Jaini, P.: Text-to-image diffusion models are zero shot classifiers. In: NeurIPS (2023)
work page 2023
-
[10]
Cohen, N., Gal, R., Meirom, E.A., Chechik, G., Atzmon, Y.: “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In: ECCV (2022)
work page 2022
-
[11]
Delmas, G., Sampaio de Rezende, R., Csurka, G., Larlus, D.: Artemis: Attention- based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)
work page 2022
-
[12]
Du, Y., Wang, M., Zhou, W., Hui, S., Li, H.: Image2sentence based asymmetrical zero-shot composed image retrieval. In: ICLR (2024)
work page 2024
-
[13]
Duan, S., Sun, Y., Peng, D., Liu, Z., Song, X., Hu, P.: Fuzzy multimodal learning for trusted cross-modal retrieval. In: CVPR (2025)
work page 2025
-
[14]
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)
work page 2024
-
[15]
arXiv preprint arXiv:2403.12803 (2024)
Fu, Y., Chen, C., Qiao, Y., Yu, Y.: Dreamda: Generative data augmentation with diffusion models. arXiv preprint arXiv:2403.12803 (2024)
-
[16]
Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. In: NeurIPS (2025)
work page 2025
-
[17]
Goenka, S., Zheng, Z., Jaiswal, A., Chada, R., Wu, Y., Hedau, V., Natarajan, P.: Fashionvlp: Vision language transformer for fashion retrieval with feedback. In: CVPR (2022)
work page 2022
-
[18]
Gu, G., Chun, S., Kim, W., Jun, H., Kang, Y., Yun, S.: Compodiff: Versatile composed image retrieval with latent diffusion. TMLR (2024)
work page 2024
-
[19]
Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only training of zero-shot composed image retrieval. In: CVPR (2024)
work page 2024
-
[20]
Gui, M., Schusterbauer, J., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Baumann, S.A., Hu, V.T., Ommer, B.: Depthfm: Fast generative monocular depth estimation with flow matching. In: AAAI (2025)
work page 2025
-
[21]
He, J., Yu, Q., Liu, Q., Chen, L.C.: Flowtok: Flowing seamlessly across text and image tokens. In: ICCV (2025)
work page 2025
-
[22]
He, Z., Li, L., Chen, L.: Flowcomposer: Composable flows for compositional zero- shot learning. In: CVPR (2026)
work page 2026
-
[23]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
work page 2020
-
[24]
Hui, M., Yang, S., Zhao, B., Shi, Y., Wang, H., Wang, P., Zhou, Y., Xie, C.: Hq- edit: A high-quality dataset for instruction-based image editing. In: ICLR (2025)
work page 2025
-
[25]
Islam, K., Zaheer, M.Z., Mahmood, A., Nandakumar, K.: Diffusemix: Label- preserving data augmentation with diffusion models. In: CVPR (2024)
work page 2024
-
[26]
Jiang, Z., Wang, Y., Chen, L.: Exploring cross-modal flows for few-shot learning. In: ICLR (2026) FlowCIR 17
work page 2026
-
[27]
Karthik, S., Roth, K., Mancini, M., Akata, Z.: Vision-by-language for training-free compositional image retrieval. In: ICLR (2024)
work page 2024
-
[28]
Koh, G., Oh, H.J., Noh, J., Jeong, W.K.: Synthetic data augmentation using pre- trained diffusion models for long-tailed food image classification. In: CVPR (2025)
work page 2025
-
[29]
Li, A.C., Prabhudesai, M., Duggal, S., Brown, E., Pathak, D.: Your diffusion model is secretly a zero-shot classifier. In: ICCV (2023)
work page 2023
-
[30]
Li, J., et al.: Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. In: ICML (2022)
work page 2022
-
[31]
Li, L., Jiang, Z., Ye, G., He, Z., Li, J., Xiao, J., Cheng, K.T., Chen, L.: Path- decoupled hyperbolic flow matching for few-shot adaptation. In: ICML (2026)
work page 2026
-
[32]
Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. In: NeurIPS (2024)
work page 2024
-
[33]
Li, W., Fan, H., Wong, Y., Yang, Y., Kankanhalli, M.S.: Improving context under- standing in multimodal large language models via multimodal composition learn- ing. In: ICML (2024)
work page 2024
-
[34]
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR (2023)
work page 2023
-
[35]
Liu, Q., Yin, X., Yuille, A., Brown, A., Singh, M.: Flowing from words to pixels: A noise-free framework for cross-modality evolution. In: CVPR (2025)
work page 2025
-
[36]
Liu, X., Pu, N., Zheng, H., Li, W., Sebe, N., Zhong, Z.: Generate, refine, and encode: Leveraging synthesized novel samples for on-the-fly fine-grained category discovery. In: ICCV (2025)
work page 2025
-
[37]
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2023)
work page 2023
-
[38]
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: CVPR (2021)
work page 2021
-
[39]
Ning, W., Chang, D., Tong, Y., He, Z., Liang, K., Ma, Z.: Hierarchical prompting for diffusion classifiers. In: ACCV (2024)
work page 2024
-
[40]
arXiv preprint arXiv:2412.12594 (2024)
Qi, Z., Liu, B., Zhang, S., Li, B., Xu, Z., Xiong, H., Xie, Z.: A simple and efficient baseline for zero-shot generative classification. arXiv preprint arXiv:2412.12594 (2024)
-
[41]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
work page 2021
-
[42]
arXiv preprint arXiv:2511.12331 (2025)
Ranjbar, S.K., Alhamoud, K., Ghassemi, M.: Spacevlm: Sub-space modeling of negation in vision-language models. arXiv preprint arXiv:2511.12331 (2025)
-
[43]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
work page 2022
-
[44]
Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023)
work page 2023
-
[45]
Sun, Z., Jing, D., Lu, Z.: Cotmr: Chain-of-thought multi-scale reasoning for training-free zero-shot composed image retrieval. In: ICCV (2025)
work page 2025
-
[46]
Tang, H., Wang, J., Zhao, M., Meng, G., Luo, R., Chen, L., Xia, S.T.: Heteroge- neous uncertainty-guided composed image retrieval with fine-grained probabilistic learning. In: AAAI (2026)
work page 2026
-
[47]
Tang, Y., Yu, J., Gai, K., Zhuang, J., Xiong, G., Hu, Y., Wu, Q.: Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed im- age retrieval. In: AAAI (2024)
work page 2024
-
[48]
Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019) 18 He et al
work page 2019
-
[49]
Wang, L., Ao, W., Boddeti, V.N., Lim, S.N.: Generative zero-shot composed image retrieval. In: CVPR (2025)
work page 2025
-
[50]
Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likelihood score alignment for visual-condition controllable generation (2026)
work page 2026
-
[51]
Wang, Y., Chen, L.: Inversion circle interpolation: Diffusion-based image augmen- tation for data-scarce classification. In: CVPR (2025)
work page 2025
-
[52]
Wang, Y., Chen, L.: Noise matters: Optimizing matching noise for diffusion clas- sifiers. In: NeurIPS (2025)
work page 2025
-
[53]
arXiv preprint arXiv:2510.06139 (2025)
Wang, Z., Jiang, D., Li, L., Dang, S., Li, C., Yang, H., Dai, G., Wang, M., Wang, J.: Deforming videos to masks: Flow matching for referring video segmentation. arXiv preprint arXiv:2510.06139 (2025)
-
[54]
Wu,H.,Gao,Y.,Guo,X.,Al-Halah,Z.,Rennie,S.,Grauman,K.,Feris,R.:Fashion iq: A new dataset towards retrieving images by natural language feedback. In: CVPR (2021)
work page 2021
-
[55]
Yang, Z., Xue, D., Qian, S., Dong, W., Xu, C.: Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In: Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval (2024)
work page 2024
-
[56]
Yue, Z., Zhou, P., Hong, R., Zhang, H., Sun, Q.: Few-shot learner parameterization by diffusion time-steps. In: CVPR (2024)
work page 2024
-
[57]
In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI)
Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual- language foundation models for image retrieval. In: 2025 International Conference on Content-Based Multimedia Indexing (CBMI). pp. 1–8. IEEE (2025)
work page 2025
-
[58]
Zhang, K., Luan, Y., Hu, H., Lee, K., Qiao, S., Chen, W., Su, Y., Chang, M.W.: MagicLens: Self-supervised image retrieval with open-ended instructions. In: ICML (2024)
work page 2024
-
[59]
Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024)
work page 2024
-
[60]
In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)
Zhao, J., Li, J., Lian, D., Sun, L., Lv, P.: Dualcir: Enhancing training-free com- posed image retrieval via dual-directional descriptions. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.