Assessing VLM Reliability for Medical Image Quality Evaluation Under Corruption and Bias

Kevin Vorwalder; Nico Pfeifer; Sofiane Ouaari

arxiv: 2607.01973 · v1 · pith:R35ZTPFVnew · submitted 2026-07-02 · 💻 cs.CV · cs.AI· cs.LG

Assessing VLM Reliability for Medical Image Quality Evaluation Under Corruption and Bias

Sofiane Ouaari , Kevin Vorwalder , Nico Pfeifer This is my paper

Pith reviewed 2026-07-03 15:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords Vision-Language ModelsMedical Image Quality AssessmentImage CorruptionContextual BiasPrivacy PreservationZero-shot EvaluationEmbedding GeometryMultimodal Reliability

0 comments

The pith

Vision-language models for medical image quality assessment drop under pixelation and shift with added metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests sixteen VLMs zero-shot on medical images across seven modalities using a dataset that applies seven corruption types at five severity levels each. It measures how these degradations change quality scores and embedding positions, then checks whether adding textual details about patient demographics, clinician expertise, equipment, or institution changes the outputs. The authors report that pixelation produces the biggest score reductions while brightness barely affects results, that embedding shifts track the score changes, and that prestige-related metadata can raise scores by more than 17 percent on average. They conclude that these patterns reveal both a privacy-reliability tension and limited objectivity in the models. A reader would care because reliable automated quality checks could ease clinical workloads only if the systems remain stable when images are degraded for privacy or when routine context is present.

Core claim

Current VLMs show limitations for medical image quality assessment. Pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability. Sensitivity to contextual metadata indicates limited objectivity and marks metadata as a privacy and bias source.

What carries the argument

Zero-shot benchmarking of VLMs on the MediMeta-C dataset under controlled corruptions and textual attribute perturbations, tracking both numerical score changes and embedding displacement.

If this is right

Pixelation produces mean score reductions of 20.58 percent and up to 34.4 percent on OCT images.
Embedding displacement under corruption is associated with the observed score changes.
Models from the same family exhibit score correlations between 0.67 and 0.83, though some increase scores on corrupted mammography images.
Institutional prestige raises quality scores by 17.15 percent on average while equipment age lowers them by 14.7 percent.
The largest single-model shifts reach +95.62 percent and -37.7 percent when metadata is altered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinics considering automated quality screening would need separate privacy methods that avoid pixelation if they also rely on these models.
The observed metadata sensitivity suggests that any deployed system should log and audit the exact textual context supplied with each image.
Future work could test whether fine-tuning on corrupted examples or metadata-ablated prompts reduces the reported shifts.
The privacy-reliability tension identified here may apply to other VLM medical tasks that use degraded images for data protection.

Load-bearing premise

The MediMeta-C dataset together with the seven chosen corruption types, five severity levels, and specific textual attributes tested are representative of real-world clinical image degradations and contextual biases.

What would settle it

A replication on an independent clinical collection that finds VLMs maintain stable quality scores under pixelation and remain unaffected by the same metadata additions would falsify the reported limitations.

read the original abstract

Vision-Language Models (VLMs) are increasingly applied in medical tasks such as pathology description, report generation, and visual question answering. Medical Image Quality Assessment (MIQA) supports diagnostic accuracy and patient safety by determining whether images meet the standards required for clinical decision-making. Automating MIQA with VLMs may reduce workload, but their behavior under real-world conditions, where images may be degraded or textual context may affect judgments, should be further explored before deployment. We benchmark VLMs on medical image quality using the MediMeta-C dataset zero-shot across seven corruption types and five severity levels. We evaluate sensitivity to degradation patterns, the effect of corruptions on embedding geometry, and whether textual attributes (demographics, expertise, infrastructure, institution) alter scores. Across 16 VLMs and seven modalities, pixelation produced the largest score reductions (mean -20.58%, up to -34.4% for OCT), whereas brightness had limited effect (-0.81%). Embedding displacement was associated with score changes. Same-family models showed correlations of 0.67-0.83; some produced increases up to +31% for corrupted mammography. Textual attributes affected scores: institutional prestige raised them +17.15%, and equipment age lowered them -14.7%. The largest changes were +95.62% (InternVL-8B) and -37.7% (MedGemma). Current VLMs show limitations for medical image quality assessment. Pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability. Sensitivity to contextual metadata indicates limited objectivity and marks metadata as a privacy and bias source. Privacy protection and objective quality assessment are related requirements for use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmark quantifies VLM score drops from pixelation and shifts from metadata in medical image quality assessment, but the chosen corruptions and attributes lack shown ties to real clinical data.

read the letter

The paper measures how 16 VLMs respond to seven image corruptions and added textual attributes when scoring medical image quality across seven modalities on the MediMeta-C dataset. Pixelation produces the largest average drop at 20.58 percent, with some models rising on corrupted mammography, while prestige text lifts scores by 17 percent and equipment age lowers them by 14.7 percent. Embedding shifts track some of the score changes, and same-family models correlate at 0.67-0.83.

It supplies new zero-shot numbers on these specific effects and notes the privacy angle with pixelation. That coverage across models and modalities is the useful part.

The main gap is that the seven corruptions, their severities, and the textual attributes are not shown to match distributions from actual clinical image logs or radiologist failure reports. Without that link the claims about reliability trade-offs and bias sources rest on an unanchored set. The abstract also gives no sample sizes per modality, no statistical tests, and no controls for prompt wording, so the reported percentages need more verification.

This is for groups working on VLM use in medical imaging who want concrete sensitivity data. A reader focused on deployment robustness would find the measurements worth seeing.

It deserves peer review so the methods and dataset choices can be checked directly.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks 16 VLMs zero-shot for medical image quality assessment (MIQA) on the MediMeta-C dataset across seven modalities. It measures score changes under seven corruption types at five severity levels, analyzes associated embedding displacements, reports same-family model correlations (0.67-0.83), and tests sensitivity to textual attributes (demographics, expertise, infrastructure, institution). Key quantitative results include largest mean score reduction from pixelation (-20.58%, up to -34.4% for OCT), minimal effect from brightness (-0.81%), metadata-driven shifts (e.g., +17.15% for institutional prestige, -14.7% for equipment age), and extreme per-model changes (+95.62% to -37.7%). The authors conclude that current VLMs exhibit limitations for MIQA, that pixelation reveals a privacy-reliability trade-off, and that metadata sensitivity indicates limited objectivity and introduces bias.

Significance. If the benchmark conditions prove representative, the work supplies concrete empirical evidence on VLM robustness failures under realistic degradations and contextual influences, which is relevant for assessing deployability in clinical MIQA pipelines. The zero-shot multi-model, multi-modality design and joint examination of corruption effects on both scores and embeddings are strengths that could inform future reliability testing protocols.

major comments (2)

[Abstract] Abstract: The inference that 'pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability' and that 'current VLMs show limitations for medical image quality assessment' is load-bearing on the assumption that the seven MediMeta-C corruptions and five severity levels are representative of clinical image degradations. The manuscript supplies no validation of MediMeta-C against clinical image logs, radiologist-reported failure modes, or hospital QA data.
[Abstract] Abstract: Reported aggregate score changes (mean -20.58% for pixelation; +17.15% for institutional prestige) and model-family correlations are presented without reference to statistical significance testing, per-modality sample sizes, variance estimates, or controls for prompt variation, leaving the quantitative support for the central claims on performance reduction and metadata sensitivity difficult to evaluate.

minor comments (2)

[Abstract] The abstract lists extreme per-model changes (+95.62% for InternVL-8B) but does not indicate whether these are averaged across modalities or corruptions or tied to specific conditions.
[Abstract] The description of embedding displacement being 'associated with score changes' would benefit from a quantitative measure or correlation coefficient to clarify the strength of the reported association.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The inference that 'pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability' and that 'current VLMs show limitations for medical image quality assessment' is load-bearing on the assumption that the seven MediMeta-C corruptions and five severity levels are representative of clinical image degradations. The manuscript supplies no validation of MediMeta-C against clinical image logs, radiologist-reported failure modes, or hospital QA data.

Authors: We agree that the privacy-reliability trade-off and VLM limitations claims depend on the relevance of the chosen corruptions. MediMeta-C applies standard synthetic corruptions drawn from established computer vision robustness benchmarks at multiple severity levels. We did not validate these against hospital QA logs or radiologist-reported modes. We will revise the abstract and add a limitations paragraph to qualify that results apply to these synthetic degradations and to note the value of future clinical validation. revision: yes
Referee: [Abstract] Abstract: Reported aggregate score changes (mean -20.58% for pixelation; +17.15% for institutional prestige) and model-family correlations are presented without reference to statistical significance testing, per-modality sample sizes, variance estimates, or controls for prompt variation, leaving the quantitative support for the central claims on performance reduction and metadata sensitivity difficult to evaluate.

Authors: The full manuscript details results over seven modalities and reports sample sizes per modality along with embedding analyses. The abstract omits these supporting elements. We will revise the abstract and results to include per-modality sample sizes, variance estimates, statistical significance tests on the reported mean changes, and explicit statement that a single fixed prompt template was used across all conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on fixed benchmark

full rationale

The paper conducts zero-shot VLM evaluations on the MediMeta-C dataset, reporting observed score deltas (e.g., pixelation mean -20.58%) and correlations under fixed corruptions and textual attributes. No equations, parameter fitting, predictions, or derivation chains appear; claims rest on direct measurement rather than any reduction to self-defined quantities or self-citation load-bearing premises. The representativeness concern raised by the skeptic is a validity issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard machine learning evaluation practices and the assumption that the chosen dataset and perturbations capture relevant real-world variation; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Zero-shot prompting is an appropriate method to evaluate inherent VLM reliability for medical image quality assessment.
All evaluations are performed zero-shot without task-specific fine-tuning or examples.
domain assumption The MediMeta-C dataset and selected corruptions/textual attributes are representative of clinical conditions.
The benchmark design and conclusions depend on this representativeness.

pith-pipeline@v0.9.1-grok · 5851 in / 1426 out tokens · 29936 ms · 2026-07-03T15:41:31.416807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

IEEE Transactions on Pattern Analysis and Machine Intelligence

Awais M, Naseer M, Khan S, et al (2025) Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence

work page 2025
[2]

5-vl technical report

Bai S, Chen K, Liu X, et al (2025) Qwen2. 5-vl technical report. arXiv preprint arXiv:250213923

work page 2025
[3]

Frontiers in Medicine 11:1495582

Bélisle-Pipon JC (2024) Why we need to be careful with llms in medicine. Frontiers in Medicine 11:1495582

work page 2024
[4]

Cheng Z, Ong AY, Wagner SK, et al (2025) Understanding the robustness of vision-languagemodelstomedicalimageartefacts.NPJDigitalMedicine8(1):727

work page 2025
[5]

Biomedical signal processing and control 27:145–154

Chow LS, Paramesran R (2016) Review of medical image quality assessment. Biomedical signal processing and control 27:145–154

work page 2016
[6]

Academic radiology 15(3):390–395

Fetzer DT, West OC (2008) The hipaa privacy rule and protected health infor- mation: implications in research involving dicom image databases. Academic radiology 15(3):390–395

work page 2008
[7]

arXiv preprint arXiv:240407214

Ghosh A, Acharya A, Saha S, et al (2024) Exploring the frontier of vision- language models: A survey of current methodologies and future directions. arXiv preprint arXiv:240407214

work page 2024
[8]

The Innovation

Gu J, Jiang X, Shi Z, et al (2024) A survey on llm-as-a-judge. The Innovation

work page 2024
[9]

Frontiers in artificial intelligence 7:1430984

Hartsock I, Rasool G (2024) Vision-language models for medical report genera- tion and visual question answering: A review. Frontiers in artificial intelligence 7:1430984

work page 2024
[10]

ImamR,MarewR,YaqubM(2025) Ontherobustness ofmedicalvision-language models: Are they truly generalizable? In: Annual Conference on Medical Image Understanding and Analysis, Springer, pp 233–256

work page 2025
[11]

Advances in Neural Information Processing Systems 36:28541–28564 19

Li C, Wong C, Zhang S, et al (2023) Llava-med: Training a large language- and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36:28541–28564 19

work page 2023
[12]

arXiv preprint arXiv:250521698

Li Y, Ghahremani M, Wachinger C (2025) Medbridge: Bridging founda- tion vision-language models to medical image diagnosis. arXiv preprint arXiv:250521698

work page 2025
[13]

Advances in neural information processing systems 36:34892–34916

Liu H, Li C, Wu Q, et al (2023) Visual instruction tuning. Advances in neural information processing systems 36:34892–34916

work page 2023
[14]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 26296–26306

Liu H, Li C, Li Y, et al (2024) Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 26296–26306

work page 2024
[15]

arXiv preprint arXiv:251001691

Liu J, Wei J, Qu W, et al (2025) Medq-bench: Evaluating and exploring medical image quality assessment abilities in mllms. arXiv preprint arXiv:251001691

work page 2025
[16]

Neurocomputing 602:128292

Ma Y, Lou J, Tanguy JY, et al (2024) Rad-iqmri: A benchmark for mri image quality assessment. Neurocomputing 602:128292

work page 2024
[17]

In: Second Workshop on Representational Alignment at ICLR 2025

Masry A, Rodriguez JA, Zhang T, et al (2025) Alignvlm: Bridging vision and language latent spaces for multimodal understanding. In: Second Workshop on Representational Alignment at ICLR 2025

work page 2025
[18]

Computers in biology and medicine 53:134–140

Newhauser W, Jones T, Swerdloff S, et al (2014) Anonymization of dicom elec- tronic medical records for radiation therapy. Computers in biology and medicine 53:134–140

work page 2014
[19]

arXiv preprint arXiv:250411695

Papadimitriou I, Su H, Fel T, et al (2025) Interpreting the linear structure of vision-language model embedding spaces. arXiv preprint arXiv:250411695

work page 2025
[20]

arXiv preprint arXiv:250705201

Sellergren A, Kazemzadeh S, Jaroensri T, et al (2025) Medgemma technical report. arXiv preprint arXiv:250705201

work page 2025
[21]

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 15(3):e70036

Shinde G, Ravi A, Dey E, et al (2025) A survey on efficient vision-language models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 15(3):e70036

work page 2025
[22]

arXiv preprint arXiv:241003435

Sun Y, Huang Q, Tang Y, et al (2024) A general framework for producing interpretable semantic text embeddings. arXiv preprint arXiv:241003435

work page 2024
[23]

Nature Medicine 31(2):599–608

Tanno R, Barrett DG, Sellergren A, et al (2025) Collaboration between clinicians and vision–language models in radiology report generation. Nature Medicine 31(2):599–608

work page 2025
[24]

Advances in neural information processing systems 30

Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Advances in neural information processing systems 30

work page 2017
[25]

5: Advancing open-source multimodal models in versatility, reasoning, and efficiency

Wang W, Gao Z, Gu L, et al (2025) Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:250818265 20

work page 2025
[26]

Advances in Neural Information Processing Systems 37:99947–99964

Wang Y, Dai Y, Jones C, et al (2024) Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection. Advances in Neural Information Processing Systems 37:99947–99964

work page 2024
[27]

Conference on Empirical Methods in Natural Language Processing, p 3876

Wang Z, Wu Z, Agarwal D, et al (2022) Medclip: Contrastive learning from unpairedmedicalimagesandtext.In:ProceedingsoftheConferenceonEmpirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, p 3876

work page 2022
[28]

Scientific Data 12(1):666

Woerner S, Jaques A, Baumgartner CF (2025) A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset. Scientific Data 12(1):666

work page 2025
[29]

arXiv preprint arXiv:250607044

Xu W, Chan HP, Li L, et al (2025) Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv preprint arXiv:250607044

work page 2025
[30]

arXiv preprint arXiv:250509388

Yang A, Li A, Yang B, et al (2025) Qwen3 technical report. arXiv preprint arXiv:250509388

work page 2025
[31]

arXiv preprint arXiv:241002736

Ye J, Wang Y, Huang Y, et al (2024) Justice or prejudice? quantifying biases in llm-as-a-judge. arXiv preprint arXiv:241002736

work page 2024
[32]

Yim Ww, Fu Y, Abacha AB, et al (2024) To err is human, how about medi- cal large language models? comparing pre-trained language models for medical assessment errors and reliability. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp 16211–16223 21 Supplementary...

work page 2024

[1] [1]

IEEE Transactions on Pattern Analysis and Machine Intelligence

Awais M, Naseer M, Khan S, et al (2025) Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence

work page 2025

[2] [2]

5-vl technical report

Bai S, Chen K, Liu X, et al (2025) Qwen2. 5-vl technical report. arXiv preprint arXiv:250213923

work page 2025

[3] [3]

Frontiers in Medicine 11:1495582

Bélisle-Pipon JC (2024) Why we need to be careful with llms in medicine. Frontiers in Medicine 11:1495582

work page 2024

[4] [4]

Cheng Z, Ong AY, Wagner SK, et al (2025) Understanding the robustness of vision-languagemodelstomedicalimageartefacts.NPJDigitalMedicine8(1):727

work page 2025

[5] [5]

Biomedical signal processing and control 27:145–154

Chow LS, Paramesran R (2016) Review of medical image quality assessment. Biomedical signal processing and control 27:145–154

work page 2016

[6] [6]

Academic radiology 15(3):390–395

Fetzer DT, West OC (2008) The hipaa privacy rule and protected health infor- mation: implications in research involving dicom image databases. Academic radiology 15(3):390–395

work page 2008

[7] [7]

arXiv preprint arXiv:240407214

Ghosh A, Acharya A, Saha S, et al (2024) Exploring the frontier of vision- language models: A survey of current methodologies and future directions. arXiv preprint arXiv:240407214

work page 2024

[8] [8]

The Innovation

Gu J, Jiang X, Shi Z, et al (2024) A survey on llm-as-a-judge. The Innovation

work page 2024

[9] [9]

Frontiers in artificial intelligence 7:1430984

Hartsock I, Rasool G (2024) Vision-language models for medical report genera- tion and visual question answering: A review. Frontiers in artificial intelligence 7:1430984

work page 2024

[10] [10]

ImamR,MarewR,YaqubM(2025) Ontherobustness ofmedicalvision-language models: Are they truly generalizable? In: Annual Conference on Medical Image Understanding and Analysis, Springer, pp 233–256

work page 2025

[11] [11]

Advances in Neural Information Processing Systems 36:28541–28564 19

Li C, Wong C, Zhang S, et al (2023) Llava-med: Training a large language- and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36:28541–28564 19

work page 2023

[12] [12]

arXiv preprint arXiv:250521698

Li Y, Ghahremani M, Wachinger C (2025) Medbridge: Bridging founda- tion vision-language models to medical image diagnosis. arXiv preprint arXiv:250521698

work page 2025

[13] [13]

Advances in neural information processing systems 36:34892–34916

Liu H, Li C, Wu Q, et al (2023) Visual instruction tuning. Advances in neural information processing systems 36:34892–34916

work page 2023

[14] [14]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 26296–26306

Liu H, Li C, Li Y, et al (2024) Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 26296–26306

work page 2024

[15] [15]

arXiv preprint arXiv:251001691

Liu J, Wei J, Qu W, et al (2025) Medq-bench: Evaluating and exploring medical image quality assessment abilities in mllms. arXiv preprint arXiv:251001691

work page 2025

[16] [16]

Neurocomputing 602:128292

Ma Y, Lou J, Tanguy JY, et al (2024) Rad-iqmri: A benchmark for mri image quality assessment. Neurocomputing 602:128292

work page 2024

[17] [17]

In: Second Workshop on Representational Alignment at ICLR 2025

Masry A, Rodriguez JA, Zhang T, et al (2025) Alignvlm: Bridging vision and language latent spaces for multimodal understanding. In: Second Workshop on Representational Alignment at ICLR 2025

work page 2025

[18] [18]

Computers in biology and medicine 53:134–140

Newhauser W, Jones T, Swerdloff S, et al (2014) Anonymization of dicom elec- tronic medical records for radiation therapy. Computers in biology and medicine 53:134–140

work page 2014

[19] [19]

arXiv preprint arXiv:250411695

Papadimitriou I, Su H, Fel T, et al (2025) Interpreting the linear structure of vision-language model embedding spaces. arXiv preprint arXiv:250411695

work page 2025

[20] [20]

arXiv preprint arXiv:250705201

Sellergren A, Kazemzadeh S, Jaroensri T, et al (2025) Medgemma technical report. arXiv preprint arXiv:250705201

work page 2025

[21] [21]

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 15(3):e70036

Shinde G, Ravi A, Dey E, et al (2025) A survey on efficient vision-language models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 15(3):e70036

work page 2025

[22] [22]

arXiv preprint arXiv:241003435

Sun Y, Huang Q, Tang Y, et al (2024) A general framework for producing interpretable semantic text embeddings. arXiv preprint arXiv:241003435

work page 2024

[23] [23]

Nature Medicine 31(2):599–608

Tanno R, Barrett DG, Sellergren A, et al (2025) Collaboration between clinicians and vision–language models in radiology report generation. Nature Medicine 31(2):599–608

work page 2025

[24] [24]

Advances in neural information processing systems 30

Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Advances in neural information processing systems 30

work page 2017

[25] [25]

5: Advancing open-source multimodal models in versatility, reasoning, and efficiency

Wang W, Gao Z, Gu L, et al (2025) Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:250818265 20

work page 2025

[26] [26]

Advances in Neural Information Processing Systems 37:99947–99964

Wang Y, Dai Y, Jones C, et al (2024) Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection. Advances in Neural Information Processing Systems 37:99947–99964

work page 2024

[27] [27]

Conference on Empirical Methods in Natural Language Processing, p 3876

Wang Z, Wu Z, Agarwal D, et al (2022) Medclip: Contrastive learning from unpairedmedicalimagesandtext.In:ProceedingsoftheConferenceonEmpirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, p 3876

work page 2022

[28] [28]

Scientific Data 12(1):666

Woerner S, Jaques A, Baumgartner CF (2025) A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset. Scientific Data 12(1):666

work page 2025

[29] [29]

arXiv preprint arXiv:250607044

Xu W, Chan HP, Li L, et al (2025) Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv preprint arXiv:250607044

work page 2025

[30] [30]

arXiv preprint arXiv:250509388

Yang A, Li A, Yang B, et al (2025) Qwen3 technical report. arXiv preprint arXiv:250509388

work page 2025

[31] [31]

arXiv preprint arXiv:241002736

Ye J, Wang Y, Huang Y, et al (2024) Justice or prejudice? quantifying biases in llm-as-a-judge. arXiv preprint arXiv:241002736

work page 2024

[32] [32]

Yim Ww, Fu Y, Abacha AB, et al (2024) To err is human, how about medi- cal large language models? comparing pre-trained language models for medical assessment errors and reliability. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp 16211–16223 21 Supplementary...

work page 2024