Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health
Pith reviewed 2026-07-03 03:52 UTC · model grok-4.3
The pith
Deep learning model achieves observer-level accuracy in segmenting the whole penis from DIXON MRI scans and quantifies tissue in 34,412 UK Biobank participants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A 3D nnU-Net trained on expert-annotated DIXON MRI data segments the full penile tissue volume with Dice coefficient 0.92 and Hausdorff distance 3.58 mm on an independent test set of 24 subjects, matching inter-observer performance, and when deployed yields total penile tissue volumes for 34,412 UK Biobank participants with inter-session reproducibility of r = 0.87.
What carries the argument
The 3D nnU-Net architecture optimized for multi-channel DIXON MRI segmentation, trained on 13,050 annotated slices from 145 subjects.
If this is right
- Automated volumetry becomes feasible at population scale for male reproductive health studies.
- Internal penile components can now be quantified alongside external measurements.
- High reproducibility supports longitudinal tracking of anatomical changes.
- The open model weights enable replication and extension in urological imaging research.
Where Pith is reading between the lines
- Combining these volumes with genetic or clinical data could reveal new associations with reproductive disorders.
- Similar segmentation approaches might apply to other soft-tissue organs in large MRI cohorts.
- Clinical translation could standardize assessment of conditions like micropenis or erectile dysfunction.
Load-bearing premise
The small training and test sets from the UK Biobank are representative and free of annotation or demographic biases that would affect performance when scaled to the full population.
What would settle it
A significant drop in Dice score or increase in Hausdorff distance when the model is tested on a new set of MRI scans from a different demographic group or scanner type.
Figures
read the original abstract
Penile measurement is clinically relevant across male reproductive and urogenital health, including conditions such as micropenis, congenital and endocrine disorders, and sexual or urinary dysfunction. However, quantitative assessment of penile size has relied mainly on external length or circumference measurements, which are difficult to standardize, sensitive to measurement conditions, and unable to capture the internal portion of the penis. MRI enables volumetric assessment of the whole penis in vivo, but automated segmentation has not previously been established at population scale. Automated whole-organ volumetry would enable high-throughput phenotyping for multi-omics and clinical studies of male reproductive disease. Here, we present a deep learning framework for whole-penis segmentation in multi-channel DIXON MRI. Using a newly curated expert-annotated training dataset ($n = 145$ subjects; $13,050$ annotated slices) and a double-annotated independent test benchmark ($n = 24$ subjects; $2,160$ double-annotated slices), we optimized a 3D nnU-Net architecture. The model achieved a 5-fold cross-validation Dice score of $0.90$ and performed at observer-level accuracy on the independent test set (Dice: $0.92$; Hausdorff distance: $3.58$). We deployed the model in $34,412$ UK Biobank participants, enabling automated quantification of total penile tissue, including both external and internal components. Longitudinal evaluation in 2,282 men demonstrated high inter-session reproducibility ($r = 0.87$). This framework establishes a reproducible and population-scalable method for MRI-based assessment of penile anatomy and provides an open technical resource for future studies in urological imaging and male reproductive health. The trained model weights will be publicly released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a 3D nnU-Net framework for automated whole-penis segmentation in multi-channel DIXON MRI. It uses a curated training set of 145 subjects (13,050 slices) with 5-fold CV Dice of 0.90, reports observer-level performance on a double-annotated independent test set of 24 subjects (Dice 0.92, Hausdorff distance 3.58), and deploys the model to quantify penile tissue volume in 34,412 UK Biobank participants, with longitudinal reproducibility r=0.87 in 2,282 men. The central claim is that this establishes a reproducible, population-scalable method for MRI-based penile volumetry and phenotyping in male reproductive health, with public release of model weights.
Significance. If the reported generalization holds, the work supplies the first automated, high-throughput pipeline for whole-organ penile volumetry at population scale, addressing a gap where prior assessments relied on non-standardized external measurements. Credit is due for the double-annotated test benchmark, longitudinal reproducibility evaluation, and commitment to releasing trained weights. The result would directly enable multi-omics and clinical studies if the performance metrics translate without systematic bias across the full cohort.
major comments (1)
- [Abstract and Results] Abstract and Results: The headline claim of reliable deployment to 34,412 participants rests on performance metrics from an independent test set of only 24 subjects. No subgroup performance breakdowns (by age, BMI, ethnicity, or anatomical variants), failure-case analysis, or explicit checks for distribution shift between the 145+24 subjects and the full UK Biobank cohort are described. With this sample size, even double annotation cannot guarantee coverage of the demographic and anatomical range needed to support population-scale quantitative phenotyping without undetected systematic error in volumetry.
minor comments (2)
- [Methods] Methods: The abstract states that a 3D nnU-Net was 'optimized' but provides no summary of hyperparameter search, data exclusion criteria, or preprocessing steps; these details are needed for reproducibility even if present in the full text.
- [Data] The manuscript should clarify whether the 145 training and 24 test subjects were drawn from the same UK Biobank imaging protocol and demographic pool as the 34,412 deployment set.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comment regarding test-set size, subgroup analysis, and generalizability below, and propose targeted revisions.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: The headline claim of reliable deployment to 34,412 participants rests on performance metrics from an independent test set of only 24 subjects. No subgroup performance breakdowns (by age, BMI, ethnicity, or anatomical variants), failure-case analysis, or explicit checks for distribution shift between the 145+24 subjects and the full UK Biobank cohort are described. With this sample size, even double annotation cannot guarantee coverage of the demographic and anatomical range needed to support population-scale quantitative phenotyping without undetected systematic error in volumetry.
Authors: We agree that the independent test set (n=24, double-annotated) is modest in size and that the manuscript does not include subgroup breakdowns, failure-case analysis, or formal distribution-shift tests against the full UK Biobank cohort. The test-set size was deliberately limited to enable exhaustive double annotation, yielding observer-level performance (Dice 0.92). Supporting evidence for deployment comes from the longitudinal reproducibility analysis performed directly on 2,282 repeat UK Biobank scans (r=0.87), which provides an internal consistency check across the target population. We acknowledge that this does not fully substitute for demographic stratification or shift detection. In revision we will add a dedicated limitations paragraph in the Discussion that explicitly states the modest test-set size, the lack of subgroup and failure-case reporting, and the impossibility of quantifying systematic volumetric bias without ground-truth labels on the full cohort. We cannot conduct new subgroup analyses or failure-case reviews without additional expert annotations, which are outside the scope of the current study. revision: partial
Circularity Check
No significant circularity; empirical ML evaluation on independent held-out data
full rationale
The paper presents a standard supervised segmentation task using nnU-Net trained on 145 expert-annotated subjects and evaluated on a separate double-annotated test set of 24 subjects. Reported metrics (5-fold CV Dice 0.90; test Dice 0.92, HD 3.58) are direct empirical measurements on held-out data, not derived quantities that reduce to the training inputs by construction. Deployment to 34,412 UK Biobank scans is a forward application of the trained model with no intervening mathematical derivation, uniqueness theorem, or self-citation chain. No equations, fitted parameters renamed as predictions, or ansatzes appear in the described pipeline. The work is therefore self-contained as an empirical ML study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert manual annotations constitute accurate ground truth
Reference graph
Works this paper leans on
-
[1]
Savenije, Frederik R
Ingeborg van den Berg, Mark H.F. Savenije, Frederik R. Teunissen, Sandrine M.G. van de Pol, Marnix J.A. Rasing, Harm H.E. van Melick, Wyger M. Brink, Johannes C.J. de 5 Boer, Cornelis A.T. van den Berg, and Jochem R.N. van der V oort van Zyp. Deep learning for automated contouring of neurovascular structures on magnetic resonance imaging for prostate canc...
2023
-
[2]
Alnajjar, Andrea Salonia, and Asif Muneer
Omer Onur Cakir, Edoardo Pozzi, Fabio Castiglione, Hus- sain M. Alnajjar, Andrea Salonia, and Asif Muneer. Penile length measurement: Methodological challenges and rec- ommendations, a systematic review.The Journal of Sexual Medicine, 18(3):433–439, 2021. 2
2021
-
[3]
Towards popula- tion scale testis volume segmentation in DIXON MRI.Com- puters in Biology and Medicine, 198(Pt A):111139, 2025
Jan Ernsting, Philipp Nikolas Beeken, Lynn Ogoniak, Jacqueline Kockwelp, Wolfgang Roll, Tim Hahn, Alexan- der Siegfried Busch, and Benjamin Risse. Towards popula- tion scale testis volume segmentation in DIXON MRI.Com- puters in Biology and Medicine, 198(Pt A):111139, 2025. 2, 3
2025
-
[4]
Miller, Steve Pieper, and Ron Kikinis
Andriy Fedorov, Reinhard Beichel, Jayashree Kalpathy- Cramer, Julien Finet, Jean-Christophe Fillion-Robin, Sonia Pujol, Christian Bauer, Dominique Jennings, Fiona Fen- nessy, Milan Sonka, John Buatti, Stephen Aylward, James V . Miller, Steve Pieper, and Ron Kikinis. 3d slicer as an image computing platform for the quantitative imaging network. Magnetic Re...
2012
-
[5]
M ¨oller, Matan Atad, Henry V ¨olzke, Robin B ¨ulow, Carsten Oliver Schmidt, Ju- lia R ¨udebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian T
Robert Graf, Paul Platzek, Evamaria Olga Riedel, Constanze Ramsch¨utz, Sophie Starck, Hendrik K. M ¨oller, Matan Atad, Henry V ¨olzke, Robin B ¨ulow, Carsten Oliver Schmidt, Ju- lia R ¨udebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian T. L ¨offler, Fabian Bamberg, Benedikt Wiestler, Johannes C. Paetzold, Daniel Rueckert, and Jan Stefan Kirsc...
2025
-
[6]
Management of disorders of sex de- velopment.Nature Reviews Endocrinology, 10(9):520–529, 09 2014
Olaf Hiort, Wiebke Birnbaum, Louise Marshall, Lutz W¨unsch, Ralf Werner, Tatjana Schr ¨oder, Ulla D ¨ohnert, and Paul-Martin Holterhus. Management of disorders of sex de- velopment.Nature Reviews Endocrinology, 10(9):520–529, 09 2014. 1
2014
-
[7]
Jaeger, Simon A
Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-net: a self- configuring method for deep learning-based biomedical im- age segmentation.Nature Methods, 18(2):203–211, 02 2021. 2, 3
2021
-
[8]
Mehta, Ranjit K
Shruti Kumar, Parv M. Mehta, Ranjit K. Chaudhary, Pankaj Nepal, Devendra Kumar, Sree Harsha Tirumani, and Vi- jayanadh Ojili. MR imaging of the penis: What a radiologist needs to know!Abdominal Radiology, 50(4):1790–1810,
-
[9]
Lavdas, B
I. Lavdas, B. Glocker, D. Rueckert, S.A. Taylor, E.O. Aboagye, and A.G. Rockall. Machine learning in whole- body MRI: experiences and challenges from an applied study using multicentre data.Clinical Radiology, 74(5):346–356,
-
[10]
Philipp Schubert, Matthias May, Daniel H ¨ofler, Hans- Peter Fautz, Jana Hutter, Ricarda Merten, Sina Mansoo- rian, Thomas Weissmann, Lisa Deloch, Miriam Schonath, Nathalia Belmas, Felix Grabenbauer, Benjamin Frey, Udo Gaipl, Bernd-Niklas Axer, Juliane Szkitsak, Michael Uder, Christoph Bert, Rainer Fietkau, and Florian Putz. Advanc- ing offline magnetic r...
-
[11]
Am i normal? a systematic review and con- struction of nomograms for flaccid and erect penis length and circumference in up to 15 521 men.BJU International, 115(6):978–986, 2015
David Veale, Sarah Miles, Sally Bramley, Gordon Muir, and John Hodsoll. Am i normal? a systematic review and con- struction of nomograms for flaccid and erect penis length and circumference in up to 15 521 men.BJU International, 115(6):978–986, 2015. 1, 2
2015
-
[12]
Bell, Magnus Borga, and Louise Thomas
Janne West, Olof Dahlqvist Leinhard, Thobias Romu, Rory Collins, Steve Garratt, Jimmy D. Bell, Magnus Borga, and Louise Thomas. Feasibility of MR-based body composi- tion analysis in large scale population studies.PLoS ONE, 11(9):e0163332, 2016. 3 6
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.