Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health

Alexander Siegfried Busch; Benjamin Risse; Gunnar Paul Kordes; Jan Ernsting; Lynn Ogoniak; Nils Johannaber; Tim Hahn; Wolfgang Roll

arxiv: 2607.02127 · v1 · pith:UJ5STJKDnew · submitted 2026-07-02 · 📡 eess.IV · cs.CV· cs.LG

Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health

Jan Ernsting , Gunnar Paul Kordes , Nils Johannaber , Lynn Ogoniak , Wolfgang Roll , Tim Hahn , Alexander Siegfried Busch , Benjamin Risse This is my paper

Pith reviewed 2026-07-03 03:52 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LG

keywords deep learningMRI segmentationpenile tissueUK BiobankDIXON MRIquantitative phenotypingnnU-Netmale reproductive health

0 comments

The pith

Deep learning model achieves observer-level accuracy in segmenting the whole penis from DIXON MRI scans and quantifies tissue in 34,412 UK Biobank participants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and validates a deep learning model to automatically segment the entire penis, including internal parts, in multi-channel DIXON MRI images. It uses a 3D nnU-Net trained on 145 annotated subjects and tested on 24 double-annotated cases, reaching a Dice score of 0.92 that matches human observers. The model is then applied to over 34,000 UK Biobank scans to enable large-scale measurement of penile tissue volume. This allows reproducible phenotyping for studies of male reproductive health that was previously limited by manual methods. Longitudinal checks show good reproducibility across sessions.

Core claim

A 3D nnU-Net trained on expert-annotated DIXON MRI data segments the full penile tissue volume with Dice coefficient 0.92 and Hausdorff distance 3.58 mm on an independent test set of 24 subjects, matching inter-observer performance, and when deployed yields total penile tissue volumes for 34,412 UK Biobank participants with inter-session reproducibility of r = 0.87.

What carries the argument

The 3D nnU-Net architecture optimized for multi-channel DIXON MRI segmentation, trained on 13,050 annotated slices from 145 subjects.

If this is right

Automated volumetry becomes feasible at population scale for male reproductive health studies.
Internal penile components can now be quantified alongside external measurements.
High reproducibility supports longitudinal tracking of anatomical changes.
The open model weights enable replication and extension in urological imaging research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining these volumes with genetic or clinical data could reveal new associations with reproductive disorders.
Similar segmentation approaches might apply to other soft-tissue organs in large MRI cohorts.
Clinical translation could standardize assessment of conditions like micropenis or erectile dysfunction.

Load-bearing premise

The small training and test sets from the UK Biobank are representative and free of annotation or demographic biases that would affect performance when scaled to the full population.

What would settle it

A significant drop in Dice score or increase in Hausdorff distance when the model is tested on a new set of MRI scans from a different demographic group or scanner type.

Figures

Figures reproduced from arXiv: 2607.02127 by Alexander Siegfried Busch, Benjamin Risse, Gunnar Paul Kordes, Jan Ernsting, Lynn Ogoniak, Nils Johannaber, Tim Hahn, Wolfgang Roll.

**Figure 2.** Figure 2: Example annotations of penile tissue in UK Biobank [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of nnU-Net predictions. Axial views highlighting the internal (1) and external penile compartments (2), along with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Penile measurement is clinically relevant across male reproductive and urogenital health, including conditions such as micropenis, congenital and endocrine disorders, and sexual or urinary dysfunction. However, quantitative assessment of penile size has relied mainly on external length or circumference measurements, which are difficult to standardize, sensitive to measurement conditions, and unable to capture the internal portion of the penis. MRI enables volumetric assessment of the whole penis in vivo, but automated segmentation has not previously been established at population scale. Automated whole-organ volumetry would enable high-throughput phenotyping for multi-omics and clinical studies of male reproductive disease. Here, we present a deep learning framework for whole-penis segmentation in multi-channel DIXON MRI. Using a newly curated expert-annotated training dataset ($n = 145$ subjects; $13,050$ annotated slices) and a double-annotated independent test benchmark ($n = 24$ subjects; $2,160$ double-annotated slices), we optimized a 3D nnU-Net architecture. The model achieved a 5-fold cross-validation Dice score of $0.90$ and performed at observer-level accuracy on the independent test set (Dice: $0.92$; Hausdorff distance: $3.58$). We deployed the model in $34,412$ UK Biobank participants, enabling automated quantification of total penile tissue, including both external and internal components. Longitudinal evaluation in 2,282 men demonstrated high inter-session reproducibility ($r = 0.87$). This framework establishes a reproducible and population-scalable method for MRI-based assessment of penile anatomy and provides an open technical resource for future studies in urological imaging and male reproductive health. The trained model weights will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical DL pipeline for whole-penis segmentation on DIXON MRI that scales to 34k UK Biobank cases, but the 24-subject test set is too small to support the generalization claim.

read the letter

The core result is a 3D nnU-Net that segments penile tissue in multi-channel DIXON MRI. It reaches Dice 0.92 and Hausdorff 3.58 on a double-annotated test set of 24 subjects, then produces volumes for 34,412 UK Biobank participants with longitudinal reproducibility r=0.87. The model weights are promised to be released.

What is actually new is the jump to population scale. Earlier work stayed at small manual cohorts or external measurements; this is the first automated whole-organ volumetry run on tens of thousands of scans in this modality.

The execution looks solid on the numbers reported. They used expert annotation on 13k slices for training, ran 5-fold CV at Dice 0.90, and hit observer-level performance on the held-out set. Releasing the weights is a concrete plus for anyone who wants to replicate or extend it.

The soft spot is the test set size. Twenty-four subjects, even double-annotated, cannot cover the range of age, BMI, ethnicity, or anatomical variation in 34k UK Biobank men. No subgroup breakdowns, no failure-case review, and no external validation set are described. That leaves open the possibility of systematic under- or over-estimation once the model leaves the narrow test distribution. Training details and exclusion criteria are also missing from the abstract, which makes it hard to judge selection bias.

This paper is for groups doing large-scale male reproductive phenotyping or anyone needing a ready-made MRI segmentation tool for UK Biobank-style data. A reader already working in urological imaging or multi-omics studies would get immediate value from the volumes and the released model.

It deserves peer review. The scale is real and the application is straightforward, so referees can focus on whether the validation is sufficient rather than whether the idea has merit.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a 3D nnU-Net framework for automated whole-penis segmentation in multi-channel DIXON MRI. It uses a curated training set of 145 subjects (13,050 slices) with 5-fold CV Dice of 0.90, reports observer-level performance on a double-annotated independent test set of 24 subjects (Dice 0.92, Hausdorff distance 3.58), and deploys the model to quantify penile tissue volume in 34,412 UK Biobank participants, with longitudinal reproducibility r=0.87 in 2,282 men. The central claim is that this establishes a reproducible, population-scalable method for MRI-based penile volumetry and phenotyping in male reproductive health, with public release of model weights.

Significance. If the reported generalization holds, the work supplies the first automated, high-throughput pipeline for whole-organ penile volumetry at population scale, addressing a gap where prior assessments relied on non-standardized external measurements. Credit is due for the double-annotated test benchmark, longitudinal reproducibility evaluation, and commitment to releasing trained weights. The result would directly enable multi-omics and clinical studies if the performance metrics translate without systematic bias across the full cohort.

major comments (1)

[Abstract and Results] Abstract and Results: The headline claim of reliable deployment to 34,412 participants rests on performance metrics from an independent test set of only 24 subjects. No subgroup performance breakdowns (by age, BMI, ethnicity, or anatomical variants), failure-case analysis, or explicit checks for distribution shift between the 145+24 subjects and the full UK Biobank cohort are described. With this sample size, even double annotation cannot guarantee coverage of the demographic and anatomical range needed to support population-scale quantitative phenotyping without undetected systematic error in volumetry.

minor comments (2)

[Methods] Methods: The abstract states that a 3D nnU-Net was 'optimized' but provides no summary of hyperparameter search, data exclusion criteria, or preprocessing steps; these details are needed for reproducibility even if present in the full text.
[Data] The manuscript should clarify whether the 145 training and 24 test subjects were drawn from the same UK Biobank imaging protocol and demographic pool as the 34,412 deployment set.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comment regarding test-set size, subgroup analysis, and generalizability below, and propose targeted revisions.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The headline claim of reliable deployment to 34,412 participants rests on performance metrics from an independent test set of only 24 subjects. No subgroup performance breakdowns (by age, BMI, ethnicity, or anatomical variants), failure-case analysis, or explicit checks for distribution shift between the 145+24 subjects and the full UK Biobank cohort are described. With this sample size, even double annotation cannot guarantee coverage of the demographic and anatomical range needed to support population-scale quantitative phenotyping without undetected systematic error in volumetry.

Authors: We agree that the independent test set (n=24, double-annotated) is modest in size and that the manuscript does not include subgroup breakdowns, failure-case analysis, or formal distribution-shift tests against the full UK Biobank cohort. The test-set size was deliberately limited to enable exhaustive double annotation, yielding observer-level performance (Dice 0.92). Supporting evidence for deployment comes from the longitudinal reproducibility analysis performed directly on 2,282 repeat UK Biobank scans (r=0.87), which provides an internal consistency check across the target population. We acknowledge that this does not fully substitute for demographic stratification or shift detection. In revision we will add a dedicated limitations paragraph in the Discussion that explicitly states the modest test-set size, the lack of subgroup and failure-case reporting, and the impossibility of quantifying systematic volumetric bias without ground-truth labels on the full cohort. We cannot conduct new subgroup analyses or failure-case reviews without additional expert annotations, which are outside the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical ML evaluation on independent held-out data

full rationale

The paper presents a standard supervised segmentation task using nnU-Net trained on 145 expert-annotated subjects and evaluated on a separate double-annotated test set of 24 subjects. Reported metrics (5-fold CV Dice 0.90; test Dice 0.92, HD 3.58) are direct empirical measurements on held-out data, not derived quantities that reduce to the training inputs by construction. Deployment to 34,412 UK Biobank scans is a forward application of the trained model with no intervening mathematical derivation, uniqueness theorem, or self-citation chain. No equations, fitted parameters renamed as predictions, or ansatzes appear in the described pipeline. The work is therefore self-contained as an empirical ML study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; standard deep-learning assumptions apply but cannot be audited in detail.

axioms (1)

domain assumption Expert manual annotations constitute accurate ground truth
Model is trained and evaluated against these labels.

pith-pipeline@v0.9.1-grok · 5883 in / 1175 out tokens · 24461 ms · 2026-07-03T03:52:21.456410+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references

[1]

Savenije, Frederik R

Ingeborg van den Berg, Mark H.F. Savenije, Frederik R. Teunissen, Sandrine M.G. van de Pol, Marnix J.A. Rasing, Harm H.E. van Melick, Wyger M. Brink, Johannes C.J. de 5 Boer, Cornelis A.T. van den Berg, and Jochem R.N. van der V oort van Zyp. Deep learning for automated contouring of neurovascular structures on magnetic resonance imaging for prostate canc...

2023
[2]

Alnajjar, Andrea Salonia, and Asif Muneer

Omer Onur Cakir, Edoardo Pozzi, Fabio Castiglione, Hus- sain M. Alnajjar, Andrea Salonia, and Asif Muneer. Penile length measurement: Methodological challenges and rec- ommendations, a systematic review.The Journal of Sexual Medicine, 18(3):433–439, 2021. 2

2021
[3]

Towards popula- tion scale testis volume segmentation in DIXON MRI.Com- puters in Biology and Medicine, 198(Pt A):111139, 2025

Jan Ernsting, Philipp Nikolas Beeken, Lynn Ogoniak, Jacqueline Kockwelp, Wolfgang Roll, Tim Hahn, Alexan- der Siegfried Busch, and Benjamin Risse. Towards popula- tion scale testis volume segmentation in DIXON MRI.Com- puters in Biology and Medicine, 198(Pt A):111139, 2025. 2, 3

2025
[4]

Miller, Steve Pieper, and Ron Kikinis

Andriy Fedorov, Reinhard Beichel, Jayashree Kalpathy- Cramer, Julien Finet, Jean-Christophe Fillion-Robin, Sonia Pujol, Christian Bauer, Dominique Jennings, Fiona Fen- nessy, Milan Sonka, John Buatti, Stephen Aylward, James V . Miller, Steve Pieper, and Ron Kikinis. 3d slicer as an image computing platform for the quantitative imaging network. Magnetic Re...

2012
[5]

M ¨oller, Matan Atad, Henry V ¨olzke, Robin B ¨ulow, Carsten Oliver Schmidt, Ju- lia R ¨udebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian T

Robert Graf, Paul Platzek, Evamaria Olga Riedel, Constanze Ramsch¨utz, Sophie Starck, Hendrik K. M ¨oller, Matan Atad, Henry V ¨olzke, Robin B ¨ulow, Carsten Oliver Schmidt, Ju- lia R ¨udebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian T. L ¨offler, Fabian Bamberg, Benedikt Wiestler, Johannes C. Paetzold, Daniel Rueckert, and Jan Stefan Kirsc...

2025
[6]

Management of disorders of sex de- velopment.Nature Reviews Endocrinology, 10(9):520–529, 09 2014

Olaf Hiort, Wiebke Birnbaum, Louise Marshall, Lutz W¨unsch, Ralf Werner, Tatjana Schr ¨oder, Ulla D ¨ohnert, and Paul-Martin Holterhus. Management of disorders of sex de- velopment.Nature Reviews Endocrinology, 10(9):520–529, 09 2014. 1

2014
[7]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-net: a self- configuring method for deep learning-based biomedical im- age segmentation.Nature Methods, 18(2):203–211, 02 2021. 2, 3

2021
[8]

Mehta, Ranjit K

Shruti Kumar, Parv M. Mehta, Ranjit K. Chaudhary, Pankaj Nepal, Devendra Kumar, Sree Harsha Tirumani, and Vi- jayanadh Ojili. MR imaging of the penis: What a radiologist needs to know!Abdominal Radiology, 50(4):1790–1810,
[9]

Lavdas, B

I. Lavdas, B. Glocker, D. Rueckert, S.A. Taylor, E.O. Aboagye, and A.G. Rockall. Machine learning in whole- body MRI: experiences and challenges from an applied study using multicentre data.Clinical Radiology, 74(5):346–356,
[10]

Philipp Schubert, Matthias May, Daniel H ¨ofler, Hans- Peter Fautz, Jana Hutter, Ricarda Merten, Sina Mansoo- rian, Thomas Weissmann, Lisa Deloch, Miriam Schonath, Nathalia Belmas, Felix Grabenbauer, Benjamin Frey, Udo Gaipl, Bernd-Niklas Axer, Juliane Szkitsak, Michael Uder, Christoph Bert, Rainer Fietkau, and Florian Putz. Advanc- ing offline magnetic r...
[11]

Am i normal? a systematic review and con- struction of nomograms for flaccid and erect penis length and circumference in up to 15 521 men.BJU International, 115(6):978–986, 2015

David Veale, Sarah Miles, Sally Bramley, Gordon Muir, and John Hodsoll. Am i normal? a systematic review and con- struction of nomograms for flaccid and erect penis length and circumference in up to 15 521 men.BJU International, 115(6):978–986, 2015. 1, 2

2015
[12]

Bell, Magnus Borga, and Louise Thomas

Janne West, Olof Dahlqvist Leinhard, Thobias Romu, Rory Collins, Steve Garratt, Jimmy D. Bell, Magnus Borga, and Louise Thomas. Feasibility of MR-based body composi- tion analysis in large scale population studies.PLoS ONE, 11(9):e0163332, 2016. 3 6

2016

[1] [1]

Savenije, Frederik R

Ingeborg van den Berg, Mark H.F. Savenije, Frederik R. Teunissen, Sandrine M.G. van de Pol, Marnix J.A. Rasing, Harm H.E. van Melick, Wyger M. Brink, Johannes C.J. de 5 Boer, Cornelis A.T. van den Berg, and Jochem R.N. van der V oort van Zyp. Deep learning for automated contouring of neurovascular structures on magnetic resonance imaging for prostate canc...

2023

[2] [2]

Alnajjar, Andrea Salonia, and Asif Muneer

Omer Onur Cakir, Edoardo Pozzi, Fabio Castiglione, Hus- sain M. Alnajjar, Andrea Salonia, and Asif Muneer. Penile length measurement: Methodological challenges and rec- ommendations, a systematic review.The Journal of Sexual Medicine, 18(3):433–439, 2021. 2

2021

[3] [3]

Towards popula- tion scale testis volume segmentation in DIXON MRI.Com- puters in Biology and Medicine, 198(Pt A):111139, 2025

Jan Ernsting, Philipp Nikolas Beeken, Lynn Ogoniak, Jacqueline Kockwelp, Wolfgang Roll, Tim Hahn, Alexan- der Siegfried Busch, and Benjamin Risse. Towards popula- tion scale testis volume segmentation in DIXON MRI.Com- puters in Biology and Medicine, 198(Pt A):111139, 2025. 2, 3

2025

[4] [4]

Miller, Steve Pieper, and Ron Kikinis

Andriy Fedorov, Reinhard Beichel, Jayashree Kalpathy- Cramer, Julien Finet, Jean-Christophe Fillion-Robin, Sonia Pujol, Christian Bauer, Dominique Jennings, Fiona Fen- nessy, Milan Sonka, John Buatti, Stephen Aylward, James V . Miller, Steve Pieper, and Ron Kikinis. 3d slicer as an image computing platform for the quantitative imaging network. Magnetic Re...

2012

[5] [5]

M ¨oller, Matan Atad, Henry V ¨olzke, Robin B ¨ulow, Carsten Oliver Schmidt, Ju- lia R ¨udebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian T

Robert Graf, Paul Platzek, Evamaria Olga Riedel, Constanze Ramsch¨utz, Sophie Starck, Hendrik K. M ¨oller, Matan Atad, Henry V ¨olzke, Robin B ¨ulow, Carsten Oliver Schmidt, Ju- lia R ¨udebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian T. L ¨offler, Fabian Bamberg, Benedikt Wiestler, Johannes C. Paetzold, Daniel Rueckert, and Jan Stefan Kirsc...

2025

[6] [6]

Management of disorders of sex de- velopment.Nature Reviews Endocrinology, 10(9):520–529, 09 2014

Olaf Hiort, Wiebke Birnbaum, Louise Marshall, Lutz W¨unsch, Ralf Werner, Tatjana Schr ¨oder, Ulla D ¨ohnert, and Paul-Martin Holterhus. Management of disorders of sex de- velopment.Nature Reviews Endocrinology, 10(9):520–529, 09 2014. 1

2014

[7] [7]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-net: a self- configuring method for deep learning-based biomedical im- age segmentation.Nature Methods, 18(2):203–211, 02 2021. 2, 3

2021

[8] [8]

Mehta, Ranjit K

Shruti Kumar, Parv M. Mehta, Ranjit K. Chaudhary, Pankaj Nepal, Devendra Kumar, Sree Harsha Tirumani, and Vi- jayanadh Ojili. MR imaging of the penis: What a radiologist needs to know!Abdominal Radiology, 50(4):1790–1810,

[9] [9]

Lavdas, B

I. Lavdas, B. Glocker, D. Rueckert, S.A. Taylor, E.O. Aboagye, and A.G. Rockall. Machine learning in whole- body MRI: experiences and challenges from an applied study using multicentre data.Clinical Radiology, 74(5):346–356,

[10] [10]

Philipp Schubert, Matthias May, Daniel H ¨ofler, Hans- Peter Fautz, Jana Hutter, Ricarda Merten, Sina Mansoo- rian, Thomas Weissmann, Lisa Deloch, Miriam Schonath, Nathalia Belmas, Felix Grabenbauer, Benjamin Frey, Udo Gaipl, Bernd-Niklas Axer, Juliane Szkitsak, Michael Uder, Christoph Bert, Rainer Fietkau, and Florian Putz. Advanc- ing offline magnetic r...

[11] [11]

Am i normal? a systematic review and con- struction of nomograms for flaccid and erect penis length and circumference in up to 15 521 men.BJU International, 115(6):978–986, 2015

David Veale, Sarah Miles, Sally Bramley, Gordon Muir, and John Hodsoll. Am i normal? a systematic review and con- struction of nomograms for flaccid and erect penis length and circumference in up to 15 521 men.BJU International, 115(6):978–986, 2015. 1, 2

2015

[12] [12]

Bell, Magnus Borga, and Louise Thomas

Janne West, Olof Dahlqvist Leinhard, Thobias Romu, Rory Collins, Steve Garratt, Jimmy D. Bell, Magnus Borga, and Louise Thomas. Feasibility of MR-based body composi- tion analysis in large scale population studies.PLoS ONE, 11(9):e0163332, 2016. 3 6

2016