Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words
Pith reviewed 2026-07-03 14:48 UTC · model grok-4.3
The pith
Contextualized embeddings predict spoken durations of Mandarin words above chance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contextualized embeddings are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. The predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.
What carries the argument
Contextualized embeddings (CEs) from language models, applied in regression to predict duration and rescale normalized f0 contours to actual milliseconds.
If this is right
- CEs predict duration above chance at both type and individual token levels.
- Predicted durations enable back-transformation of normalized f0 contours to ms-scale contours that approximate empirical data.
- The approach applies to 7470 tokens of monosyllabic CV words from spontaneous speech.
Where Pith is reading between the lines
- The same embedding-based method might be tested for predicting other prosodic features such as intensity.
- Extending the approach to polysyllabic words or non-tonal languages would check whether duration prediction generalizes beyond the current monosyllabic Mandarin case.
- Combining duration prediction with embedding-derived pitch could support more accurate computational models of conversational speech timing.
Load-bearing premise
The type-wise and token-wise permutation baselines adequately establish that performance exceeds chance without bias from data selection, embedding dimensionality, or the specific spontaneous-speech corpus used.
What would settle it
A replication on new spontaneous Mandarin speech tokens in which embedding-based duration predictions fail to beat the token-wise permutation baseline would falsify the central claim.
Figures
read the original abstract
Time-normalized f0 contours of Mandarin words in conversational speech have been shown to be predictable in part from their contextualized embeddings (CEs). The present study investigates whether CEs also predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a Mandarin corpus of spontaneous speech. We show that CEs indeed are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. We also show that the predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that contextualized embeddings (CEs) predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from spontaneous speech. It reports above-chance prediction at both type and token levels, supported by type-wise and token-wise permutation baselines, and shows that predicted durations are precise enough to back-transform [0,1]-normalized f0 contours to the ms time scale, yielding approximations to empirical contours that outperform a permutation baseline.
Significance. If the central results hold under rigorous controls, the work would extend prior findings on CE-based f0 contour prediction to duration, indicating that CEs encode multiple prosodic dimensions in conversational Mandarin. The back-transformation result would be a notable strength, providing a direct link between duration prediction and improved contour accuracy on the physical time scale.
major comments (2)
- Abstract: the claim that CEs are predictive 'above chance level' at the token level rests entirely on the type-wise and token-wise permutation baselines, yet no details are given on baseline construction (e.g., whether token-wise shuffling preserves type multiplicity, whether cross-validation respects token-type structure, or whether a matched-dimensionality control such as random vectors of equal dimension is included). Without these specifics, it remains possible that any sufficiently rich vector representation would produce the reported signal, undermining the attribution to CEs specifically.
- Abstract: no information is supplied on model architecture, exact evaluation metrics, data splits, or the spontaneous-speech corpus extraction procedure. These omissions are load-bearing because the soundness of the above-chance and back-transformation claims cannot be assessed without them.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: Abstract: the claim that CEs are predictive 'above chance level' at the token level rests entirely on the type-wise and token-wise permutation baselines, yet no details are given on baseline construction (e.g., whether token-wise shuffling preserves type multiplicity, whether cross-validation respects token-type structure, or whether a matched-dimensionality control such as random vectors of equal dimension is included). Without these specifics, it remains possible that any sufficiently rich vector representation would produce the reported signal, undermining the attribution to CEs specifically.
Authors: We agree that the abstract lacks sufficient detail on baseline construction, which is a valid concern for assessing the specificity of the results to contextualized embeddings. The Methods section describes the type-wise and token-wise permutation baselines, but we will expand this description in the revision to explicitly state that token-wise shuffling preserves type multiplicity by permuting durations within the full set while maintaining type frequencies, that cross-validation is structured to keep all tokens of a given type within the same fold, and that a matched-dimensionality random vector control is included. We will also add a concise summary of these controls to the abstract. revision: yes
-
Referee: Abstract: no information is supplied on model architecture, exact evaluation metrics, data splits, or the spontaneous-speech corpus extraction procedure. These omissions are load-bearing because the soundness of the above-chance and back-transformation claims cannot be assessed without them.
Authors: The referee is correct that the abstract omits these methodological details. While the full manuscript describes the linear regression model, Pearson correlation and MSE metrics, token-level cross-validation splits, and extraction of 7470 monosyllabic CV tokens from the spontaneous speech corpus in the Methods section, we acknowledge that the abstract should be more self-contained. In the revision we will add a brief overview of the model, metrics, splits, and corpus procedure to the abstract to allow readers to evaluate the claims without immediately consulting the Methods. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reports an empirical prediction task in which pre-trained contextualized embeddings are used as input features to regress spoken word duration (and back-transform f0 contours) for 7470 tokens. Performance is evaluated against type-wise and token-wise permutation baselines that shuffle the target variable independently of the embeddings. No equations, self-citations, or ansatzes are quoted that would reduce the reported predictivity to a fitted parameter or prior result by construction; the baselines constitute an external statistical control rather than an internal redefinition. The derivation chain is therefore self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Prosody concerns those phonetic properties that are not cov- ered by words’ vowels and consonants, such as spoken word duration, f0 contour and prominence [1]. The realization of prosodic properties is governed by a wide range of factors, such as the prosodic properties of neighbouring words [2], internal and external sandhi processes [3, 4],...
-
[2]
Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words
Method 2.1. Data The corpus used in the current study is the Taiwan Mandarin spontaneous speech corpus [17], which provides word-level transcriptions using traditional Chinese characters. We fol- lowed the transcriptions in the corpus, and distinguished be- tween word types on the basis of the characters with which the words are transcribed. In Mandarin, ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
maps, as shown by [12]. In the present study, we evaluated prediction quality against permutation baselines that obliterate the relation between contextualized embeddings and duration or f0 contour. In this way, we can ascertain whether the predic- tions derived from the empirical embeddings are more precise than those in which the relation between form a...
-
[4]
Results 3.1. Training data Under 10-fold cross-validation, the mean correlation for vowel duration in training was0.535, the mean correlation for the global permutation baseline was 0.339 and mean correlation for the type-wise permutation baseline was 0.506, lower than the empirical mean correlation (t(9) =−18.473, p <0.0001). With respect to word duratio...
-
[5]
Discussion We have shown that contextualized embeddings predict both spoken word duration and time-normalized f0 contours with above-chance accuracies at the type level. For spoken word duration, we report the novel finding that prediction accuracy is above chance also at the token level. Furthermore, combin- ing predicted shape and duration leads to pred...
-
[6]
J. Cole, “Prosody in context: A review,”Language, Cognition and Neuroscience, vol. 30, no. 1-2, pp. 1–31, 2015
work page 2015
-
[7]
X.-n. S. Shen,The prosody of Mandarin Chinese. Univ of Cali- fornia Press, 1990, vol. 118
work page 1990
-
[8]
C.-L. Shih,The Prosodic Domain of Tone Sandhi in Chi- nese (Phrasal Phonology, Tonal Typology, Mandarin, Syntax- Phonology Interface). University of California, San Diego, 1986
work page 1986
-
[9]
O. Niebuhr, C. M. Lill, and J. Neuschulz, “At the segment-prosody divide: The interplay of intonation, sibilant pitch and sibilant as- similation,” inProceedings of the 17th ICPhS, Hong Kong, China, 2011, pp. 1478–1481
work page 2011
-
[10]
S. Gahl, “Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech,”Lan- guage, vol. 84, no. 3, pp. 474–496, 2008
work page 2008
-
[11]
A. Lohmann, “Cut (n) and cut (v) are not homophones: Lemma frequency affects the duration of noun–verb conversion pairs,” Journal of Linguistics, vol. 54, no. 4, pp. 753–777, 2018
work page 2018
-
[12]
Speech rate and prosody units: Ev- idence of interaction from mandarin chinese,
C.-Y . Tseng and Y .-L. Lee, “Speech rate and prosody units: Ev- idence of interaction from mandarin chinese,” inProceedings of Speech Prosody 2004, 2004, pp. 251–254
work page 2004
-
[13]
Homophony and morphology: The acoustics of word-final s in English,
I. Plag, J. Homann, and G. Kunter, “Homophony and morphology: The acoustics of word-final s in English,”Journal of Linguistics, pp. 1–36, 2015
work page 2015
-
[14]
Communicating emotion: The role of prosodic fea- tures
R. W. Frick, “Communicating emotion: The role of prosodic fea- tures.”Psychological bulletin, vol. 97, no. 3, p. 412, 1985
work page 1985
-
[15]
Time and thyme again: Connecting English spoken word duration to models of the mental lexicon,
S. Gahl and R. H. Baayen, “Time and thyme again: Connecting English spoken word duration to models of the mental lexicon,” Language, 2024, page accepted
work page 2024
-
[16]
M. Heitmeier, Y .-Y . Chuang, and R. H. Baayen,The Discrimi- native Lexicon: Theory and implementation in the Julia package JudiLing. Cambridge: Cambridge University Press, 2026
work page 2026
-
[17]
Word-specific tonal realizations in Mandarin
Y .-Y . Chuang, M. J. Bell, Y .-H. Tseng, and R. H. Baayen, “Word- specific tonal realizations in Mandarin,”Language, 2026, in press; arXiv preprint arXiv:2405.07006
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Y . Lu, Y .-Y . Chuang, and R. H. Baayen, “The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling,”Corpus Linguistics and Linguistic Theory, 2026
work page 2026
-
[19]
Do words sing their own tunes? word-specific pitch realizations in Mandarin and En- glish,
Y .-Y . Chuang, R. H. Baayen, and M. J. Bell, “Do words sing their own tunes? word-specific pitch realizations in Mandarin and En- glish,” inProceedings of ICPhS 2023, 2023
work page 2023
-
[20]
S. N. Wood,Generalized additive models: an introduction with R. chapman and hall/CRC, 2017
work page 2017
-
[21]
The acoustic variation of Mandarin tones,
A. T. Ho, “The acoustic variation of Mandarin tones,”Phonetica, vol. 33, no. 5, pp. 353–367, 1976
work page 1976
-
[22]
A preliminary construction of Taiwan Southern Min spon- taneous speech corpus,
J. Fon, “A preliminary construction of Taiwan Southern Min spon- taneous speech corpus,” Tech. Rep. NSC-92-2411-H-003-050, National Science Council, Taiwan, Tech. Rep., 2004
work page 2004
-
[23]
Duanmu,The phonology of standard Chinese
S. Duanmu,The phonology of standard Chinese. OUP Oxford, 2007
work page 2007
-
[24]
Lectures on Chinese Phonetics [國音學講義],
T. Yi, “Lectures on Chinese Phonetics [國音學講義],” 1920
work page 1920
-
[25]
Duration reflexes of syllable structure in mandarin,
F. Wu and M. Kenstowicz, “Duration reflexes of syllable structure in mandarin,”Lingua, vol. 164, pp. 87–99, 2015
work page 2015
-
[26]
Montreal forced aligner: Trainable text-speech align- ment using kaldi
M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using kaldi.” inInterspeech, vol. 2017, 2017, pp. 498–502
work page 2017
-
[27]
2022. Praat: Doing phonetics by computer [Computer program]. Version 6.0. 43,
P. Boersma and D. Weenink, “2022. Praat: Doing phonetics by computer [Computer program]. Version 6.0. 43,” 1992
work page 2022
-
[28]
X. Jin, M. Ernestus, and R. H. Baayen, “A new kid on the block: Distributional semantics predicts the word-specific tone signa- tures of monosyllabic words in conversational Taiwan Mandarin speech.”under revision for Journal of Phonetics, 2025, arXiv preprint arXiv:2503.23163
-
[29]
L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579– 2605, 2008
work page 2008
-
[30]
Dynamic programming algorithm op- timization for spoken word recognition,
H. Sakoe and S. Chiba, “Dynamic programming algorithm op- timization for spoken word recognition,”IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1, pp. 43– 49, 2003
work page 2003
-
[31]
The use of multiple measurements in taxonomic problems,
R. A. Fisher, “The use of multiple measurements in taxonomic problems,”Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936
work page 1936
-
[32]
The utilization of multiple measurements in problems of biological classification,
C. R. Rao, “The utilization of multiple measurements in problems of biological classification,”Journal of the Royal Statistical Soci- ety. Series B (Methodological), vol. 10, no. 2, pp. 159–203, 1948
work page 1948
-
[33]
Computers and the study of literature,
J. F. Burrows, “Computers and the study of literature,” inCom- puters and Written Texts, C. S. Butler, Ed. Oxford: Blackwell, 1992, pp. 167–204
work page 1992
-
[34]
Q. Wenfeng and W. Yanyi,jiebaR: Chinese Text Segmentation, 2019, R package version 0.11. [Online]. Available: https: //CRAN.R-project.org/package=jiebaR
work page 2019
- [35]
-
[36]
Roles and representations of systematic fine pho- netic detail in speech understanding,
S. Hawkins, “Roles and representations of systematic fine pho- netic detail in speech understanding,”Journal of Phonetics, vol. 31, pp. 373–405, 2003
work page 2003
-
[37]
R. F. Port and A. P. Leary, “Against formal phonology,”Language, vol. 81, pp. 927–964, 2005
work page 2005
-
[38]
Cracking arbitrariness: A data- driven study of auditory iconicity in spoken English,
A. G. de Varda and M. Marelli, “Cracking arbitrariness: A data- driven study of auditory iconicity in spoken English,”Psycho- nomic Bulletin & Review, pp. 1–18, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.