Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

Mirjam Ernestus; R.Harald Baayen; Xiaoyun Jin

arxiv: 2607.02002 · v1 · pith:K2EW54JWnew · submitted 2026-07-02 · 💻 cs.CL

Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

Xiaoyun Jin , Mirjam Ernestus , R.Harald Baayen This is my paper

Pith reviewed 2026-07-03 14:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords contextualized embeddingsword durationpitch contoursMandarin speechspontaneous speechf0 contourstoken-level prediction

0 comments

The pith

Contextualized embeddings predict spoken durations of Mandarin words above chance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether contextualized embeddings can forecast the duration of individual Mandarin monosyllabic words spoken in conversation. It reports that these embeddings outperform permutation baselines for duration prediction at both the word type and token levels. The duration predictions are accurate enough to convert time-normalized pitch contours into real-time millisecond contours that match observed speech patterns. A reader might care because this links static language representations to dynamic aspects of speech production like timing and intonation.

Core claim

Contextualized embeddings are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. The predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.

What carries the argument

Contextualized embeddings (CEs) from language models, applied in regression to predict duration and rescale normalized f0 contours to actual milliseconds.

If this is right

CEs predict duration above chance at both type and individual token levels.
Predicted durations enable back-transformation of normalized f0 contours to ms-scale contours that approximate empirical data.
The approach applies to 7470 tokens of monosyllabic CV words from spontaneous speech.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding-based method might be tested for predicting other prosodic features such as intensity.
Extending the approach to polysyllabic words or non-tonal languages would check whether duration prediction generalizes beyond the current monosyllabic Mandarin case.
Combining duration prediction with embedding-derived pitch could support more accurate computational models of conversational speech timing.

Load-bearing premise

The type-wise and token-wise permutation baselines adequately establish that performance exceeds chance without bias from data selection, embedding dimensionality, or the specific spontaneous-speech corpus used.

What would settle it

A replication on new spontaneous Mandarin speech tokens in which embedding-based duration predictions fail to beat the token-wise permutation baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.02002 by Mirjam Ernestus, R.Harald Baayen, Xiaoyun Jin.

**Figure 1.** Figure 1: Examples of three Mandarin words that have the largest (它), medium (讀) and shortest (大) distance between observed (red) and predicted (blue) f0 contours. types) is used, which excludes word types with extremely short duration or octave jumps in pitch measurements. For a given type, we calculated the centroid of its contextualized embeddings, and then used the appropriate mapping (see equation 2) to obtain… view at source ↗

**Figure 2.** Figure 2: Examples of vowels’ pitch contours in real time, predicted for the centroids of 不 and 發. Right panels: both duration and shape estimates are for 不 and 發. Left panels: duration from the homophone 部 (upper) and phonological neighbour 殺, shape from 不 and 發. Center panels: duration from 不 and 發, but shape from 部 and 殺. 4. Discussion We have shown that contextualized embeddings predict both spoken word dura… view at source ↗

read the original abstract

Time-normalized f0 contours of Mandarin words in conversational speech have been shown to be predictable in part from their contextualized embeddings (CEs). The present study investigates whether CEs also predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a Mandarin corpus of spontaneous speech. We show that CEs indeed are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. We also show that the predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CEs predict duration above permutation baselines in Mandarin monosyllables and rescale f0 contours, but baseline construction details are missing.

read the letter

The core result is that contextualized embeddings predict spoken durations for 7470 Mandarin monosyllabic tokens from spontaneous speech, above chance at both type and token levels, and that those duration predictions can stretch normalized f0 contours onto a real time scale so they approximate the observed ones.

The work extends the authors' earlier embedding-based f0 modeling by adding duration as a target and by testing at the token level rather than just averaging over types. The use of separate type-wise and token-wise permutation baselines is a reasonable way to check that the signal is not just type-level leakage, and the back-transformation step shows a practical downstream use.

The main uncertainty is whether the permutation baselines actually isolate the contribution of the embeddings. The abstract gives no information on whether the token-wise shuffle preserves the multiplicity of tokens per type, whether cross-validation respects the type-token structure, or whether a matched-dimensionality control (random vectors) was run. Without those checks it remains possible that any sufficiently rich vector representation would produce similar numbers. The scope is also narrow—only CV monosyllables from one corpus—so the result is best read as a targeted demonstration rather than a broad claim about embeddings and prosody.

This is the sort of incremental but cleanly executed study that matters to people working on embedding-driven prosody models or tonal-language speech synthesis. It is coherent on its own terms and deserves a full referee process even if the claims stay modest.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that contextualized embeddings (CEs) predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from spontaneous speech. It reports above-chance prediction at both type and token levels, supported by type-wise and token-wise permutation baselines, and shows that predicted durations are precise enough to back-transform [0,1]-normalized f0 contours to the ms time scale, yielding approximations to empirical contours that outperform a permutation baseline.

Significance. If the central results hold under rigorous controls, the work would extend prior findings on CE-based f0 contour prediction to duration, indicating that CEs encode multiple prosodic dimensions in conversational Mandarin. The back-transformation result would be a notable strength, providing a direct link between duration prediction and improved contour accuracy on the physical time scale.

major comments (2)

Abstract: the claim that CEs are predictive 'above chance level' at the token level rests entirely on the type-wise and token-wise permutation baselines, yet no details are given on baseline construction (e.g., whether token-wise shuffling preserves type multiplicity, whether cross-validation respects token-type structure, or whether a matched-dimensionality control such as random vectors of equal dimension is included). Without these specifics, it remains possible that any sufficiently rich vector representation would produce the reported signal, undermining the attribution to CEs specifically.
Abstract: no information is supplied on model architecture, exact evaluation metrics, data splits, or the spontaneous-speech corpus extraction procedure. These omissions are load-bearing because the soundness of the above-chance and back-transformation claims cannot be assessed without them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: Abstract: the claim that CEs are predictive 'above chance level' at the token level rests entirely on the type-wise and token-wise permutation baselines, yet no details are given on baseline construction (e.g., whether token-wise shuffling preserves type multiplicity, whether cross-validation respects token-type structure, or whether a matched-dimensionality control such as random vectors of equal dimension is included). Without these specifics, it remains possible that any sufficiently rich vector representation would produce the reported signal, undermining the attribution to CEs specifically.

Authors: We agree that the abstract lacks sufficient detail on baseline construction, which is a valid concern for assessing the specificity of the results to contextualized embeddings. The Methods section describes the type-wise and token-wise permutation baselines, but we will expand this description in the revision to explicitly state that token-wise shuffling preserves type multiplicity by permuting durations within the full set while maintaining type frequencies, that cross-validation is structured to keep all tokens of a given type within the same fold, and that a matched-dimensionality random vector control is included. We will also add a concise summary of these controls to the abstract. revision: yes
Referee: Abstract: no information is supplied on model architecture, exact evaluation metrics, data splits, or the spontaneous-speech corpus extraction procedure. These omissions are load-bearing because the soundness of the above-chance and back-transformation claims cannot be assessed without them.

Authors: The referee is correct that the abstract omits these methodological details. While the full manuscript describes the linear regression model, Pearson correlation and MSE metrics, token-level cross-validation splits, and extraction of 7470 monosyllabic CV tokens from the spontaneous speech corpus in the Methods section, we acknowledge that the abstract should be more self-contained. In the revision we will add a brief overview of the model, metrics, splits, and corpus procedure to the abstract to allow readers to evaluate the claims without immediately consulting the Methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports an empirical prediction task in which pre-trained contextualized embeddings are used as input features to regress spoken word duration (and back-transform f0 contours) for 7470 tokens. Performance is evaluated against type-wise and token-wise permutation baselines that shuffle the target variable independently of the embeddings. No equations, self-citations, or ansatzes are quoted that would reduce the reported predictivity to a fitted parameter or prior result by construction; the baselines constitute an external statistical control rather than an internal redefinition. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5667 in / 1280 out tokens · 41109 ms · 2026-07-03T14:48:54.134012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

[1]

Introduction Prosody concerns those phonetic properties that are not cov- ered by words’ vowels and consonants, such as spoken word duration, f0 contour and prominence [1]. The realization of prosodic properties is governed by a wide range of factors, such as the prosodic properties of neighbouring words [2], internal and external sandhi processes [3, 4],...

work page
[2]

Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

Method 2.1. Data The corpus used in the current study is the Taiwan Mandarin spontaneous speech corpus [17], which provides word-level transcriptions using traditional Chinese characters. We fol- lowed the transcriptions in the corpus, and distinguished be- tween word types on the basis of the characters with which the words are transcribed. In Mandarin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

In the present study, we evaluated prediction quality against permutation baselines that obliterate the relation between contextualized embeddings and duration or f0 contour

maps, as shown by [12]. In the present study, we evaluated prediction quality against permutation baselines that obliterate the relation between contextualized embeddings and duration or f0 contour. In this way, we can ascertain whether the predic- tions derived from the empirical embeddings are more precise than those in which the relation between form a...

work page
[4]

Results 3.1. Training data Under 10-fold cross-validation, the mean correlation for vowel duration in training was0.535, the mean correlation for the global permutation baseline was 0.339 and mean correlation for the type-wise permutation baseline was 0.506, lower than the empirical mean correlation (t(9) =−18.473, p <0.0001). With respect to word duratio...

work page
[5]

For spoken word duration, we report the novel finding that prediction accuracy is above chance also at the token level

Discussion We have shown that contextualized embeddings predict both spoken word duration and time-normalized f0 contours with above-chance accuracies at the type level. For spoken word duration, we report the novel finding that prediction accuracy is above chance also at the token level. Furthermore, combin- ing predicted shape and duration leads to pred...

work page
[6]

Prosody in context: A review,

J. Cole, “Prosody in context: A review,”Language, Cognition and Neuroscience, vol. 30, no. 1-2, pp. 1–31, 2015

work page 2015
[7]

X.-n. S. Shen,The prosody of Mandarin Chinese. Univ of Cali- fornia Press, 1990, vol. 118

work page 1990
[8]

Shih,The Prosodic Domain of Tone Sandhi in Chi- nese (Phrasal Phonology, Tonal Typology, Mandarin, Syntax- Phonology Interface)

C.-L. Shih,The Prosodic Domain of Tone Sandhi in Chi- nese (Phrasal Phonology, Tonal Typology, Mandarin, Syntax- Phonology Interface). University of California, San Diego, 1986

work page 1986
[9]

At the segment-prosody divide: The interplay of intonation, sibilant pitch and sibilant as- similation,

O. Niebuhr, C. M. Lill, and J. Neuschulz, “At the segment-prosody divide: The interplay of intonation, sibilant pitch and sibilant as- similation,” inProceedings of the 17th ICPhS, Hong Kong, China, 2011, pp. 1478–1481

work page 2011
[10]

Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech,

S. Gahl, “Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech,”Lan- guage, vol. 84, no. 3, pp. 474–496, 2008

work page 2008
[11]

Cut (n) and cut (v) are not homophones: Lemma frequency affects the duration of noun–verb conversion pairs,

A. Lohmann, “Cut (n) and cut (v) are not homophones: Lemma frequency affects the duration of noun–verb conversion pairs,” Journal of Linguistics, vol. 54, no. 4, pp. 753–777, 2018

work page 2018
[12]

Speech rate and prosody units: Ev- idence of interaction from mandarin chinese,

C.-Y . Tseng and Y .-L. Lee, “Speech rate and prosody units: Ev- idence of interaction from mandarin chinese,” inProceedings of Speech Prosody 2004, 2004, pp. 251–254

work page 2004
[13]

Homophony and morphology: The acoustics of word-final s in English,

I. Plag, J. Homann, and G. Kunter, “Homophony and morphology: The acoustics of word-final s in English,”Journal of Linguistics, pp. 1–36, 2015

work page 2015
[14]

Communicating emotion: The role of prosodic fea- tures

R. W. Frick, “Communicating emotion: The role of prosodic fea- tures.”Psychological bulletin, vol. 97, no. 3, p. 412, 1985

work page 1985
[15]

Time and thyme again: Connecting English spoken word duration to models of the mental lexicon,

S. Gahl and R. H. Baayen, “Time and thyme again: Connecting English spoken word duration to models of the mental lexicon,” Language, 2024, page accepted

work page 2024
[16]

Heitmeier, Y .-Y

M. Heitmeier, Y .-Y . Chuang, and R. H. Baayen,The Discrimi- native Lexicon: Theory and implementation in the Julia package JudiLing. Cambridge: Cambridge University Press, 2026

work page 2026
[17]

Word-specific tonal realizations in Mandarin

Y .-Y . Chuang, M. J. Bell, Y .-H. Tseng, and R. H. Baayen, “Word- specific tonal realizations in Mandarin,”Language, 2026, in press; arXiv preprint arXiv:2405.07006

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling,

Y . Lu, Y .-Y . Chuang, and R. H. Baayen, “The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling,”Corpus Linguistics and Linguistic Theory, 2026

work page 2026
[19]

Do words sing their own tunes? word-specific pitch realizations in Mandarin and En- glish,

Y .-Y . Chuang, R. H. Baayen, and M. J. Bell, “Do words sing their own tunes? word-specific pitch realizations in Mandarin and En- glish,” inProceedings of ICPhS 2023, 2023

work page 2023
[20]

S. N. Wood,Generalized additive models: an introduction with R. chapman and hall/CRC, 2017

work page 2017
[21]

The acoustic variation of Mandarin tones,

A. T. Ho, “The acoustic variation of Mandarin tones,”Phonetica, vol. 33, no. 5, pp. 353–367, 1976

work page 1976
[22]

A preliminary construction of Taiwan Southern Min spon- taneous speech corpus,

J. Fon, “A preliminary construction of Taiwan Southern Min spon- taneous speech corpus,” Tech. Rep. NSC-92-2411-H-003-050, National Science Council, Taiwan, Tech. Rep., 2004

work page 2004
[23]

Duanmu,The phonology of standard Chinese

S. Duanmu,The phonology of standard Chinese. OUP Oxford, 2007

work page 2007
[24]

Lectures on Chinese Phonetics [國音學講義],

T. Yi, “Lectures on Chinese Phonetics [國音學講義],” 1920

work page 1920
[25]

Duration reflexes of syllable structure in mandarin,

F. Wu and M. Kenstowicz, “Duration reflexes of syllable structure in mandarin,”Lingua, vol. 164, pp. 87–99, 2015

work page 2015
[26]

Montreal forced aligner: Trainable text-speech align- ment using kaldi

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using kaldi.” inInterspeech, vol. 2017, 2017, pp. 498–502

work page 2017
[27]

2022. Praat: Doing phonetics by computer [Computer program]. Version 6.0. 43,

P. Boersma and D. Weenink, “2022. Praat: Doing phonetics by computer [Computer program]. Version 6.0. 43,” 1992

work page 2022
[28]

A new kid on the block: Distributional semantics predicts the word-specific tone signa- tures of monosyllabic words in conversational Taiwan Mandarin speech

X. Jin, M. Ernestus, and R. H. Baayen, “A new kid on the block: Distributional semantics predicts the word-specific tone signa- tures of monosyllabic words in conversational Taiwan Mandarin speech.”under revision for Journal of Phonetics, 2025, arXiv preprint arXiv:2503.23163

work page arXiv 2025
[29]

Visualizing data using t-sne,

L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579– 2605, 2008

work page 2008
[30]

Dynamic programming algorithm op- timization for spoken word recognition,

H. Sakoe and S. Chiba, “Dynamic programming algorithm op- timization for spoken word recognition,”IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1, pp. 43– 49, 2003

work page 2003
[31]

The use of multiple measurements in taxonomic problems,

R. A. Fisher, “The use of multiple measurements in taxonomic problems,”Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936

work page 1936
[32]

The utilization of multiple measurements in problems of biological classification,

C. R. Rao, “The utilization of multiple measurements in problems of biological classification,”Journal of the Royal Statistical Soci- ety. Series B (Methodological), vol. 10, no. 2, pp. 159–203, 1948

work page 1948
[33]

Computers and the study of literature,

J. F. Burrows, “Computers and the study of literature,” inCom- puters and Written Texts, C. S. Butler, Ed. Oxford: Blackwell, 1992, pp. 167–204

work page 1992
[34]

Wenfeng and W

Q. Wenfeng and W. Yanyi,jiebaR: Chinese Text Segmentation, 2019, R package version 0.11. [Online]. Available: https: //CRAN.R-project.org/package=jiebaR

work page 2019
[35]

Drager, k

“Drager, k.”Journal of Phonetics, vol. 39, no. 4, pp. 694–707, 2011

work page 2011
[36]

Roles and representations of systematic fine pho- netic detail in speech understanding,

S. Hawkins, “Roles and representations of systematic fine pho- netic detail in speech understanding,”Journal of Phonetics, vol. 31, pp. 373–405, 2003

work page 2003
[37]

Against formal phonology,

R. F. Port and A. P. Leary, “Against formal phonology,”Language, vol. 81, pp. 927–964, 2005

work page 2005
[38]

Cracking arbitrariness: A data- driven study of auditory iconicity in spoken English,

A. G. de Varda and M. Marelli, “Cracking arbitrariness: A data- driven study of auditory iconicity in spoken English,”Psycho- nomic Bulletin & Review, pp. 1–18, 2025

work page 2025

[1] [1]

Introduction Prosody concerns those phonetic properties that are not cov- ered by words’ vowels and consonants, such as spoken word duration, f0 contour and prominence [1]. The realization of prosodic properties is governed by a wide range of factors, such as the prosodic properties of neighbouring words [2], internal and external sandhi processes [3, 4],...

work page

[2] [2]

Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

Method 2.1. Data The corpus used in the current study is the Taiwan Mandarin spontaneous speech corpus [17], which provides word-level transcriptions using traditional Chinese characters. We fol- lowed the transcriptions in the corpus, and distinguished be- tween word types on the basis of the characters with which the words are transcribed. In Mandarin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

In the present study, we evaluated prediction quality against permutation baselines that obliterate the relation between contextualized embeddings and duration or f0 contour

maps, as shown by [12]. In the present study, we evaluated prediction quality against permutation baselines that obliterate the relation between contextualized embeddings and duration or f0 contour. In this way, we can ascertain whether the predic- tions derived from the empirical embeddings are more precise than those in which the relation between form a...

work page

[4] [4]

Results 3.1. Training data Under 10-fold cross-validation, the mean correlation for vowel duration in training was0.535, the mean correlation for the global permutation baseline was 0.339 and mean correlation for the type-wise permutation baseline was 0.506, lower than the empirical mean correlation (t(9) =−18.473, p <0.0001). With respect to word duratio...

work page

[5] [5]

For spoken word duration, we report the novel finding that prediction accuracy is above chance also at the token level

Discussion We have shown that contextualized embeddings predict both spoken word duration and time-normalized f0 contours with above-chance accuracies at the type level. For spoken word duration, we report the novel finding that prediction accuracy is above chance also at the token level. Furthermore, combin- ing predicted shape and duration leads to pred...

work page

[6] [6]

Prosody in context: A review,

J. Cole, “Prosody in context: A review,”Language, Cognition and Neuroscience, vol. 30, no. 1-2, pp. 1–31, 2015

work page 2015

[7] [7]

X.-n. S. Shen,The prosody of Mandarin Chinese. Univ of Cali- fornia Press, 1990, vol. 118

work page 1990

[8] [8]

Shih,The Prosodic Domain of Tone Sandhi in Chi- nese (Phrasal Phonology, Tonal Typology, Mandarin, Syntax- Phonology Interface)

C.-L. Shih,The Prosodic Domain of Tone Sandhi in Chi- nese (Phrasal Phonology, Tonal Typology, Mandarin, Syntax- Phonology Interface). University of California, San Diego, 1986

work page 1986

[9] [9]

At the segment-prosody divide: The interplay of intonation, sibilant pitch and sibilant as- similation,

O. Niebuhr, C. M. Lill, and J. Neuschulz, “At the segment-prosody divide: The interplay of intonation, sibilant pitch and sibilant as- similation,” inProceedings of the 17th ICPhS, Hong Kong, China, 2011, pp. 1478–1481

work page 2011

[10] [10]

Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech,

S. Gahl, “Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech,”Lan- guage, vol. 84, no. 3, pp. 474–496, 2008

work page 2008

[11] [11]

Cut (n) and cut (v) are not homophones: Lemma frequency affects the duration of noun–verb conversion pairs,

A. Lohmann, “Cut (n) and cut (v) are not homophones: Lemma frequency affects the duration of noun–verb conversion pairs,” Journal of Linguistics, vol. 54, no. 4, pp. 753–777, 2018

work page 2018

[12] [12]

Speech rate and prosody units: Ev- idence of interaction from mandarin chinese,

C.-Y . Tseng and Y .-L. Lee, “Speech rate and prosody units: Ev- idence of interaction from mandarin chinese,” inProceedings of Speech Prosody 2004, 2004, pp. 251–254

work page 2004

[13] [13]

Homophony and morphology: The acoustics of word-final s in English,

I. Plag, J. Homann, and G. Kunter, “Homophony and morphology: The acoustics of word-final s in English,”Journal of Linguistics, pp. 1–36, 2015

work page 2015

[14] [14]

Communicating emotion: The role of prosodic fea- tures

R. W. Frick, “Communicating emotion: The role of prosodic fea- tures.”Psychological bulletin, vol. 97, no. 3, p. 412, 1985

work page 1985

[15] [15]

Time and thyme again: Connecting English spoken word duration to models of the mental lexicon,

S. Gahl and R. H. Baayen, “Time and thyme again: Connecting English spoken word duration to models of the mental lexicon,” Language, 2024, page accepted

work page 2024

[16] [16]

Heitmeier, Y .-Y

M. Heitmeier, Y .-Y . Chuang, and R. H. Baayen,The Discrimi- native Lexicon: Theory and implementation in the Julia package JudiLing. Cambridge: Cambridge University Press, 2026

work page 2026

[17] [17]

Word-specific tonal realizations in Mandarin

Y .-Y . Chuang, M. J. Bell, Y .-H. Tseng, and R. H. Baayen, “Word- specific tonal realizations in Mandarin,”Language, 2026, in press; arXiv preprint arXiv:2405.07006

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling,

Y . Lu, Y .-Y . Chuang, and R. H. Baayen, “The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling,”Corpus Linguistics and Linguistic Theory, 2026

work page 2026

[19] [19]

Do words sing their own tunes? word-specific pitch realizations in Mandarin and En- glish,

Y .-Y . Chuang, R. H. Baayen, and M. J. Bell, “Do words sing their own tunes? word-specific pitch realizations in Mandarin and En- glish,” inProceedings of ICPhS 2023, 2023

work page 2023

[20] [20]

S. N. Wood,Generalized additive models: an introduction with R. chapman and hall/CRC, 2017

work page 2017

[21] [21]

The acoustic variation of Mandarin tones,

A. T. Ho, “The acoustic variation of Mandarin tones,”Phonetica, vol. 33, no. 5, pp. 353–367, 1976

work page 1976

[22] [22]

A preliminary construction of Taiwan Southern Min spon- taneous speech corpus,

J. Fon, “A preliminary construction of Taiwan Southern Min spon- taneous speech corpus,” Tech. Rep. NSC-92-2411-H-003-050, National Science Council, Taiwan, Tech. Rep., 2004

work page 2004

[23] [23]

Duanmu,The phonology of standard Chinese

S. Duanmu,The phonology of standard Chinese. OUP Oxford, 2007

work page 2007

[24] [24]

Lectures on Chinese Phonetics [國音學講義],

T. Yi, “Lectures on Chinese Phonetics [國音學講義],” 1920

work page 1920

[25] [25]

Duration reflexes of syllable structure in mandarin,

F. Wu and M. Kenstowicz, “Duration reflexes of syllable structure in mandarin,”Lingua, vol. 164, pp. 87–99, 2015

work page 2015

[26] [26]

Montreal forced aligner: Trainable text-speech align- ment using kaldi

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using kaldi.” inInterspeech, vol. 2017, 2017, pp. 498–502

work page 2017

[27] [27]

2022. Praat: Doing phonetics by computer [Computer program]. Version 6.0. 43,

P. Boersma and D. Weenink, “2022. Praat: Doing phonetics by computer [Computer program]. Version 6.0. 43,” 1992

work page 2022

[28] [28]

A new kid on the block: Distributional semantics predicts the word-specific tone signa- tures of monosyllabic words in conversational Taiwan Mandarin speech

X. Jin, M. Ernestus, and R. H. Baayen, “A new kid on the block: Distributional semantics predicts the word-specific tone signa- tures of monosyllabic words in conversational Taiwan Mandarin speech.”under revision for Journal of Phonetics, 2025, arXiv preprint arXiv:2503.23163

work page arXiv 2025

[29] [29]

Visualizing data using t-sne,

L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579– 2605, 2008

work page 2008

[30] [30]

Dynamic programming algorithm op- timization for spoken word recognition,

H. Sakoe and S. Chiba, “Dynamic programming algorithm op- timization for spoken word recognition,”IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1, pp. 43– 49, 2003

work page 2003

[31] [31]

The use of multiple measurements in taxonomic problems,

R. A. Fisher, “The use of multiple measurements in taxonomic problems,”Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936

work page 1936

[32] [32]

The utilization of multiple measurements in problems of biological classification,

C. R. Rao, “The utilization of multiple measurements in problems of biological classification,”Journal of the Royal Statistical Soci- ety. Series B (Methodological), vol. 10, no. 2, pp. 159–203, 1948

work page 1948

[33] [33]

Computers and the study of literature,

J. F. Burrows, “Computers and the study of literature,” inCom- puters and Written Texts, C. S. Butler, Ed. Oxford: Blackwell, 1992, pp. 167–204

work page 1992

[34] [34]

Wenfeng and W

Q. Wenfeng and W. Yanyi,jiebaR: Chinese Text Segmentation, 2019, R package version 0.11. [Online]. Available: https: //CRAN.R-project.org/package=jiebaR

work page 2019

[35] [35]

Drager, k

“Drager, k.”Journal of Phonetics, vol. 39, no. 4, pp. 694–707, 2011

work page 2011

[36] [36]

Roles and representations of systematic fine pho- netic detail in speech understanding,

S. Hawkins, “Roles and representations of systematic fine pho- netic detail in speech understanding,”Journal of Phonetics, vol. 31, pp. 373–405, 2003

work page 2003

[37] [37]

Against formal phonology,

R. F. Port and A. P. Leary, “Against formal phonology,”Language, vol. 81, pp. 927–964, 2005

work page 2005

[38] [38]

Cracking arbitrariness: A data- driven study of auditory iconicity in spoken English,

A. G. de Varda and M. Marelli, “Cracking arbitrariness: A data- driven study of auditory iconicity in spoken English,”Psycho- nomic Bulletin & Review, pp. 1–18, 2025

work page 2025