Towards a Phonology-Informed Evaluation of Multilingual TTS
Pith reviewed 2026-07-03 14:55 UTC · model grok-4.3
The pith
A speech classifier trained on human recordings can audit TTS output for failures to preserve Assamese vowel harmony patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A classifier trained on human Assamese speech transfers to synthesized speech with only minimal loss in accuracy and can therefore be used to audit whether TTS output preserves underlying ATR specifications; the audit shows that [+ATR] mid vowels are realized as [-ATR] in one-third of tokens, a bias absent from human speech, while predicted ATR labels classify harmony patterns more accurately than the model's transcriptions at the word level.
What carries the argument
Classifier-based framework that audits TTS output against language-specific phonological patterns by comparing predicted labels on synthesized speech to those on human speech.
If this is right
- TTS systems can pass standard naturalness tests yet still systematically neutralize phonological contrasts required for grammatical distinctions.
- Word-level harmony classification using the classifier's labels recovers the underlying phonological grammar more reliably than orthographic transcriptions from the same TTS output.
- The same transfer approach applies to any phonological contrast that has measurable acoustic cues, without requiring new labeled data for each language.
Where Pith is reading between the lines
- If the bias is confirmed across other TTS architectures, evaluation pipelines could add an automatic phonological-faithfulness check before deployment in language-learning or documentation tools.
- The method supplies a way to quantify how much of a language's contrastive system survives synthesis, which could guide targeted data collection for under-resourced languages.
- Extending the audit to sentence-level or discourse-level phonological rules would test whether the same classifier transfer holds beyond isolated vowels.
Load-bearing premise
Any difference between the classifier's output on TTS speech and the intended phonological specification counts as a production error by the TTS system rather than an acoustic modeling difference.
What would settle it
If retraining or fine-tuning the classifier on a small set of TTS examples removes the reported one-third bias in mid-vowel ATR realization while leaving human-speech accuracy unchanged, the audit would no longer indicate a TTS-specific phonological failure.
Figures
read the original abstract
Neural TTS systems can sound natural across languages, but naturalness does not guarantee the preservation of sound contrasts that distinguish words from their grammatical forms. Standard metrics like MOS do not test for this. We propose a classifier-based framework that audits TTS output against language-specific phonological patterns using human speech as a benchmark. Testing Assamese advanced tongue root (ATR) vowel harmony with Meta's MMS TTS, we show that a classifier trained on human speech transfers to synthesized speech with minimal loss. The faithfulness audit reveals that [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens despite an underlying [+ATR] specification, a bias absent in human speech. At the word level, predicted ATR labels classify harmony more accurately than transcription labels, indicating a gap between intended and produced phonology. The framework offers task-specific diagnostics and generalizes to other phonological contrasts with measurable acoustic cues.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a classifier-based framework for auditing phonological faithfulness in multilingual TTS systems, using Assamese ATR vowel harmony and Meta's MMS TTS as the test case. It reports that a classifier trained on human speech transfers to TTS output with minimal loss, that [+ATR] mid vowels are realized as [-ATR] in 1/3 of tokens (a bias absent in human speech), and that predicted ATR labels classify word-level harmony more accurately than transcription labels.
Significance. If the transfer validity and empirical findings are substantiated with appropriate controls and metrics, the framework would offer a targeted diagnostic for phonological contrast preservation in TTS that complements standard metrics like MOS, potentially useful for evaluating systems on languages with vowel harmony or other measurable contrasts.
major comments (3)
- [Abstract] Abstract: The assertion that the classifier 'transfers to synthesized speech with minimal loss' is unsupported by any reported metrics (validation accuracy, data sizes, feature sets, or statistical tests), which is load-bearing for both the 1/3 bias claim and the word-level harmony result.
- [Abstract] Abstract: The 1/3 realization rate for [+ATR] mid vowels as [-ATR] lacks controls for acoustic domain shift (e.g., altered formants or spectral properties in TTS), so it is unclear whether mismatches reflect TTS phonological bias or classifier sensitivity to non-phonological differences between human training data and TTS waveforms.
- [Abstract] Abstract: The word-level finding that predicted ATR labels outperform transcription labels in classifying harmony is derived directly from the same classifier outputs; without evidence that the classifier recovers underlying specifications accurately on TTS data, this result cannot reliably indicate a gap between intended and produced phonology.
minor comments (1)
- [Abstract] Abstract: Adding a short description of the acoustic features or classifier architecture would improve clarity and allow readers to assess potential confounds.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript to improve transparency in the abstract and main text.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that the classifier 'transfers to synthesized speech with minimal loss' is unsupported by any reported metrics (validation accuracy, data sizes, feature sets, or statistical tests), which is load-bearing for both the 1/3 bias claim and the word-level harmony result.
Authors: The full manuscript reports these supporting details in the methods and results sections. To address the concern that the abstract claim stands alone without numbers, we will revise the abstract to include the key metrics: validation accuracy on human speech, transfer accuracy on TTS, dataset sizes, feature sets (MFCCs), and statistical tests. This makes the abstract self-contained while preserving the original findings. revision: yes
-
Referee: [Abstract] Abstract: The 1/3 realization rate for [+ATR] mid vowels as [-ATR] lacks controls for acoustic domain shift (e.g., altered formants or spectral properties in TTS), so it is unclear whether mismatches reflect TTS phonological bias or classifier sensitivity to non-phonological differences between human training data and TTS waveforms.
Authors: This is a fair point about potential domain shift effects. The current evidence relies on the classifier's transfer performance and the absence of the bias in human speech data. We will add a discussion of acoustic comparisons (e.g., formant distributions) between domains and note this as a limitation. Full isolation of phonological vs. acoustic factors would require further targeted experiments. revision: partial
-
Referee: [Abstract] Abstract: The word-level finding that predicted ATR labels outperform transcription labels in classifying harmony is derived directly from the same classifier outputs; without evidence that the classifier recovers underlying specifications accurately on TTS data, this result cannot reliably indicate a gap between intended and produced phonology.
Authors: The transfer accuracy with minimal loss provides the evidence that the classifier remains reliable on TTS data. We interpret the improved harmony classification using predicted labels (vs. transcriptions) as indicating a mismatch between intended and realized phonology. We will expand the discussion to explicitly link the transfer result to this interpretation but do not believe additional validation experiments are required for the current claim. revision: no
Circularity Check
No circularity: empirical benchmarking with external human-speech reference
full rationale
The paper describes a classifier trained on human speech and applied to TTS waveforms for phonological auditing. No equations, derivations, fitted parameters, or self-citations are presented as load-bearing steps. The central results (ATR mismatch rates, word-level harmony accuracy) are obtained by direct comparison to an independent human-speech benchmark rather than by construction from the TTS data itself or from prior self-authored results. This is standard empirical evaluation and contains no self-definitional, fitted-input, or self-citation reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A classifier trained on human speech captures phonological patterns that transfer to TTS output with minimal loss.
Reference graph
Works this paper leans on
-
[1]
Towards a Phonology-Informed Evaluation of Multilingual TTS
Background Neural architectures and the massive scaling of training data have rapidly improved multilingual text-to-speech (TTS) sys- tems. Standard evaluation metrics based on Mean Opinion Scores (MOS) focus on perceived naturalness and whether words are recoverable by automatic recognition. However, sounding natural does not guarantee that a system repr...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Materials & Methods 2.1. Data and model setup Human benchmark.We created the human benchmark corpus by recording the speech of14adult native Assamese speakers from the upper Assam region (8females,6males). Each par- ticipant read out target words (X) embedded in a carrier frame (‘moi X buli kolu’, corresponding to English‘I say X’). We man- ually sliced t...
-
[3]
Results 3.1. Task 1: vowel-level ATR classification Table 2 reports the accuracy (Acc) and macro-F1 for both mod- els across the four directions. Table 2:Cross-domain ATR classification with Lobanov- normalized acoustic features. Model Metric H→H H→TTS TTS→TTS TTS→H LR Acc 81.7% 83% 86.5% 79.8% macro-F1 0.81 0.81 0.84 0.77 RF Acc 90.5% 74.7% 87.5% 80.8% m...
-
[4]
Conclusion Our results show that MMS TTS underproduces the acous- tic correlates of [+ATR] in mid vowels, with a 7:1 underproduction-to-overgeneration ratio absent from human speech. This bias persists at the word-level: the drop in ac- curacy and macro-F1 from A+B gold relative to A+B pred on TTS transfer (Table 4) reflects that the phonological categori...
-
[5]
Use of Generative AI Disclosure Generative AI was used to assist with code auto-completion, minor text editing, and grammar polishing. However, all scien- tific content, code implementations, results, and analyses were proposed, verified, and finalized by the authors
-
[6]
Modern speech synthesis for phonetic sciences: A discussion and an evaluation,
Z. Malisz, G. E. Henter, C. V . Botinhao, O. Watts, J. Beskow, and J. Gustafson, “Modern speech synthesis for phonetic sciences: A discussion and an evaluation,” inProc. Intl. Congress of Phonetic Sciences (ICPhS), 2019, pp. 487–491
work page 2019
-
[7]
BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,
ITU-R, “BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,” Recommendation ITU- R BS.1534-1 (MUSHRA), 2001
work page 2001
-
[8]
P.800: Methods for subjective determination of transmis- sion quality,
ITU-T, “P.800: Methods for subjective determination of transmis- sion quality,” Recommendation ITU-T P.800, 1996
work page 1996
-
[9]
P.910: Subjective video quality assessment methods for multimedia applications,
——, “P.910: Subjective video quality assessment methods for multimedia applications,” Recommendation ITU-T P.910, https: //www.itu.int/rec/T-REC-P.910/en, 2021
work page 2021
-
[10]
S. King and V . Karaiskos, “The blizzard challenge 2016,” inProc. Blizzard Challenge Workshop, 2016, pp. 1–16
work page 2016
-
[11]
P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz,´E. Sz´ekely, C. T˚annander, and J. V oße, “Speech synthesis evaluation – state-of-the-art as- sessment and suggestion for a novel research program,” inProc. Speech Synthesis Workshop (SSW), 2019, pp. 105–110
work page 2019
-
[12]
The limits of the mean opinion score for speech synthesis evaluation,
S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech and Language, vol. 84, 2024
work page 2024
-
[13]
Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,
A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Szekely, and J. Gustafson, “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” inProc. Speech Synthesis Workshop (SSW), 2023
work page 2023
-
[14]
ITU-T, “P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Recom- mendation ITU-T P.862, 2001
work page 2001
-
[15]
An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,”IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011
work page 2011
-
[16]
MOSNet: Deep learning-based objective assessment for voice conversion,
C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, and Y . Tsao, “MOSNet: Deep learning-based objective assessment for voice conversion,” inProc. Interspeech Conf., 2019. [Online]. Available: https://www.isca-archive.org/interspeech 2019/lo19 interspeech.html
work page 2019
-
[17]
G. Mittag and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inProc. Interspeech Conf., 2021. [Online]. Available: https://www.isca-archive.org/interspeech 2021/mittag21 interspeech.html
work page 2021
-
[18]
UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,” inProc. Interspeech Conf., 2022, pp. 4521–4525
work page 2022
-
[19]
Gener- alization ability of MOS prediction networks,
E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Gener- alization ability of MOS prediction networks,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8442–8446
work page 2022
-
[20]
Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,
Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y . Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,” inProc. Neural Information Processing Conf, 2018
work page 2018
-
[21]
Towards building text-to-speech systems for the next billion users,
G. K. Kumar, S. Praveen, P. Kumar, M. M. Khapra, and K. Nan- dakumar, “Towards building text-to-speech systems for the next billion users,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[22]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProc. Language Resources and Evaluation Conf. (LREC). European Language Resources Association, 2020, pp. 4218–
work page 2020
-
[23]
Available: https://aclanthology.org/2020.lrec-1
[Online]. Available: https://aclanthology.org/2020.lrec-1. 520/
work page 2020
-
[24]
Articulation testing methods: Consonantal differentiation with a closed-response set,
A. S. House, C. E. Williams, M. H. L. Hecker, and K. D. Kryter, “Articulation testing methods: Consonantal differentiation with a closed-response set,”J. Acoust. Soc. Am., vol. 37, no. 1, pp. 158– 166, 1965
work page 1965
-
[25]
Evaluating processed speech using the diagnostic rhyme test,
W. D. V oiers, “Evaluating processed speech using the diagnostic rhyme test,”Speech Technology, pp. 30–39, 1983
work page 1983
-
[26]
P. W. Nye and J. H. Gaitenby, “Consonant intelligibility in syn- thetic speech and in a natural speech control (Modified Rhyme Test results),” inHaskins Laboratories Status Report on Speech Research, vol. SR-33, 1973, pp. 77–91
work page 1973
-
[27]
Univer- sal phone recognition with a multilingual allophone system,
X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopou- los, D. R. Mortensen, G. Neubig, A. W. Blacket al., “Univer- sal phone recognition with a multilingual allophone system,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Process- ing (ICASSP), 2020, pp. 8249–8253
work page 2020
-
[28]
Analysis methods in neural language processing: A survey,
Y . Belinkov and J. Glass, “Analysis methods in neural language processing: A survey,”Trans. Assoc. Comput. Linguistics (TACL), vol. 7, pp. 49–72, 2019
work page 2019
-
[29]
Integrated-multilingual speech recognition using uni- versal phonological features,
L. Deng, “Integrated-multilingual speech recognition using uni- versal phonological features,” inProc. IEEE Intl. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), 1997, pp. 1007– 1010
work page 1997
-
[30]
Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,
P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,” inProc. SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Mor- phology. Seattle, Washington: Association for Computa- tional Linguistics, Jul. 2022, pp. 83–91. [Online]. Available: https://aclanthol...
work page 2022
-
[31]
Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,
S. Mahanta, “Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,” Ph.D. disserta- tion, Netherlands Graduate School of Linguistics, 2008
work page 2008
-
[32]
S. Hess, “Assimilatory effects in a vowel harmony system: an acoustic analysis of advanced tongue root in Akan,”Journal of Phonetics, vol. 20, no. 4, pp. 475–492, 1992
work page 1992
-
[33]
Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,
P. Olejarczuk, M. A. Otero, and M. M. Baese-Berk, “Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,”Journal of Phonetics, vol. 74, pp. 18–41, 2019
work page 2019
-
[34]
Scaling speech technology to 1,000+ languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”J. Mach. Learn. Res. (JMLR), vol. 25, no. 97, pp. 1–52, 2024
work page 2024
-
[35]
Praat: Doing phonetics by com- puter,
P. Boersma and D. Weenink, “Praat: Doing phonetics by com- puter,” Version 6.4.60, https://www.praat.org/, 2026
work page 2026
-
[36]
FormantPro as a tool for speech analysis and segmentation,
Y . Xu and H. Gao, “FormantPro as a tool for speech analysis and segmentation,”Revista de Estudos da Linguagem, vol. 26, no. 4, pp. 1435–1454, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.