pith. sign in

arxiv: 2607.01965 · v1 · pith:YBCWA6LNnew · submitted 2026-07-02 · 💻 cs.CL · cs.ET· cs.LG

Towards a Phonology-Informed Evaluation of Multilingual TTS

Pith reviewed 2026-07-03 14:55 UTC · model grok-4.3

classification 💻 cs.CL cs.ETcs.LG
keywords multilingual TTS evaluationphonological faithfulnessATR vowel harmonyAssameseclassifier-based auditspeech synthesis diagnostics
0
0 comments X

The pith

A speech classifier trained on human recordings can audit TTS output for failures to preserve Assamese vowel harmony patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a classifier-based audit that checks whether neural TTS systems maintain language-specific phonological contrasts, using human speech as the reference. Applied to Assamese advanced tongue root vowel harmony in Meta's MMS TTS, the method finds that mid vowels specified as [+ATR] are produced as [-ATR] in roughly one-third of cases, a pattern not seen in natural speech. At the word level, labels from the classifier better recover the expected harmony rules than the system's own transcriptions do. This shows a measurable gap between the phonology the model intends and what it actually produces. The approach supplies a diagnostic that is specific to phonological faithfulness rather than overall naturalness scores.

Core claim

A classifier trained on human Assamese speech transfers to synthesized speech with only minimal loss in accuracy and can therefore be used to audit whether TTS output preserves underlying ATR specifications; the audit shows that [+ATR] mid vowels are realized as [-ATR] in one-third of tokens, a bias absent from human speech, while predicted ATR labels classify harmony patterns more accurately than the model's transcriptions at the word level.

What carries the argument

Classifier-based framework that audits TTS output against language-specific phonological patterns by comparing predicted labels on synthesized speech to those on human speech.

If this is right

  • TTS systems can pass standard naturalness tests yet still systematically neutralize phonological contrasts required for grammatical distinctions.
  • Word-level harmony classification using the classifier's labels recovers the underlying phonological grammar more reliably than orthographic transcriptions from the same TTS output.
  • The same transfer approach applies to any phonological contrast that has measurable acoustic cues, without requiring new labeled data for each language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the bias is confirmed across other TTS architectures, evaluation pipelines could add an automatic phonological-faithfulness check before deployment in language-learning or documentation tools.
  • The method supplies a way to quantify how much of a language's contrastive system survives synthesis, which could guide targeted data collection for under-resourced languages.
  • Extending the audit to sentence-level or discourse-level phonological rules would test whether the same classifier transfer holds beyond isolated vowels.

Load-bearing premise

Any difference between the classifier's output on TTS speech and the intended phonological specification counts as a production error by the TTS system rather than an acoustic modeling difference.

What would settle it

If retraining or fine-tuning the classifier on a small set of TTS examples removes the reported one-third bias in mid-vowel ATR realization while leaving human-speech accuracy unchanged, the audit would no longer indicate a TTS-specific phonological failure.

Figures

Figures reproduced from arXiv: 2607.01965 by Neeraj Kumar Sharma, Shakuntala Mahanta, Sneha Ray Barman.

Figure 1
Figure 1. Figure 1: Faithfulness audit error directionality. (a) Mis￾match concentrates in mid [+ATR] vowels /e/ and /o/ in TTS, whereas human mismatch is highest for /u/, /U/, and /E/. (b) Human errors are roughly symmetric; TTS errors show a 7:1 underproduction-to-overgeneration ratio . 3.3. Task 2: word-level harmony classification The second classification task tests whether the token-level pat￾terns identified in the fai… view at source ↗
read the original abstract

Neural TTS systems can sound natural across languages, but naturalness does not guarantee the preservation of sound contrasts that distinguish words from their grammatical forms. Standard metrics like MOS do not test for this. We propose a classifier-based framework that audits TTS output against language-specific phonological patterns using human speech as a benchmark. Testing Assamese advanced tongue root (ATR) vowel harmony with Meta's MMS TTS, we show that a classifier trained on human speech transfers to synthesized speech with minimal loss. The faithfulness audit reveals that [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens despite an underlying [+ATR] specification, a bias absent in human speech. At the word level, predicted ATR labels classify harmony more accurately than transcription labels, indicating a gap between intended and produced phonology. The framework offers task-specific diagnostics and generalizes to other phonological contrasts with measurable acoustic cues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a classifier-based framework for auditing phonological faithfulness in multilingual TTS systems, using Assamese ATR vowel harmony and Meta's MMS TTS as the test case. It reports that a classifier trained on human speech transfers to TTS output with minimal loss, that [+ATR] mid vowels are realized as [-ATR] in 1/3 of tokens (a bias absent in human speech), and that predicted ATR labels classify word-level harmony more accurately than transcription labels.

Significance. If the transfer validity and empirical findings are substantiated with appropriate controls and metrics, the framework would offer a targeted diagnostic for phonological contrast preservation in TTS that complements standard metrics like MOS, potentially useful for evaluating systems on languages with vowel harmony or other measurable contrasts.

major comments (3)
  1. [Abstract] Abstract: The assertion that the classifier 'transfers to synthesized speech with minimal loss' is unsupported by any reported metrics (validation accuracy, data sizes, feature sets, or statistical tests), which is load-bearing for both the 1/3 bias claim and the word-level harmony result.
  2. [Abstract] Abstract: The 1/3 realization rate for [+ATR] mid vowels as [-ATR] lacks controls for acoustic domain shift (e.g., altered formants or spectral properties in TTS), so it is unclear whether mismatches reflect TTS phonological bias or classifier sensitivity to non-phonological differences between human training data and TTS waveforms.
  3. [Abstract] Abstract: The word-level finding that predicted ATR labels outperform transcription labels in classifying harmony is derived directly from the same classifier outputs; without evidence that the classifier recovers underlying specifications accurately on TTS data, this result cannot reliably indicate a gap between intended and produced phonology.
minor comments (1)
  1. [Abstract] Abstract: Adding a short description of the acoustic features or classifier architecture would improve clarity and allow readers to assess potential confounds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript to improve transparency in the abstract and main text.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the classifier 'transfers to synthesized speech with minimal loss' is unsupported by any reported metrics (validation accuracy, data sizes, feature sets, or statistical tests), which is load-bearing for both the 1/3 bias claim and the word-level harmony result.

    Authors: The full manuscript reports these supporting details in the methods and results sections. To address the concern that the abstract claim stands alone without numbers, we will revise the abstract to include the key metrics: validation accuracy on human speech, transfer accuracy on TTS, dataset sizes, feature sets (MFCCs), and statistical tests. This makes the abstract self-contained while preserving the original findings. revision: yes

  2. Referee: [Abstract] Abstract: The 1/3 realization rate for [+ATR] mid vowels as [-ATR] lacks controls for acoustic domain shift (e.g., altered formants or spectral properties in TTS), so it is unclear whether mismatches reflect TTS phonological bias or classifier sensitivity to non-phonological differences between human training data and TTS waveforms.

    Authors: This is a fair point about potential domain shift effects. The current evidence relies on the classifier's transfer performance and the absence of the bias in human speech data. We will add a discussion of acoustic comparisons (e.g., formant distributions) between domains and note this as a limitation. Full isolation of phonological vs. acoustic factors would require further targeted experiments. revision: partial

  3. Referee: [Abstract] Abstract: The word-level finding that predicted ATR labels outperform transcription labels in classifying harmony is derived directly from the same classifier outputs; without evidence that the classifier recovers underlying specifications accurately on TTS data, this result cannot reliably indicate a gap between intended and produced phonology.

    Authors: The transfer accuracy with minimal loss provides the evidence that the classifier remains reliable on TTS data. We interpret the improved harmony classification using predicted labels (vs. transcriptions) as indicating a mismatch between intended and realized phonology. We will expand the discussion to explicitly link the transfer result to this interpretation but do not believe additional validation experiments are required for the current claim. revision: no

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with external human-speech reference

full rationale

The paper describes a classifier trained on human speech and applied to TTS waveforms for phonological auditing. No equations, derivations, fitted parameters, or self-citations are presented as load-bearing steps. The central results (ATR mismatch rates, word-level harmony accuracy) are obtained by direct comparison to an independent human-speech benchmark rather than by construction from the TTS data itself or from prior self-authored results. This is standard empirical evaluation and contains no self-definitional, fitted-input, or self-citation reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; the core premise is classifier transferability.

axioms (1)
  • domain assumption A classifier trained on human speech captures phonological patterns that transfer to TTS output with minimal loss.
    Stated directly in the abstract as the basis for the audit.

pith-pipeline@v0.9.1-grok · 5686 in / 1207 out tokens · 28929 ms · 2026-07-03T14:55:28.603470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    Towards a Phonology-Informed Evaluation of Multilingual TTS

    Background Neural architectures and the massive scaling of training data have rapidly improved multilingual text-to-speech (TTS) sys- tems. Standard evaluation metrics based on Mean Opinion Scores (MOS) focus on perceived naturalness and whether words are recoverable by automatic recognition. However, sounding natural does not guarantee that a system repr...

  2. [2]

    Data and model setup Human benchmark.We created the human benchmark corpus by recording the speech of14adult native Assamese speakers from the upper Assam region (8females,6males)

    Materials & Methods 2.1. Data and model setup Human benchmark.We created the human benchmark corpus by recording the speech of14adult native Assamese speakers from the upper Assam region (8females,6males). Each par- ticipant read out target words (X) embedded in a carrier frame (‘moi X buli kolu’, corresponding to English‘I say X’). We man- ually sliced t...

  3. [3]

    Task 1: vowel-level ATR classification Table 2 reports the accuracy (Acc) and macro-F1 for both mod- els across the four directions

    Results 3.1. Task 1: vowel-level ATR classification Table 2 reports the accuracy (Acc) and macro-F1 for both mod- els across the four directions. Table 2:Cross-domain ATR classification with Lobanov- normalized acoustic features. Model Metric H→H H→TTS TTS→TTS TTS→H LR Acc 81.7% 83% 86.5% 79.8% macro-F1 0.81 0.81 0.84 0.77 RF Acc 90.5% 74.7% 87.5% 80.8% m...

  4. [4]

    Conclusion Our results show that MMS TTS underproduces the acous- tic correlates of [+ATR] in mid vowels, with a 7:1 underproduction-to-overgeneration ratio absent from human speech. This bias persists at the word-level: the drop in ac- curacy and macro-F1 from A+B gold relative to A+B pred on TTS transfer (Table 4) reflects that the phonological categori...

  5. [5]

    However, all scien- tific content, code implementations, results, and analyses were proposed, verified, and finalized by the authors

    Use of Generative AI Disclosure Generative AI was used to assist with code auto-completion, minor text editing, and grammar polishing. However, all scien- tific content, code implementations, results, and analyses were proposed, verified, and finalized by the authors

  6. [6]

    Modern speech synthesis for phonetic sciences: A discussion and an evaluation,

    Z. Malisz, G. E. Henter, C. V . Botinhao, O. Watts, J. Beskow, and J. Gustafson, “Modern speech synthesis for phonetic sciences: A discussion and an evaluation,” inProc. Intl. Congress of Phonetic Sciences (ICPhS), 2019, pp. 487–491

  7. [7]

    BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,

    ITU-R, “BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,” Recommendation ITU- R BS.1534-1 (MUSHRA), 2001

  8. [8]

    P.800: Methods for subjective determination of transmis- sion quality,

    ITU-T, “P.800: Methods for subjective determination of transmis- sion quality,” Recommendation ITU-T P.800, 1996

  9. [9]

    P.910: Subjective video quality assessment methods for multimedia applications,

    ——, “P.910: Subjective video quality assessment methods for multimedia applications,” Recommendation ITU-T P.910, https: //www.itu.int/rec/T-REC-P.910/en, 2021

  10. [10]

    The blizzard challenge 2016,

    S. King and V . Karaiskos, “The blizzard challenge 2016,” inProc. Blizzard Challenge Workshop, 2016, pp. 1–16

  11. [11]

    Speech synthesis evaluation – state-of-the-art as- sessment and suggestion for a novel research program,

    P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz,´E. Sz´ekely, C. T˚annander, and J. V oße, “Speech synthesis evaluation – state-of-the-art as- sessment and suggestion for a novel research program,” inProc. Speech Synthesis Workshop (SSW), 2019, pp. 105–110

  12. [12]

    The limits of the mean opinion score for speech synthesis evaluation,

    S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech and Language, vol. 84, 2024

  13. [13]

    Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,

    A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Szekely, and J. Gustafson, “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” inProc. Speech Synthesis Workshop (SSW), 2023

  14. [14]

    P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,

    ITU-T, “P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Recom- mendation ITU-T P.862, 2001

  15. [15]

    An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,”IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011

  16. [16]

    MOSNet: Deep learning-based objective assessment for voice conversion,

    C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, and Y . Tsao, “MOSNet: Deep learning-based objective assessment for voice conversion,” inProc. Interspeech Conf., 2019. [Online]. Available: https://www.isca-archive.org/interspeech 2019/lo19 interspeech.html

  17. [17]

    NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

    G. Mittag and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inProc. Interspeech Conf., 2021. [Online]. Available: https://www.isca-archive.org/interspeech 2021/mittag21 interspeech.html

  18. [18]

    UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,” inProc. Interspeech Conf., 2022, pp. 4521–4525

  19. [19]

    Gener- alization ability of MOS prediction networks,

    E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Gener- alization ability of MOS prediction networks,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8442–8446

  20. [20]

    Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,

    Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y . Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,” inProc. Neural Information Processing Conf, 2018

  21. [21]

    Towards building text-to-speech systems for the next billion users,

    G. K. Kumar, S. Praveen, P. Kumar, M. M. Khapra, and K. Nan- dakumar, “Towards building text-to-speech systems for the next billion users,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  22. [22]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProc. Language Resources and Evaluation Conf. (LREC). European Language Resources Association, 2020, pp. 4218–

  23. [23]

    Available: https://aclanthology.org/2020.lrec-1

    [Online]. Available: https://aclanthology.org/2020.lrec-1. 520/

  24. [24]

    Articulation testing methods: Consonantal differentiation with a closed-response set,

    A. S. House, C. E. Williams, M. H. L. Hecker, and K. D. Kryter, “Articulation testing methods: Consonantal differentiation with a closed-response set,”J. Acoust. Soc. Am., vol. 37, no. 1, pp. 158– 166, 1965

  25. [25]

    Evaluating processed speech using the diagnostic rhyme test,

    W. D. V oiers, “Evaluating processed speech using the diagnostic rhyme test,”Speech Technology, pp. 30–39, 1983

  26. [26]

    Consonant intelligibility in syn- thetic speech and in a natural speech control (Modified Rhyme Test results),

    P. W. Nye and J. H. Gaitenby, “Consonant intelligibility in syn- thetic speech and in a natural speech control (Modified Rhyme Test results),” inHaskins Laboratories Status Report on Speech Research, vol. SR-33, 1973, pp. 77–91

  27. [27]

    Univer- sal phone recognition with a multilingual allophone system,

    X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopou- los, D. R. Mortensen, G. Neubig, A. W. Blacket al., “Univer- sal phone recognition with a multilingual allophone system,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Process- ing (ICASSP), 2020, pp. 8249–8253

  28. [28]

    Analysis methods in neural language processing: A survey,

    Y . Belinkov and J. Glass, “Analysis methods in neural language processing: A survey,”Trans. Assoc. Comput. Linguistics (TACL), vol. 7, pp. 49–72, 2019

  29. [29]

    Integrated-multilingual speech recognition using uni- versal phonological features,

    L. Deng, “Integrated-multilingual speech recognition using uni- versal phonological features,” inProc. IEEE Intl. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), 1997, pp. 1007– 1010

  30. [30]

    Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,

    P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,” inProc. SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Mor- phology. Seattle, Washington: Association for Computa- tional Linguistics, Jul. 2022, pp. 83–91. [Online]. Available: https://aclanthol...

  31. [31]

    Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,

    S. Mahanta, “Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,” Ph.D. disserta- tion, Netherlands Graduate School of Linguistics, 2008

  32. [32]

    Assimilatory effects in a vowel harmony system: an acoustic analysis of advanced tongue root in Akan,

    S. Hess, “Assimilatory effects in a vowel harmony system: an acoustic analysis of advanced tongue root in Akan,”Journal of Phonetics, vol. 20, no. 4, pp. 475–492, 1992

  33. [33]

    Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,

    P. Olejarczuk, M. A. Otero, and M. M. Baese-Berk, “Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,”Journal of Phonetics, vol. 74, pp. 18–41, 2019

  34. [34]

    Scaling speech technology to 1,000+ languages,

    V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”J. Mach. Learn. Res. (JMLR), vol. 25, no. 97, pp. 1–52, 2024

  35. [35]

    Praat: Doing phonetics by com- puter,

    P. Boersma and D. Weenink, “Praat: Doing phonetics by com- puter,” Version 6.4.60, https://www.praat.org/, 2026

  36. [36]

    FormantPro as a tool for speech analysis and segmentation,

    Y . Xu and H. Gao, “FormantPro as a tool for speech analysis and segmentation,”Revista de Estudos da Linguagem, vol. 26, no. 4, pp. 1435–1454, 2018