Towards a Phonology-Informed Evaluation of Multilingual TTS

Neeraj Kumar Sharma; Shakuntala Mahanta; Sneha Ray Barman

arxiv: 2607.01965 · v1 · pith:YBCWA6LNnew · submitted 2026-07-02 · 💻 cs.CL · cs.ET· cs.LG

Towards a Phonology-Informed Evaluation of Multilingual TTS

Sneha Ray Barman , Neeraj Kumar Sharma , Shakuntala Mahanta This is my paper

Pith reviewed 2026-07-03 14:55 UTC · model grok-4.3

classification 💻 cs.CL cs.ETcs.LG

keywords multilingual TTS evaluationphonological faithfulnessATR vowel harmonyAssameseclassifier-based auditspeech synthesis diagnostics

0 comments

The pith

A speech classifier trained on human recordings can audit TTS output for failures to preserve Assamese vowel harmony patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a classifier-based audit that checks whether neural TTS systems maintain language-specific phonological contrasts, using human speech as the reference. Applied to Assamese advanced tongue root vowel harmony in Meta's MMS TTS, the method finds that mid vowels specified as [+ATR] are produced as [-ATR] in roughly one-third of cases, a pattern not seen in natural speech. At the word level, labels from the classifier better recover the expected harmony rules than the system's own transcriptions do. This shows a measurable gap between the phonology the model intends and what it actually produces. The approach supplies a diagnostic that is specific to phonological faithfulness rather than overall naturalness scores.

Core claim

A classifier trained on human Assamese speech transfers to synthesized speech with only minimal loss in accuracy and can therefore be used to audit whether TTS output preserves underlying ATR specifications; the audit shows that [+ATR] mid vowels are realized as [-ATR] in one-third of tokens, a bias absent from human speech, while predicted ATR labels classify harmony patterns more accurately than the model's transcriptions at the word level.

What carries the argument

Classifier-based framework that audits TTS output against language-specific phonological patterns by comparing predicted labels on synthesized speech to those on human speech.

If this is right

TTS systems can pass standard naturalness tests yet still systematically neutralize phonological contrasts required for grammatical distinctions.
Word-level harmony classification using the classifier's labels recovers the underlying phonological grammar more reliably than orthographic transcriptions from the same TTS output.
The same transfer approach applies to any phonological contrast that has measurable acoustic cues, without requiring new labeled data for each language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bias is confirmed across other TTS architectures, evaluation pipelines could add an automatic phonological-faithfulness check before deployment in language-learning or documentation tools.
The method supplies a way to quantify how much of a language's contrastive system survives synthesis, which could guide targeted data collection for under-resourced languages.
Extending the audit to sentence-level or discourse-level phonological rules would test whether the same classifier transfer holds beyond isolated vowels.

Load-bearing premise

Any difference between the classifier's output on TTS speech and the intended phonological specification counts as a production error by the TTS system rather than an acoustic modeling difference.

What would settle it

If retraining or fine-tuning the classifier on a small set of TTS examples removes the reported one-third bias in mid-vowel ATR realization while leaving human-speech accuracy unchanged, the audit would no longer indicate a TTS-specific phonological failure.

Figures

Figures reproduced from arXiv: 2607.01965 by Neeraj Kumar Sharma, Shakuntala Mahanta, Sneha Ray Barman.

**Figure 1.** Figure 1: Faithfulness audit error directionality. (a) Mismatch concentrates in mid [+ATR] vowels /e/ and /o/ in TTS, whereas human mismatch is highest for /u/, /U/, and /E/. (b) Human errors are roughly symmetric; TTS errors show a 7:1 underproduction-to-overgeneration ratio . 3.3. Task 2: word-level harmony classification The second classification task tests whether the token-level patterns identified in the fai… view at source ↗

read the original abstract

Neural TTS systems can sound natural across languages, but naturalness does not guarantee the preservation of sound contrasts that distinguish words from their grammatical forms. Standard metrics like MOS do not test for this. We propose a classifier-based framework that audits TTS output against language-specific phonological patterns using human speech as a benchmark. Testing Assamese advanced tongue root (ATR) vowel harmony with Meta's MMS TTS, we show that a classifier trained on human speech transfers to synthesized speech with minimal loss. The faithfulness audit reveals that [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens despite an underlying [+ATR] specification, a bias absent in human speech. At the word level, predicted ATR labels classify harmony more accurately than transcription labels, indicating a gap between intended and produced phonology. The framework offers task-specific diagnostics and generalizes to other phonological contrasts with measurable acoustic cues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a classifier audit for TTS phonological fidelity in Assamese ATR harmony but the transfer claim and 1/3 bias rest on unshown details.

read the letter

The main thing to know is that the authors test a classifier trained on human speech to flag when TTS output fails to keep phonological contrasts like advanced tongue root harmony. They report that Meta's MMS system realizes [+ATR] mid vowels as [-ATR] in roughly one third of cases, unlike human speech, and that the classifier labels work better than transcriptions for word-level harmony detection.

What is actually new is the transfer setup itself as a diagnostic tool. Standard metrics like MOS focus on naturalness and do not check whether the system produces the right sound distinctions for grammar or word identity. The framework is straightforward: train on human data, apply to TTS waveforms, and measure mismatches against the underlying specification. That addresses a real gap for multilingual TTS where languages have measurable acoustic cues for contrasts.

The soft spots sit in the empirical support. The abstract states the classifier transfers with minimal loss and produces the 1/3 bias finding, yet gives no validation accuracy, no feature list, no sample sizes, and no tests for domain shift between human and synthetic acoustics. If the classifier boundary is sensitive to TTS-specific changes in vowel space or spectral properties, the reported bias could be an artifact rather than a production error. The stress-test concern lands because the word-level result depends directly on those same classifier outputs.

This is for people building or evaluating TTS systems in languages with harmony or similar contrasts. A reader who wants targeted diagnostics beyond naturalness scores could get value from the idea, provided the full paper supplies the missing classifier metrics and controls. It deserves a serious referee because the core proposal is concrete and falsifiable even if the current evidence is light on the transfer step.

Referee Report

3 major / 1 minor

Summary. The paper proposes a classifier-based framework for auditing phonological faithfulness in multilingual TTS systems, using Assamese ATR vowel harmony and Meta's MMS TTS as the test case. It reports that a classifier trained on human speech transfers to TTS output with minimal loss, that [+ATR] mid vowels are realized as [-ATR] in 1/3 of tokens (a bias absent in human speech), and that predicted ATR labels classify word-level harmony more accurately than transcription labels.

Significance. If the transfer validity and empirical findings are substantiated with appropriate controls and metrics, the framework would offer a targeted diagnostic for phonological contrast preservation in TTS that complements standard metrics like MOS, potentially useful for evaluating systems on languages with vowel harmony or other measurable contrasts.

major comments (3)

[Abstract] Abstract: The assertion that the classifier 'transfers to synthesized speech with minimal loss' is unsupported by any reported metrics (validation accuracy, data sizes, feature sets, or statistical tests), which is load-bearing for both the 1/3 bias claim and the word-level harmony result.
[Abstract] Abstract: The 1/3 realization rate for [+ATR] mid vowels as [-ATR] lacks controls for acoustic domain shift (e.g., altered formants or spectral properties in TTS), so it is unclear whether mismatches reflect TTS phonological bias or classifier sensitivity to non-phonological differences between human training data and TTS waveforms.
[Abstract] Abstract: The word-level finding that predicted ATR labels outperform transcription labels in classifying harmony is derived directly from the same classifier outputs; without evidence that the classifier recovers underlying specifications accurately on TTS data, this result cannot reliably indicate a gap between intended and produced phonology.

minor comments (1)

[Abstract] Abstract: Adding a short description of the acoustic features or classifier architecture would improve clarity and allow readers to assess potential confounds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript to improve transparency in the abstract and main text.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the classifier 'transfers to synthesized speech with minimal loss' is unsupported by any reported metrics (validation accuracy, data sizes, feature sets, or statistical tests), which is load-bearing for both the 1/3 bias claim and the word-level harmony result.

Authors: The full manuscript reports these supporting details in the methods and results sections. To address the concern that the abstract claim stands alone without numbers, we will revise the abstract to include the key metrics: validation accuracy on human speech, transfer accuracy on TTS, dataset sizes, feature sets (MFCCs), and statistical tests. This makes the abstract self-contained while preserving the original findings. revision: yes
Referee: [Abstract] Abstract: The 1/3 realization rate for [+ATR] mid vowels as [-ATR] lacks controls for acoustic domain shift (e.g., altered formants or spectral properties in TTS), so it is unclear whether mismatches reflect TTS phonological bias or classifier sensitivity to non-phonological differences between human training data and TTS waveforms.

Authors: This is a fair point about potential domain shift effects. The current evidence relies on the classifier's transfer performance and the absence of the bias in human speech data. We will add a discussion of acoustic comparisons (e.g., formant distributions) between domains and note this as a limitation. Full isolation of phonological vs. acoustic factors would require further targeted experiments. revision: partial
Referee: [Abstract] Abstract: The word-level finding that predicted ATR labels outperform transcription labels in classifying harmony is derived directly from the same classifier outputs; without evidence that the classifier recovers underlying specifications accurately on TTS data, this result cannot reliably indicate a gap between intended and produced phonology.

Authors: The transfer accuracy with minimal loss provides the evidence that the classifier remains reliable on TTS data. We interpret the improved harmony classification using predicted labels (vs. transcriptions) as indicating a mismatch between intended and realized phonology. We will expand the discussion to explicitly link the transfer result to this interpretation but do not believe additional validation experiments are required for the current claim. revision: no

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with external human-speech reference

full rationale

The paper describes a classifier trained on human speech and applied to TTS waveforms for phonological auditing. No equations, derivations, fitted parameters, or self-citations are presented as load-bearing steps. The central results (ATR mismatch rates, word-level harmony accuracy) are obtained by direct comparison to an independent human-speech benchmark rather than by construction from the TTS data itself or from prior self-authored results. This is standard empirical evaluation and contains no self-definitional, fitted-input, or self-citation reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; the core premise is classifier transferability.

axioms (1)

domain assumption A classifier trained on human speech captures phonological patterns that transfer to TTS output with minimal loss.
Stated directly in the abstract as the basis for the audit.

pith-pipeline@v0.9.1-grok · 5686 in / 1207 out tokens · 28929 ms · 2026-07-03T14:55:28.603470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

Towards a Phonology-Informed Evaluation of Multilingual TTS

Background Neural architectures and the massive scaling of training data have rapidly improved multilingual text-to-speech (TTS) sys- tems. Standard evaluation metrics based on Mean Opinion Scores (MOS) focus on perceived naturalness and whether words are recoverable by automatic recognition. However, sounding natural does not guarantee that a system repr...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Data and model setup Human benchmark.We created the human benchmark corpus by recording the speech of14adult native Assamese speakers from the upper Assam region (8females,6males)

Materials & Methods 2.1. Data and model setup Human benchmark.We created the human benchmark corpus by recording the speech of14adult native Assamese speakers from the upper Assam region (8females,6males). Each par- ticipant read out target words (X) embedded in a carrier frame (‘moi X buli kolu’, corresponding to English‘I say X’). We man- ually sliced t...

work page
[3]

Task 1: vowel-level ATR classification Table 2 reports the accuracy (Acc) and macro-F1 for both mod- els across the four directions

Results 3.1. Task 1: vowel-level ATR classification Table 2 reports the accuracy (Acc) and macro-F1 for both mod- els across the four directions. Table 2:Cross-domain ATR classification with Lobanov- normalized acoustic features. Model Metric H→H H→TTS TTS→TTS TTS→H LR Acc 81.7% 83% 86.5% 79.8% macro-F1 0.81 0.81 0.84 0.77 RF Acc 90.5% 74.7% 87.5% 80.8% m...

work page
[4]

Conclusion Our results show that MMS TTS underproduces the acous- tic correlates of [+ATR] in mid vowels, with a 7:1 underproduction-to-overgeneration ratio absent from human speech. This bias persists at the word-level: the drop in ac- curacy and macro-F1 from A+B gold relative to A+B pred on TTS transfer (Table 4) reflects that the phonological categori...

work page
[5]

However, all scien- tific content, code implementations, results, and analyses were proposed, verified, and finalized by the authors

Use of Generative AI Disclosure Generative AI was used to assist with code auto-completion, minor text editing, and grammar polishing. However, all scien- tific content, code implementations, results, and analyses were proposed, verified, and finalized by the authors

work page
[6]

Modern speech synthesis for phonetic sciences: A discussion and an evaluation,

Z. Malisz, G. E. Henter, C. V . Botinhao, O. Watts, J. Beskow, and J. Gustafson, “Modern speech synthesis for phonetic sciences: A discussion and an evaluation,” inProc. Intl. Congress of Phonetic Sciences (ICPhS), 2019, pp. 487–491

work page 2019
[7]

BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,

ITU-R, “BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,” Recommendation ITU- R BS.1534-1 (MUSHRA), 2001

work page 2001
[8]

P.800: Methods for subjective determination of transmis- sion quality,

ITU-T, “P.800: Methods for subjective determination of transmis- sion quality,” Recommendation ITU-T P.800, 1996

work page 1996
[9]

P.910: Subjective video quality assessment methods for multimedia applications,

——, “P.910: Subjective video quality assessment methods for multimedia applications,” Recommendation ITU-T P.910, https: //www.itu.int/rec/T-REC-P.910/en, 2021

work page 2021
[10]

The blizzard challenge 2016,

S. King and V . Karaiskos, “The blizzard challenge 2016,” inProc. Blizzard Challenge Workshop, 2016, pp. 1–16

work page 2016
[11]

Speech synthesis evaluation – state-of-the-art as- sessment and suggestion for a novel research program,

P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz,´E. Sz´ekely, C. T˚annander, and J. V oße, “Speech synthesis evaluation – state-of-the-art as- sessment and suggestion for a novel research program,” inProc. Speech Synthesis Workshop (SSW), 2019, pp. 105–110

work page 2019
[12]

The limits of the mean opinion score for speech synthesis evaluation,

S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech and Language, vol. 84, 2024

work page 2024
[13]

Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,

A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Szekely, and J. Gustafson, “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” inProc. Speech Synthesis Workshop (SSW), 2023

work page 2023
[14]

P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,

ITU-T, “P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Recom- mendation ITU-T P.862, 2001

work page 2001
[15]

An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,”IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011

work page 2011
[16]

MOSNet: Deep learning-based objective assessment for voice conversion,

C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, and Y . Tsao, “MOSNet: Deep learning-based objective assessment for voice conversion,” inProc. Interspeech Conf., 2019. [Online]. Available: https://www.isca-archive.org/interspeech 2019/lo19 interspeech.html

work page 2019
[17]

NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

G. Mittag and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inProc. Interspeech Conf., 2021. [Online]. Available: https://www.isca-archive.org/interspeech 2021/mittag21 interspeech.html

work page 2021
[18]

UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,” inProc. Interspeech Conf., 2022, pp. 4521–4525

work page 2022
[19]

Gener- alization ability of MOS prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Gener- alization ability of MOS prediction networks,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8442–8446

work page 2022
[20]

Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,

Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y . Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,” inProc. Neural Information Processing Conf, 2018

work page 2018
[21]

Towards building text-to-speech systems for the next billion users,

G. K. Kumar, S. Praveen, P. Kumar, M. M. Khapra, and K. Nan- dakumar, “Towards building text-to-speech systems for the next billion users,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[22]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProc. Language Resources and Evaluation Conf. (LREC). European Language Resources Association, 2020, pp. 4218–

work page 2020
[23]

Available: https://aclanthology.org/2020.lrec-1

[Online]. Available: https://aclanthology.org/2020.lrec-1. 520/

work page 2020
[24]

Articulation testing methods: Consonantal differentiation with a closed-response set,

A. S. House, C. E. Williams, M. H. L. Hecker, and K. D. Kryter, “Articulation testing methods: Consonantal differentiation with a closed-response set,”J. Acoust. Soc. Am., vol. 37, no. 1, pp. 158– 166, 1965

work page 1965
[25]

Evaluating processed speech using the diagnostic rhyme test,

W. D. V oiers, “Evaluating processed speech using the diagnostic rhyme test,”Speech Technology, pp. 30–39, 1983

work page 1983
[26]

Consonant intelligibility in syn- thetic speech and in a natural speech control (Modified Rhyme Test results),

P. W. Nye and J. H. Gaitenby, “Consonant intelligibility in syn- thetic speech and in a natural speech control (Modified Rhyme Test results),” inHaskins Laboratories Status Report on Speech Research, vol. SR-33, 1973, pp. 77–91

work page 1973
[27]

Univer- sal phone recognition with a multilingual allophone system,

X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopou- los, D. R. Mortensen, G. Neubig, A. W. Blacket al., “Univer- sal phone recognition with a multilingual allophone system,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Process- ing (ICASSP), 2020, pp. 8249–8253

work page 2020
[28]

Analysis methods in neural language processing: A survey,

Y . Belinkov and J. Glass, “Analysis methods in neural language processing: A survey,”Trans. Assoc. Comput. Linguistics (TACL), vol. 7, pp. 49–72, 2019

work page 2019
[29]

Integrated-multilingual speech recognition using uni- versal phonological features,

L. Deng, “Integrated-multilingual speech recognition using uni- versal phonological features,” inProc. IEEE Intl. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), 1997, pp. 1007– 1010

work page 1997
[30]

Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,

P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,” inProc. SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Mor- phology. Seattle, Washington: Association for Computa- tional Linguistics, Jul. 2022, pp. 83–91. [Online]. Available: https://aclanthol...

work page 2022
[31]

Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,

S. Mahanta, “Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,” Ph.D. disserta- tion, Netherlands Graduate School of Linguistics, 2008

work page 2008
[32]

Assimilatory effects in a vowel harmony system: an acoustic analysis of advanced tongue root in Akan,

S. Hess, “Assimilatory effects in a vowel harmony system: an acoustic analysis of advanced tongue root in Akan,”Journal of Phonetics, vol. 20, no. 4, pp. 475–492, 1992

work page 1992
[33]

Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,

P. Olejarczuk, M. A. Otero, and M. M. Baese-Berk, “Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,”Journal of Phonetics, vol. 74, pp. 18–41, 2019

work page 2019
[34]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”J. Mach. Learn. Res. (JMLR), vol. 25, no. 97, pp. 1–52, 2024

work page 2024
[35]

Praat: Doing phonetics by com- puter,

P. Boersma and D. Weenink, “Praat: Doing phonetics by com- puter,” Version 6.4.60, https://www.praat.org/, 2026

work page 2026
[36]

FormantPro as a tool for speech analysis and segmentation,

Y . Xu and H. Gao, “FormantPro as a tool for speech analysis and segmentation,”Revista de Estudos da Linguagem, vol. 26, no. 4, pp. 1435–1454, 2018

work page 2018

[1] [1]

Towards a Phonology-Informed Evaluation of Multilingual TTS

Background Neural architectures and the massive scaling of training data have rapidly improved multilingual text-to-speech (TTS) sys- tems. Standard evaluation metrics based on Mean Opinion Scores (MOS) focus on perceived naturalness and whether words are recoverable by automatic recognition. However, sounding natural does not guarantee that a system repr...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Data and model setup Human benchmark.We created the human benchmark corpus by recording the speech of14adult native Assamese speakers from the upper Assam region (8females,6males)

Materials & Methods 2.1. Data and model setup Human benchmark.We created the human benchmark corpus by recording the speech of14adult native Assamese speakers from the upper Assam region (8females,6males). Each par- ticipant read out target words (X) embedded in a carrier frame (‘moi X buli kolu’, corresponding to English‘I say X’). We man- ually sliced t...

work page

[3] [3]

Task 1: vowel-level ATR classification Table 2 reports the accuracy (Acc) and macro-F1 for both mod- els across the four directions

Results 3.1. Task 1: vowel-level ATR classification Table 2 reports the accuracy (Acc) and macro-F1 for both mod- els across the four directions. Table 2:Cross-domain ATR classification with Lobanov- normalized acoustic features. Model Metric H→H H→TTS TTS→TTS TTS→H LR Acc 81.7% 83% 86.5% 79.8% macro-F1 0.81 0.81 0.84 0.77 RF Acc 90.5% 74.7% 87.5% 80.8% m...

work page

[4] [4]

Conclusion Our results show that MMS TTS underproduces the acous- tic correlates of [+ATR] in mid vowels, with a 7:1 underproduction-to-overgeneration ratio absent from human speech. This bias persists at the word-level: the drop in ac- curacy and macro-F1 from A+B gold relative to A+B pred on TTS transfer (Table 4) reflects that the phonological categori...

work page

[5] [5]

However, all scien- tific content, code implementations, results, and analyses were proposed, verified, and finalized by the authors

Use of Generative AI Disclosure Generative AI was used to assist with code auto-completion, minor text editing, and grammar polishing. However, all scien- tific content, code implementations, results, and analyses were proposed, verified, and finalized by the authors

work page

[6] [6]

Modern speech synthesis for phonetic sciences: A discussion and an evaluation,

Z. Malisz, G. E. Henter, C. V . Botinhao, O. Watts, J. Beskow, and J. Gustafson, “Modern speech synthesis for phonetic sciences: A discussion and an evaluation,” inProc. Intl. Congress of Phonetic Sciences (ICPhS), 2019, pp. 487–491

work page 2019

[7] [7]

BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,

ITU-R, “BS.1534-1: Method for the subjective assessment of in- termediate quality level of audio systems,” Recommendation ITU- R BS.1534-1 (MUSHRA), 2001

work page 2001

[8] [8]

P.800: Methods for subjective determination of transmis- sion quality,

ITU-T, “P.800: Methods for subjective determination of transmis- sion quality,” Recommendation ITU-T P.800, 1996

work page 1996

[9] [9]

P.910: Subjective video quality assessment methods for multimedia applications,

——, “P.910: Subjective video quality assessment methods for multimedia applications,” Recommendation ITU-T P.910, https: //www.itu.int/rec/T-REC-P.910/en, 2021

work page 2021

[10] [10]

The blizzard challenge 2016,

S. King and V . Karaiskos, “The blizzard challenge 2016,” inProc. Blizzard Challenge Workshop, 2016, pp. 1–16

work page 2016

[11] [11]

Speech synthesis evaluation – state-of-the-art as- sessment and suggestion for a novel research program,

P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz,´E. Sz´ekely, C. T˚annander, and J. V oße, “Speech synthesis evaluation – state-of-the-art as- sessment and suggestion for a novel research program,” inProc. Speech Synthesis Workshop (SSW), 2019, pp. 105–110

work page 2019

[12] [12]

The limits of the mean opinion score for speech synthesis evaluation,

S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech and Language, vol. 84, 2024

work page 2024

[13] [13]

Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,

A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Szekely, and J. Gustafson, “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” inProc. Speech Synthesis Workshop (SSW), 2023

work page 2023

[14] [14]

P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,

ITU-T, “P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Recom- mendation ITU-T P.862, 2001

work page 2001

[15] [15]

An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al- gorithm for intelligibility prediction of time–frequency weighted noisy speech,”IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011

work page 2011

[16] [16]

MOSNet: Deep learning-based objective assessment for voice conversion,

C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, and Y . Tsao, “MOSNet: Deep learning-based objective assessment for voice conversion,” inProc. Interspeech Conf., 2019. [Online]. Available: https://www.isca-archive.org/interspeech 2019/lo19 interspeech.html

work page 2019

[17] [17]

NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

G. Mittag and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inProc. Interspeech Conf., 2021. [Online]. Available: https://www.isca-archive.org/interspeech 2021/mittag21 interspeech.html

work page 2021

[18] [18]

UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,” inProc. Interspeech Conf., 2022, pp. 4521–4525

work page 2022

[19] [19]

Gener- alization ability of MOS prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Gener- alization ability of MOS prediction networks,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8442–8446

work page 2022

[20] [20]

Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,

Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y . Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthe- sis,” inProc. Neural Information Processing Conf, 2018

work page 2018

[21] [21]

Towards building text-to-speech systems for the next billion users,

G. K. Kumar, S. Praveen, P. Kumar, M. M. Khapra, and K. Nan- dakumar, “Towards building text-to-speech systems for the next billion users,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[22] [22]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProc. Language Resources and Evaluation Conf. (LREC). European Language Resources Association, 2020, pp. 4218–

work page 2020

[23] [23]

Available: https://aclanthology.org/2020.lrec-1

[Online]. Available: https://aclanthology.org/2020.lrec-1. 520/

work page 2020

[24] [24]

Articulation testing methods: Consonantal differentiation with a closed-response set,

A. S. House, C. E. Williams, M. H. L. Hecker, and K. D. Kryter, “Articulation testing methods: Consonantal differentiation with a closed-response set,”J. Acoust. Soc. Am., vol. 37, no. 1, pp. 158– 166, 1965

work page 1965

[25] [25]

Evaluating processed speech using the diagnostic rhyme test,

W. D. V oiers, “Evaluating processed speech using the diagnostic rhyme test,”Speech Technology, pp. 30–39, 1983

work page 1983

[26] [26]

Consonant intelligibility in syn- thetic speech and in a natural speech control (Modified Rhyme Test results),

P. W. Nye and J. H. Gaitenby, “Consonant intelligibility in syn- thetic speech and in a natural speech control (Modified Rhyme Test results),” inHaskins Laboratories Status Report on Speech Research, vol. SR-33, 1973, pp. 77–91

work page 1973

[27] [27]

Univer- sal phone recognition with a multilingual allophone system,

X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopou- los, D. R. Mortensen, G. Neubig, A. W. Blacket al., “Univer- sal phone recognition with a multilingual allophone system,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Process- ing (ICASSP), 2020, pp. 8249–8253

work page 2020

[28] [28]

Analysis methods in neural language processing: A survey,

Y . Belinkov and J. Glass, “Analysis methods in neural language processing: A survey,”Trans. Assoc. Comput. Linguistics (TACL), vol. 7, pp. 49–72, 2019

work page 2019

[29] [29]

Integrated-multilingual speech recognition using uni- versal phonological features,

L. Deng, “Integrated-multilingual speech recognition using uni- versal phonological features,” inProc. IEEE Intl. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), 1997, pp. 1007– 1010

work page 1997

[30] [30]

Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,

P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain-informed probing of wav2vec 2.0 embeddings for phonetic features,” inProc. SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Mor- phology. Seattle, Washington: Association for Computa- tional Linguistics, Jul. 2022, pp. 83–91. [Online]. Available: https://aclanthol...

work page 2022

[31] [31]

Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,

S. Mahanta, “Directionality and locality in vowel harmony: With special reference to vowel harmony in Assamese,” Ph.D. disserta- tion, Netherlands Graduate School of Linguistics, 2008

work page 2008

[32] [32]

Assimilatory effects in a vowel harmony system: an acoustic analysis of advanced tongue root in Akan,

S. Hess, “Assimilatory effects in a vowel harmony system: an acoustic analysis of advanced tongue root in Akan,”Journal of Phonetics, vol. 20, no. 4, pp. 475–492, 1992

work page 1992

[33] [33]

Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,

P. Olejarczuk, M. A. Otero, and M. M. Baese-Berk, “Acoustic cor- relates of anticipatory and progressive [ATR] harmony processes in Ethiopian Komo,”Journal of Phonetics, vol. 74, pp. 18–41, 2019

work page 2019

[34] [34]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”J. Mach. Learn. Res. (JMLR), vol. 25, no. 97, pp. 1–52, 2024

work page 2024

[35] [35]

Praat: Doing phonetics by com- puter,

P. Boersma and D. Weenink, “Praat: Doing phonetics by com- puter,” Version 6.4.60, https://www.praat.org/, 2026

work page 2026

[36] [36]

FormantPro as a tool for speech analysis and segmentation,

Y . Xu and H. Gao, “FormantPro as a tool for speech analysis and segmentation,”Revista de Estudos da Linguagem, vol. 26, no. 4, pp. 1435–1454, 2018

work page 2018