pith. sign in

arxiv: 2607.01502 · v1 · pith:EBSSK75Gnew · submitted 2026-07-01 · 💻 cs.CL

From Monolingual to Multilingual: Evaluating Mamba for ASR in South African Languages

Pith reviewed 2026-07-03 20:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords automatic speech recognitionMambaConformermultilingual ASRSouth African languageslow-resource languageslanguage embeddings
0
0 comments X

The pith

Mamba matches Conformer accuracy for South African language speech recognition while training faster and using fewer resources, and multilingual training yields additional gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests Mamba, a state space model, against Conformer baselines for automatic speech recognition on seven South African languages, each with 50 hours of training data. Mamba reaches similar word error rates but requires less computation and finishes training more quickly. Pooling data across languages for multilingual training improves accuracy over separate monolingual models for both architectures. Adding language and language-family embeddings as biases to the acoustic features leaves in-domain performance unchanged yet strengthens results on data from other corpora, and these embeddings prove useful in lower-resource splits of 5 or 10 hours per language.

Core claim

In monolingual experiments Mamba achieves recognition accuracy comparable to a Conformer model of similar scale on seven South African languages while consuming fewer computational resources and training faster. Multilingual training by pooling all languages consistently improves performance over monolingual training. Adding explicit language information does not improve in-domain performance but does improve cross-corpus robustness. In low-resource ablations with 5-hour and 10-hour per-language data, language embeddings provide gains whose removal or alteration hurts performance, and analysis shows the embeddings act as task-specific control vectors rather than encoders of typological lingu

What carries the argument

Mamba state space model for ASR, with multilingual data pooling and optional language or language-family embeddings added as biases to downsampled acoustic representations.

Load-bearing premise

The 50-hour per-language data splits are representative of real usage and the Mamba and Conformer models were trained under sufficiently comparable conditions for the accuracy comparison to be meaningful.

What would settle it

A controlled re-run on the same data splits in which Mamba's word error rate on held-out test sets exceeds the Conformer's by more than a few percent after matching parameter count, optimizer, and preprocessing exactly.

Figures

Figures reproduced from arXiv: 2607.01502 by Badr M. Abdullah, Dietrich Klakow, Jesujoba O. Alabi, Julian Herreilers.

Figure 1
Figure 1. Figure 1: Difference in WER (%) relative to baseline for Conformer and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cosine Similarity of the embeddings from 50-hour per language MLE. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Recent advances in automatic speech recognition (ASR) have explored different sequence models, including Conformer-based models and newer state space models such as Mamba. Although prior work has evaluated these architectures in multiple languages, their effectiveness in African languages remains underexplored. In this work, we evaluate Mamba for ASR on seven South African languages. In monolingual experiments, each model is trained on 50 hours of speech per language, and we compare Mamba to a Conformer baseline of similar parameter scale. Mamba achieves similar recognition accuracy to Conformer while using fewer computational resources and training faster. We further evaluate generalization in this setting and find that both models struggle to generalize to speech that is much longer than what they were trained on. We then study multilingual ASR using Mamba models, where the baseline is pooling all languages together. On top of this, we tested three extensions: training with language-family information by adding both language and language-family embeddings as biases to the downsampled acoustic representations, and multitask learning with a CTC ASR objective and a language identification (LID) head. We find that multilingual training consistently improves performance over monolingual training. However, adding explicit language information does not improve in-domain performance but does improve cross-corpus robustness. We conducted ablation studies in low-resource multilingual settings using 5-hour and 10-hour per-language training data, where we observed gains from using language embeddings and further demonstrated that removing or altering them hurt model performance. Lastly, we analysed these embeddings and find that they do not capture linguistic similarity in a typological sense, but instead act as task-specific control vectors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper evaluates Mamba state-space models versus Conformer baselines for ASR on seven South African languages. In monolingual experiments on 50-hour per-language splits, Mamba is reported to match Conformer accuracy while using fewer resources and training faster; both architectures struggle to generalize to much longer utterances. Multilingual Mamba training (pooling languages) improves over monolingual baselines; adding language and language-family embeddings or an LID head does not help in-domain performance but improves cross-corpus robustness. In 5-hour and 10-hour low-resource ablations, language embeddings yield gains, and analysis indicates the embeddings function as task-specific control vectors rather than encoding typological similarity.

Significance. If the architecture comparison is shown to rest on matched training conditions, the work supplies empirical evidence on an underexplored domain (ASR for South African languages) and practical guidance on efficient multilingual modeling with state-space models. The low-resource ablations and embedding analysis add concrete, falsifiable observations about when explicit language information helps or harms.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (monolingual experiments): The central claim that 'Mamba achieves similar recognition accuracy to Conformer' rests on a baseline described only as 'of similar parameter scale' trained on the same 50-hour splits. The manuscript supplies no confirmation that optimizer, learning-rate schedule, batch size, data augmentation, or early-stopping criteria were identical; without this, the accuracy parity cannot be attributed to architecture rather than unequal optimization effort. This is load-bearing for all subsequent multilingual and ablation claims.
  2. [Abstract, experimental sections] Abstract and experimental sections: No error bars, standard deviations across seeds, or statistical significance tests are reported for any WER or accuracy figures. The absence of these quantities makes it impossible to determine whether reported similarities, improvements, or ablation effects exceed experimental noise.
  3. [Abstract, §3, §4] Abstract and §3/§4: Exact dataset statistics (total hours after preprocessing, speaker counts, train/dev/test splits per language) and a complete hyper-parameter table are missing. These omissions prevent independent verification of the 50-hour, 10-hour, and 5-hour regimes that underpin the monolingual-to-multilingual and low-resource claims.
minor comments (1)
  1. [Abstract] The abstract states that embeddings 'do not capture linguistic similarity in a typological sense' but the precise typological features or distance metric used for this conclusion are not named in the provided summary; a short clarification would aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (monolingual experiments): The central claim that 'Mamba achieves similar recognition accuracy to Conformer' rests on a baseline described only as 'of similar parameter scale' trained on the same 50-hour splits. The manuscript supplies no confirmation that optimizer, learning-rate schedule, batch size, data augmentation, or early-stopping criteria were identical; without this, the accuracy parity cannot be attributed to architecture rather than unequal optimization effort. This is load-bearing for all subsequent multilingual and ablation claims.

    Authors: We agree that matched training conditions are essential to attribute performance to architecture. The manuscript does not explicitly confirm identical optimizer, learning-rate schedule, batch size, data augmentation, or early-stopping criteria. In the revision we will add a comprehensive description and table of all training procedures for both models, noting any architecture-specific adjustments while documenting efforts to ensure comparability. revision: yes

  2. Referee: [Abstract, experimental sections] Abstract and experimental sections: No error bars, standard deviations across seeds, or statistical significance tests are reported for any WER or accuracy figures. The absence of these quantities makes it impossible to determine whether reported similarities, improvements, or ablation effects exceed experimental noise.

    Authors: We acknowledge that the lack of error bars, standard deviations, and statistical tests weakens the interpretability of the results. In the revised manuscript we will rerun the primary experiments across multiple random seeds, report means with standard deviations, and include statistical significance tests for key comparisons. revision: yes

  3. Referee: [Abstract, §3, §4] Abstract and §3/§4: Exact dataset statistics (total hours after preprocessing, speaker counts, train/dev/test splits per language) and a complete hyper-parameter table are missing. These omissions prevent independent verification of the 50-hour, 10-hour, and 5-hour regimes that underpin the monolingual-to-multilingual and low-resource claims.

    Authors: We agree that precise dataset statistics and a full hyperparameter table are required for reproducibility. The revised manuscript will include tables reporting total hours after preprocessing, speaker counts, exact train/dev/test splits per language, and all hyperparameters used across the monolingual, multilingual, and low-resource experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparisons on fixed data splits

full rationale

The paper reports direct experimental results from training Mamba and Conformer ASR models on 50-hour (and smaller) per-language splits, measuring WER and comparing resource usage. No equations, derivations, or fitted parameters are presented whose outputs are then renamed as predictions. No self-citation chains or uniqueness theorems are invoked to justify architectural choices. The central claims rest on observable training outcomes rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard supervised learning assumptions for ASR (CTC loss validity, i.i.d. training/test splits, fair hyper-parameter search) without additional free parameters or new entities introduced in the abstract.

axioms (2)
  • domain assumption The 50-hour (and 5/10-hour) per-language data splits are sufficient and representative for training and evaluating ASR models.
    Invoked implicitly by reporting results on these splits without further justification.
  • domain assumption Conformer and Mamba models can be compared fairly when matched on parameter scale.
    Stated in the monolingual experimental design.

pith-pipeline@v0.9.1-grok · 5840 in / 1361 out tokens · 25496 ms · 2026-07-03T20:52:25.682120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Robust speech recognition via large-scale weak supervi- sion,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inInternational conference on machine learning, Baltimore, USA, 2023

  2. [2]

    E-branchformer: Branchformer with enhanced merging for speech recognition,

    K. Kim, F. Wu, Y . Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-branchformer: Branchformer with enhanced merging for speech recognition,” inIEEE Spoken Language Technology Workshop (SLT), 2022

  3. [3]

    End-to-end speech recognition: A survey,

    R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schl ¨uter, and S. Watanabe, “End-to-end speech recognition: A survey,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 325–351, 2023

  4. [4]

    Conformer: Convolution- augmented Transformer for Speech Recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented Transformer for Speech Recognition,” inInterspeech, Vir- tual, 2020

  5. [5]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conference on Language Modeling, Philadelphia, USA, 2024

  6. [6]

    Assessing the Performance and Efficiency of Mamba ASR in Low-Resource Scenarios,

    R. Zevallos, M. Cortada Garcia, S. Solito, C. Mena, A. Peir ´o-Lilja, and J. Hernando, “Assessing the Performance and Efficiency of Mamba ASR in Low-Resource Scenarios,” inInterspeech, Rotterdam, The Netherlands, 2025

  7. [7]

    Attention- Free Dual-Mode ASR with Latency-Controlled Selective State Spaces,

    T. Moriya, M. Mimura, K. Matsui, H. Sato, and K. Matsuura, “Attention- Free Dual-Mode ASR with Latency-Controlled Selective State Spaces,” inInterspeech, Rotterdam, The Netherlands, 2025

  8. [8]

    Mlma: Towards multilingual asr with mamba-based architectures,

    M. N. Ali, D. Falavigna, and A. Brutti, “Mlma: Towards multilingual asr with mamba-based architectures,”ArXiv, vol. abs/2510.18684, 2025

  9. [9]

    The low-resource double bind: An empirical study of pruning for low-resource machine translation,

    O. Ahia, J. Kreutzer, and S. Hooker, “The low-resource double bind: An empirical study of pruning for low-resource machine translation,” inFindings of the Association for Computational Linguistics: Empiri- cal Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 2021

  10. [10]

    Swivuriso: The South African Next Voices Multilingual Speech Dataset

    V . Marivate, K. Olaleye, S. Mundia, A. Bakainga, U. Netshifhefhe, M. Milanzie, T. H. Mogale, T. Sindane, Z. Abdulrasaq, K. Mokgosi, C. Okorie, N. Z. V . Wyk, G. Morrissey, D. Dunbar, F. Smit, T. Chidi, R. Mabuya, A. Bukula, R. Mlambo, T. Macucwa, I. Abdulmumin, , and S. Rananga, “Swivuriso: The south african next voices multilingual speech dataset,”arXiv...

  11. [11]

    Waxal: A large-scale multilingual african language speech corpus,

    A. D. Diack, P. H. Nelson, K. Agbesi, A. Nakalembe, M. Mohamed- khair, V . Dube, T. Siyavora, S. Venugopalan, J. Hickey, U. Okonkwo, A. Bapna, I. Wiafe, R. D. Helegah, E. D. Atsakpo, C. Nutrokpor, F. B. P. Winful, K. K. Solaga, J.-D. Abdulai, A. O. Ekpezu, A. Niyonkuru, S. Rutunda, B. Ishimwe, M. Melese, E. Bainomugisha, J. Nakatumba- Nabende, A. Katumba,...

  12. [12]

    Beyond the utterance: An empirical study of very long context speech recognition,

    R. Flynn and A. Ragni, “Beyond the utterance: An empirical study of very long context speech recognition,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 910–920, 2026

  13. [13]

    Charting the landscape of African NLP: Mapping progress and shaping the road ahead,

    J. O. Alabi, M. A. Hedderich, D. I. Adelani, and D. Klakow, “Charting the landscape of African NLP: Mapping progress and shaping the road ahead,” inConference on Empirical Methods in Natural Language Processing (EMNLP), Suzhou, China, 2025

  14. [14]

    Automatic speech recognition for African low-resource languages: Challenges and future directions,

    S. H. Imam, B. Sani, D. K. Gete, B. Y . Ahmed, I. S. Ahmad, I. Abdulmumin, S. M. Yimam, M. Y . Bello, and S. H. Muhammad, “Automatic speech recognition for African low-resource languages: Challenges and future directions,” in6th Workshop on African Natural Language Processing (AfricaNLP), Vienna, Austria, 2025

  15. [15]

    A first South African corpus of multilingual code-switched soap opera speech,

    E. van der Westhuizen and T. Niesler, “A first South African corpus of multilingual code-switched soap opera speech,” in11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, 2018

  16. [16]

    The NCHLT speech corpus of the South African languages,

    E. Barnard, M. H. Davel, C. van Heerden, F. de Wet, and J. Badenhorst, “The NCHLT speech corpus of the South African languages,” in 4th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), St. Petersburg, Russia, 2014

  17. [17]

    Nchlt auxiliary speech data for asr technology development in south africa,

    J. Badenhorst and F. de Wet, “Nchlt auxiliary speech data for asr technology development in south africa,”Data in Brief, vol. 41, p. 107860, 2022

  18. [18]

    Initial fieldwork for LW AZI: A telephone- based spoken dialog system for rural South Africa,

    T. Gumede and M. Plauch ´e, “Initial fieldwork for LW AZI: A telephone- based spoken dialog system for rural South Africa,” in1st Workshop on Language Technologies for African Languages, Athens, Greece, 2009

  19. [19]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in12th Language Resources and Evaluation Conference, Marseille, France, 2020

  20. [20]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” inIEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2022

  21. [21]

    The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages,

    J. Rajab, A. Aremu, E. A. Chimoto, D. Dunbar, G. Morrissey, F. Thior, L. Potgieter, J. Ojo, A. L. Tonja, W. N. Nekoto, P. Moiloa, J. Abbott, V . Marivate, and B. Rosman, “The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages,” in63rd Annual Meeting of the Association for Computational Linguistics (Volume ...

  22. [22]

    Building a Unified Code-Switching ASR System for South African Languages,

    E. Yilmaz, A. Biswas, E. van der Westhuizen, F. de Wet, and T. Niesler, “Building a Unified Code-Switching ASR System for South African Languages,” inInterspeech, Hyderabad, India, 2018, pp. 1923–1927

  23. [23]

    Code-switched automatic speech recognition in five south african languages,

    A. Biswas, E. Yilmaz, E. van der Westhuizen, F. de Wet, and T. Niesler, “Code-switched automatic speech recognition in five south african languages,”Computer Speech & Language, vol. 71, p. 101262, 2022

  24. [24]

    Semi-Supervised Acoustic Model Training for Five-Lingual Code- Switched ASR,

    A. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, and T. Niesler, “Semi-Supervised Acoustic Model Training for Five-Lingual Code- Switched ASR,” inInterspeech, Graz, Austria, 2019

  25. [25]

    Semi-supervised development of ASR systems for multilingual code- switched speech in under-resourced languages,

    A. Biswas, E. Yilmaz, F. de Wet, E. van der Westhuizen, and T. Niesler, “Semi-supervised development of ASR systems for multilingual code- switched speech in under-resourced languages,” in12th Language Re- sources and Evaluation Conference (LREC), Marseille, France, 2020

  26. [26]

    mhubert-147: A compact multilingual hubert model,

    M. Zanon Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mhubert-147: A compact multilingual hubert model,” inInterspeech, Kos, Greece, 2024

  27. [27]

    AfriHuBERT: A self-supervised speech representation model for African languages,

    J. O. Alabi, X. Liu, D. Klakow, and J. Yamagishi, “AfriHuBERT: A self-supervised speech representation model for African languages,” in Interspeech, Rotterdam, The Netherlands, 2025

  28. [28]

    Connectionist temporal classification: labelling unsegmented sequence data with re- current neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with re- current neural networks,” in23rd International Conference on Machine Learning (ICML), New York, USA, 2006

  29. [29]

    Transformer-XL: Attentive language models beyond a fixed-length context,

    Z. Dai, Z. Yang, Y . Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019

  30. [30]

    Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,

    X. Jiang, Y . A. Li, A. N. Florea, C. Han, and N. Mesgarani, “Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025

  31. [31]

    Lae: Language- aware encoder for monolingual and multilingual asr,

    J. Tian, J. Yu, C. Zhang, C. Weng, Y . Zou, and D. Yu, “Lae: Language- aware encoder for monolingual and multilingual asr,” inInterspeech, Incheon, Korea, 2022

  32. [32]

    SpeechBrain: A general-purpose speech toolkit,

    M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu- gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Bengio, “SpeechBrain: A general- purpose speech toolkit,” 2021, arXiv:2106.04624

  33. [33]

    Exploring the Capability of Mamba in Speech Applications,

    K. Miyazaki, Y . Masuyama, and M. Murata, “Exploring the Capability of Mamba in Speech Applications,” inInterspeech, Kos, Greece, 2024