From Monolingual to Multilingual: Evaluating Mamba for ASR in South African Languages
Pith reviewed 2026-07-03 20:52 UTC · model grok-4.3
The pith
Mamba matches Conformer accuracy for South African language speech recognition while training faster and using fewer resources, and multilingual training yields additional gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In monolingual experiments Mamba achieves recognition accuracy comparable to a Conformer model of similar scale on seven South African languages while consuming fewer computational resources and training faster. Multilingual training by pooling all languages consistently improves performance over monolingual training. Adding explicit language information does not improve in-domain performance but does improve cross-corpus robustness. In low-resource ablations with 5-hour and 10-hour per-language data, language embeddings provide gains whose removal or alteration hurts performance, and analysis shows the embeddings act as task-specific control vectors rather than encoders of typological lingu
What carries the argument
Mamba state space model for ASR, with multilingual data pooling and optional language or language-family embeddings added as biases to downsampled acoustic representations.
Load-bearing premise
The 50-hour per-language data splits are representative of real usage and the Mamba and Conformer models were trained under sufficiently comparable conditions for the accuracy comparison to be meaningful.
What would settle it
A controlled re-run on the same data splits in which Mamba's word error rate on held-out test sets exceeds the Conformer's by more than a few percent after matching parameter count, optimizer, and preprocessing exactly.
Figures
read the original abstract
Recent advances in automatic speech recognition (ASR) have explored different sequence models, including Conformer-based models and newer state space models such as Mamba. Although prior work has evaluated these architectures in multiple languages, their effectiveness in African languages remains underexplored. In this work, we evaluate Mamba for ASR on seven South African languages. In monolingual experiments, each model is trained on 50 hours of speech per language, and we compare Mamba to a Conformer baseline of similar parameter scale. Mamba achieves similar recognition accuracy to Conformer while using fewer computational resources and training faster. We further evaluate generalization in this setting and find that both models struggle to generalize to speech that is much longer than what they were trained on. We then study multilingual ASR using Mamba models, where the baseline is pooling all languages together. On top of this, we tested three extensions: training with language-family information by adding both language and language-family embeddings as biases to the downsampled acoustic representations, and multitask learning with a CTC ASR objective and a language identification (LID) head. We find that multilingual training consistently improves performance over monolingual training. However, adding explicit language information does not improve in-domain performance but does improve cross-corpus robustness. We conducted ablation studies in low-resource multilingual settings using 5-hour and 10-hour per-language training data, where we observed gains from using language embeddings and further demonstrated that removing or altering them hurt model performance. Lastly, we analysed these embeddings and find that they do not capture linguistic similarity in a typological sense, but instead act as task-specific control vectors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates Mamba state-space models versus Conformer baselines for ASR on seven South African languages. In monolingual experiments on 50-hour per-language splits, Mamba is reported to match Conformer accuracy while using fewer resources and training faster; both architectures struggle to generalize to much longer utterances. Multilingual Mamba training (pooling languages) improves over monolingual baselines; adding language and language-family embeddings or an LID head does not help in-domain performance but improves cross-corpus robustness. In 5-hour and 10-hour low-resource ablations, language embeddings yield gains, and analysis indicates the embeddings function as task-specific control vectors rather than encoding typological similarity.
Significance. If the architecture comparison is shown to rest on matched training conditions, the work supplies empirical evidence on an underexplored domain (ASR for South African languages) and practical guidance on efficient multilingual modeling with state-space models. The low-resource ablations and embedding analysis add concrete, falsifiable observations about when explicit language information helps or harms.
major comments (3)
- [Abstract, §4] Abstract and §4 (monolingual experiments): The central claim that 'Mamba achieves similar recognition accuracy to Conformer' rests on a baseline described only as 'of similar parameter scale' trained on the same 50-hour splits. The manuscript supplies no confirmation that optimizer, learning-rate schedule, batch size, data augmentation, or early-stopping criteria were identical; without this, the accuracy parity cannot be attributed to architecture rather than unequal optimization effort. This is load-bearing for all subsequent multilingual and ablation claims.
- [Abstract, experimental sections] Abstract and experimental sections: No error bars, standard deviations across seeds, or statistical significance tests are reported for any WER or accuracy figures. The absence of these quantities makes it impossible to determine whether reported similarities, improvements, or ablation effects exceed experimental noise.
- [Abstract, §3, §4] Abstract and §3/§4: Exact dataset statistics (total hours after preprocessing, speaker counts, train/dev/test splits per language) and a complete hyper-parameter table are missing. These omissions prevent independent verification of the 50-hour, 10-hour, and 5-hour regimes that underpin the monolingual-to-multilingual and low-resource claims.
minor comments (1)
- [Abstract] The abstract states that embeddings 'do not capture linguistic similarity in a typological sense' but the precise typological features or distance metric used for this conclusion are not named in the provided summary; a short clarification would aid readers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and indicate planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (monolingual experiments): The central claim that 'Mamba achieves similar recognition accuracy to Conformer' rests on a baseline described only as 'of similar parameter scale' trained on the same 50-hour splits. The manuscript supplies no confirmation that optimizer, learning-rate schedule, batch size, data augmentation, or early-stopping criteria were identical; without this, the accuracy parity cannot be attributed to architecture rather than unequal optimization effort. This is load-bearing for all subsequent multilingual and ablation claims.
Authors: We agree that matched training conditions are essential to attribute performance to architecture. The manuscript does not explicitly confirm identical optimizer, learning-rate schedule, batch size, data augmentation, or early-stopping criteria. In the revision we will add a comprehensive description and table of all training procedures for both models, noting any architecture-specific adjustments while documenting efforts to ensure comparability. revision: yes
-
Referee: [Abstract, experimental sections] Abstract and experimental sections: No error bars, standard deviations across seeds, or statistical significance tests are reported for any WER or accuracy figures. The absence of these quantities makes it impossible to determine whether reported similarities, improvements, or ablation effects exceed experimental noise.
Authors: We acknowledge that the lack of error bars, standard deviations, and statistical tests weakens the interpretability of the results. In the revised manuscript we will rerun the primary experiments across multiple random seeds, report means with standard deviations, and include statistical significance tests for key comparisons. revision: yes
-
Referee: [Abstract, §3, §4] Abstract and §3/§4: Exact dataset statistics (total hours after preprocessing, speaker counts, train/dev/test splits per language) and a complete hyper-parameter table are missing. These omissions prevent independent verification of the 50-hour, 10-hour, and 5-hour regimes that underpin the monolingual-to-multilingual and low-resource claims.
Authors: We agree that precise dataset statistics and a full hyperparameter table are required for reproducibility. The revised manuscript will include tables reporting total hours after preprocessing, speaker counts, exact train/dev/test splits per language, and all hyperparameters used across the monolingual, multilingual, and low-resource experiments. revision: yes
Circularity Check
No circularity: purely empirical model comparisons on fixed data splits
full rationale
The paper reports direct experimental results from training Mamba and Conformer ASR models on 50-hour (and smaller) per-language splits, measuring WER and comparing resource usage. No equations, derivations, or fitted parameters are presented whose outputs are then renamed as predictions. No self-citation chains or uniqueness theorems are invoked to justify architectural choices. The central claims rest on observable training outcomes rather than any reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 50-hour (and 5/10-hour) per-language data splits are sufficient and representative for training and evaluating ASR models.
- domain assumption Conformer and Mamba models can be compared fairly when matched on parameter scale.
Reference graph
Works this paper leans on
-
[1]
Robust speech recognition via large-scale weak supervi- sion,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inInternational conference on machine learning, Baltimore, USA, 2023
work page 2023
-
[2]
E-branchformer: Branchformer with enhanced merging for speech recognition,
K. Kim, F. Wu, Y . Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-branchformer: Branchformer with enhanced merging for speech recognition,” inIEEE Spoken Language Technology Workshop (SLT), 2022
work page 2022
-
[3]
End-to-end speech recognition: A survey,
R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schl ¨uter, and S. Watanabe, “End-to-end speech recognition: A survey,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 325–351, 2023
work page 2023
-
[4]
Conformer: Convolution- augmented Transformer for Speech Recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented Transformer for Speech Recognition,” inInterspeech, Vir- tual, 2020
work page 2020
-
[5]
Mamba: Linear-time sequence modeling with selective state spaces,
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conference on Language Modeling, Philadelphia, USA, 2024
work page 2024
-
[6]
Assessing the Performance and Efficiency of Mamba ASR in Low-Resource Scenarios,
R. Zevallos, M. Cortada Garcia, S. Solito, C. Mena, A. Peir ´o-Lilja, and J. Hernando, “Assessing the Performance and Efficiency of Mamba ASR in Low-Resource Scenarios,” inInterspeech, Rotterdam, The Netherlands, 2025
work page 2025
-
[7]
Attention- Free Dual-Mode ASR with Latency-Controlled Selective State Spaces,
T. Moriya, M. Mimura, K. Matsui, H. Sato, and K. Matsuura, “Attention- Free Dual-Mode ASR with Latency-Controlled Selective State Spaces,” inInterspeech, Rotterdam, The Netherlands, 2025
work page 2025
-
[8]
Mlma: Towards multilingual asr with mamba-based architectures,
M. N. Ali, D. Falavigna, and A. Brutti, “Mlma: Towards multilingual asr with mamba-based architectures,”ArXiv, vol. abs/2510.18684, 2025
-
[9]
The low-resource double bind: An empirical study of pruning for low-resource machine translation,
O. Ahia, J. Kreutzer, and S. Hooker, “The low-resource double bind: An empirical study of pruning for low-resource machine translation,” inFindings of the Association for Computational Linguistics: Empiri- cal Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 2021
work page 2021
-
[10]
Swivuriso: The South African Next Voices Multilingual Speech Dataset
V . Marivate, K. Olaleye, S. Mundia, A. Bakainga, U. Netshifhefhe, M. Milanzie, T. H. Mogale, T. Sindane, Z. Abdulrasaq, K. Mokgosi, C. Okorie, N. Z. V . Wyk, G. Morrissey, D. Dunbar, F. Smit, T. Chidi, R. Mabuya, A. Bukula, R. Mlambo, T. Macucwa, I. Abdulmumin, , and S. Rananga, “Swivuriso: The south african next voices multilingual speech dataset,”arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Waxal: A large-scale multilingual african language speech corpus,
A. D. Diack, P. H. Nelson, K. Agbesi, A. Nakalembe, M. Mohamed- khair, V . Dube, T. Siyavora, S. Venugopalan, J. Hickey, U. Okonkwo, A. Bapna, I. Wiafe, R. D. Helegah, E. D. Atsakpo, C. Nutrokpor, F. B. P. Winful, K. K. Solaga, J.-D. Abdulai, A. O. Ekpezu, A. Niyonkuru, S. Rutunda, B. Ishimwe, M. Melese, E. Bainomugisha, J. Nakatumba- Nabende, A. Katumba,...
work page 2026
-
[12]
Beyond the utterance: An empirical study of very long context speech recognition,
R. Flynn and A. Ragni, “Beyond the utterance: An empirical study of very long context speech recognition,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 910–920, 2026
work page 2026
-
[13]
Charting the landscape of African NLP: Mapping progress and shaping the road ahead,
J. O. Alabi, M. A. Hedderich, D. I. Adelani, and D. Klakow, “Charting the landscape of African NLP: Mapping progress and shaping the road ahead,” inConference on Empirical Methods in Natural Language Processing (EMNLP), Suzhou, China, 2025
work page 2025
-
[14]
Automatic speech recognition for African low-resource languages: Challenges and future directions,
S. H. Imam, B. Sani, D. K. Gete, B. Y . Ahmed, I. S. Ahmad, I. Abdulmumin, S. M. Yimam, M. Y . Bello, and S. H. Muhammad, “Automatic speech recognition for African low-resource languages: Challenges and future directions,” in6th Workshop on African Natural Language Processing (AfricaNLP), Vienna, Austria, 2025
work page 2025
-
[15]
A first South African corpus of multilingual code-switched soap opera speech,
E. van der Westhuizen and T. Niesler, “A first South African corpus of multilingual code-switched soap opera speech,” in11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, 2018
work page 2018
-
[16]
The NCHLT speech corpus of the South African languages,
E. Barnard, M. H. Davel, C. van Heerden, F. de Wet, and J. Badenhorst, “The NCHLT speech corpus of the South African languages,” in 4th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), St. Petersburg, Russia, 2014
work page 2014
-
[17]
Nchlt auxiliary speech data for asr technology development in south africa,
J. Badenhorst and F. de Wet, “Nchlt auxiliary speech data for asr technology development in south africa,”Data in Brief, vol. 41, p. 107860, 2022
work page 2022
-
[18]
Initial fieldwork for LW AZI: A telephone- based spoken dialog system for rural South Africa,
T. Gumede and M. Plauch ´e, “Initial fieldwork for LW AZI: A telephone- based spoken dialog system for rural South Africa,” in1st Workshop on Language Technologies for African Languages, Athens, Greece, 2009
work page 2009
-
[19]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in12th Language Resources and Evaluation Conference, Marseille, France, 2020
work page 2020
-
[20]
Fleurs: Few-shot learning evaluation of universal representations of speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” inIEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2022
work page 2022
-
[21]
J. Rajab, A. Aremu, E. A. Chimoto, D. Dunbar, G. Morrissey, F. Thior, L. Potgieter, J. Ojo, A. L. Tonja, W. N. Nekoto, P. Moiloa, J. Abbott, V . Marivate, and B. Rosman, “The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages,” in63rd Annual Meeting of the Association for Computational Linguistics (Volume ...
work page 2025
-
[22]
Building a Unified Code-Switching ASR System for South African Languages,
E. Yilmaz, A. Biswas, E. van der Westhuizen, F. de Wet, and T. Niesler, “Building a Unified Code-Switching ASR System for South African Languages,” inInterspeech, Hyderabad, India, 2018, pp. 1923–1927
work page 2018
-
[23]
Code-switched automatic speech recognition in five south african languages,
A. Biswas, E. Yilmaz, E. van der Westhuizen, F. de Wet, and T. Niesler, “Code-switched automatic speech recognition in five south african languages,”Computer Speech & Language, vol. 71, p. 101262, 2022
work page 2022
-
[24]
Semi-Supervised Acoustic Model Training for Five-Lingual Code- Switched ASR,
A. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, and T. Niesler, “Semi-Supervised Acoustic Model Training for Five-Lingual Code- Switched ASR,” inInterspeech, Graz, Austria, 2019
work page 2019
-
[25]
A. Biswas, E. Yilmaz, F. de Wet, E. van der Westhuizen, and T. Niesler, “Semi-supervised development of ASR systems for multilingual code- switched speech in under-resourced languages,” in12th Language Re- sources and Evaluation Conference (LREC), Marseille, France, 2020
work page 2020
-
[26]
mhubert-147: A compact multilingual hubert model,
M. Zanon Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mhubert-147: A compact multilingual hubert model,” inInterspeech, Kos, Greece, 2024
work page 2024
-
[27]
AfriHuBERT: A self-supervised speech representation model for African languages,
J. O. Alabi, X. Liu, D. Klakow, and J. Yamagishi, “AfriHuBERT: A self-supervised speech representation model for African languages,” in Interspeech, Rotterdam, The Netherlands, 2025
work page 2025
-
[28]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with re- current neural networks,” in23rd International Conference on Machine Learning (ICML), New York, USA, 2006
work page 2006
-
[29]
Transformer-XL: Attentive language models beyond a fixed-length context,
Z. Dai, Z. Yang, Y . Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019
work page 2019
-
[30]
X. Jiang, Y . A. Li, A. N. Florea, C. Han, and N. Mesgarani, “Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025
work page 2025
-
[31]
Lae: Language- aware encoder for monolingual and multilingual asr,
J. Tian, J. Yu, C. Zhang, C. Weng, Y . Zou, and D. Yu, “Lae: Language- aware encoder for monolingual and multilingual asr,” inInterspeech, Incheon, Korea, 2022
work page 2022
-
[32]
SpeechBrain: A general-purpose speech toolkit,
M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu- gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Bengio, “SpeechBrain: A general- purpose speech toolkit,” 2021, arXiv:2106.04624
-
[33]
Exploring the Capability of Mamba in Speech Applications,
K. Miyazaki, Y . Masuyama, and M. Murata, “Exploring the Capability of Mamba in Speech Applications,” inInterspeech, Kos, Greece, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.