WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang; Chao Yang; Chenchen Zeng; Di Wu; Hang Lv; Hui Bu; Lei Xie; Pengcheng Guo; Qijie Shao; Xiaoyu Chen

arxiv: 2110.03370 · v5 · pith:WECHKIVZnew · submitted 2021-10-07 · 💻 cs.SD · cs.CL

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang , Hang Lv , Pengcheng Guo , Qijie Shao , Chao Yang , Lei Xie , Xin Xu , Hui Bu

show 4 more authors

Xiaoyu Chen Chenchen Zeng Di Wu Zhendong Peng

This is my paper

classification 💻 cs.SD cs.CL

keywords speechtesthoursrecognitionwenetspeechcandidatescorpusdata

0 comments

read the original abstract

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
Non-Intrusive Automatic Speech Recognition Refinement: A Survey
eess.AS 2025-08 accept novelty 4.0

A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.
Data-Efficient On-Policy Distillation for Automatic Speech Recognition
cs.AI 2026-05 unverdicted novelty 3.0

On-policy distillation from a Qwen-ASR teacher improves a 0.6B Ark-ASR model over supervised fine-tuning and a same-scale baseline on four of five ASR benchmarks using 100k hours of speech.