Step-Audio 2 Technical Report

Bingxin Li; Bin Wang; Binxing Jiao; Bo Li; Boyong Wu; Brian Li; Buyun Ma; Changhe Song; Changxin Miao; Changyi Wan

arxiv: 2507.16632 · v3 · pith:TGUESTQNnew · submitted 2025-07-22 · 💻 cs.CL · cs.SD· eess.AS

Step-Audio 2 Technical Report

Boyong Wu , Chao Yan , Chen Hu , Cheng Yi , Chengli Feng , Fei Tian , Feiyu Shen , Gang Yu

show 101 more authors

Haoyang Zhang Jingbei Li Mingrui Chen Peng Liu Wang You Xiangyu Tony Zhang Xingyuan Li Xuerui Yang Yayue Deng Yechang Huang Yuxin Li Yuxin Zhang Zhao You Brian Li Changyi Wan Hanpeng Hu Jiangjie Zhen Siyu Chen Song Yuan Xuelin Zhang Yimin Jiang Yu Zhou Yuxiang Yang Bingxin Li Buyun Ma Changhe Song Dongqing Pang Guoqiang Hu Haiyang Sun Kang An Na Wang Shuli Gao Wei Ji Wen Li Wen Sun Xuan Wen Yong Ren Yuankai Ma Yufan Lu Bin Wang Bo Li Changxin Miao Che Liu Chen Xu Dapeng Shi Dingyuan Hu Donghang Wu Enle Liu Guanzhe Huang Gulin Yan Han Zhang Hao Nie Haonan Jia Hongyu Zhou Jianjian Sun Jiaoren Wu Jie Wu Jie Yang Jin Yang Junzhe Lin Kaixiang Li Lei Yang Liying Shi Li Zhou Longlong Gu Ming Li Mingliang Li Mingxiao Li Nan Wu Qi Han Qinyuan Tan Shaoliang Pang Shengjie Fan Siqi Liu Tiancheng Cao Wanying Lu Wenqing He Wuxun Xie Xu Zhao Xueqi Li Yanbo Yu Yang Yang Yi Liu Yifan Lu Yilei Wang Yuanhao Ding Yuanwei Liang Yuanwei Lu Yuchu Luo Yuhe Yin Yumeng Zhan Yuxiang Zhang Zidong Yang Zixin Zhang Binxing Jiao Daxin Jiang Heung-Yeung Shum Jiansheng Chen Jing Li Xiangyu Zhang Yibo Zhu

This is my paper

Pith reviewed 2026-05-16 05:55 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords multi-modal LLMaudio understandingspeech conversationdiscrete audio tokensreinforcement learningretrieval-augmented generationend-to-end model

0 comments

The pith

Step-Audio 2 integrates latent audio encoding and discrete token generation to deliver state-of-the-art audio understanding and expressive end-to-end speech conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Step-Audio 2 as an end-to-end multi-modal large language model built specifically for industry-level audio tasks and natural speech interaction. It combines a latent audio encoder with reasoning-centric reinforcement learning to strengthen automatic speech recognition and general audio comprehension. Discrete audio tokens are generated directly inside the language model to capture speaking styles and emotions in responses. Retrieval-augmented generation plus tool calls for web search and audio search reduce hallucinations while allowing timbre changes. Trained on millions of hours of speech data, the model reports superior results on standard benchmarks against both open-source and commercial alternatives.

Core claim

Step-Audio 2 demonstrates that an integrated architecture using latent audio encoding, reasoning-centric reinforcement learning, discrete audio token generation within language modeling, and retrieval-augmented generation produces stronger automatic speech recognition, audio understanding, and responsive conversational output than prior separate-component systems.

What carries the argument

Discrete audio token generation embedded in the language modeling process, which enables direct responsiveness to paralinguistic cues such as emotion and style.

If this is right

Direct modeling of paralinguistic information reduces the need for separate emotion or style modules in conversational agents
Tool calling for web search and audio retrieval measurably lowers hallucination rates in spoken responses
End-to-end discrete token output supports lower-latency turn-taking in multi-turn dialogue
Scaling to millions of hours of training data yields consistent gains across diverse conversational domains

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-generation approach could be applied to video or sensor streams to create unified multi-modal conversational systems
External tool integration may allow future models to maintain up-to-date knowledge without full retraining
If the RL component generalizes well, similar reasoning-centric training could improve robustness in low-resource languages or noisy environments

Load-bearing premise

The combination of latent encoding, RL reasoning, discrete tokens, and RAG produces robust performance on real-world conversational audio beyond the evaluated benchmarks.

What would settle it

A new test set of long-form conversational audio with varied emotions and accents where Step-Audio 2 shows no accuracy or naturalness advantage over strong baseline models.

read the original abstract

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Step-Audio 2 combines latent audio encoding, reasoning RL, and discrete token generation for end-to-end speech with tool use, but the SOTA claims rest on unreported benchmark details.

read the letter

Step-Audio 2 is a technical report on an end-to-end multimodal model that handles both audio understanding and speech generation in one system. The core setup uses a latent audio encoder, trains with reasoning-centric reinforcement learning, and adds discrete audio token generation so the model can respond to speaking style and emotion. It layers on retrieval-augmented generation plus tool calls for web search and audio search to cut down on hallucinations and switch timbres. The training runs on millions of hours of speech and audio data, which is the scale that usually matters for conversational robustness.

Referee Report

2 major / 2 minor

Summary. The paper introduces Step-Audio 2, an end-to-end multi-modal LLM for audio understanding and conversational speech. It combines a latent audio encoder, reasoning-centric reinforcement learning, discrete audio token generation to capture paralinguistic cues, and RAG with external tool calling (web search, audio search) to reduce hallucinations. Trained on millions of hours of speech and audio data, the model claims state-of-the-art results on various audio understanding and conversational benchmarks relative to open-source and commercial baselines.

Significance. If the performance claims are substantiated with complete benchmark tables, baselines, and ablations, the work would advance practical end-to-end audio LLMs by showing how RL-driven reasoning and RAG can be integrated with discrete token modeling for expressive, low-hallucination conversation. The industry-oriented framing and emphasis on real-world tool use are strengths.

major comments (2)

[Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.
[§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.

minor comments (2)

[Abstract] Abstract: phrases such as 'promising performance' and 'state-of-the-art' are used without accompanying metrics or qualifiers.
[Conclusion] The GitHub link is provided but no details on released code, checkpoints, or evaluation scripts are given in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and verifiability of the results.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.

Authors: We acknowledge that the current evaluation section provides only high-level SOTA claims without the requested specifics. In the revised manuscript we will expand this section to explicitly name the benchmarks (including LibriSpeech, CommonVoice, and conversational test sets), list exact baseline versions, define all metrics, specify data splits, report error bars where available, and include ablation results isolating the RL and RAG components. revision: yes
Referee: [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.

Authors: We agree that the claim would be stronger with direct quantitative support. The revised version will add comparisons to continuous-token and non-RL baselines, reporting relevant metrics that demonstrate the contribution of discrete token generation to paralinguistic responsiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report on model architecture, training data, and benchmark results with no mathematical derivations, equations, or self-referential definitions present. Performance claims reference external benchmarks and datasets rather than quantities defined or fitted inside the paper. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claims reduce to standard empirical evaluation and are therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an empirically trained neural network without listing explicit axioms, free parameters, or newly invented theoretical entities; performance claims rest on large-scale data training and benchmark results.

pith-pipeline@v0.9.0 · 5915 in / 1048 out tokens · 29325 ms · 2026-05-16T05:55:34.925173+00:00 · methodology

discussion (0)

Forward citations

Cited by 51 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
cs.CL 2026-07 unverdicted novelty 7.0

SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
RedVox: Safety and Fairness Gaps in Speech Models Across Languages
cs.CL 2026-06 unverdicted novelty 7.0

RedVox benchmark shows speech model safety and fairness vulnerabilities persist under non-adversarial conditions, worsen in non-English languages, and increase with spoken inputs.
AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?
cs.SD 2026-06 unverdicted novelty 7.0

Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.
Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models
cs.SD 2026-06 unverdicted novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering
cs.CL 2026-06 unverdicted novelty 7.0

FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
eess.AS 2026-06 unverdicted novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show ...
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
cs.CL 2026-05 unverdicted novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and de...
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
eess.AS 2026-05 unverdicted novelty 7.0

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-...
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
Liberating LLM Capabilities in Full-Duplex Speech Models
cs.CL 2026-05 unverdicted novelty 7.0

LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
eess.AS 2026-04 unverdicted novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
cs.CL 2026-01 unverdicted novelty 7.0

MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
cs.CL 2025-12 accept novelty 7.0

Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation
eess.AS 2026-06 unverdicted novelty 6.0

PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original ...
MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios
eess.AS 2026-06 unverdicted novelty 6.0

MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
cs.CL 2026-06 unverdicted novelty 6.0

A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark
cs.SD 2026-06 unverdicted novelty 6.0

Introduces RAIL, a CHC-grounded benchmark with five core auditory capabilities to assess LALMs beyond task-centric metrics, showing uneven model performance.
Audio Interaction Model
cs.SD 2026-06 unverdicted novelty 6.0

Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
LaSR: Context-Aware Speech Recognition via Latent Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard...
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs
cs.CL 2026-05 unverdicted novelty 6.0

EchoDistill applies noisy-to-clean self-distillation with GRPO to boost Audio LLM robustness, reporting 4.18% average GSR gains under strong noise.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
eess.AS 2026-04 unverdicted novelty 6.0

VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
eess.AS 2026-04 unverdicted novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
cs.CL 2025-10 unverdicted novelty 6.0

MPS proposes a dual-brain architecture separating formulation reasoning from articulation to achieve real-time CoT in SLMs with accuracy comparable to full pre-computation but much lower latency.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
cs.CL 2025-09 unverdicted novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models
cs.SD 2026-06 unverdicted novelty 5.0

ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization
cs.CL 2026-05 unverdicted novelty 5.0

MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve bench...
StepAudio 2.5 Technical Report
eess.AS 2026-05 unverdicted novelty 5.0

StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
eess.AS 2026-05 unverdicted novelty 5.0

DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
cs.SD 2026-05 unverdicted novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
eess.AS 2026-04 unverdicted novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
cs.CL 2026-07 unverdicted novelty 4.0

JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.
Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models
eess.AS 2026-06 unverdicted novelty 4.0

CogAudio-LLM introduces LIME-440K dataset, EIPS chain-of-thought reasoning, and DR-SAPO optimization to address semantic dominance and improve affective responses in audio language models.
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
eess.AS 2026-05 unverdicted novelty 4.0

Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on M...
Step-Audio-R1.5 Technical Report
eess.AS 2026-04 unverdicted novelty 4.0

Step-Audio-R1.5 applies RLHF to audio reasoning models to escape the verifiable reward trap of RLVR, preserving analytical ability while restoring prosodic naturalness and immersion in long dialogues.
Step-Audio-R1.5 Technical Report
eess.AS 2026-04 unverdicted novelty 4.0

Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
eess.AS 2026-04 unverdicted novelty 4.0

NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
cs.CV 2026-02 unverdicted novelty 4.0

OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
A Survey of Audio Reasoning in Multimodal Foundation Models
eess.AS 2026-05 unverdicted novelty 2.0

A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
eess.AS 2026-05 unverdicted novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 48 Pith papers · 25 internal anchors

[1]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou et al. “Seed-tts: A family of high-quality versatile speech generation models”. In: arXiv preprint arXiv:2406.02430 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

PaLM 2 Technical Report

Rohan Anil et al. PaLM 2 Technical Report. 2023. arXiv: 2305.10403 [cs.CL]. URL: https://arxiv.org/ abs/2305.10403

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. 2020. arXiv: 2006.11477 [cs.CL]. URL: https://arxiv.org/abs/2006.11477

work page arXiv 2020
[4]

Qwen Technical Report

Jinze Bai et al. Qwen Technical Report. 2023. arXiv: 2309.16609 [cs.CL]. URL: https://arxiv.org/abs/ 2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Seed-ASR: Understanding diverse speech and contexts with LLM-based speech recognition,

Ye Bai et al. “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition”. In:arXiv preprint arXiv:2407.04675 (2024)

work page arXiv 2024
[6]

Better speech synthesis through scaling, 2023

James Betker. Better speech synthesis through scaling . 2023. arXiv: 2305 . 07243 [cs.SD]. URL: https : //arxiv.org/abs/2305.07243

work page arXiv 2023
[7]

Audiolm: a language modeling approach to audio generation

Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533

work page 2023
[8]

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio

Guoguo Chen et al. “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio”. In: Interspeech 2021. ISCA, Aug. 2021. DOI: 10.21437/interspeech.2021- 1965 . URL: http: //dx.doi.org/10.21437/Interspeech.2021-1965

work page doi:10.21437/interspeech.2021- 2021
[9]

Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025a

Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)

work page arXiv 2025
[10]

Sanyuan Chen et al.BEATs: Audio Pre-Training with Acoustic Tokenizers. 2022. arXiv:2212.09058 [eess.AS]. URL: https://arxiv.org/abs/2212.09058

work page arXiv 2022
[11]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

Sanyuan Chen et al. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”. In: IEEE Journal of Selected Topics in Signal Processing16.6 (Oct. 2022), pp. 1505–1518. ISSN : 1941-0484. DOI: 10.1109/jstsp.2022.3188113. URL: http://dx.doi.org/10.1109/JSTSP.2022.3188113

work page doi:10.1109/jstsp.2022.3188113 2022
[12]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Qwen2-Audio Technical Report

Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi

Jade Copet et al. Simple and Controllable Music Generation. 2024. arXiv: 2306.05284 [cs.SD]. URL: https: //arxiv.org/abs/2306.05284

work page arXiv 2024
[15]

High Fidelity Neural Audio Compression

Alexandre Défossez et al. “High fidelity neural audio compression”. In:arXiv preprint arXiv:2210.13438 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Pengi: An audio language model for audio tasks

Soham Deshmukh et al. “Pengi: An audio language model for audio tasks”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 18090–18108

work page 2023
[18]

Kimi-Audio Technical Report

Ding Ding et al. “Kimi-audio technical report”. In: arXiv preprint arXiv:2504.18425 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du et al. “Cosyvoice 2: Scalable streaming speech synthesis with large language models”. In: arXiv preprint arXiv:2412.10117 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. 2024. arXiv: 2407.05407 [cs.SD]. URL: https://arxiv.org/abs/2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Llama- omni: Seamless speech interaction with large language models,

Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)

work page arXiv 2024
[22]

arXiv preprint arXiv:2501.16327 , year=

Heting Gao et al. LUCY: Linguistic Understanding and Control Yielding Early Stage of Her . 2025. arXiv: 2501.16327 [cs.CL]. URL: https://arxiv.org/abs/2501.16327

work page arXiv 2025
[23]

Gemmeke, Daniel P

Jort F. Gemmeke et al. “Audio Set: An ontology and human-labeled dataset for audio events”. In:2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 2017, pp. 776–780. DOI: 10.1109/ICASSP.2017.7952261. 14 Step-Audio 2 Technical Report

work page doi:10.1109/icassp.2017.7952261 2017
[24]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

Sreyan Ghosh et al. Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. 2025. arXiv: 2503.03983 [cs.SD]. URL: https://arxiv.org/abs/2503.03983

work page arXiv 2025
[25]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel et al. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. 2025. arXiv: 2507.08128 [cs.SD]. URL: https://arxiv.org/abs/2507.08128

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

ADD 2022: The first audio deep synthesis detection challenge,

Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828

work page doi:10.1109/icassp43922.2022.9746828 2022
[27]

Joint audio and speech understanding

Yuan Gong et al. “Joint audio and speech understanding”. In:2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2023, pp. 1–8

work page 2023
[28]

arXiv preprint arXiv:2305.10790 , year=

Yuan Gong et al. “Listen, think, and understand”. In: arXiv preprint arXiv:2305.10790 (2023)

work page arXiv 2023
[29]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407 . 21783 [cs.AI]. URL: https : //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

HuBERT: Self-Supervised Speech Representa- tion Learning by Masked Prediction of Hidden Units,

Wei-Ning Hsu et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. 2021. arXiv: 2106.07447 [cs.CL]. URL: https://arxiv.org/abs/2106.07447

work page arXiv 2021
[31]

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Ailin Huang et al. “Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model”. In:arXiv preprint arXiv:2506.08967 (2025)

work page arXiv 2025
[32]

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)

work page internal anchor Pith review arXiv 2025
[33]

Audiogpt: Understanding and generating speech, music, sound, and talking head

Rongjie Huang et al. “Audiogpt: Understanding and generating speech, music, sound, and talking head”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 38. 21. 2024, pp. 23802–23804

work page 2024
[34]

GPT-4o System Card

Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

CochlScene: Acquisition of acoustic scene data using crowdsourcing

Il-Young Jeong and Jeongsoo Park. “CochlScene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 17–21. DOI: 10.23919/APSIPAASC55919.2022.9979822

work page doi:10.23919/apsipaasc55919.2022.9979822 2022
[36]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji et al. “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling”. In: arXiv preprint arXiv:2408.16532 (2024)

work page arXiv 2024
[37]

CVSS corpus and massively multilingual speech-to-speech translation,

Ye Jia et al. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. 2022. arXiv: 2201.03713 [cs.CL]. URL: https://arxiv.org/abs/2201.03713

work page arXiv 2022
[38]

Direct speech-to-speech translation with a sequence-to-sequence model

Ye Jia et al. “Direct speech-to-speech translation with a sequence-to-sequence model”. In: arXiv preprint arXiv:1904.06037 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[39]

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation

Ye Jia et al. “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation”. In: International conference on machine learning. PMLR. 2022, pp. 10120–10134

work page 2022
[40]

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,

Eugene Kharitonov et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision. 2023. arXiv: 2302.03540 [cs.SD]. URL: https://arxiv.org/abs/2302.03540

work page arXiv 2023
[41]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132

work page 2019
[42]

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. 2020. arXiv: 2010.05646 [cs.SD]. URL: https://arxiv.org/abs/2010. 05646

work page arXiv 2020
[43]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

Zhifeng Kong et al. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities. 2024. arXiv: 2402.01831 [cs.SD]. URL: https://arxiv.org/abs/2402.01831

work page arXiv 2024
[44]

Transvip: Speech to speech translation system with voice and isochrony preservation

Chenyang Le et al. “Transvip: Speech to speech translation system with voice and isochrony preservation”. In: Advances in Neural Information Processing Systems 37 (2024), pp. 89682–89705

work page 2024
[45]

Textless speech-to-speech translation on real data

Ann Lee et al. “Textless speech-to-speech translation on real data”. In:arXiv preprint arXiv:2112.08352 (2021)

work page arXiv 2021
[46]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee et al. BigVGAN: A Universal Neural Vocoder with Large-Scale Training. 2023. arXiv: 2206.04658 [cs.SD]. URL: https://arxiv.org/abs/2206.04658

work page arXiv 2023
[47]

Advancing large language models to capture varied speaking styles and respond properly in spoken conversations

Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. “Advancing large language models to capture varied speaking styles and respond properly in spoken conversations”. In: arXiv preprint arXiv:2402.12786 (2024)

work page arXiv 2024
[48]

Paralinguistics-enhanced large language modeling of spoken dialogue

Guan-Ting Lin et al. “Paralinguistics-enhanced large language modeling of spoken dialogue”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10316–10320

work page 2024
[49]

Spirit LM: Interleaved spoken and written language model,

Tu Anh Nguyen et al. Spirit LM: Interleaved Spoken and Written Language Model. 2024. arXiv: 2402.05755 [cs.CL]. URL: https://arxiv.org/abs/2402.05755

work page arXiv 2024
[50]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. https://openai.com/research/gpt-4. Accessed: 2025-07-11. 2023. 15 Step-Audio 2 Technical Report

work page 2025
[51]

Introducing ChatGPT

OpenAI. Introducing ChatGPT. Accessed: 2025-07-11. 2022. URL: https://openai.com/blog/chatgpt

work page 2025
[52]

Deep voice 3: 2000-speaker neural text-to-speech

Wei Ping et al. “Deep voice 3: 2000-speaker neural text-to-speech”. In:proc. ICLR. V ol. 79. 2018, pp. 1094–1099

work page 2000
[53]

Robust speech recognition via large-scale weak supervision

Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518

work page 2023
[54]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov et al. “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 53728–53741

work page 2023
[55]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Yi Ren et al. “Fastspeech 2: Fast and high-quality end-to-end text to speech”. In:arXiv preprint arXiv:2006.04558 (2020)

work page arXiv 2006
[56]

Omni-r1: Do you really need audio to fine-tune your audio llm?

Andrew Rouditchenko et al. “Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?” In:arXiv preprint arXiv:2505.09439 (2025)

work page arXiv 2025
[57]

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K Rubenstein et al. “Audiopalm: A large language model that can speak and listen”. In: arXiv preprint arXiv:2306.12925 (2023)

work page internal anchor Pith review arXiv 2023
[58]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

Jonathan Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions”. In:2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 4779–4783

work page 2018
[60]

Snac: Multi-scale neural audio codec,

Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. “Snac: Multi-scale neural audio codec”. In: arXiv preprint arXiv:2410.14411 (2024)

work page arXiv 2024
[61]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)

work page internal anchor Pith review arXiv 2023
[62]

Springer Science & Business Media, 2013

Wolfgang Wahlster.Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013

work page 2013
[63]

Changhan Wang, Anne Wu, and Juan Pino.CoVoST 2 and Massively Multilingual Speech-to-Text Translation

work page
[64]

CoV oST 2: A Massively Mul- tilingual Speech-to-Text Translation Corpus,

arXiv: 2007.10310 [cs.CL]. URL: https://arxiv.org/abs/2007.10310

work page arXiv 2007
[65]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang et al. “Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens”. In: arXiv preprint arXiv:2503.01710 (2025)

work page internal anchor Pith review arXiv 2025
[67]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024b

Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)

work page arXiv 2024
[68]

Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

Yuancheng Wang et al. “Maskgct: Zero-shot text-to-speech with masked generative codec transformer”. In:arXiv preprint arXiv:2409.00750 (2024)

work page arXiv 2024
[69]

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang et al. “Tacotron: Towards end-to-end speech synthesis”. In: arXiv preprint arXiv:1703.10135 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[70]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei et al. “Finetuned language models are zero-shot learners”. In:arXiv preprint arXiv:2109.01652 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu et al. “Google’s neural machine translation system: Bridging the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[72]

Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel

Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)

work page arXiv 2024
[73]

arXiv preprint arXiv:2410.11190 , year=

Zhifei Xie and Changqiao Wu. “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities”. In: arXiv preprint arXiv:2410.11190 (2024)

work page arXiv 2024
[74]

BigCodec: Pushing the limits of low-bitrate neural speech codec,

Detai Xin et al. “Bigcodec: Pushing the limits of low-bitrate neural speech codec”. In: arXiv preprint arXiv:2409.05377 (2024)

work page arXiv 2024
[75]

Qwen2.5-Omni Technical Report

Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

arXiv preprint arXiv:2502.17810 (2025)

Ruiqi Yan et al. URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models . 2025. arXiv: 2502.17810 [cs.CL]. URL: https://arxiv.org/abs/2502.17810

work page arXiv 2025
[77]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour et al. “Soundstream: An end-to-end neural audio codec”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 495–507

work page 2021
[78]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng et al. “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot”. In: arXiv preprint arXiv:2412.02612 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang et al. WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

work page
[80]

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition.arXiv preprint arXiv:2110.03370, 2022

arXiv: 2110.03370 [cs.SD]. URL: https://arxiv.org/abs/2110.03370

work page arXiv

Showing first 80 references.

[1] [1]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou et al. “Seed-tts: A family of high-quality versatile speech generation models”. In: arXiv preprint arXiv:2406.02430 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

PaLM 2 Technical Report

Rohan Anil et al. PaLM 2 Technical Report. 2023. arXiv: 2305.10403 [cs.CL]. URL: https://arxiv.org/ abs/2305.10403

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. 2020. arXiv: 2006.11477 [cs.CL]. URL: https://arxiv.org/abs/2006.11477

work page arXiv 2020

[4] [4]

Qwen Technical Report

Jinze Bai et al. Qwen Technical Report. 2023. arXiv: 2309.16609 [cs.CL]. URL: https://arxiv.org/abs/ 2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Seed-ASR: Understanding diverse speech and contexts with LLM-based speech recognition,

Ye Bai et al. “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition”. In:arXiv preprint arXiv:2407.04675 (2024)

work page arXiv 2024

[6] [6]

Better speech synthesis through scaling, 2023

James Betker. Better speech synthesis through scaling . 2023. arXiv: 2305 . 07243 [cs.SD]. URL: https : //arxiv.org/abs/2305.07243

work page arXiv 2023

[7] [7]

Audiolm: a language modeling approach to audio generation

Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533

work page 2023

[8] [8]

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio

Guoguo Chen et al. “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio”. In: Interspeech 2021. ISCA, Aug. 2021. DOI: 10.21437/interspeech.2021- 1965 . URL: http: //dx.doi.org/10.21437/Interspeech.2021-1965

work page doi:10.21437/interspeech.2021- 2021

[9] [9]

Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025a

Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)

work page arXiv 2025

[10] [10]

Sanyuan Chen et al.BEATs: Audio Pre-Training with Acoustic Tokenizers. 2022. arXiv:2212.09058 [eess.AS]. URL: https://arxiv.org/abs/2212.09058

work page arXiv 2022

[11] [11]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

Sanyuan Chen et al. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”. In: IEEE Journal of Selected Topics in Signal Processing16.6 (Oct. 2022), pp. 1505–1518. ISSN : 1941-0484. DOI: 10.1109/jstsp.2022.3188113. URL: http://dx.doi.org/10.1109/JSTSP.2022.3188113

work page doi:10.1109/jstsp.2022.3188113 2022

[12] [12]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Qwen2-Audio Technical Report

Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi

Jade Copet et al. Simple and Controllable Music Generation. 2024. arXiv: 2306.05284 [cs.SD]. URL: https: //arxiv.org/abs/2306.05284

work page arXiv 2024

[15] [15]

High Fidelity Neural Audio Compression

Alexandre Défossez et al. “High fidelity neural audio compression”. In:arXiv preprint arXiv:2210.13438 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Pengi: An audio language model for audio tasks

Soham Deshmukh et al. “Pengi: An audio language model for audio tasks”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 18090–18108

work page 2023

[18] [18]

Kimi-Audio Technical Report

Ding Ding et al. “Kimi-audio technical report”. In: arXiv preprint arXiv:2504.18425 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du et al. “Cosyvoice 2: Scalable streaming speech synthesis with large language models”. In: arXiv preprint arXiv:2412.10117 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. 2024. arXiv: 2407.05407 [cs.SD]. URL: https://arxiv.org/abs/2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Llama- omni: Seamless speech interaction with large language models,

Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)

work page arXiv 2024

[22] [22]

arXiv preprint arXiv:2501.16327 , year=

Heting Gao et al. LUCY: Linguistic Understanding and Control Yielding Early Stage of Her . 2025. arXiv: 2501.16327 [cs.CL]. URL: https://arxiv.org/abs/2501.16327

work page arXiv 2025

[23] [23]

Gemmeke, Daniel P

Jort F. Gemmeke et al. “Audio Set: An ontology and human-labeled dataset for audio events”. In:2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 2017, pp. 776–780. DOI: 10.1109/ICASSP.2017.7952261. 14 Step-Audio 2 Technical Report

work page doi:10.1109/icassp.2017.7952261 2017

[24] [24]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

Sreyan Ghosh et al. Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. 2025. arXiv: 2503.03983 [cs.SD]. URL: https://arxiv.org/abs/2503.03983

work page arXiv 2025

[25] [25]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel et al. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. 2025. arXiv: 2507.08128 [cs.SD]. URL: https://arxiv.org/abs/2507.08128

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

ADD 2022: The first audio deep synthesis detection challenge,

Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828

work page doi:10.1109/icassp43922.2022.9746828 2022

[27] [27]

Joint audio and speech understanding

Yuan Gong et al. “Joint audio and speech understanding”. In:2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2023, pp. 1–8

work page 2023

[28] [28]

arXiv preprint arXiv:2305.10790 , year=

Yuan Gong et al. “Listen, think, and understand”. In: arXiv preprint arXiv:2305.10790 (2023)

work page arXiv 2023

[29] [29]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407 . 21783 [cs.AI]. URL: https : //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

HuBERT: Self-Supervised Speech Representa- tion Learning by Masked Prediction of Hidden Units,

Wei-Ning Hsu et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. 2021. arXiv: 2106.07447 [cs.CL]. URL: https://arxiv.org/abs/2106.07447

work page arXiv 2021

[31] [31]

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Ailin Huang et al. “Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model”. In:arXiv preprint arXiv:2506.08967 (2025)

work page arXiv 2025

[32] [32]

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)

work page internal anchor Pith review arXiv 2025

[33] [33]

Audiogpt: Understanding and generating speech, music, sound, and talking head

Rongjie Huang et al. “Audiogpt: Understanding and generating speech, music, sound, and talking head”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 38. 21. 2024, pp. 23802–23804

work page 2024

[34] [34]

GPT-4o System Card

Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

CochlScene: Acquisition of acoustic scene data using crowdsourcing

Il-Young Jeong and Jeongsoo Park. “CochlScene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 17–21. DOI: 10.23919/APSIPAASC55919.2022.9979822

work page doi:10.23919/apsipaasc55919.2022.9979822 2022

[36] [36]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji et al. “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling”. In: arXiv preprint arXiv:2408.16532 (2024)

work page arXiv 2024

[37] [37]

CVSS corpus and massively multilingual speech-to-speech translation,

Ye Jia et al. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. 2022. arXiv: 2201.03713 [cs.CL]. URL: https://arxiv.org/abs/2201.03713

work page arXiv 2022

[38] [38]

Direct speech-to-speech translation with a sequence-to-sequence model

Ye Jia et al. “Direct speech-to-speech translation with a sequence-to-sequence model”. In: arXiv preprint arXiv:1904.06037 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[39] [39]

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation

Ye Jia et al. “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation”. In: International conference on machine learning. PMLR. 2022, pp. 10120–10134

work page 2022

[40] [40]

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,

Eugene Kharitonov et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision. 2023. arXiv: 2302.03540 [cs.SD]. URL: https://arxiv.org/abs/2302.03540

work page arXiv 2023

[41] [41]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132

work page 2019

[42] [42]

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. 2020. arXiv: 2010.05646 [cs.SD]. URL: https://arxiv.org/abs/2010. 05646

work page arXiv 2020

[43] [43]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

Zhifeng Kong et al. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities. 2024. arXiv: 2402.01831 [cs.SD]. URL: https://arxiv.org/abs/2402.01831

work page arXiv 2024

[44] [44]

Transvip: Speech to speech translation system with voice and isochrony preservation

Chenyang Le et al. “Transvip: Speech to speech translation system with voice and isochrony preservation”. In: Advances in Neural Information Processing Systems 37 (2024), pp. 89682–89705

work page 2024

[45] [45]

Textless speech-to-speech translation on real data

Ann Lee et al. “Textless speech-to-speech translation on real data”. In:arXiv preprint arXiv:2112.08352 (2021)

work page arXiv 2021

[46] [46]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee et al. BigVGAN: A Universal Neural Vocoder with Large-Scale Training. 2023. arXiv: 2206.04658 [cs.SD]. URL: https://arxiv.org/abs/2206.04658

work page arXiv 2023

[47] [47]

Advancing large language models to capture varied speaking styles and respond properly in spoken conversations

Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. “Advancing large language models to capture varied speaking styles and respond properly in spoken conversations”. In: arXiv preprint arXiv:2402.12786 (2024)

work page arXiv 2024

[48] [48]

Paralinguistics-enhanced large language modeling of spoken dialogue

Guan-Ting Lin et al. “Paralinguistics-enhanced large language modeling of spoken dialogue”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10316–10320

work page 2024

[49] [49]

Spirit LM: Interleaved spoken and written language model,

Tu Anh Nguyen et al. Spirit LM: Interleaved Spoken and Written Language Model. 2024. arXiv: 2402.05755 [cs.CL]. URL: https://arxiv.org/abs/2402.05755

work page arXiv 2024

[50] [50]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. https://openai.com/research/gpt-4. Accessed: 2025-07-11. 2023. 15 Step-Audio 2 Technical Report

work page 2025

[51] [51]

Introducing ChatGPT

OpenAI. Introducing ChatGPT. Accessed: 2025-07-11. 2022. URL: https://openai.com/blog/chatgpt

work page 2025

[52] [52]

Deep voice 3: 2000-speaker neural text-to-speech

Wei Ping et al. “Deep voice 3: 2000-speaker neural text-to-speech”. In:proc. ICLR. V ol. 79. 2018, pp. 1094–1099

work page 2000

[53] [53]

Robust speech recognition via large-scale weak supervision

Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518

work page 2023

[54] [54]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov et al. “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 53728–53741

work page 2023

[55] [55]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Yi Ren et al. “Fastspeech 2: Fast and high-quality end-to-end text to speech”. In:arXiv preprint arXiv:2006.04558 (2020)

work page arXiv 2006

[56] [56]

Omni-r1: Do you really need audio to fine-tune your audio llm?

Andrew Rouditchenko et al. “Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?” In:arXiv preprint arXiv:2505.09439 (2025)

work page arXiv 2025

[57] [57]

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K Rubenstein et al. “Audiopalm: A large language model that can speak and listen”. In: arXiv preprint arXiv:2306.12925 (2023)

work page internal anchor Pith review arXiv 2023

[58] [58]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

Jonathan Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions”. In:2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 4779–4783

work page 2018

[60] [60]

Snac: Multi-scale neural audio codec,

Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. “Snac: Multi-scale neural audio codec”. In: arXiv preprint arXiv:2410.14411 (2024)

work page arXiv 2024

[61] [61]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)

work page internal anchor Pith review arXiv 2023

[62] [62]

Springer Science & Business Media, 2013

Wolfgang Wahlster.Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013

work page 2013

[63] [63]

Changhan Wang, Anne Wu, and Juan Pino.CoVoST 2 and Massively Multilingual Speech-to-Text Translation

work page

[64] [64]

CoV oST 2: A Massively Mul- tilingual Speech-to-Text Translation Corpus,

arXiv: 2007.10310 [cs.CL]. URL: https://arxiv.org/abs/2007.10310

work page arXiv 2007

[65] [65]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang et al. “Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens”. In: arXiv preprint arXiv:2503.01710 (2025)

work page internal anchor Pith review arXiv 2025

[67] [67]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024b

Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)

work page arXiv 2024

[68] [68]

Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

Yuancheng Wang et al. “Maskgct: Zero-shot text-to-speech with masked generative codec transformer”. In:arXiv preprint arXiv:2409.00750 (2024)

work page arXiv 2024

[69] [69]

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang et al. “Tacotron: Towards end-to-end speech synthesis”. In: arXiv preprint arXiv:1703.10135 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[70] [70]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei et al. “Finetuned language models are zero-shot learners”. In:arXiv preprint arXiv:2109.01652 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[71] [71]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu et al. “Google’s neural machine translation system: Bridging the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[72] [72]

Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel

Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)

work page arXiv 2024

[73] [73]

arXiv preprint arXiv:2410.11190 , year=

Zhifei Xie and Changqiao Wu. “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities”. In: arXiv preprint arXiv:2410.11190 (2024)

work page arXiv 2024

[74] [74]

BigCodec: Pushing the limits of low-bitrate neural speech codec,

Detai Xin et al. “Bigcodec: Pushing the limits of low-bitrate neural speech codec”. In: arXiv preprint arXiv:2409.05377 (2024)

work page arXiv 2024

[75] [75]

Qwen2.5-Omni Technical Report

Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

arXiv preprint arXiv:2502.17810 (2025)

Ruiqi Yan et al. URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models . 2025. arXiv: 2502.17810 [cs.CL]. URL: https://arxiv.org/abs/2502.17810

work page arXiv 2025

[77] [77]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour et al. “Soundstream: An end-to-end neural audio codec”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 495–507

work page 2021

[78] [78]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng et al. “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot”. In: arXiv preprint arXiv:2412.02612 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang et al. WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

work page

[80] [80]

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition.arXiv preprint arXiv:2110.03370, 2022

arXiv: 2110.03370 [cs.SD]. URL: https://arxiv.org/abs/2110.03370

work page arXiv