Step-Audio 2 Technical Report
Pith reviewed 2026-05-16 05:55 UTC · model grok-4.3
The pith
Step-Audio 2 integrates latent audio encoding and discrete token generation to deliver state-of-the-art audio understanding and expressive end-to-end speech conversation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Step-Audio 2 demonstrates that an integrated architecture using latent audio encoding, reasoning-centric reinforcement learning, discrete audio token generation within language modeling, and retrieval-augmented generation produces stronger automatic speech recognition, audio understanding, and responsive conversational output than prior separate-component systems.
What carries the argument
Discrete audio token generation embedded in the language modeling process, which enables direct responsiveness to paralinguistic cues such as emotion and style.
If this is right
- Direct modeling of paralinguistic information reduces the need for separate emotion or style modules in conversational agents
- Tool calling for web search and audio retrieval measurably lowers hallucination rates in spoken responses
- End-to-end discrete token output supports lower-latency turn-taking in multi-turn dialogue
- Scaling to millions of hours of training data yields consistent gains across diverse conversational domains
Where Pith is reading between the lines
- The same token-generation approach could be applied to video or sensor streams to create unified multi-modal conversational systems
- External tool integration may allow future models to maintain up-to-date knowledge without full retraining
- If the RL component generalizes well, similar reasoning-centric training could improve robustness in low-resource languages or noisy environments
Load-bearing premise
The combination of latent encoding, RL reasoning, discrete tokens, and RAG produces robust performance on real-world conversational audio beyond the evaluated benchmarks.
What would settle it
A new test set of long-form conversational audio with varied emotions and accents where Step-Audio 2 shows no accuracy or naturalness advantage over strong baseline models.
read the original abstract
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Step-Audio 2, an end-to-end multi-modal LLM for audio understanding and conversational speech. It combines a latent audio encoder, reasoning-centric reinforcement learning, discrete audio token generation to capture paralinguistic cues, and RAG with external tool calling (web search, audio search) to reduce hallucinations. Trained on millions of hours of speech and audio data, the model claims state-of-the-art results on various audio understanding and conversational benchmarks relative to open-source and commercial baselines.
Significance. If the performance claims are substantiated with complete benchmark tables, baselines, and ablations, the work would advance practical end-to-end audio LLMs by showing how RL-driven reasoning and RAG can be integrated with discrete token modeling for expressive, low-hallucination conversation. The industry-oriented framing and emphasis on real-world tool use are strengths.
major comments (2)
- [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.
- [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.
minor comments (2)
- [Abstract] Abstract: phrases such as 'promising performance' and 'state-of-the-art' are used without accompanying metrics or qualifiers.
- [Conclusion] The GitHub link is provided but no details on released code, checkpoints, or evaluation scripts are given in the text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and verifiability of the results.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.
Authors: We acknowledge that the current evaluation section provides only high-level SOTA claims without the requested specifics. In the revised manuscript we will expand this section to explicitly name the benchmarks (including LibriSpeech, CommonVoice, and conversational test sets), list exact baseline versions, define all metrics, specify data splits, report error bars where available, and include ablation results isolating the RL and RAG components. revision: yes
-
Referee: [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.
Authors: We agree that the claim would be stronger with direct quantitative support. The revised version will add comparisons to continuous-token and non-RL baselines, reporting relevant metrics that demonstrate the contribution of discrete token generation to paralinguistic responsiveness. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical technical report on model architecture, training data, and benchmark results with no mathematical derivations, equations, or self-referential definitions present. Performance claims reference external benchmarks and datasets rather than quantities defined or fitted inside the paper. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claims reduce to standard empirical evaluation and are therefore self-contained.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 51 Pith papers
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
-
RedVox: Safety and Fairness Gaps in Speech Models Across Languages
RedVox benchmark shows speech model safety and fairness vulnerabilities persist under non-adversarial conditions, worsen in non-English languages, and increase with spoken inputs.
-
AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?
Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.
-
Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models
Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
-
Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering
FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
-
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show ...
-
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and de...
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-...
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
Liberating LLM Capabilities in Full-Duplex Speech Models
LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.
-
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
-
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original ...
-
MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios
MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.
-
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
-
RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark
Introduces RAIL, a CHC-grounded benchmark with five core auditory capabilities to assess LALMs beyond task-centric metrics, showing uneven model performance.
-
Audio Interaction Model
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
-
LaSR: Context-Aware Speech Recognition via Latent Reasoning
LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard...
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs
EchoDistill applies noisy-to-clean self-distillation with GRPO to boost Audio LLM robustness, reporting 4.18% average GSR gains under strong noise.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
MPS proposes a dual-brain architecture separating formulation reasoning from articulation to achieve real-time CoT in SLMs with accuracy comparable to full pre-computation but much lower latency.
-
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
-
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models
ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.
-
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve bench...
-
StepAudio 2.5 Technical Report
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
-
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.
-
Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models
CogAudio-LLM introduces LIME-440K dataset, EIPS chain-of-thought reasoning, and DR-SAPO optimization to address semantic dominance and improve affective responses in audio language models.
-
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on M...
-
Step-Audio-R1.5 Technical Report
Step-Audio-R1.5 applies RLHF to audio reasoning models to escape the verifiable reward trap of RLVR, preserving analytical ability while restoring prosodic naturalness and immersion in long dialogues.
-
Step-Audio-R1.5 Technical Report
Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.
-
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
-
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
-
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.
Reference graph
Works this paper leans on
-
[1]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou et al. “Seed-tts: A family of high-quality versatile speech generation models”. In: arXiv preprint arXiv:2406.02430 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Rohan Anil et al. PaLM 2 Technical Report. 2023. arXiv: 2305.10403 [cs.CL]. URL: https://arxiv.org/ abs/2305.10403
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
Alexei Baevski et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. 2020. arXiv: 2006.11477 [cs.CL]. URL: https://arxiv.org/abs/2006.11477
-
[4]
Jinze Bai et al. Qwen Technical Report. 2023. arXiv: 2309.16609 [cs.CL]. URL: https://arxiv.org/abs/ 2309.16609
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Seed-ASR: Understanding diverse speech and contexts with LLM-based speech recognition,
Ye Bai et al. “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition”. In:arXiv preprint arXiv:2407.04675 (2024)
-
[6]
Better speech synthesis through scaling, 2023
James Betker. Better speech synthesis through scaling . 2023. arXiv: 2305 . 07243 [cs.SD]. URL: https : //arxiv.org/abs/2305.07243
-
[7]
Audiolm: a language modeling approach to audio generation
Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533
work page 2023
-
[8]
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio
Guoguo Chen et al. “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio”. In: Interspeech 2021. ISCA, Aug. 2021. DOI: 10.21437/interspeech.2021- 1965 . URL: http: //dx.doi.org/10.21437/Interspeech.2021-1965
-
[9]
Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)
- [10]
-
[11]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing,
Sanyuan Chen et al. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”. In: IEEE Journal of Selected Topics in Signal Processing16.6 (Oct. 2022), pp. 1505–1518. ISSN : 1941-0484. DOI: 10.1109/jstsp.2022.3188113. URL: http://dx.doi.org/10.1109/JSTSP.2022.3188113
-
[12]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi
Jade Copet et al. Simple and Controllable Music Generation. 2024. arXiv: 2306.05284 [cs.SD]. URL: https: //arxiv.org/abs/2306.05284
-
[15]
High Fidelity Neural Audio Compression
Alexandre Défossez et al. “High fidelity neural audio compression”. In:arXiv preprint arXiv:2210.13438 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Pengi: An audio language model for audio tasks
Soham Deshmukh et al. “Pengi: An audio language model for audio tasks”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 18090–18108
work page 2023
-
[18]
Ding Ding et al. “Kimi-audio technical report”. In: arXiv preprint arXiv:2504.18425 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du et al. “Cosyvoice 2: Scalable streaming speech synthesis with large language models”. In: arXiv preprint arXiv:2412.10117 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Zhihao Du et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. 2024. arXiv: 2407.05407 [cs.SD]. URL: https://arxiv.org/abs/2407.05407
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Llama- omni: Seamless speech interaction with large language models,
Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)
-
[22]
arXiv preprint arXiv:2501.16327 , year=
Heting Gao et al. LUCY: Linguistic Understanding and Control Yielding Early Stage of Her . 2025. arXiv: 2501.16327 [cs.CL]. URL: https://arxiv.org/abs/2501.16327
-
[23]
Jort F. Gemmeke et al. “Audio Set: An ontology and human-labeled dataset for audio events”. In:2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 2017, pp. 776–780. DOI: 10.1109/ICASSP.2017.7952261. 14 Step-Audio 2 Technical Report
-
[24]
Sreyan Ghosh et al. Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. 2025. arXiv: 2503.03983 [cs.SD]. URL: https://arxiv.org/abs/2503.03983
-
[25]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Arushi Goel et al. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. 2025. arXiv: 2507.08128 [cs.SD]. URL: https://arxiv.org/abs/2507.08128
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
ADD 2022: The first audio deep synthesis detection challenge,
Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828
-
[27]
Joint audio and speech understanding
Yuan Gong et al. “Joint audio and speech understanding”. In:2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2023, pp. 1–8
work page 2023
-
[28]
arXiv preprint arXiv:2305.10790 , year=
Yuan Gong et al. “Listen, think, and understand”. In: arXiv preprint arXiv:2305.10790 (2023)
-
[29]
Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407 . 21783 [cs.AI]. URL: https : //arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
HuBERT: Self-Supervised Speech Representa- tion Learning by Masked Prediction of Hidden Units,
Wei-Ning Hsu et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. 2021. arXiv: 2106.07447 [cs.CL]. URL: https://arxiv.org/abs/2106.07447
-
[31]
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
Ailin Huang et al. “Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model”. In:arXiv preprint arXiv:2506.08967 (2025)
-
[32]
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)
work page internal anchor Pith review arXiv 2025
-
[33]
Audiogpt: Understanding and generating speech, music, sound, and talking head
Rongjie Huang et al. “Audiogpt: Understanding and generating speech, music, sound, and talking head”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 38. 21. 2024, pp. 23802–23804
work page 2024
-
[34]
Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
CochlScene: Acquisition of acoustic scene data using crowdsourcing
Il-Young Jeong and Jeongsoo Park. “CochlScene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 17–21. DOI: 10.23919/APSIPAASC55919.2022.9979822
-
[36]
Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling
Shengpeng Ji et al. “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling”. In: arXiv preprint arXiv:2408.16532 (2024)
-
[37]
CVSS corpus and massively multilingual speech-to-speech translation,
Ye Jia et al. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. 2022. arXiv: 2201.03713 [cs.CL]. URL: https://arxiv.org/abs/2201.03713
-
[38]
Direct speech-to-speech translation with a sequence-to-sequence model
Ye Jia et al. “Direct speech-to-speech translation with a sequence-to-sequence model”. In: arXiv preprint arXiv:1904.06037 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[39]
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
Ye Jia et al. “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation”. In: International conference on machine learning. PMLR. 2022, pp. 10120–10134
work page 2022
-
[40]
Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,
Eugene Kharitonov et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision. 2023. arXiv: 2302.03540 [cs.SD]. URL: https://arxiv.org/abs/2302.03540
-
[41]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132
work page 2019
-
[42]
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. 2020. arXiv: 2010.05646 [cs.SD]. URL: https://arxiv.org/abs/2010. 05646
-
[43]
Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities
Zhifeng Kong et al. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities. 2024. arXiv: 2402.01831 [cs.SD]. URL: https://arxiv.org/abs/2402.01831
-
[44]
Transvip: Speech to speech translation system with voice and isochrony preservation
Chenyang Le et al. “Transvip: Speech to speech translation system with voice and isochrony preservation”. In: Advances in Neural Information Processing Systems 37 (2024), pp. 89682–89705
work page 2024
-
[45]
Textless speech-to-speech translation on real data
Ann Lee et al. “Textless speech-to-speech translation on real data”. In:arXiv preprint arXiv:2112.08352 (2021)
-
[46]
Bigvgan: A universal neural vocoder with large-scale training,
Sang-gil Lee et al. BigVGAN: A Universal Neural Vocoder with Large-Scale Training. 2023. arXiv: 2206.04658 [cs.SD]. URL: https://arxiv.org/abs/2206.04658
-
[47]
Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. “Advancing large language models to capture varied speaking styles and respond properly in spoken conversations”. In: arXiv preprint arXiv:2402.12786 (2024)
-
[48]
Paralinguistics-enhanced large language modeling of spoken dialogue
Guan-Ting Lin et al. “Paralinguistics-enhanced large language modeling of spoken dialogue”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10316–10320
work page 2024
-
[49]
Spirit LM: Interleaved spoken and written language model,
Tu Anh Nguyen et al. Spirit LM: Interleaved Spoken and Written Language Model. 2024. arXiv: 2402.05755 [cs.CL]. URL: https://arxiv.org/abs/2402.05755
-
[50]
OpenAI. GPT-4 Technical Report. https://openai.com/research/gpt-4. Accessed: 2025-07-11. 2023. 15 Step-Audio 2 Technical Report
work page 2025
-
[51]
OpenAI. Introducing ChatGPT. Accessed: 2025-07-11. 2022. URL: https://openai.com/blog/chatgpt
work page 2025
-
[52]
Deep voice 3: 2000-speaker neural text-to-speech
Wei Ping et al. “Deep voice 3: 2000-speaker neural text-to-speech”. In:proc. ICLR. V ol. 79. 2018, pp. 1094–1099
work page 2000
-
[53]
Robust speech recognition via large-scale weak supervision
Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518
work page 2023
-
[54]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov et al. “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 53728–53741
work page 2023
-
[55]
Fastspeech 2: Fast and high-quality end-to-end text to speech,
Yi Ren et al. “Fastspeech 2: Fast and high-quality end-to-end text to speech”. In:arXiv preprint arXiv:2006.04558 (2020)
-
[56]
Omni-r1: Do you really need audio to fine-tune your audio llm?
Andrew Rouditchenko et al. “Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?” In:arXiv preprint arXiv:2505.09439 (2025)
-
[57]
AudioPaLM: A Large Language Model That Can Speak and Listen
Paul K Rubenstein et al. “Audiopalm: A large language model that can speak and listen”. In: arXiv preprint arXiv:2306.12925 (2023)
work page internal anchor Pith review arXiv 2023
-
[58]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
Jonathan Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions”. In:2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 4779–4783
work page 2018
-
[60]
Snac: Multi-scale neural audio codec,
Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. “Snac: Multi-scale neural audio codec”. In: arXiv preprint arXiv:2410.14411 (2024)
-
[61]
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)
work page internal anchor Pith review arXiv 2023
-
[62]
Springer Science & Business Media, 2013
Wolfgang Wahlster.Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013
work page 2013
-
[63]
Changhan Wang, Anne Wu, and Juan Pino.CoVoST 2 and Massively Multilingual Speech-to-Text Translation
-
[64]
CoV oST 2: A Massively Mul- tilingual Speech-to-Text Translation Corpus,
arXiv: 2007.10310 [cs.CL]. URL: https://arxiv.org/abs/2007.10310
-
[65]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Xinsheng Wang et al. “Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens”. In: arXiv preprint arXiv:2503.01710 (2025)
work page internal anchor Pith review arXiv 2025
-
[67]
Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)
-
[68]
Maskgct: Zero-shot text-to- speech with masked generative codec transformer,
Yuancheng Wang et al. “Maskgct: Zero-shot text-to-speech with masked generative codec transformer”. In:arXiv preprint arXiv:2409.00750 (2024)
-
[69]
Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang et al. “Tacotron: Towards end-to-end speech synthesis”. In: arXiv preprint arXiv:1703.10135 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[70]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei et al. “Finetuned language models are zero-shot learners”. In:arXiv preprint arXiv:2109.01652 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[71]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu et al. “Google’s neural machine translation system: Bridging the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[72]
Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel
Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)
-
[73]
arXiv preprint arXiv:2410.11190 , year=
Zhifei Xie and Changqiao Wu. “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities”. In: arXiv preprint arXiv:2410.11190 (2024)
-
[74]
BigCodec: Pushing the limits of low-bitrate neural speech codec,
Detai Xin et al. “Bigcodec: Pushing the limits of low-bitrate neural speech codec”. In: arXiv preprint arXiv:2409.05377 (2024)
-
[75]
Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
arXiv preprint arXiv:2502.17810 (2025)
Ruiqi Yan et al. URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models . 2025. arXiv: 2502.17810 [cs.CL]. URL: https://arxiv.org/abs/2502.17810
-
[77]
Soundstream: An end-to-end neural audio codec
Neil Zeghidour et al. “Soundstream: An end-to-end neural audio codec”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 495–507
work page 2021
-
[78]
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
Aohan Zeng et al. “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot”. In: arXiv preprint arXiv:2412.02612 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
Binbin Zhang et al. WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
-
[80]
arXiv: 2110.03370 [cs.SD]. URL: https://arxiv.org/abs/2110.03370
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.