pith. sign in

arxiv: 2406.02430 · v1 · pith:UZSHTDQ6new · submitted 2024-06-04 · 📡 eess.AS · cs.SD

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Pith reviewed 2026-05-15 12:22 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords text-to-speechspeech synthesisautoregressive modelsdiffusion modelsspeaker similarityemotion controlspeech editingfoundation models
0
0 comments X

The pith

Seed-TTS generates speech that matches human recordings in speaker similarity and naturalness according to objective metrics and listener tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Seed-TTS as a family of large-scale autoregressive text-to-speech models that produce speech virtually indistinguishable from human speech. These models match ground truth performance in speaker similarity and naturalness through both objective measures and subjective evaluations, while supporting in-context learning for new speakers and fine control over attributes such as emotion. A non-autoregressive variant called Seed-TTS_DiT uses a fully diffusion-based architecture for end-to-end generation without relying on pre-estimated phoneme durations. The authors add self-distillation for speech factorization and reinforcement learning to improve robustness, similarity, and controllability, positioning the system as a versatile foundation model for expressive speech generation across diverse conditions.

Core claim

Seed-TTS achieves performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations, serving as a foundation model for speech generation with superior controllability over speech attributes such as emotion and the ability to produce highly expressive and diverse speech for speakers in the wild.

What carries the argument

Large-scale autoregressive text-to-speech model enhanced by self-distillation for speech factorization and reinforcement learning for robustness, paired with a fully diffusion-based non-autoregressive architecture that performs end-to-end speech generation without pre-estimated durations.

If this is right

  • Fine-tuning produces even higher subjective scores in naturalness and speaker similarity.
  • The models support effective in-context learning for speakers outside the training set.
  • Seed-TTS_DiT enables speech editing through its end-to-end diffusion process.
  • Reinforcement learning improves robustness and controllability over emotional expression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use could replace recorded human voices in media and virtual agents if performance holds outside controlled test conditions.
  • The factorization and reinforcement learning steps might transfer to other audio generation tasks such as music or sound effects.
  • Real-time deployment could support dynamic, personalized voice output in interactive systems without per-speaker retraining.

Load-bearing premise

Subjective listener evaluations and chosen objective metrics reliably indicate real-world indistinguishability from human speech and that the models generalize to unseen speakers and conditions without overfitting.

What would settle it

A blind listening test with many participants across varied real-world conditions and unseen speakers where listeners cannot distinguish Seed-TTS outputs from actual human recordings at rates above chance.

read the original abstract

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Seed-TTS, a family of large-scale autoregressive TTS models (with a diffusion-based NAR variant Seed-TTS_DiT) that generate speech claimed to be virtually indistinguishable from human speech. It reports matching ground-truth performance in speaker similarity and naturalness via objective and subjective evaluations, strong in-context learning, controllability over attributes like emotion, and further gains from fine-tuning, self-distillation for factorization, and RL for robustness.

Significance. If the central performance claims hold under rigorous scrutiny, the work would constitute a meaningful contribution to speech generation by providing versatile foundation models with high fidelity, expressiveness for in-the-wild speakers, and end-to-end NAR processing without pre-estimated durations. The combination of AR and DiT architectures plus RL enhancements offers practical value, though the current lack of evaluation transparency limits immediate assessment of its standing relative to prior TTS systems.

major comments (3)
  1. [Abstract] Abstract: The claim that Seed-TTS 'matches ground truth human speech in both objective and subjective evaluations' and produces output that is 'virtually indistinguishable' is load-bearing for the central contribution, yet the manuscript provides no details on the subjective protocol (forced-choice discrimination vs. scalar MOS/ABX ratings, number of listeners and utterances, presentation of ground-truth references, or strict held-out test speakers/conditions). Scalar ratings alone can approach ceiling values without proving indistinguishability.
  2. [Section 4] Section 4 (Experiments) and abstract: No information is given on training data scale, exact objective metrics (e.g., specific speaker similarity measures or their computation), chosen baselines, or statistical significance (error bars, p-values). This absence prevents evaluation of whether reported gains are robust or could be explained by data scale or overfitting.
  3. [Section 5] Section 5 (fine-tuning and RL): The post-hoc fine-tuning and RL improvements are presented as achieving 'even higher subjective scores,' but without reporting the base vs. fine-tuned comparison tables, training schedules, or controls for data leakage, it is unclear whether these gains reflect genuine robustness enhancements or simply additional adaptation to the evaluation distribution.
minor comments (2)
  1. [Abstract] The manuscript would benefit from explicit cross-references in the text to specific demo audio examples that illustrate the controllability and editing claims.
  2. [Section 3] Notation for the DiT variant (Seed-TTS_DiT) is introduced without a dedicated equation or diagram clarifying how the diffusion process replaces autoregressive token prediction while remaining end-to-end.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We will revise the paper to address the concerns regarding evaluation transparency and provide more details on the experimental setup. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that Seed-TTS 'matches ground truth human speech in both objective and subjective evaluations' and produces output that is 'virtually indistinguishable' is load-bearing for the central contribution, yet the manuscript provides no details on the subjective protocol (forced-choice discrimination vs. scalar MOS/ABX ratings, number of listeners and utterances, presentation of ground-truth references, or strict held-out test speakers/conditions). Scalar ratings alone can approach ceiling values without proving indistinguishability.

    Authors: We agree that additional details on the subjective evaluation protocol are essential to support the claims. In the revised version, we will expand the abstract and add a dedicated subsection in Section 4 describing the listening test methodology. This will include: the use of ABX or MOS ratings, number of participants (e.g., 20+ native speakers), number of utterances per condition, how ground-truth references were presented, and confirmation that evaluations used held-out speakers and in-the-wild conditions not seen during training. We believe this will demonstrate the indistinguishability more rigorously. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments) and abstract: No information is given on training data scale, exact objective metrics (e.g., specific speaker similarity measures or their computation), chosen baselines, or statistical significance (error bars, p-values). This absence prevents evaluation of whether reported gains are robust or could be explained by data scale or overfitting.

    Authors: We acknowledge this gap in the current draft. The revised manuscript will include: (1) details on the training data scale, such as the total hours of speech data used (noting it is on the order of tens of thousands of hours from diverse sources); (2) exact definitions and computation methods for objective metrics, e.g., speaker similarity via cosine distance on embeddings from a pre-trained speaker verification model like ECAPA-TDNN; (3) a full list of baselines compared against, including recent TTS systems; and (4) statistical analysis with error bars from multiple runs or bootstrap methods and p-values for key comparisons. This will allow readers to assess the robustness of the results. revision: yes

  3. Referee: [Section 5] Section 5 (fine-tuning and RL): The post-hoc fine-tuning and RL improvements are presented as achieving 'even higher subjective scores,' but without reporting the base vs. fine-tuned comparison tables, training schedules, or controls for data leakage, it is unclear whether these gains reflect genuine robustness enhancements or simply additional adaptation to the evaluation distribution.

    Authors: We will revise Section 5 to include direct comparison tables between the base Seed-TTS model and the fine-tuned/RL versions on the same evaluation sets. We will detail the fine-tuning schedules, hyperparameters, and the RL reward design. To address data leakage concerns, we will clarify that all fine-tuning and RL stages used disjoint data splits from the evaluation sets, with no overlap in speakers or utterances. This will show that the improvements stem from the proposed self-distillation and RL techniques rather than overfitting to the test distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces empirical TTS models (autoregressive Seed-TTS and diffusion-based Seed-TTS_DiT) trained on large-scale data, with proposed techniques like self-distillation for factorization and RL for robustness. All load-bearing claims rest on external objective metrics and subjective listener evaluations compared against ground-truth human speech, not on internal derivations that reduce to fitted inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force results; performance matching is demonstrated via held-out test comparisons rather than tautological renaming or self-referential fitting. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions in neural speech modeling plus many unspecified training choices; no new entities are postulated.

free parameters (2)
  • model scale and hyperparameters
    Large autoregressive and diffusion architectures require numerous tuned parameters whose values are not reported in the abstract.
  • fine-tuning schedule
    Performance gains after fine-tuning depend on unspecified data selection and optimization choices.
axioms (2)
  • domain assumption Neural networks can accurately model the distribution of natural speech waveforms
    Invoked implicitly by the autoregressive and diffusion architectures.
  • domain assumption Subjective human ratings and standard objective metrics (e.g., similarity scores) are valid proxies for perceptual quality
    Underpins all reported evaluation claims.

pith-pipeline@v0.9.0 · 5723 in / 1266 out tokens · 28066 ms · 2026-05-15T12:22:23.552594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

    eess.AS 2026-06 unverdicted novelty 8.0

    WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

  2. FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

    cs.SD 2026-06 unverdicted novelty 7.0

    FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates l...

  3. Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

    cs.SD 2026-06 unverdicted novelty 7.0

    Sarashina2.2-TTS achieves SOTA kanji reading accuracy via data scaling and Joyo-kanji-targeted synthesis, introduces the Joyo Kanji Yomi Benchmark and Kana-CER metric, and shows stable cross-lingual performance.

  4. ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

    cs.SD 2026-06 unverdicted novelty 7.0

    ParaPairAudioBench is a new pairwise benchmark showing LALM judges lag human paralinguistic judgments by 32 percentage points with poor tie calibration across style, rate, emphasis, age, and gender.

  5. AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

    eess.AS 2026-06 unverdicted novelty 7.0

    AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.

  6. Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

    cs.CL 2026-06 unverdicted novelty 7.0

    Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.

  7. M*: A Modular, Extensible, Serving System for Multimodal Models

    cs.LG 2026-06 unverdicted novelty 7.0

    M* introduces the Walk Graph abstraction to serve arbitrary compositions of multimodal model components and reports latency and throughput gains over vLLM-Omni and other baselines on text-to-image, text-to-speech, and...

  8. PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

    cs.CL 2026-05 unverdicted novelty 7.0

    PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and de...

  9. Native Audio-Visual Alignment for Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    NAVA proposes native audio-visual alignment via Align-then-Fuse MMDiT and Timbre-in-Context Conditioning for joint audio-video generation with improved synchronization and timbre control.

  10. Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

    cs.SD 2026-05 unverdicted novelty 7.0

    ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.

  11. Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

    eess.AS 2026-05 unverdicted novelty 7.0

    GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

  12. MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

    eess.AS 2026-04 unverdicted novelty 7.0

    MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

  13. From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

    cs.AI 2026-04 unverdicted novelty 7.0

    ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.

  14. X-VC: Zero-shot Streaming Voice Conversion in Codec Space

    eess.AS 2026-04 unverdicted novelty 7.0

    X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.

  15. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

  16. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    cs.SD 2025-07 unverdicted novelty 7.0

    Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

  17. HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

    eess.AS 2026-06 unverdicted novelty 6.0

    HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intel...

  18. ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion

    eess.AS 2026-06 unverdicted novelty 6.0

    ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.

  19. Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption

    cs.SD 2026-06 unverdicted novelty 6.0

    Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.

  20. Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

    eess.AS 2026-06 unverdicted novelty 6.0

    RTFree-F5 replaces reference transcripts with mapped self-supervised speech representations in F5-TTS, cutting WER on dysarthric speech from 24.6% to 10.4% without any transcript at inference.

  21. TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

    cs.SD 2026-06 unverdicted novelty 6.0

    TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.

  22. dots.tts Technical Report

    cs.SD 2026-06 unverdicted novelty 6.0

    dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.

  23. HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

    cs.SD 2026-06 unverdicted novelty 6.0

    HybridCodec unifies SSL distillation and dual-stream design in a neural audio codec for improved semantic specialization, competitive reconstruction, and faster inference.

  24. GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

    cs.SD 2026-06 unverdicted novelty 6.0

    GLASS enables composable acoustic style control in zero-shot TTS by training independent GRPO-optimized LoRA adapters on style rewards that can be linearly combined.

  25. CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

    cs.SD 2026-06 unverdicted novelty 6.0

    CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and ...

  26. Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

    cs.SD 2026-06 unverdicted novelty 6.0

    Foley-Omni extends isolated audio synthesis to joint generation of full video soundtracks across speech, effects, and music, with a new V2ST-Bench for evaluation showing competitive single-task results and gains in mi...

  27. UniVocal: Unified Speech-Singing Code-Switching Synthesis

    cs.SD 2026-06 unverdicted novelty 6.0

    UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

  28. LaSR: Context-Aware Speech Recognition via Latent Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard...

  29. SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

    eess.AS 2026-05 unverdicted novelty 6.0

    SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.

  30. RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

    cs.SD 2026-05 unverdicted novelty 6.0

    RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.

  31. Taming Audio VAEs via Target-KL Regularization

    cs.SD 2026-05 unverdicted novelty 6.0

    The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.

  32. SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

    eess.AS 2026-05 unverdicted novelty 6.0

    SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.

  33. MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.

  34. Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

    eess.AS 2026-04 unverdicted novelty 6.0

    Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

  35. Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...

  36. OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...

  37. Qwen3-TTS Technical Report

    cs.SD 2026-01 unverdicted novelty 6.0

    Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5...

  38. Two-Dimensional Quantization for Geometry-Aware Audio Coding

    cs.SD 2025-12 unverdicted novelty 6.0

    Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

  39. Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

    cs.MM 2025-09 unverdicted novelty 6.0

    A single generative model uses twin DiT backbones with blockwise cross-attention and scaled-RoPE timing exchange to synthesize synchronized audio-video directly.

  40. StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

    cs.CL 2025-09 unverdicted novelty 6.0

    StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

  41. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  42. Step-Audio 2 Technical Report

    cs.CL 2025-07 unverdicted novelty 6.0

    Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...

  43. ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

    eess.AS 2025-07 conditional novelty 6.0

    ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and repor...

  44. CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    cs.SD 2025-05 unverdicted novelty 6.0

    CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...

  45. Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    cs.CL 2025-02 unverdicted novelty 6.0

    Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a n...

  46. A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models

    cs.SD 2026-07 unverdicted novelty 5.0

    SLM modules provide a clean low-dimensional emotion subspace with strong speaker-emotion disentanglement while CFM modules show entanglement and poor generalization for activation steering in hybrid TTS.

  47. Energy-Efficient Multimodal Inference Serving with Tri-serve

    cs.DC 2026-06 unverdicted novelty 5.0

    Tri-serve is a software DVFS controller that jointly mitigates inter-stage dependency stalls, arithmetic-intensity effects on frequency, and thermal throttling to deliver 22% better energy efficiency in multimodal inf...

  48. VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation

    cs.SD 2026-06 unverdicted novelty 5.0

    VoiceTTA applies group relative preference optimization with rewards on F0/energy variation, speaker similarity, and WER to adapt zero-shot TTS models at inference for uncommon styles.

  49. Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS

    eess.AS 2026-06 unverdicted novelty 5.0

    Introduces joint residual reweighting that decomposes CFG guidance into text, speaker, and joint residuals and reweights the joint term independently to improve speaker similarity while preserving text correctness in ...

  50. Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS

    eess.AS 2026-06 unverdicted novelty 5.0

    Introduces joint residual reweighting that disentangles speaker and joint residuals in CFG to improve speaker fidelity while preserving text accuracy in zero-shot TTS.

  51. FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

    eess.AS 2026-06 unverdicted novelty 5.0

    FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.

  52. Imitation Learning for Elder-Facing Speech Synthesis

    cs.SD 2026-06 unverdicted novelty 5.0

    An imitation learning approach with two-stage on-policy reward learning enhances TTS for elderly listeners and outperforms standard GRPO and supervised baselines.

  53. Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

    cs.SD 2026-06 unverdicted novelty 5.0

    Zero-VC applies speaker anonymization as a perturbation to achieve strictly causal zero-lookahead streaming voice conversion by balancing timbre leakage against prosodic utility.

  54. Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

    eess.AS 2026-06 unverdicted novelty 5.0

    MOS models match humans on acoustic degradation but are insensitive to prosodic errors and show a double dissociation on speaker characteristics like mean F0 bias and insensitivity to rate and F0 variability.

  55. End-to-End Training for Discrete Token LLM based TTS System

    cs.SD 2026-06 unverdicted novelty 5.0

    An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.

  56. FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

    eess.AS 2026-06 unverdicted novelty 5.0

    FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.

  57. VoxCPM2 Technical Report

    cs.SD 2026-06 unverdicted novelty 5.0

    VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.

  58. UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

    eess.AS 2026-05 unverdicted novelty 5.0

    UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.

  59. Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

    eess.AS 2026-05 unverdicted novelty 5.0

    Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.

  60. JaiTTS: A Thai Voice Cloning Model

    cs.CL 2026-04 unverdicted novelty 5.0

    JaiTTS-v1.0 achieves a character error rate of 1.94% on short Thai speech tasks, surpassing human ground truth of 1.98%, matches humans on long tasks, and wins 283 of 400 human pairwise comparisons against commercial models.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 73 Pith papers · 13 internal anchors

  1. [1]

    Streaming voice conversion via intermediate bottleneck features and non-streaming teacher guidance

    Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, and Yuxuan Wang. Streaming voice conversion via intermediate bottleneck features and non-streaming teacher guidance. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

  2. [2]

    StreamV oice: Streamable context-aware language modeling for real-time zero-shot voice conversion

    Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Zhuo Chen, Lei Xie, Yuping Wang, and Yuxuan Wang. StreamV oice: Streamable context-aware language modeling for real-time zero-shot voice conversion. arXiv preprint arXiv:2401.11053, 2024a. Zhichao Wang, Yuanzhe Chen, Lei Xie, Qiao Tian, and Yuping Wang. LM-VC: Zero-shot voice conversion via speech generation base...

  3. [3]

    arXiv preprint arXiv:2402.08093 , year=

    Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, et al. BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint arXiv:2402.08093,

  4. [4]

    Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias,

    Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, et al. Mega-TTS: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509,

  5. [5]

    Deep Reinforcement Learning: An Overview

    Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274,

  6. [6]

    Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,

    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers, 2023b. Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et...

  7. [7]

    arXiv preprint arXiv:2212.14518 , year=

    Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: Easy end-to-end diffusion- based text to speech. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023a. Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, et al. ResGrad: Residual denoising di...

  8. [8]

    LLaMA: Open and Efficient Foundation Language Models

    In INTERSPEECH, pages 1606–1610, 2022a. Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, and Yujia Xiao. ProsodySpeech: Towards advanced prosody model for neural text-to-speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7582–7586. IEEE, 2022b. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier...

  9. [9]

    Better speech synthesis through scaling, 2023

    James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243,

  10. [10]

    Bigvgan: A universal neural vocoder with large-scale training,

    Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. BigVGAN: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658,

  11. [11]

    Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis

    Jian Cong, Shan Yang, Lei Xie, and Dan Su. Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. arXiv preprint arXiv:2106.10831,

  12. [12]

    Basis-MelGAN: Efficient neural vocoder based on audio decomposi- tion

    Zhengxi Liu and Yanmin Qian. Basis-MelGAN: Efficient neural vocoder based on audio decomposi- tion. arXiv preprint arXiv:2106.13419,

  13. [13]

    Common voice: A massively-multilingual speech corpus,

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common V oice: A massively- multilingual speech corpus. arXiv preprint arXiv:1912.06670,

  14. [14]

    DiDiSpeech: A large scale mandarin speech corpus

    Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, et al. DiDiSpeech: A large scale mandarin speech corpus. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6968–6972. IEEE,

  15. [15]

    InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 15467–15471

    Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. FunASR: A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013, 2023b. 14 Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. Large-scale self-super...

  16. [16]

    Fastspeech 2: Fast and high-quality end-to-end text to speech,

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558,

  17. [17]

    Controllable and lossless non-autoregressive end-to-end text-to-speech

    Zhengxi Liu, Qiao Tian, Chenxu Hu, Xudong Liu, Menglin Wu, Yuping Wang, Hang Zhao, and Yuxuan Wang. Controllable and lossless non-autoregressive end-to-end text-to-speech. arXiv preprint arXiv:2207.06088,

  18. [18]

    LibriSpeech: an ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE,

  19. [19]

    WeNet 2.0: More productive end-to-end speech recognition toolkit

    Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, and Jianwei Niu. WeNet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455,

  20. [20]

    Developing far-field speaker system via teacher-student learning

    Jinyu Li, Rui Zhao, Zhuo Chen, Changliang Liu, Xiong Xiao, Guoli Ye, and Yifan Gong. Developing far-field speaker system via teacher-student learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5699–5703. IEEE,

  21. [21]

    V oxceleb: A large-scale speaker identification dataset,

    Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A large-scale speaker identifica- tion dataset. arXiv preprint arXiv:1706.08612,

  22. [22]

    Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

    Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai. Singing voice synthesis using deep autoregressive neural networks for acoustic modeling. arXiv preprint arXiv:1906.08977,

  23. [23]

    LiteSing: Towards fast, lightweight and expressive singing voice synthesis

    Xiaobin Zhuang, Tao Jiang, Szu-Yu Chou, Bin Wu, Peng Hu, and Simon Lui. LiteSing: Towards fast, lightweight and expressive singing voice synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7078–7082. IEEE,

  24. [24]

    Prosody-aware SpeechT5 for expressive neural TTS

    Yan Deng, Long Zhou, Yuanhao Yi, Shujie Liu, and Lei He. Prosody-aware SpeechT5 for expressive neural TTS. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

  25. [25]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,

  26. [26]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    15 Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378,

  27. [27]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: Activation- aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,

  28. [28]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469,

  29. [29]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206,

  30. [30]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

  31. [31]

    A White Paper on Neural Network Quantization

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tij- men Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,

  32. [32]

    decoupleq: Towards 2-bit post-training uniform quantization via decoupling parameters into integer and floating points

    Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, and Shouda Liu. decoupleq: Towards 2-bit post-training uniform quantization via decoupling parameters into integer and floating points. arXiv preprint arXiv:2404.12759,

  33. [33]

    Anastassiou, Z

    Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, and Mingbo Ma. V oiceShop: A unified speech-to-speech framework for identity- preserving zero-shot voice editing. arXiv preprint arXiv:2404.06674,

  34. [34]

    HierSpeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis

    Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee. HierSpeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis. arXiv preprint arXiv:2311.12454,

  35. [35]

    Zero-shot accent conversion using pseudo siamese disentanglement network

    Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, and Yuxuan Wang. Zero-shot accent conversion using pseudo siamese disentanglement network. arXiv preprint arXiv:2212.05751,

  36. [36]

    Diffusion-based voice conversion with fast maximum likelihood sampling scheme

    Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. arXiv preprint arXiv:2109.13821,

  37. [37]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908,

  39. [39]

    MusicRL: Aligning music generation to human preferences

    Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, et al. MusicRL: Aligning music generation to human preferences. arXiv preprint arXiv:2402.04229,

  40. [40]

    SpeechAlign: Aligning speech generation to human preferences

    Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechAlign: Aligning speech generation to human preferences. arXiv preprint arXiv:2404.05600,

  41. [41]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs. arXiv preprint arXiv:2402.14740,

  42. [42]

    Minimum word error rate training for attention-based sequence-to-sequence models

    Rohit Prabhavalkar, Tara N Sainath, Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, and Anjuli Kannan. Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4839–4843. IEEE,

  43. [43]

    Transforming and combining rewards for aligning large language models

    Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D’Amour, Sanmi Koyejo, and Victor Veitch. Transforming and combining rewards for aligning large language models. arXiv preprint arXiv:2402.00742, 2024b. Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint ...

  44. [44]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion - tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737,

  45. [45]

    SpeechX: Neural codec language model as a versatile speech transformer

    Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, and Takuya Yoshioka. SpeechX: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873, 2023c. OpenAI. Navigating the challenges and opportunities of synthetic voices. https://openai.com/index/nav...