Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Pith reviewed 2026-05-15 12:22 UTC · model grok-4.3
The pith
Seed-TTS generates speech that matches human recordings in speaker similarity and naturalness according to objective metrics and listener tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seed-TTS achieves performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations, serving as a foundation model for speech generation with superior controllability over speech attributes such as emotion and the ability to produce highly expressive and diverse speech for speakers in the wild.
What carries the argument
Large-scale autoregressive text-to-speech model enhanced by self-distillation for speech factorization and reinforcement learning for robustness, paired with a fully diffusion-based non-autoregressive architecture that performs end-to-end speech generation without pre-estimated durations.
If this is right
- Fine-tuning produces even higher subjective scores in naturalness and speaker similarity.
- The models support effective in-context learning for speakers outside the training set.
- Seed-TTS_DiT enables speech editing through its end-to-end diffusion process.
- Reinforcement learning improves robustness and controllability over emotional expression.
Where Pith is reading between the lines
- Widespread use could replace recorded human voices in media and virtual agents if performance holds outside controlled test conditions.
- The factorization and reinforcement learning steps might transfer to other audio generation tasks such as music or sound effects.
- Real-time deployment could support dynamic, personalized voice output in interactive systems without per-speaker retraining.
Load-bearing premise
Subjective listener evaluations and chosen objective metrics reliably indicate real-world indistinguishability from human speech and that the models generalize to unseen speakers and conditions without overfitting.
What would settle it
A blind listening test with many participants across varied real-world conditions and unseen speakers where listeners cannot distinguish Seed-TTS outputs from actual human recordings at rates above chance.
read the original abstract
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Seed-TTS, a family of large-scale autoregressive TTS models (with a diffusion-based NAR variant Seed-TTS_DiT) that generate speech claimed to be virtually indistinguishable from human speech. It reports matching ground-truth performance in speaker similarity and naturalness via objective and subjective evaluations, strong in-context learning, controllability over attributes like emotion, and further gains from fine-tuning, self-distillation for factorization, and RL for robustness.
Significance. If the central performance claims hold under rigorous scrutiny, the work would constitute a meaningful contribution to speech generation by providing versatile foundation models with high fidelity, expressiveness for in-the-wild speakers, and end-to-end NAR processing without pre-estimated durations. The combination of AR and DiT architectures plus RL enhancements offers practical value, though the current lack of evaluation transparency limits immediate assessment of its standing relative to prior TTS systems.
major comments (3)
- [Abstract] Abstract: The claim that Seed-TTS 'matches ground truth human speech in both objective and subjective evaluations' and produces output that is 'virtually indistinguishable' is load-bearing for the central contribution, yet the manuscript provides no details on the subjective protocol (forced-choice discrimination vs. scalar MOS/ABX ratings, number of listeners and utterances, presentation of ground-truth references, or strict held-out test speakers/conditions). Scalar ratings alone can approach ceiling values without proving indistinguishability.
- [Section 4] Section 4 (Experiments) and abstract: No information is given on training data scale, exact objective metrics (e.g., specific speaker similarity measures or their computation), chosen baselines, or statistical significance (error bars, p-values). This absence prevents evaluation of whether reported gains are robust or could be explained by data scale or overfitting.
- [Section 5] Section 5 (fine-tuning and RL): The post-hoc fine-tuning and RL improvements are presented as achieving 'even higher subjective scores,' but without reporting the base vs. fine-tuned comparison tables, training schedules, or controls for data leakage, it is unclear whether these gains reflect genuine robustness enhancements or simply additional adaptation to the evaluation distribution.
minor comments (2)
- [Abstract] The manuscript would benefit from explicit cross-references in the text to specific demo audio examples that illustrate the controllability and editing claims.
- [Section 3] Notation for the DiT variant (Seed-TTS_DiT) is introduced without a dedicated equation or diagram clarifying how the diffusion process replaces autoregressive token prediction while remaining end-to-end.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. We will revise the paper to address the concerns regarding evaluation transparency and provide more details on the experimental setup. Below we respond to each major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that Seed-TTS 'matches ground truth human speech in both objective and subjective evaluations' and produces output that is 'virtually indistinguishable' is load-bearing for the central contribution, yet the manuscript provides no details on the subjective protocol (forced-choice discrimination vs. scalar MOS/ABX ratings, number of listeners and utterances, presentation of ground-truth references, or strict held-out test speakers/conditions). Scalar ratings alone can approach ceiling values without proving indistinguishability.
Authors: We agree that additional details on the subjective evaluation protocol are essential to support the claims. In the revised version, we will expand the abstract and add a dedicated subsection in Section 4 describing the listening test methodology. This will include: the use of ABX or MOS ratings, number of participants (e.g., 20+ native speakers), number of utterances per condition, how ground-truth references were presented, and confirmation that evaluations used held-out speakers and in-the-wild conditions not seen during training. We believe this will demonstrate the indistinguishability more rigorously. revision: yes
-
Referee: [Section 4] Section 4 (Experiments) and abstract: No information is given on training data scale, exact objective metrics (e.g., specific speaker similarity measures or their computation), chosen baselines, or statistical significance (error bars, p-values). This absence prevents evaluation of whether reported gains are robust or could be explained by data scale or overfitting.
Authors: We acknowledge this gap in the current draft. The revised manuscript will include: (1) details on the training data scale, such as the total hours of speech data used (noting it is on the order of tens of thousands of hours from diverse sources); (2) exact definitions and computation methods for objective metrics, e.g., speaker similarity via cosine distance on embeddings from a pre-trained speaker verification model like ECAPA-TDNN; (3) a full list of baselines compared against, including recent TTS systems; and (4) statistical analysis with error bars from multiple runs or bootstrap methods and p-values for key comparisons. This will allow readers to assess the robustness of the results. revision: yes
-
Referee: [Section 5] Section 5 (fine-tuning and RL): The post-hoc fine-tuning and RL improvements are presented as achieving 'even higher subjective scores,' but without reporting the base vs. fine-tuned comparison tables, training schedules, or controls for data leakage, it is unclear whether these gains reflect genuine robustness enhancements or simply additional adaptation to the evaluation distribution.
Authors: We will revise Section 5 to include direct comparison tables between the base Seed-TTS model and the fine-tuned/RL versions on the same evaluation sets. We will detail the fine-tuning schedules, hyperparameters, and the RL reward design. To address data leakage concerns, we will clarify that all fine-tuning and RL stages used disjoint data splits from the evaluation sets, with no overlap in speakers or utterances. This will show that the improvements stem from the proposed self-distillation and RL techniques rather than overfitting to the test distribution. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces empirical TTS models (autoregressive Seed-TTS and diffusion-based Seed-TTS_DiT) trained on large-scale data, with proposed techniques like self-distillation for factorization and RL for robustness. All load-bearing claims rest on external objective metrics and subjective listener evaluations compared against ground-truth human speech, not on internal derivations that reduce to fitted inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force results; performance matching is demonstrated via held-out test comparisons rather than tautological renaming or self-referential fitting. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale and hyperparameters
- fine-tuning schedule
axioms (2)
- domain assumption Neural networks can accurately model the distribution of natural speech waveforms
- domain assumption Subjective human ratings and standard objective metrics (e.g., similarity scores) are valid proxies for perceptual quality
Forward citations
Cited by 60 Pith papers
-
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
-
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates l...
-
Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis
Sarashina2.2-TTS achieves SOTA kanji reading accuracy via data scaling and Joyo-kanji-targeted synthesis, introduces the Joyo Kanji Yomi Benchmark and Kana-CER metric, and shows stable cross-lingual performance.
-
ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge
ParaPairAudioBench is a new pairwise benchmark showing LALM judges lag human paralinguistic judgments by 32 percentage points with poor tie calibration across style, rate, emphasis, age, and gender.
-
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
-
Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis
Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
-
M*: A Modular, Extensible, Serving System for Multimodal Models
M* introduces the Walk Graph abstraction to serve arbitrary compositions of multimodal model components and reports latency and throughput gains over vLLM-Omni and other baselines on text-to-image, text-to-speech, and...
-
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and de...
-
Native Audio-Visual Alignment for Generation
NAVA proposes native audio-visual alignment via Align-then-Fuse MMDiT and Timbre-in-Context Conditioning for joint audio-video generation with improved synchronization and timbre control.
-
Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues
ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech
HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intel...
-
ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion
ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.
-
Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption
Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.
-
Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning
RTFree-F5 replaces reference transcripts with mapped self-supervised speech representations in F5-TTS, cutting WER on dysarthric speech from 24.6% to 10.4% without any transcript at inference.
-
TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech
TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.
-
dots.tts Technical Report
dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.
-
HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec
HybridCodec unifies SSL distillation and dual-stream design in a neural audio codec for improved semantic specialization, competitive reconstruction, and faster inference.
-
GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech
GLASS enables composable acoustic style control in zero-shot TTS by training independent GRPO-optimized LoRA adapters on style rewards that can be linearly combined.
-
CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding
CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and ...
-
Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation
Foley-Omni extends isolated audio synthesis to joint generation of full video soundtracks across speech, effects, and music, with a new V2ST-Bench for evaluation showing competitive single-task results and gains in mi...
-
UniVocal: Unified Speech-Singing Code-Switching Synthesis
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
-
LaSR: Context-Aware Speech Recognition via Latent Reasoning
LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard...
-
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
-
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching
RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.
-
Taming Audio VAEs via Target-KL Regularization
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
-
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
-
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
-
Qwen3-TTS Technical Report
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5...
-
Two-Dimensional Quantization for Geometry-Aware Audio Coding
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
-
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
A single generative model uses twin DiT backbones with blockwise cross-attention and scaled-RoPE timing exchange to synthesize synchronized audio-video directly.
-
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
-
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and repor...
-
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a n...
-
A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models
SLM modules provide a clean low-dimensional emotion subspace with strong speaker-emotion disentanglement while CFM modules show entanglement and poor generalization for activation steering in hybrid TTS.
-
Energy-Efficient Multimodal Inference Serving with Tri-serve
Tri-serve is a software DVFS controller that jointly mitigates inter-stage dependency stalls, arithmetic-intensity effects on frequency, and thermal throttling to deliver 22% better energy efficiency in multimodal inf...
-
VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation
VoiceTTA applies group relative preference optimization with rewards on F0/energy variation, speaker similarity, and WER to adapt zero-shot TTS models at inference for uncommon styles.
-
Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS
Introduces joint residual reweighting that decomposes CFG guidance into text, speaker, and joint residuals and reweights the joint term independently to improve speaker similarity while preserving text correctness in ...
-
Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS
Introduces joint residual reweighting that disentangles speaker and joint residuals in CFG to improve speaker fidelity while preserving text accuracy in zero-shot TTS.
-
FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech
FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.
-
Imitation Learning for Elder-Facing Speech Synthesis
An imitation learning approach with two-stage on-policy reward learning enhances TTS for elderly listeners and outperforms standard GRPO and supervised baselines.
-
Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization
Zero-VC applies speaker anonymization as a perturbation to achieve strictly causal zero-lookahead streaming voice conversion by balancing timbre leakage against prosodic utility.
-
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
MOS models match humans on acoustic degradation but are insensitive to prosodic errors and show a double dissociation on speaker characteristics like mean F0 bias and insensitivity to rate and F0 variability.
-
End-to-End Training for Discrete Token LLM based TTS System
An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.
-
FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation
FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.
-
VoxCPM2 Technical Report
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
-
UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.
-
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
-
JaiTTS: A Thai Voice Cloning Model
JaiTTS-v1.0 achieves a character error rate of 1.94% on short Thai speech tasks, surpassing human ground truth of 1.98%, matches humans on long tasks, and wins 283 of 400 human pairwise comparisons against commercial models.
Reference graph
Works this paper leans on
-
[1]
Streaming voice conversion via intermediate bottleneck features and non-streaming teacher guidance
Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, and Yuxuan Wang. Streaming voice conversion via intermediate bottleneck features and non-streaming teacher guidance. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,
work page 2023
-
[2]
StreamV oice: Streamable context-aware language modeling for real-time zero-shot voice conversion
Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Zhuo Chen, Lei Xie, Yuping Wang, and Yuxuan Wang. StreamV oice: Streamable context-aware language modeling for real-time zero-shot voice conversion. arXiv preprint arXiv:2401.11053, 2024a. Zhichao Wang, Yuanzhe Chen, Lei Xie, Qiao Tian, and Yuping Wang. LM-VC: Zero-shot voice conversion via speech generation base...
-
[3]
arXiv preprint arXiv:2402.08093 , year=
Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, et al. BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint arXiv:2402.08093,
-
[4]
Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias,
Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, et al. Mega-TTS: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509,
-
[5]
Deep Reinforcement Learning: An Overview
Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers, 2023b. Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et...
-
[7]
arXiv preprint arXiv:2212.14518 , year=
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: Easy end-to-end diffusion- based text to speech. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023a. Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, et al. ResGrad: Residual denoising di...
-
[8]
LLaMA: Open and Efficient Foundation Language Models
In INTERSPEECH, pages 1606–1610, 2022a. Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, and Yujia Xiao. ProsodySpeech: Towards advanced prosody model for neural text-to-speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7582–7586. IEEE, 2022b. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Better speech synthesis through scaling, 2023
James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243,
-
[10]
Bigvgan: A universal neural vocoder with large-scale training,
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. BigVGAN: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658,
-
[11]
Jian Cong, Shan Yang, Lei Xie, and Dan Su. Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. arXiv preprint arXiv:2106.10831,
-
[12]
Basis-MelGAN: Efficient neural vocoder based on audio decomposi- tion
Zhengxi Liu and Yanmin Qian. Basis-MelGAN: Efficient neural vocoder based on audio decomposi- tion. arXiv preprint arXiv:2106.13419,
-
[13]
Common voice: A massively-multilingual speech corpus,
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common V oice: A massively- multilingual speech corpus. arXiv preprint arXiv:1912.06670,
-
[14]
DiDiSpeech: A large scale mandarin speech corpus
Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, et al. DiDiSpeech: A large scale mandarin speech corpus. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6968–6972. IEEE,
work page 2021
-
[15]
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. FunASR: A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013, 2023b. 14 Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. Large-scale self-super...
-
[16]
Fastspeech 2: Fast and high-quality end-to-end text to speech,
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558,
-
[17]
Controllable and lossless non-autoregressive end-to-end text-to-speech
Zhengxi Liu, Qiao Tian, Chenxu Hu, Xudong Liu, Menglin Wu, Yuping Wang, Hang Zhao, and Yuxuan Wang. Controllable and lossless non-autoregressive end-to-end text-to-speech. arXiv preprint arXiv:2207.06088,
-
[18]
LibriSpeech: an ASR corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE,
work page 2015
-
[19]
WeNet 2.0: More productive end-to-end speech recognition toolkit
Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, and Jianwei Niu. WeNet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455,
-
[20]
Developing far-field speaker system via teacher-student learning
Jinyu Li, Rui Zhao, Zhuo Chen, Changliang Liu, Xiong Xiao, Guoli Ye, and Yifan Gong. Developing far-field speaker system via teacher-student learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5699–5703. IEEE,
work page 2018
-
[21]
V oxceleb: A large-scale speaker identification dataset,
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A large-scale speaker identifica- tion dataset. arXiv preprint arXiv:1706.08612,
-
[22]
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai. Singing voice synthesis using deep autoregressive neural networks for acoustic modeling. arXiv preprint arXiv:1906.08977,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[23]
LiteSing: Towards fast, lightweight and expressive singing voice synthesis
Xiaobin Zhuang, Tao Jiang, Szu-Yu Chou, Bin Wu, Peng Hu, and Simon Lui. LiteSing: Towards fast, lightweight and expressive singing voice synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7078–7082. IEEE,
work page 2021
-
[24]
Prosody-aware SpeechT5 for expressive neural TTS
Yan Deng, Long Zhou, Yuanhao Yi, Shujie Liu, and Lei He. Prosody-aware SpeechT5 for expressive neural TTS. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,
work page 2023
-
[25]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
15 Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: Activation- aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
A White Paper on Neural Network Quantization
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tij- men Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,
work page internal anchor Pith review arXiv
-
[32]
Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, and Shouda Liu. decoupleq: Towards 2-bit post-training uniform quantization via decoupling parameters into integer and floating points. arXiv preprint arXiv:2404.12759,
-
[33]
Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, and Mingbo Ma. V oiceShop: A unified speech-to-speech framework for identity- preserving zero-shot voice editing. arXiv preprint arXiv:2404.06674,
-
[34]
Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee. HierSpeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis. arXiv preprint arXiv:2311.12454,
-
[35]
Zero-shot accent conversion using pseudo siamese disentanglement network
Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, and Yuxuan Wang. Zero-shot accent conversion using pseudo siamese disentanglement network. arXiv preprint arXiv:2212.05751,
-
[36]
Diffusion-based voice conversion with fast maximum likelihood sampling scheme
Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. arXiv preprint arXiv:2109.13821,
-
[37]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908,
-
[39]
MusicRL: Aligning music generation to human preferences
Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, et al. MusicRL: Aligning music generation to human preferences. arXiv preprint arXiv:2402.04229,
-
[40]
SpeechAlign: Aligning speech generation to human preferences
Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechAlign: Aligning speech generation to human preferences. arXiv preprint arXiv:2404.05600,
-
[41]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs. arXiv preprint arXiv:2402.14740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Minimum word error rate training for attention-based sequence-to-sequence models
Rohit Prabhavalkar, Tara N Sainath, Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, and Anjuli Kannan. Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4839–4843. IEEE,
work page 2018
-
[43]
Transforming and combining rewards for aligning large language models
Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D’Amour, Sanmi Koyejo, and Victor Veitch. Transforming and combining rewards for aligning large language models. arXiv preprint arXiv:2402.00742, 2024b. Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint ...
-
[44]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion - tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
SpeechX: Neural codec language model as a versatile speech transformer
Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, and Takuya Yoshioka. SpeechX: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873, 2023c. OpenAI. Navigating the challenges and opportunities of synthetic voices. https://openai.com/index/nav...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.