A scalable pipeline for enabling non-verbal speech generation and understanding

Runchuan Ye, Yixuan Zhou, Renjie Yu, Zijian Lin, Kehan Li, Xiang Li, Xin Liu, Guoyang Zeng, Zhiyong Wu · 2025 · arXiv 2508.05385

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

eess.AS · 2026-05-12 · unverdicted · novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.

Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in ASR

eess.AS · 2026-07-02 · unverdicted · novelty 4.0

Three data-centric strategies are studied to improve rare non-verbal vocalization recognition in ASR while preserving lexical accuracy.

citing papers explorer

Showing 4 of 4 citing papers.

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations cs.SD · 2026-04-17 · unverdicted · none · ref 12
NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 13
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models cs.CL · 2026-04-01 · unverdicted · none · ref 45
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.
Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in ASR eess.AS · 2026-07-02 · unverdicted · none · ref 14
Three data-centric strategies are studied to improve rare non-verbal vocalization recognition in ASR while preserving lexical accuracy.

A scalable pipeline for enabling non-verbal speech generation and understanding

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer