pith. sign in

cs.SD

Sound

Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.

Top Pith
2
eess.AS 2026-05-14 Recognition

Benchmark standardizes early Parkinson's speech detection

by Terry Yi Zhong, Cristian Tejedor-Garcia +4 more

A Benchmark for Early-stage Parkinson's Disease Detection from Speech

Speaker-independent splits on accessible datasets enable fair, replicable comparisons across tasks and training settings.

abstract click to expand
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.
1 0
Top Pith
2
cs.SD 2026-05-14

No voice agent tops both accuracy and experience scores

by Tara Bogavelli, Gabrielle Gauthier Melançon +11 more

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Tests across 12 systems show trade-offs, large reliability gaps, and drops from accents or noise in simulated conversations.

abstract click to expand
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $\Delta$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
0
Top Pith
2
cs.SD 2026-05-06

PHALAR model lifts stem retrieval accuracy by up to 70 percent with half the parameters

by Davide Marincione, Michele Mancusi +5 more

PHALAR: Phasors for Learned Musical Audio Representations

Contrastive framework adds pitch and phase equivariance via spectral pooling and complex head, trains seven times faster, and matches human

Figure from the paper full image
abstract click to expand
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
0
0
cs.CL 2026-07-03

Narration acoustics predict audiobook appeal beyond title

by Shahar Elisha, Mariano Beguerisse-Díaz +1 more

Audio-Based Understanding of Audiobook Narration Appeal

Vocal features extracted from recordings remain tied to view-rate and engagement after title controls are applied.

abstract click to expand
Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.
1 0
0
cs.SD 2026-07-03

Prompts let AI locate one sound among many

by Ziyang Jiang, Yu Chen +6 more

SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios

SelectTSL steers attention to user-specified targets and estimates their direction plus count in overlapping scenes.

Figure from the paper full image
abstract click to expand
Humans can selectively attend to a target sound and estimate its direction in complex scenarios, whereas such selective localization remains challenging for current deep learning-based systems. Sound source localization (SSL) has achieved remarkable success with deep learning, yet most methods localize all active sources without selectivity. Conversely, target sound extraction (TSE) extracts sources using multimodal prompts but typically fails to preserve the multichannel spatial information required for accurate localization. To bridge this gap, we formulate the task of prompt-guided selective target sound localization and propose SelectTSL, an end-to-end architecture that localizes only the user-specified target in multi-source acoustic scenes. Specifically, we design a target-aware selective localization strategy that employs a Prompt-Guided Selective Attention Module (PGSA) to generate prompt-informed embeddings. These embeddings guide an inter-channel phase difference (IPD) enhancer to refine raw phase cues, fusing with target magnitudes to jointly estimate direction of arrival (DoA) and target-source cardinality, i.e., the number of target sound sources. This coupled design effectively focuses on the user-specified target spatial cues for selective localization and also handles time-varying numbers of target sources. Extensive experiments on both synthetic data and real-world recordings demonstrate that our proposed method consistently outperforms other baselines and exhibits robust generalization to real acoustic environments.
0
0
cs.SD 2026-07-03

Phase spectrograms from one mic array estimate speaker head orientation

by Balint Turi, Archontis Politis +2 more

Speaker head orientation estimation with a single microphone array using phase spectrogram features

Simulated voice directivity training followed by real-data fine-tuning reaches 11.3 degree mean error after personalization.

abstract click to expand
Estimating a speaker's head orientation from audio can provide valuable information in smart environments, meetings, and driver monitoring. We propose a novel approach that leverages the phase component of the short-time Fourier transform from a single microphone array as input to a deep neural network combining convolutional, recurrent, and self-attention layers. Unlike prior methods that use physics-informed handcrafted features or raw waveform inputs, our approach enables robust learning from simulated and real data. Trained on a large-scale dataset generated with voice directivity patterns and fine-tuned on real recordings, our model achieves state-of-the-art accuracy, outperforming baselines under both clean and noisy conditions. Personalization experiments further demonstrate significant gains, reaching a mean angular error of 11.3 degrees when adapting to individual users and environments.
1 0
0
cs.SD 2026-07-03

Multi-branch audio system reaches 80.84% hierarchical F1

by Beile Ning, Jiayi Yu +6 more

A Multi-Branch Hierarchy-Aware Framework for Heterogeneous Audio Classification

CLAP features, separate acoustic branches, and KNN post-processing lift scores on heterogeneous sound taxonomy task.

Figure from the paper full image
abstract click to expand
This technical report describes our system for Task 1 of the DCASE 2026 Challenge, which aims to classify heterogeneous audio recordings according to the Broad Sound Taxonomy (BST). The task requires both accurate second-level prediction and consistency with the top-level taxonomy. Our system is built on CLAP-based audio-text representations and is improved along three strategies: expanding the training set with a filtered subset of BSD35k, enhancing acoustic modeling with feature-specific branches, and refining predictions using hierarchy-aware classifiers and KNN-based post-processing. Among the acoustic features considered, the log-STFT branch provides the strongest single-model performance. With KNN-based post-processing, our best single system achieves a hierarchical F1 score (Hier. F1) of 80.84% on the BSD10k-v1.2 set under the same evaluation protocol as the baseline. We further construct ensemble systems by combining models with complementary acoustic features and classification heads, achieving Hier. F1 scores of 81.25% and 81.18%, respectively.
1 0
0
eess.AS 2026-07-03

Single neural audio codec model handles multiple token rates

by Tomohiko Nakamura, Wataru Nakata +2 more

Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers

Shared parameters create resolution-specific kernels by scaling size and stride to each token interval

abstract click to expand
Discrete tokens obtained from neural audio codecs (NACs) have been used as compact representations in audio generation and understanding models. In such token-based systems, token temporal resolution (TTR), defined as the time interval between adjacent token frames, is important because it controls the trade-off between representing rapid acoustic events and reducing token-sequence length. However, most NACs are trained at a single TTR and require separate training for each TTR. This paper proposes a mechanism that enables a single NAC to operate at multiple TTRs using sampling-frequency-independent convolutional layers. The mechanism regards TTR as the sampling period of the token sequence and generates TTR-dependent convolutional kernels from a shared parameter set, while adjusting the kernel size and stride for each TTR. We incorporate the mechanism into Descript Audio Codec, leaving the quantizer unchanged. Experiments on environmental sound reconstruction show that the proposed model outperforms a single-model baseline that switches TTR-specific layers for each TTR.
1 0
0
cs.LG 2026-07-03

Decomposer converts MIDI into readable music programs

by Yewon Kim, Apurva Gandhi +3 more

Decomposer: Learning to Decompile Symbolic Music to Programs

Two-stage training on synthetic pairs plus dual-reward RL produces more faithful and editable code than LLMs or heuristics.

Figure from the paper full image
abstract click to expand
Musical performance involves executing a set of high-level musical instructions, yet recovering those instructions from the performance is a challenging inverse problem. We present Decomposer, a post-training framework for symbolic music decompilation: the task of recovering executable, editable music programs from symbolic music. We instantiate the task as MIDI-to-Strudel decompilation, where the model takes symbolic MIDI as input and produces a program in Strudel, a music programming language, that reconstructs the input when executed. The task poses two challenges: Strudel is a low-resource language with little naturally paired MIDI-code data, and optimizing faithful reconstruction of MIDI alone can collapse to unreadable note-by-note transliteration. We address these challenges in two stages. First, we construct Strudel-Synth, a synthetic corpus of paired Strudel programs and rendered MIDI, and use it for supervised fine-tuning. Second, we refine the model with reinforcement learning on unpaired MIDI, optimizing rewards for both MIDI reconstruction faithfulness and code readability. Our evaluation across synthetic and real-world MIDI benchmarks shows that Decomposer achieves substantially higher MIDI reconstruction faithfulness than closed-source LLMs while producing more readable and diverse code than the heuristic converter.
0
0
cs.SD 2026-07-03

Hearing aid enhancement reaches 8 ms latency with fewer operations

by Z. Benslimane, P. Chouteau +5 more

RT-Tango: Real-Time Distributed Binaural Speech Enhancement for Low-Power Hearing Aid Devices

Distributed two-stage system applies perceptual compression and recurrent estimation to keep quality competitive on constrained hardware.

Figure from the paper full image
abstract click to expand
Real-time binaural speech enhancement is constrained by latency, computational cost, and inter-device communication, yet existing efficient solutions predominantly address single-channel settings. In this paper, we introduce RT-Tango, a real-time distributed binaural speech enhancement framework designed for streaming on resource-constrained platforms and specifically for hearing aids. RT-Tango relies on a two-stage distributed architecture combining perceptually motivated ERB feature compression, lightweight grouped recurrent mask estimation, and temporal sparsification to reduce computational cost. Stringent latency constraints are addressed by decoupling spectral resolution from algorithmic delay using an asymmetric STFT, together with causal recurrent inference and online estimation of spatial statistics. Experimental results show that RT-Tango achieves competitive speech enhancement while significantly reducing MACs operations and functioning at ultra-low latencies as low as 8 ms.
1 0
0
cs.AI 2026-07-03

Reinforcement learning creates clean-label speech backdoors

by Yueming Huang, Wenhan Yao +3 more

DRL-CLBA: A Clean Label Backdoor Attack for Speech Classification via DDPG Reinforcement Learning

DDPG shifts target audio to hidden steganographic anchors in latent space without label changes, resisting fine-tuning and pruning.

Figure from the paper full image
abstract click to expand
Deep learning models for speech classification are vulnerable to backdoor attacks, where malicious triggers cause misclassification at inference time. While sample-specific attacks can bypass many defenses, they often rely on poisoned label attack, making them detectable via manual data defense. In this paper, we propose DRL-CLBA, a novel clean label backdoor attack for speech classification that leverages Deep Deterministic Policy Gradient (DDPG) reinforcement learning. We also utilize deep audio steganography to embed sample-specific triggers into source audio, creating feature-space anchors. The proposed reinforcement learning framework effectively optimizes target samples toward trigger-bearing anchor points in the model's deep latent space, enabling label-migration-free poisoning of target samples. Experimental results across three datasets and four different DNNs demonstrate that DRL-CLBA achieves a high attack success rate, effectively bypassing some backdoor defenses. The attack demonstrates strong resistance against fine-tuning, pruning, and spectral signature defenses, exposing critical vulnerabilities in speech-controlled systems.
0
0
cs.CR 2026-07-03

Meta-learning adds multiple backdoors to speech models via timbre leak

by Yueming Huang, Wenhan Yao +3 more

Pmeta-TLA: Backdoor Attacks for Speech Classification Models via Meta-Learning with Timbre Leakage Attack

New trigger spreads frame-level timbre info to create natural poisoned samples that bypass detectors.

Figure from the paper full image
abstract click to expand
Recently, speech classification methods have gained widespread adoption in intelligent gadgets. Current study indicates that backdoor attacks provide a substantial security concern to these models, underscoring the pressing necessity to investigate additional potential attack techniques to expose and prevent such risks. This work discusses the vulnerability of current speech triggers to detection by deep neural network defenders and introduces the Timbre Leakage Attack (TLA). The suggested trigger disseminates timbre information at the frame level within the deep self-supervised features, producing poisoned samples that appear natural to human perception. Furthermore, we introduce Pmeta-TLA, an innovative training mechanism for embedding numerous backdoors one time. This method proposes a multi-backdoor injection training strategy using meta-learning and Projected Conflicting Gradients (PCGrad) and introduces TLA as a multi-target attack tool within it. We performed tests on data-poisoning backdoor attacks in keyword spotting tasks utilizing some deep neural network models. Experimental results indicate that the proposed strategy attains superior Attack efficacy, enhanced stealthiness, robustness, and a reduced attack cost relative to baseline methods.
0
0
cs.SD 2026-07-03

Text embedding clusters raise objective scores in music generation

by Shunsuke Yoshida, Yu-Hua Chen +1 more

UT-AISTimprt submission for ICME 2026 Grand Challenge on Academic Text-to-Music Generation

Grouping similar text embeddings in batches outperforms audio clusters, with moderate granularity best for metrics and finer clusters best f

abstract click to expand
This work investigates the effect of batch sampling strategies during training for text-to-audio music generation under low-data and small-scale model settings. This paper describes our approach and findings for the ICME 2026 Grand Challenge on Academic Text-to-Music Generation. Training data are clustered using either text embeddings or audio embeddings, and samples with similar characteristics are grouped within the same mini-batch to mitigate gradient interference. The effects of modality and cluster granularity on clustering are analyzed. Results show that clustering based on text embeddings achieves better performance on objective evaluation metrics than clustering based on audio embeddings. In addition, different cluster granularity leads to different behaviors across evaluation criteria: a moderate number of clusters performs best on objective metrics, while a larger number of clusters tends to exhibit music with more coherent structure in listening tests.
0
0
cs.SD 2026-07-03

Explicit guidance lifts MoE accuracy on overlapping speech

by Yujie Guo, Jiaming Zhou +3 more

H-SAGE: Holistic Speaker-Aware Guided Experts for MoE-based Multi-Talker ASR

A global encoder plus overlap-aware loss helps experts route better in high-overlap conditions on LibriSpeechMix.

abstract click to expand
Multi-talker Automatic Speech Recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, particularly under complex high-overlap conditions. While recent Mixture-of-Experts (MoE) approaches have shown promise, they typically rely on frame-independent routing that leads to temporal myopia, and depend solely on the downstream ASR objective, which results in implicit and ungrounded representation learning. To address these limitations, we propose Holistic Speaker-Aware Guided Experts (H-SAGE) for MoE-based MTASR. Specifically, we introduce a Speaker-Aware Global Encoder to capture long-term dependencies, supervised by an auxiliary Overlap-Aware Loss that explicitly guides the model to discern acoustic states. Furthermore, we design a Holistic Gating Mechanism to arbitrate expert selection by jointly evaluating global context and local details. Experiments on LibriSpeechMix demonstrate that H-SAGE achieves consistent improvements over strong baselines, particularly in complex scenarios, validating that explicit acoustic guidance effectively enhances expert collaboration. Our code can be found at https://github.com/NKU-HLT/H-SAGE.
0
0
cs.SD 2026-07-02

Uncertainty score flags unreliable room embeddings from one utterance

by Yang Xiang, Philipp Götz +4 more

Quantifying the Uncertainty of Blindly Estimated Room Embeddings Using a Dispersion-Calibrated Score

Calibrated on dispersion from artificial corruptions, it tracks representation quality without downstream labels or multiple recordings.

Figure from the paper full image
abstract click to expand
Room embeddings derived from reverberant speech are often unreliable: speech content and recording degradation can alter the representation even when speaker, room, and source-receiver geometry remain unchanged, degrading downstream task performance. We propose a framework that learns room embeddings robust to speech-content variation and a representation-level uncertainty score from reverberant speech without downstream-task supervision. The embedding is anchored to a structured room impulse response (RIR) latent space and trained using a multi-view data structure with Kullback-Leibler (KL)-based alignment; a multi-positive contrastive term further refines robustness. A lightweight uncertainty head is calibrated using the dispersion of corruption-induced embeddings and optimized with a rank-based objective. Across waveform- and spectrogram-level corruptions, the score is consistent with representation dispersion and enables effective selective prediction while requiring only a single utterance at inference.
1 0
0
cs.SD 2026-07-02

NPUsper cuts Whisper latency 4.84x on mobile NPUs

by Sihyeon Lee, Hojeong Lee +4 more

NPUsper: Eliminating Redundant Computation for Real-Time Whisper on Mobile NPUs

Hallucination detection lets short audio chunks replace padded inputs and chunked decoding trims cache work while accuracy holds.

Figure from the paper full image
abstract click to expand
We present NPUsper, a live transcription system that makes Whisper efficient on mobile NPUs by eliminating redundant computation. To avoid the heavy padding used by prior streaming systems, NPUsper detects hallucinated tokens online from temporal patterns in decoder cross-attention, allowing each inference round to process short audio inputs with minimal carryover. For efficient mobile-NPU execution, we propose controlled unrolling, which executes autoregressive decoding as K-step chunk graphs, removing unnecessary KV-cache computation and reducing graph-dispatch overhead. NPUsper achieves up to 4.84x lower per-word latency, up to 33.2x lower time-to-first-token (TTFT), and up to 88.64% lower average power consumption compared with baselines, while maintaining comparable transcription accuracy. The code is available at https://github.com/npusper/NPUsper.
0
0
cs.SD 2026-07-02

SLM yields cleaner emotion subspaces than CFM for TTS steering

by Siyi Wang, James Bailey +1 more

A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models

Geometric measurements show low-dimensional disentangled representations enable better single-site control while joint steering trades inten

Figure from the paper full image
abstract click to expand
While prior work has explored emotion control in hybrid text-to-speech systems, the geometric properties of these modules, and their implications for steerability, remain poorly understood. We present the first comparative study of speech language model (SLM) and conditional flow-matching (CFM) modules as activation steering sites for mixed emotion speech synthesis. We first characterize emotion representations using linear probing and local intrinsic dimensionality (LID), and then evaluate single-site and joint steering for mixed-emotion synthesis. Our results show that SLM offers a clean, low-dimensional emotion-specific subspace with strong speaker--emotion disentanglement, while CFM exhibitspoor cross-speaker generalization due to speaker--emotion entanglement. Joint steering increases emotion intensity but degrades proportional control and speech quality on in-distribution data. These findings provide practical guidance for multi-site activation steering in hybrid TTS systems and highlight the importance of representation geometry in controllable speech generation.
0
0
eess.AS 2026-07-02

CNNs turn 4-mic covariance into 32-mic acoustic images

by Marianthi Adamopoulou, Parthasaarathy Sudarsanam +6 more

CNN Models for Microphone Array Covariance Matrix Upsampling and Acoustic Imaging

Models trained on real recordings achieve lower error than random guessing and produce sound maps nearly identical to those from a full 32-c

Figure from the paper full image
abstract click to expand
Acoustic imaging visualization is a core methodology in acoustics, enabling spatial analysis of sound sources and acoustic scenes. However, limited sensor availability in practical systems motivate approaches that enhance spatial resolution without increasing the hardware complexity. In this paper, we focus on upsampling virtually a tetrahedral 4-microphone array to a spherical 32-microphone array by estimating the covariance matrices of the channels employing deep learning techniques. Five neural network architectures are investigated for covariance upsampling for acoustic imaging using the real-world STARSS23 dataset. These models are developed to estimate a 32-microphone, time-frequency covariance matrix from a 4-microphone input covariance representation. The proposed architectures are based on 2D convolutional layers to capture the underlying spatial-spectral structure of covariance matrices, and are further enhanced with frequency dynamic convolution to model their frequency-dependent properties. The proposed architectures are evaluated in terms of root mean square error (RMSE) and using delay-and-sum beamforming acoustic imaging. Quantitative results show that all models outperform a random-guess baseline, which yields an RMSE of 0.548, with the best-performing architecture achieving an RMSE of 0.432. We analyze qualitatively the performance of the proposed models through beamforming heatmap visualizations derived from the 4-channel input covariance, the 32-channel ground truth, and the predicted 32-channel covariance matrices. These results demonstrate that covariance upsampling significantly enhances the effective performance of the 4-channel microphone array, producing sound maps that closely resemble those obtained with the 32-channel array.
0
0
cs.SD 2026-07-02

Pretrained embeddings beat scratch models on jazz standard recognition

by Çağr{i} Eser

Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition

They generalize better than from-scratch spectrogram models yet track performer identity, partially fixed by a contrastive projection.

Figure from the paper full image
abstract click to expand
Recognizing jazz standards from audio is a challenging form of tune-level music retrieval: different performances of the same standard may vary in tempo, key, arrangement, instrumentation, improvisational content, and even whether the head melody is present. We study this problem using a curated subset of the Jazz Trio Database designed for cross-performance standard recognition. We compare a from-scratch trained Harmonic CNN baseline against frozen pretrained music representations from recent music understanding foundation models, using both supervised probing and nearest-neighbor retrieval. Our results suggest that from-scratch spectrogram models overfit strongly to training performances, while pretrained embeddings provide better top-$k$ results but are sensitive to performer identity, which can be partially reduced with a lightweight contrastive projection. Our findings motivate jazz standard recognition as a useful stress test for music representation models and as a step toward retrieval-based standard identification. Project page: https://github.com/cagries/tipofmyear.
0
0
cs.CV 2026-07-02

Benchmark splits audio-visual sync into separate timing and meaning scores

by Tianhong Zhou, Mingyang Han +9 more

AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization

AV-SyncBench lets researchers measure offset accuracy apart from content matching on 38k verified samples.

Figure from the paper full image
abstract click to expand
Audio-visual feature extraction is a fundamental component of multimodal understanding and generation tasks. However, existing evaluation protocols for feature extraction models exhibit dimensional bias, typically focusing on either semantic matching or temporal offset detection. Moreover, their data construction remains coupled, preventing independent assessment of temporal and semantic consistency. We propose AV-SyncBench, the first benchmark to fully separate temporal and semantic evaluation for audio-visual synchronization. Built from in-the-wild videos, it spans Voice, Music, and Sound across 10 scenarios and 5 challenge tasks. Data are automatically filtered and manually verified to ensure on-screen sound sources. The benchmark contains 3,269 videos and 38,390 samples, and we evaluate five representative models to quantify feature quality for alignment and downstream tasks. The code and dataset are available at: https://fgt7t6g.github.io/AV-SyncBench.
1 0
0
cs.CL 2026-07-02

Tool merges Python and web for speech feature comparison

by Stephen McIntosh, Daisuke Saito +1 more

Speech Playground: An Interactive Tool for Speech Analysis and Comparison

Supports continuous, discrete and variable-length representations plus TextGrid alignment for research and CAPT tasks.

Figure from the paper full image
abstract click to expand
This paper presents Speech Playground, an interactive speech visualization and comparison tool. While existing tools such as Praat are excellent, it can be cumbersome to integrate them with modern deep learning representations and use them for comparison. Speech Playground addresses this by combining a Python backend with a web-based frontend for interactive exploration of multiple feature types, including continuous, discrete, and variable-length representations. It includes TextGrid and forced alignment support together with configurable distance and alignment settings for visual and auditory comparison. Speech Playground is intended for use in speech research, representation validation, and computer-aided pronunciation training (CAPT)-oriented experimentation.
0
0
cs.SD 2026-07-02

Guidance framework speeds flow matching speech synthesis nearly 3x

by Zuda Yu, Qianhui Xu +6 more

Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis

Heterogeneous augmentation and trajectory rectification remove CFG overhead and raise speaker similarity.

Figure from the paper full image
abstract click to expand
Flow Matching (FM) has emerged as a powerful paradigm for speech generation but remains constrained by high inference latency and timbre leakage. To address these bottlenecks, we propose a unified guidance framework that enhances generation efficiency and robustness through two complementary strategies. On the data front, we introduce Data-guidance via heterogeneous augmentation, encouraging the model to disentangle linguistic content from acoustic residue. In parallel, we propose an enhanced Model-guidance mechanism that synergizes trajectory rectification with a novel intrinsic guidance objective. This approach distills conditional knowledge into network weights and straightens inference trajectory path, thereby eliminating Classifier-Free Guidance (CFG) overhead. Experiments demonstrate that our framework accelerates inference by nearly three times while effectively improving speaker similarity compared to state-of-the-art baselines.
1 0
0
cs.SD 2026-07-02

Text prompts steer evolving soundscapes through a categorical schema

by Prabal Gupta (Rama Labs, Kitchener +1 more

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

Performers adjust parameters directly while audio continues without interruption, using any of three backends.

Figure from the paper full image
abstract click to expand
We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct parameter adjustments - stepping brightness down, switching a rhythm style - each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends - embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model - all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound - reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.
0
0
cs.SD 2026-07-01

Hidden-state selector lifts audio decoding accuracy 4.3 percent

by Aaron Isidore Grace, Zhouyuan Huo +1 more

Adaptive Perturbation Selection for Contrastive Audio Decoding

A lightweight network trained on base-model states picks the best perturbation per example and task without retraining the language model.

Figure from the paper full image
abstract click to expand
Large audio-language models (LALMs) frequently hallucinate by overriding acoustic evidence with language priors. While contrastive decoding (CD) offers training-free mitigation, existing methods rely on blunt perturbations like masking or noise, leaving structured audio transformations unexplored. We explore this design space by evaluating a diverse library of targeted audio perturbations and adaptively selecting the optimal negative branch for each task and example. First, we improve upon earlier prompt engineering by showing that a simple binary yes/no constraint reduces the model's tendency to falsely confirm absent audio features. Second, evaluating our library across temporal, spectral, frequency, and amplitude domains reveals that optimal transformations are highly task-dependent; for instance, reversing the audio array disrupts temporal coherence, raising accuracy on the temporal order task from 74.7% to 81.4%. Finally, we trained a light-weight perturbation selector on model hidden states to dynamically route negative branches, yielding an additional +4.3% gain on the existence task.
0
0
cs.SD 2026-07-01

Merged Roman numeral datasets create 1,621-piece corpus

by Johannes Hentschel, Emmanouil Karystinaios +2 more

Dilemmadata: On the Interoperability of Heterogeneous Roman Numeral Datasets

84 overlapping pieces allow note-for-note comparison of two analytical traditions on identical music

abstract click to expand
In recent years, there has been growing effort to annotate and collect large-scale corpora of Roman numeral analyses in support of data-driven studies in tonal harmony. We introduce dilemmadata, the first resource to reconcile two major collections, the AugmentedNet Dataset (AN) and the Distant Listening Corpus (DLC), making them interoperable through a shared note-wise TSV schema. The reconciliation confronts four families of dilemmata: annotation-standard (the two encode the same musical fact differently in terms of vocabulary size, syntax, conventions for chord extensions, inventory of special chord functions), representational (what counts as a row, and which information survives the conversion), toolchain (incompatible Python ecosystems built around music21 vs. ms3+dimcat), and curatorial (which pieces to include, exclude, or retain twice). We resolve each by deliberately transforming, augmenting, and omitting information, formalising the mismatches, preserving musical semantics, and flagging transformations that may subtly affect annotation fidelity. Consistency checks and qualitative inspections offer a preliminary assessment of post-conversion validity and a basis for critiquing the theoretical assumptions embedded in each original standard. After removing duplicates and merging the two collections, the resulting dilemmadata (1,621 pieces and aprox. 2.8 M note-wise annotations) is the largest homogeneous Roman-numeral corpus currently available, albeit far from perfect. Crucially, we retain 84 pieces common to both corpora under each of their original analyses, yielding a shared reference set in which two equally legitimate analytical traditions can be compared note-for-note over identical musical material. Released on Zenodo, dilemmadata supports interoperability, comparative harmonization modeling, and future refinement of Roman-numeral encoding standards.
0
0
cs.SD 2026-07-01

Entropy regularization cuts base-to-novel gap in audio prompt learning

by Asif Hanif, Mohammad Yaqub

ZEBRA: Zero-Shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization in Audio-Language Models

ZEBRA fuses zero-shot and prompt logits to raise novel-class accuracy while preserving base performance across audio datasets.

Figure from the paper full image
abstract click to expand
Audio-Language Models (ALMs) achieve strong zero-shot performance by aligning audio with textual class descriptions. Although prompt learning improves accuracy on base classes through few-shot supervised adaptation, we observe a critical trade-off: it often degrades performance on novel classes, sometimes falling below zero-shot accuracy. This exposes a base-to-novel generalization gap in prompt learning for ALMs. To address this issue, we propose \textbf{ZEBRA} (Zero-shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization), a plug-and-play framework that fuses zero-shot logits with prompt-learning logits, and employs self-entropy regularization to reduce overfitting to base classes. Experiments across multiple audio classification datasets show that ZEBRA consistently improves novel-class performance while maintaining strong base accuracy, significantly reducing the base-to-novel gap compared to standard prompt learning. The code is available at: https://github.com/asif-hanif/zebra.
1 0
0
eess.AS 2026-07-01

High-fidelity room simulations cut speech errors by 38 percent

by Georg Götz, Alessia Milo +4 more

Improving multichannel speech enhancement through accurate room-acoustic simulations

Wave-based and hybrid acoustic data for training outperforms purely geometrical simulations on real measured recordings.

Figure from the paper full image
abstract click to expand
Room-acoustic simulations are widely used to augment training data for deep-learning-based speech enhancement. While most pipelines rely on simplified geometrical acoustics, wave-based approaches offer greater physical accuracy. In this work, we examine how simulation fidelity affects multichannel speech enhancement performance. To this end, we train SpatialNet on datasets augmented with different room-acoustic simulation methods and evaluate the resulting models on measured data. We compare lower-fidelity datasets based on geometrical acoustics with a high-fidelity dataset using advanced acoustic modelling and a hybrid combination of wave-based and geometrical acoustics simulations. Training on the high-fidelity dataset results in an up to 38 % relative reduction in median word error rate compared to the lower-fidelity alternatives. These results show that augmentation with high-fidelity room-acoustic simulations directly translates into improved multichannel speech enhancement performance.
1 0
0
eess.AS 2026-07-01

Multilingual SSL models predict articulatory movements with r up to 0.68

by Ailín Pollio San Pedro, Tomi Kinnunen +2 more

How Bilingual Are SSL Speech Models? Cross-Lingual Probing of Articulatory Encoding with Finnish and Russian EMA

Bilingual Finnish-Russian EMA data shows intermediate layers capture tongue and lip positions across languages using only minutes of trainin

Figure from the paper full image
abstract click to expand
SSL speech models capture rich phonetic, prosodic, and acoustic patterns from raw audio, yet how they encode articulatory information across diverse languages remains unclear. Using EMA data from bilingual Finnish-Russian speakers, we evaluate cross-lingual correlations between SSL latent representations and articulatory movements. Models achieve strong prediction performance (Pearson r up to 0.68) even with approximately 5 minutes of training data, with multilingual models outperforming monolingual ones. Intermediate layers encode articulatory features most effectively, and tongue movements are more predictable than lip movements. We also assess the impact of task type (read versus spontaneous speech) and language proficiency, finding higher accuracy for structured tasks and strong generalization across proficiency levels. These results enhance the interpretability of SSL models and show their potential for speech-technology applications.
1 0
0
cs.CL 2026-07-01

Adapted model halves errors in Bambara child reading ASR

by Yacouba Diarra, Nouhoum Souleymane Coulibaly +3 more

Building an ASR Solution for Training and Assessing Children's Reading

Training on 55 hours of data from 60 children enables practical assessment of literacy skills.

Figure from the paper full image
abstract click to expand
Automatic speech recognition for children's reading remains underdeveloped for most African languages, including Bambara, despite its potential value for reproducible literacy assessment. We present an open-source system for assessing children's reading in Bambara, developed through an end-to-end process linking field data collection, benchmark construction, model adaptation, a reading application, and classroom validation. A mobile collection and assessment app was used to collect 55 hours of raw reading speech from 60 children, from which we construct a public benchmark for Bambara child-reading assessment. Fine-tuning experiments compare Soloni, a Bambara-adapted Fast-Conformer ASR framework with TDT and CTC decoders, with QuartzNet, a compact convolutional ASR architecture. The best Soloni model reduces WER from 0.42 to 0.22 and CER from 0.15 to 0.08, substantially outperforming QuartzNet on the isolated benchmark. The experiments further show that repeated readings of the same texts provide architecture-dependent benefits: they substantially improve QuartzNet but add only marginal gains for Soloni, while SpecAugment regulates training without exceeding the best unaugmented configuration. Disaggregated analysis identifies children under 10 as the main source of residual errors, motivating targeted collection from younger readers. Ten classroom trials supported continued use of the application.
0
0
eess.AS 2026-07-01

Probes show acoustics leak into speech embeddings in codecs

by Philipp Grundhuber, Emanuël A. P. Habets

Beyond Cross-Reconstruction: Probing-Based Disentanglement Evaluation for Acoustic Teleportation Codecs

Speaker identity stays mostly partitioned but room parameters emerge unsupervised in acoustic embeddings and leak elsewhere.

Figure from the paper full image
abstract click to expand
Some neural audio codecs disentangle speech into latent subspaces encoding content, speaker identity, and acoustics, enabling acoustic teleportation and voice conversion. Existing evaluations rely on cross-reconstruction quality, which cannot reliably detect leakage across partitions. We extend a probing based framework to assess disentanglement by regressing room-acoustic parameters (reverberation time, clarity, and direct-to-reverberant ratio) and classifying speaker identity, using the gap between intended and unintended partitions as the disentanglement measure. Applied to an acoustic teleportation codec, we find speaker identity is largely confined to its partition, while acoustics leak into the speech embeddings due to the training objective. Acoustic embeddings blindly estimate room parameters within 0.02 s of supervised baselines, indicating physically meaningful structure emerges without explicit supervision.
1 0
0
cs.SD 2026-07-01

Binary QA accuracy misses instrument grounding flaws

by Yujun Lee, Joonhyeok Shin +2 more

Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models

Models display position bias, confusable errors and temporal inconsistencies on extended tests

Figure from the paper full image
abstract click to expand
Recent music audio-language models achieve high accuracy on instrument question-answering benchmarks, but it remains unclear whether this reflects robust audio grounding or benchmark-specific shortcuts. In this paper, we introduce an OpenMIC-derived diagnostic benchmark sequence for instrument grounding in music audio-language models, extending binary instrument-presence QA to genre-prior-reduced examples, confusable instrument discrimination, longer audio context, and temporal localization. Across these settings, high binary QA accuracy often fails to predict model behavior: models can exhibit option-position bias, confusable-instrument errors, and temporal response bias. These results suggest that instrument grounding should be evaluated with multi-axis diagnostic benchmarks rather than a single aggregate accuracy.
0
0
cs.SD 2026-07-01

One-step audio model distilled from captions alone

by Binh Mai, Tran Quoc Bao Le +2 more

SwiftAudio: Data-Efficient Caption-Only Distillation for One-Step Text-to-Audio Diffusion-based Generation

SwiftAudio trains a fast text-to-audio generator on 45K captions without audio pairs and tops other one-step methods.

Figure from the paper full image
abstract click to expand
Diffusion-based text-to-audio (TTA) models achieve impressive synthesis quality but suffer from high inference latency due to iterative multi-step denoising. Existing one-step approaches alleviate this issue but still rely on paired text--audio data during distillation. To address these limitations, we propose SwiftAudio, a one-step TTA framework that performs audio-free distillation from a pretrained diffusion teacher using only text captions. Specifically, we adapt Variational Score Distillation (VSD) to the audio domain and introduce a temporal smoothness regularization objective to encourage coherent latent audio representations. This design enables the student model to inherit the teacher's generative prior without requiring paired audio supervision and allows effective training with only approximately 45K captions. Experiments on AudioCaps and Clotho demonstrate that SwiftAudio achieves state-of-the-art performance among strict one-step methods and substantially narrows the gap to multi-step diffusion systems. Project page: https://swiftaudio.org/
0
0
cs.SD 2026-07-01

FlexiSLM adds dynamic frame rate control to spoken language models

by Jiaqi Li, Chaoren Wang +10 more

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

It beats fixed-rate 7B models on quality and halves inference time at lower rates.

Figure from the paper full image
abstract click to expand
Spoken language models (SLMs) extend LLMs to speech input and output. Existing SLMs represent speech at fixed frame rates (e.g., 25 or 12.5 Hz), ignoring the time-varying information density of speech and offering no flexibility to trade off quality for speed at inference time. Recent audio tokenizer research has proposed dynamic frame rate speech coding, which exploits this non-uniformity and enables two new capabilities: very low average frame rates and frame rate controllability. However, this technique has not yet been applied to SLMs. We introduce Flexible Spoken Language Model (FlexiSLM), the first SLM that supports dynamic and controllable frame rates on both speech input and output. Using dynamic frame rate representations, FlexiSLM outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at its high-quality operating points. We further verify that FlexiSLM can be accurately steered down to 4.0 Hz; at 6.25 Hz, it roughly halves inference time relative to 12.5 Hz while retaining strong speech-to-speech quality. Audio samples are available at https://flexislm.github.io .
0
0
cs.SD 2026-07-01

One model edits speaker, emotion and content in speech

by Chuanbo Zhu, Wuyou Zhou +5 more

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

Discrete phonetic tokens let users change small sound units or whole words while controlling voice and mood in the same system.

Figure from the paper full image
abstract click to expand
Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion editing as separate tasks, limiting both editing granularity and flexibility. We propose UniSAE, a unified speech attribute editing framework which supports composable speaker, emotion and content editing from sub-phoneme to word level within a single architecture. UniSAE introduces a Discrete Phonetic PosteriorGram (DPPG) representation that factorizes speech content into discrete tokens encoding phoneme identity, pronunciation variants, and duration, enabling direct phoneme- and sub-phoneme-level editing. For higher-level modifications, an autoregressive content transformer predicts edited DPPG sequences for word-level content editing. The edited sequences are rendered into speech by a diffusion-based acoustic decoder, conditioned on disentangled speaker and emotion representations. Experimental results demonstrate that the proposed unified framework supports precise speaker and emotion control, content editing at multiple granularities, and joint modification of all three attributes within a single framework.
0
0
cs.SD 2026-07-01

UTMOS scores stay high even when audio quality drops under attack

by Wen-Chin Huang, Tomoki Toda

Attacking UTMOS: Probing the Robustness of a Speech Quality Assessment Model

Optimization in waveform, mel, and EnCodec spaces decouples the model's output from what listeners actually hear.

Figure from the paper full image
abstract click to expand
UTMOS has become one of the most commonly used deep neural network-based speech quality assessment (SQA) metrics in speech processing research. In this paper, we attack UTMOS to probe its robustness. Starting from high-quality speech samples, we optimize the input in two directions: a score-preserving attack, which degrades perceived quality while maintaining the predicted score, and a quality-preserving attack, which lowers the predicted score while maintaining perceived quality. We consider three input spaces: raw waveform, mel spectrogram with a HiFi-GAN vocoder, and the latent space of EnCodec, a neural audio codec. Experimental results show that score-preserving attacks are effective against UTMOS. Although perfect quality-preserving attacks are more difficult, optimization in the EnCodec latent space provides the best chance of success. These results reveal failure modes of UTMOS and highlight the importance of robustness analysis for DNN-based SQA metrics.
0
0
cs.CL 2026-07-01

Matched references calibrate prosody flags to 10% rate in dialogue AI

by Ashish Hallur, Thomas Thebaud +3 more

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

Conditioning on speaker traits and interaction state yields expected flag rates on human data and interpretable deviations unlike pooled ave

Figure from the paper full image
abstract click to expand
Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with model-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output. Using 4000+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes for $F_0$ mean, $F_0$ expressivity, speech rate, articulation rate, pause ratio, and mean pause duration. We then define a percentile-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th-95th percentile out-of-regime flags. On held-out human rows, pooled references over-flag state-conditioned $F_0$ expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user-centered evaluation.
0
0
eess.AS 2026-06-30

Frozen backbone keeps S2T skills while adding direct S2S output

by Yuxuan Hu, Heng Lu +9 more

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

PRIME-Speech trains only a post-decoder on hidden states to generate spoken responses without degrading original text reasoning.

Figure from the paper full image
abstract click to expand
Strong speech-to-text (S2T) LLMs already provide robust speech perception and text reasoning, but adding speech-to-speech (S2S) output is challenging: fine-tuning the backbone can degrade the original S2T performance, while attaching a downstream talker reintroduces a serial text-to-speech bottleneck. We present PRIME-Speech, a frozen-backbone S2S conversion framework that trains only speech-generation modules. PRIME-Speech synchronizes a causal audio post-decoder with intermediate hidden states of the frozen backbone, so codec tokens are generated from the model's evolving reasoning trajectory rather than from completed text chunks. The post-decoder uses mixed hidden-state, text, and audio-history conditioning, and a training-time packing strategy with turn-level audio KV-cache and position reset stabilizes multi-turn spoken interaction without additional multi-turn S2S training data. Multi-token prediction further reduces the effective codec prediction rate and improves first-audio latency without modifying the reasoning path. Across speech translation, spoken QA, speech understanding, and multi-turn dialogue, PRIME-Speech preserves the S2T behavior of the frozen backbone while producing accurate, low-WER spoken responses.
0
0
cs.CV 2026-06-30

Cache method speeds audio portrait videos up to 4x

by Juncheng Ma, Yuxuan Du +9 more

SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation

It reuses stable background residuals across blocks while refreshing only audio-driven human regions to keep exact lip sync.

Figure from the paper full image
abstract click to expand
Diffusion Transformers (DiTs) have significantly advanced audio-driven portrait animation, but their high computational cost leads to substantial inference latency. Although training-free diffusion caching accelerates inference significant, existing methods are primarily developed for text-conditioned generation and overlook the spatial and modality imbalances inherent in audio-driven portrait animation. In this paper, we propose SyncCache, a training-free caching acceleration method tailored for DiT-based portrait animation that explicitly exploits asymmetric dynamics. Specifically, high-frequency dynamics driven by audio conditions and concentrated in human regions are more challenging and critical to cache and reuse than the low-frequency visual background in portrait animation. First, we introduce Spatially-Asymmetric Probing to prioritize error sensitivity in dynamic human region. Second, through Modality-Decoupled Caching, we bypass heavy DiT block by reusing stable inter-block residuals, while continuously recomputing lightweight audio blocks to preserve precise lip synchronization. Furthermore, we introduce a cache ratio to control cache capacity and formulate memory-adaptive cache selection as an offline dynamic programming problem without online overhead. Extensive experiments demonstrate that SyncCache achieves superior speed-quality trade-offs, delivering up to 4.12x acceleration on HunyuanVideo-Avatar and 3.75x on Wan-S2V with near-lossless visual fidelity and precise audio alignment.
0
0
cs.CV 2026-06-30

One tokenizer maps audio-video pairs to shared 1D tokens

by Kien T. Pham, I Chieh Chen +2 more

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Shared encoder and codebook enable joint reconstruction plus audio-to-video and video-to-audio tasks without separate branches.

Figure from the paper full image
abstract click to expand
Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present \textbf{AVTok}, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.
0
0
cs.SD 2026-06-30

Four probed layers beat full speech model on deepfakes

by Marjan Beheshti, Majid Rostami +1 more

Probing-Guided Layer Selection from Self-Supervised Speech Models for Generalizable Audio Deepfake Detection

Independent probes rank transformer layers by cross-domain power, then fuse only the strongest ones for lower error with far fewer parameter

Figure from the paper full image
abstract click to expand
Audio deepfake detection systems often fail to generalize across domains because they rely on features tied to specific attacks or recording conditions. Self-supervised speech models offer rich multi-layer representations, yet existing approaches either use a single layer or fuse all layers indiscriminately, and only reveal layer importance after training. We propose a model-agnostic, two-stage methodology that identifies informative depth zones before any task-specific model is trained. In the first stage, lightweight XGBoost probes evaluate each transformer layer's cross-domain discriminative power, producing a layer ranking. In the second stage, a compact neural classifier fuses only the selected layers through per-layer attention pooling and a shared bottleneck projection, while the backbone remains frozen. Applied across three backbones, the probing reveals two key findings. First, informative layers cluster in depth zones rather than at uniquely optimal positions: within-zone substitutions fall within multi-seed noise, while zone violations degrade performance by up to 5x. Second, the probing produces backbone-specific selections rather than a fixed layer recipe. On XLS-R-300M, four probing-selected layers with 1.34M trainable parameters achieve 4.94 +/- 0.32% equal error rate on In-The-Wild and 5.07% cross-domain average over four shared datasets, a 28% relative improvement over the best prior frozen-backbone result (Xiao and Vu, 2025) using all 25 layers with identical training data.
0
0
cs.SD 2026-06-30

Hierarchical model plans then refines song tracks for coherence

by Shun Lei, Huaicheng Zhang +9 more

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

Staged SFT and DPO training separates musicality from controllability and acoustics to beat open-source baselines.

Figure from the paper full image
abstract click to expand
Full-length song generation must preserve coherence and musicality, render detailed vocal and accompaniment acoustics, and follow lyrics and prompts. Existing language model-based systems face a structural trade-off: mixed-token modeling preserves vocal-instrument coordination but obscures track-specific details, whereas dual-track prediction improves acoustics but requires longer sequences and weakens global planning. We present LeVo 2, a hybrid LLM-Diffusion framework for controllable full-length song generation. LeVo 2 formulates this trade-off as hierarchical modeling: LeLM first predicts mixed tokens for semantic planning, then predicts vocal and accompaniment tokens in parallel for track-specific refinement, while a diffusion-based Music Codec reconstructs full-length waveforms. A central contribution of this extended version is an aesthetics-guided training schedule for alignment. During pre-training, an automated music aesthetic evaluation framework assigns musicality-tier conditions to large-scale data, providing musicality priors before preference alignment. Progressive post-training applies SFT, large-scale offline DPO, and closed-loop semi-online DPO to separately improve generation quality, controllability, and musicality. Modular extension then trains the Track-Specific LM for acoustic refinement while preserving the aligned semantic planner. This schedule separates musicality learning, controllability alignment, and acoustic refinement, mitigating optimization conflict and the limitations of static offline preference pairs. Expert listening tests and objective evaluations show that LeVo 2 outperforms open-source baselines across six subjective dimensions, and approaches leading commercial systems on several listening metrics. Ablations validate the effects of the training strategy, aesthetics guidance, scaling, and hierarchical architecture.
0
0
eess.AS 2026-06-30

Flow model edits sung lyrics while preserving melody and length

by Yoonjeong Park, Jaekwon Im +1 more

MeloDISinger: Melody-Aware & Duration-Preserving Singing Voice Editing with Audio Infilling

MeloDISinger predicts duration ratios via phonetic-melodic cross-attention to keep timing and tune intact during text changes.

Figure from the paper full image
abstract click to expand
Text-based singing voice editing (SVE) aims to revise sung lyrics while preserving the original melody, total duration, and non-edited regions. In this paper, we propose MeloDISinger, a flow-matching-based SVE model for melody-aware and duration-preserving editing. Its core module, MeloDRP, predicts fixed-budget duration ratios, enabling explicit span-wise duration control. For melody-aware duration allocation, MeloDRP fuses phonetic cues with pseudo-MIDI melodic context through cross-attention, while temporal-overlap supervision encourages soft phoneme--note correspondences. We further use a flow-matching mel decoder for audio infilling to synthesize edited regions while preserving surrounding context. In addition, we introduce a duration-aware edited-lyric generation pipeline using WhisperX and an LLM to construct feasible evaluation scenarios. Experiments demonstrate state-of-the-art performance in both objective and subjective evaluations.
1 0
0
cs.SD 2026-06-30

Saliency maps define reusable masks for sparse SER attacks

by Qiyang Sun, Yi Chang +2 more

SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition

One XAI-derived mask scopes magnitude-bounded updates and maintains competitive success rates across models while improving explanation cons

Figure from the paper full image
abstract click to expand
Speech conveys rich emotional information. As Speech Emotion Recognition (SER) is usually deployed in privacy-sensitive and reliability-critical environments, adversarial attacks on SER have attracted increasing attention. Existing sparse attacks control the number of perturbed elements, yet, they often lack explainability guidance and explicit measures of explanation consistency. A unified treatment of sparsity and magnitude constraints is also uncommon. In addition, transferability across attack families and target models remains limited. Hence, we propose a SalIency-Guided sparse Mask Attack (SIGMA). On self-supervised speech features, we use post-hoc explainable artificial intelligence (XAI) techniques to produce saliency maps and identify the scope of the mask, and then restrict magnitude-bounded updates to this mask. The mask is computed once and can be reused across models and different sparsity attacks to amortise cost. We evaluate on the IEMOCAP and TESS datasets. Under matched budgets and across multiple sparse-attack settings, SIGMA maintains competitive attack success rates, navigating a conscious trade-off between attack efficacy and explanation consistency. SIGMA therefore provides an efficient and interpretable framework for analysing the vulnerability and explanation behaviour of SER models under structured perturbations.
0
0
cs.SD 2026-06-30

Timbre predictor scores match humans at r=0.66 and FAD rankings

by Théo Chasle Cauchy, Modan Tailleur +3 more

Predicting Timbre Traits for Interpretable Assessment of Musical Sound Synthesizers

The model analyzes individual sounds on 20 traits to show which synthesizer outputs and dimensions need work.

Figure from the paper full image
abstract click to expand
Measuring neural audio synthesizers' performance is now routinely conducted using distribution based metrics such as the Fr\'echet Audio Distance (FAD). Although this metric can be correlated with human perception, it offers limited interpretability beyond ranking different approaches. In this paper, we introduce a deep neural timbre trait predictor composed of a pretrained audio neural embedding (CLAP), and a shallow learnable component. The latter is trained using the RWC musical instrument database and human judgments of 20 timbre descriptions (e.g., woody, percussive, rumbling, etc.) for 31 instruments. The resulting model shows strong correlation with average human ratings (r = 0.66, p < 0.001). We then demonstrate the benefit of this predictor for evaluating the performance of TokenSynth, a neural sound synthesizer. First, the Mean Absolute Error (MAE) computed over the set of generated sounds under different conditioning modalities of the model provides the same ranking as a FAD computed with the RWC database as a reference, suggesting that the proposed predictors are able to provide equivalent information on a distributional basis. Second, because the model is able to qualitatively analyze isolated sounds, we can determine which generated sounds could be improved and identify specific timbral dimensions that need adjustment.
1 0
0
cs.CL 2026-06-30

Joint prediction and reconstruction improves speech generation and speaker tasks

by Karl El Hajal, Mathew Magimai.-Doss

OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

OLIVE keeps recognition performance competitive by using waveform reconstruction to retain signal details alongside masked prediction for in

Figure from the paper full image
abstract click to expand
We propose Online Latent prediction with Invariant Views and rEconstruction (OLIVE), a self-supervised speech representation learning framework that jointly optimizes analysis and synthesis objectives. OLIVE combines view-augmented masked latent prediction with waveform reconstruction under a unified objective. Reconstruction constrains early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance for robust downstream performance. We show that these objectives enable representations that support a broad range of tasks. In particular, OLIVE improves results on generation and speaker tasks, maintains competitive performance on recognition and semantic tasks, and improves waveform reconstruction.
0
0
cs.SD 2026-06-30

Two-step scheme lifts audio transfer scores at fixed inference cost

by Ludovic K. Tuncay (IRIT-SAMoVA), Etienne Labbé (IRIT-SAMoVA) +1 more

BEST-RQ-2: Contextualize-Then-Predict, a Two-Step Approach for Self-Supervised Audio Representations

Decomposing masked prediction into context and prediction stages improves overall benchmark transfer without extra runtime compute.

abstract click to expand
Self-supervised learning enables audio representations that transfer across domains and tasks. We present BEST-RQ-2, an evolution of BEST-RQ that retains frozen randomprojection-based discrete targets while introducing a two-step contextualize-then-predict pretraining scheme. A ViT context encoder processes only the unmasked spectrogram regions, and a lightweight predictor infers targets for the masked regions; the predictor is discarded after pretraining. Replacing the original Conformer encoder with a ViT shifts performance across domains, slightly reducing speech performance while improving music and environmental sounds, with comparable average scores. The main improvement comes from decomposing masked prediction into separate contextualization and prediction stages. On the X-ARES and XARES-LLM benchmarks, BEST-RQ-2 consistently outperforms one-stage baselines in overall transfer while keeping inference compute unchanged. Code and model checkpoints are publicly available.
0
0
cs.CV 2026-06-30

Cultural embeddings raise gesture quality without speaker identity

by Ariel Gjaci, Antonio Sgorbissa +1 more

SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset

Domain-generalization losses isolate culture from individual style, improving realism and consistency on a new four-group TED dataset.

Figure from the paper full image
abstract click to expand
Recent co-speech gesture generation methods often overlook cultural differences, limiting their effectiveness in human-agent interaction. Moreover, culture-conditioned models are rarely evaluated under speaker-disjoint splits, so apparent "cultural" behavior may be confounded with speaker-specific gesturing style. We introduce SICAGE, a modular framework for culture-aware co-speech gesture generation that conditions motion synthesis models on speaker-independent cultural representations. SICAGE learns these representations from audio and text by treating each speaker as a separate domain while imposing invariance across speakers. This encourages representations to remain culture-discriminative while reducing dependence on speaker identity. The resulting cultural embeddings condition a multimodal generator to produce culturally appropriate gestures. We instantiate this idea with two domain generalization approaches: adversarial learning and Fishr regularization. We further introduce ALaDiT, a real-time diffusion-based gesture generator designed to efficiently incorporate the learned cultural embeddings. To validate our method, we built TED4C-L, a 106-hour multimodal dataset of 764 TED speakers from four cultural groups. Experiments show that SICAGE improves motion realism, diversity, beat synchronization, semantic relevance, and cultural consistency.
0
0
cs.SD 2026-06-30

Child-adapted SSL models improve voice anonymization

by Pranav Tushar, Xiao Xiao Miao +1 more

Child-Centric Voice Anonymization in Single and Multi-Speaker Speech via Domain-Adapted SSL Models

Experiments on MyST data show better speech quality and privacy for kids in solo and mixed-speaker recordings.

Figure from the paper full image
abstract click to expand
Voice anonymization aims to protect speaker identity while preserving linguistic content and speech usability. However, most anonymization systems are developed on adult speech, leading to degraded performance when applied to child speech. This paper investigates child-centric anonymization by adapting a self-supervised learning (SSL) based anonymization pipeline to the child speech domain. The system is adapted using child speech from the MyST corpus and evaluated under both single-speaker and two-speaker mixture conditions. Experimental results show that child-domain adaptation improves intelligibility and perceptual quality while maintaining strong privacy protection. Extending the approach to multi-speaker further demonstrates that combining target speaker extraction with child-adapted anonymization provides privacy protection while preserving conversational structure. These findings highlight the importance of child-specific adaptation for practical speech anonymization systems.
1 0
0
eess.AS 2026-06-29

VIB layers cut noise degradation in LLM-based AVSR

by Piyush Arora, Navlika Singh +3 more

VIB-AVSR: Variational Information Bottleneck for Noise-Robust LLM-Based Audio-Visual Speech Recognition

Targeted insertion inside the backbone stabilizes outputs across SNR levels and noise types with no extra data or redesign.

abstract click to expand
Audio-Visual Speech Recognition takes two input modalities, acoustic and visual streams, where visual information from lip movements aids recognition when audio is noisy. Recently, LLM-based AVSR models have emerged as a promising paradigm by connecting pre-trained audio-visual encoders to an LLM, achieving strong results in clean conditions. However, these models are predominantly optimized for clean acoustic conditions, with limited attention to making the LLM backbone robust to noise. No explicit mechanism is employed to produce stable representations under corrupted audio, leading to performance degradation in noisy environments. To address this, we propose VIB-AVSR, which integrates Variational Information Bottleneck layers at targeted positions within the LLM backbone to regularize representations. VIB-AVSR reduces degradation under noisy conditions across multiple SNR levels and noise types, without requiring architectural modifications or additional training data.
0
0
cs.SD 2026-06-29

Session-grouped splits drop drone detection from 0.796 to 0.745

by David Shulman

EchoHawk: A Reproducible Acoustic Pipeline for Drone Detection, Classification, and Direction-Finding, with a Cautionary Study of Session-Level Data Leakage

Public acoustic datasets leak session information when clips are split randomly, inflating reported counter-drone performance.

Figure from the paper full image
abstract click to expand
Passive acoustic sensing is an attractive modality for counter-unmanned aerial system (counter-UAS) defence: it is covert, low-cost, and effective against drones with small radar cross-sections or minimal radio emissions. We present EchoHawk, an open and fully reproducible reference pipeline that detects a drone from its rotor harmonics, estimates its blade-passing frequency, and localises it with a microphone array via classical wideband beamforming (delay-and-sum, MVDR, MUSIC) and time-delay processing (GCC-PHAT, SRP-PHAT), followed by temporal tracking. We evaluate the system on a physically transparent synthetic benchmark that pits drones against hard low-frequency harmonic confusers, such as ground vehicles, and on real recorded audio. Our central methodological contribution is a documented case of session-level data leakage in a widely used public dataset: because its recordings are pre-segmented into short clips, naive clip-level splits place adjacent slices of the same continuous recording in both training and test sets, inflating reported performance. Enforcing recording-session-grouped cross-validation reduces, for example, a random-forest baseline's detection probability at a 1% false-alarm rate from 0.796 to 0.745, yielding honest numbers. All code, figures, and a synthetic data generator are released so that every result runs without any download.
0
0
cs.SD 2026-06-29

Time-frequency MoE raises SDR 3.8 dB at 4.1 GMACs/s for speech separation

by Qinzhe Hu, Chenda Li +4 more

TF-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation

Alternating expert modules increase capacity on a mel-band Conformer without raising inference cost beyond prior compact models.

Figure from the paper full image
abstract click to expand
Recent advances in speech separation (SS) have led to compact front-end models with small parameter sizes, yet their high computational cost remains a major barrier for deployment on edge devices. To address this, we propose TF-MoE, a sparse Mixture-of-Experts (MoE) framework that enhances model capacity with almost no increase in inference cost. Our method introduces dynamic expert specialization in time and frequency dimensions through alternating time-wise and frequency-wise MoE modules, each dynamically selecting experts per frame or mel band. Built upon a mel-band-splitting Conformer backbone, TF-MoE achieves strong performance on SS tasks under low-compute settings. Experimental results demonstrate that TF-MoE consistently improves separation performance under computation cost constraints, outperforming BSRNN by +3.8 dB SDR on Libri2Mix with comparable 4.1 GMACs/s inference cost. This positions TF-MoE as a promising candidate for edge-device deployment.
1 0
0
cs.SD 2026-06-29

Everyday audio edits flip deepfake detector results

by Nicolas M. Müller, Aditya Tirumala Bukkapatnam +1 more

Proteus: Automated Adversarial Robustness Testing for Audio Deepfake Detectors

A search framework finds chains of codec, noise and compression that fool detectors but keep speech clear and speakers recognizable.

abstract click to expand
We present Proteus, a framework developed at Resemble AI for automated robustness testing of our audio deepfake detection system. Given a detector, Proteus systematically searches over sequences of everyday audio transformations (codec transcoding, additive noise, reverberation, dynamic-range compression, and VoIP simulation) to find combinations that fool the detector while preserving speech quality. We propose two complementary search strategies: (1) a breadth-first search that exhaustively maps augmentation effectiveness across the parameter space, and (2) a Q-learning agent designed to efficiently discover deeper attack chains by exploiting structural patterns in the BFS data. We report findings from continuous deployment of Proteus against our production detector, showing that specific augmentation chains can reliably flip detection verdicts while preserving speech intelligibility and speaker identity. We discuss how these findings are used to harden the detector through targeted retraining.
0
0
cs.SD 2026-06-29

DOA spatial priors extract speakers without diarization

by Yichi Wang, Junzhe Chen +2 more

Position-Aware Target Speaker Extraction for Long-Form Multi-Party Conversations: A Diarization-Free Framework for ASR

Position-aware extraction generates attributed streams for simple VAD, yielding ASR gains over CSS in meetings.

Figure from the paper full image
abstract click to expand
In long-form multi-party conversations, highly imbalanced speaker activity and frequent overlap make it difficult to identify "who spoke when and what". Sliding-window continuous speech separation (CSS) mitigates sparse supervision, but often suffers from cross-window speaker inconsistency and residual crosstalk, which in practice requires diarization for reliable speaker attribution. Motivated by the stability of speakers' directions of arrival (DOAs) in meetings, we propose PATSE, a multi-channel Position-Aware Target Speaker Extraction front-end that uses DOA as a spatial prior to directly extract the speech of each target speaker. PATSE combines a DOA-guided spatial encoder and conditioner to generate speaker-attributed streams, from which speaker activity can be inferred via simple post-processing (e.g., VAD) without explicit diarization. Experiments on both replayed and real conversations show consistent ASR gains outperforming CSS and diarization-based pipelines.
1 0
0
eess.AS 2026-06-29

Dynamic masking beats fixed-rate speech codecs at matched bitrate

by Hoyeol Sohn, Juhan Nam

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

A binary keep-mask and learned embedding let the codec drop redundant frames while counting every overhead bit and still raising quality and

Figure from the paper full image
abstract click to expand
Variable frame rate (VFR) coding has recently emerged in neural speech codecs, allocating fewer frames to redundant regions and more frames to rapidly changing speech. VFR must transmit side information about retained time steps, but prior gains are either not rigorously addressed or often minor once these overhead bits are included in total bitrate. We present Dynamic Token Masking (DTM)-Codec, a neural speech codec that demonstrates clear gains over fixed-frame-rate baselines under a strict matched-total-bitrate protocol. DTM keeps selected encoder tokens, fills masked positions with a learned <MASK> embedding, and transmits a binary keep-mask for position-aware decoding. We further introduce Path Length Equalization (PLE), a linear-time boundary selector for VFR coding that yields well-spread adaptive segments with negligible overhead. Across operating points, DTM-Codec broadly improves reconstruction quality and intelligibility over fixed-frame-rate baselines.
1 0
0
cs.LG 2026-06-29

Router weights audio and face inputs for 99% polyglot ID accuracy

by Chuxiao Zuo, Yao Zhu +4 more

AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification

Adaptive Modality Routing uses per-sample adapters and KL-supervised training to maintain performance when modalities drop or languages chan

Figure from the paper full image
abstract click to expand
Multimodal speaker identification systems face two key challenges in real-world deployment: missing modalities and language mismatch between training and testing conditions. In practical scenarios, background multi-speaker conversations, ambient noise, and overlapping speech further degrade identification accuracy. To address these challenges, we propose a multimodal polyglot speaker identification system for the POLY-SIM 2026 Grand Challenge. The system is fundamentally built upon Adaptive Modality Routing(AMR), a modality fusion module that dynamically assesses per-sample input quality and integrates modality information. Specifically, AMR employs two modality adapters to process the embeddings extracted from a linguistically robust audio encoder(W2V-BERT 2.0) and a large-scale pretrained face encoder(IResNet-18), producing modality-adapted embeddings. Based on these adapted embeddings, a trainable router estimates dynamic modality weights, which are subsequently applied to aggregate the modality-specific logits for the final prediction. To optimize this routing mechanism, we adopt a modality-aware training strategy that constructs four types of sample pairs to simulate diverse input conditions, with KL divergence serving as explicit supervision for weight assignment. Experimental results on the POLY-SIM 2026 evaluation set show that the proposed system achieves identification accuracy of 99.93%(English multimodal, P3), 100.00%(Urdu multimodal, P5), 97.50%(English audio-only, P4), and 98.83%(Urdu audio-only, P6). The average accuracy across all four protocols is 99.07%, surpassing the Fusion and Orthogonal Projection(FOP) baseline by 32.73%.
0
0
physics.med-ph 2026-06-29

Single-DOF model sustains phonation oscillation with added forces

by Sardar Nafis Bin Ali, Maryam Naghibolhosseini +1 more

An Optimal Contact-Mechanically Consistent and Flow-Separation Adapted Modeling of Vocal Fold Dynamics

Resistance and closure terms let a damped mass-spring system match subject glottal waveforms to under 3 percent error without vocal-tract co

abstract click to expand
Single mass-spring-damper models of vocal folds have been effective in simulating vocal fold vibrations without added complexity. However, single-degree-of-freedom models cannot sustain oscillation in the presence of structural damping unless source-tract interaction is considered. Moreover, existing lumped models struggle to accurately simulate vocal fold closure during phonation. This study aims to develop a reliable and simplified single-degree-of-freedom model of phonation that can simulate sustained oscillation in a damped system without incorporating a vocal tract model. Additionally, the proposed model maintains vocal fold closure in a manner consistent with the physics of phonation, addressing a longstanding challenge in existing lumped models. High-speed videoendoscopy (HSV) data from four normophonic subjects producing sustained vowel /i/ were used to extract glottal area waveforms (GAWs) via deep learning-based image segmentation for particle swarm optimization of the model parameters. An additional resistance force was incorporated to compensate for flow separation and generate the force imbalance required for sustained oscillation. An external structural force was also added during closure to sustain the closed phase. The 4th-order Runge-Kutta method was used to solve the governing equations with enhanced numerical stability and accuracy. The model parameters were optimized for individual subjects, resulting in normalized errors below 3% between experimental and simulated GAWs. The proposed model accurately reproduced subject-specific vocal fold vibrations and vocal fold closure in agreement with experimental data. Overall, the proposed model provides a computationally efficient framework for simulating sustained phonation without requiring complex source-tract coupling while capturing the key biomechanical and aerodynamic mechanisms of phonation.
0
0
cs.SD 2026-06-29

Alignment technique lifts cross-domain ship detection by 42.6%

by Quoc Thinh Vo, David K. Han

Underwater Source Detection and Classification for Signal-based Surveillance: Audio Dataset Curation and Cross-Domain Evaluation

A new underwater dataset plus margin-enhanced loss and feature alignment yield stronger robustness when models move between acoustic domains

abstract click to expand
Machine learning for underwater acoustics is constrained by the scarcity of publicly available labeled datasets. In contrast to air-acoustic domains, where large benchmarks enable rapid model development, underwater datasets are typically small and limited in acoustic diversity, restricting robust model training and cross-domain generalization. To help address this gap, we introduce a curated underwater audio dataset derived from an open-source maritime sound archive. The dataset contains over one thousand labeled audio segments across eight biologically and mechanically relevant acoustic classes, providing an additional resource for training models in data-limited underwater environments. Additionally, we establish a lightweight Convolutional Neural Network (CNN) baseline and propose a margin-enhanced loss with feature alignment to mitigate class confusion arising from data imbalance, acoustic similarity, and cross-domain mismatch. While the baseline achieves 96.35% in-domain accuracy, evaluation on ShipsEar reveals substantial domain shift; the proposed feature alignment improve zero-shot ship detection by 42.60%, demonstrating stronger robustness under distribution mismatch. We further release a transparent curation pipeline and reproducible benchmark to support future research on imbalance mitigation, domain adaptation, and data-efficient underwater acoustic classification.
1 0
0
cs.SD 2026-06-29

Clustering DINO features drops speech poisoning success to 0.25%

by Thomas Thebaud, Sonal Joshi +5 more

Clustering Unsupervised Representations as Defense against Poisoning Attacks on Speech Commands Classification System

Majority label inside each cluster removes most relabeled trigger samples before training.

Figure from the paper full image
abstract click to expand
Poisoning attacks entail attackers intentionally tampering with training data. In this paper, we consider a dirty-label poisoning attack scenario on a speech commands classification system. The threat model assumes that certain utterances from one of the classes (source class) are poisoned by superimposing a trigger on it, and its label is changed to another class selected by the attacker (target class). We propose a filtering defense against such an attack. First, we use DIstillation with NO labels (DINO) to learn unsupervised representations for all the training examples. Next, we use K-means and LDA to cluster these representations. Finally, we keep the utterances with the most repeated label in their cluster for training and discard the rest. For a 10% poisoned source class, we demonstrate a drop in attack success rate from 99.75% to 0.25%. We test our defense against a variety of threat models, including different target and source classes, as well as trigger variations.
1 0
0
cs.SD 2026-06-29

wav2VOT matches existing tools for VOT estimation with wav2vec2

by James Tanner, Morgan Sonderegger +3 more

wav2VOT: Automatic estimation of voice onset time, closure duration, and burst realisation with wav2vec2

The system reaches comparable accuracy on unseen data and improves further after fine-tuning on target speech.

Figure from the paper full image
abstract click to expand
While automatic tools for speech annotation are now commonplace within phonetic research pipelines, many tasks require substantial manual correction or training sets to perform accurately. Simultaneously, large speech models such as wav2vec2 have been shown to perform well at speech classification tasks, raising the question of how these models may be applied to phonetic annotation tasks. We introduce wav2VOT: a tool for the automatic estimation of voice onset time, closure duration, and burst realisation using wav2vec2. We demonstrate that wav2VOT performs comparably with current approaches on unseen datasets, and can estimate with high accuracy with fine-tuning. Analysis of wav2VOT predictions demonstrate high fidelity across stop voicing and place of articulation. These results demonstrate that large speech models are capable of producing accurate annotations, and further motivate exploration of large speech models as tools in phonetic research pipelines.
1 0
0
cs.SD 2026-06-29

Audio embeddings from language models enable instruction-based search

by Fengjie Lu, Chenang Jiang +3 more

ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models

ALM2Vec pulls capabilities from large audio-language models to create one embedding space for many retrieval tasks and natural language cont

Figure from the paper full image
abstract click to expand
Recent advances in language--audio retrieval have been largely driven by contrastive dual-encoder architectures that align audio and text in a shared embedding space. While effective, existing retrieval embeddings are primarily optimized for audio--caption matching, limiting their ability to support diverse retrieval objectives and controllable retrieval behaviors. We present ALM2Vec, a universal audio embedding framework derived from pretrained large audio--language models (LALMs). By transferring the audio understanding, instruction-following, and reasoning capabilities acquired through large-scale multimodal training, ALM2Vec learns a unified embedding space for retrieval across audio domains and task types. Beyond conventional text--audio retrieval, ALM2Vec incorporates natural-language instructions into the embedding process, enabling instruction-aware retrieval for scenarios such as audio question answering and aspect-conditioned retrieval. Experimental results show that ALM2Vec achieves competitive performance on standard audio and speech retrieval benchmarks while exhibiting promising compositional and controllable retrieval capabilities, highlighting its potential as a unified audio embedding model for retrieval across domains, tasks, and user intents.
0
0
eess.AS 2026-06-29

Codec splits emotion from content to fix TTS reward conflicts

by Sihang Nie, Xiaofen Xing +6 more

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

HPRO extracts separate style tokens and aligns rewards at frame-to-sentence scales so emotional expressiveness rises while intelligibility h

Figure from the paper full image
abstract click to expand
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimization offers a promising alternative, existing approaches suffer from two structural mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients, leading to reward hacking and semantic degradation; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To overcome these challenges, we propose HPRO, a hierarchical progressive reward optimization framework. Within HPRO, we introduce the HD-Emo codec as a novel differentiable reward model to resolve the information conflict. It extracts speech into distinct content and style preference tokens, structurally isolating emotional optimization from semantic content. Building upon this structured preference space, HPRO bridges the scale gap by progressively aligning frame-, word- and sentence-level objectives. Experiments demonstrate that HPRO significantly enhances emotional expressiveness, while effectively preserving linguistic intelligibility. The code and audio samples are publicly available at https://xxh333.github.io/hpro-demo/.
0
0
cs.SD 2026-06-29

Voice clustering links repeated speakers across 121 anonymized calls

by Muhammad Shakeel Akram, Amal Htait +3 more

DG^VoiC: Speaker Clustering for Fraud Investigation under Real Call-Centre Conditions

96 percent AMI on human-verified reference set shows cross-profile linkage is feasible for fraud checks

Figure from the paper full image
abstract click to expand
Insurance fraud remains costly and operationally difficult, particularly in call-centre workflows where many customer interactions begin at FNOL. While recent fraud detection methods mainly rely on structured data, text, or images, repeated speaker identity across calls remains underused as an investigative signal. This paper presents DG^VoiC, a voice clustering framework for customer verification and cross-profile speaker linking on anonymised real call-centre audio. The approach combines sensitive information-aligned anonymisation, speech-focused preprocessing, sliding-window speaker embedding extraction, and cosine similarity based clustering to identify repeated speakers under real telephony conditions. The method was evaluated on 121 recordings, with a curated reference subset of 56 samples in 22 human-agreed speaker clusters. used for validation. The best configuration achieved 96% AMI, 95% ARI, 98% completeness, 100% homogeneity, and 99% V-measure. These results show that speaker clustering can provide a strong additional signal for fraud investigation by helping analysts verify speaker consistency and surface repeated voices across customers.
0
0
cs.SD 2026-06-29

Match files extended to encode repeated and improvised note links

by Suhit Chiruthapudi, Adam Štefunko +4 more

A Flexible Encoding Model for Non-Unique Note Alignments

Virtual pointer notes allow multiple performance-to-score connections while old parsers continue to work unchanged.

abstract click to expand
Symbolic music alignment links notes in a symbolic performance to their counterparts in a score. While existing alignment encoding formats provide unique correspondences between these notes, there are various musical practices and forms such as practice repetitions in rehearsal and improvised realizations in basso continuo that require a more flexible approach to encoding their alignments. In this paper, we propose a minimal, backward-compatible extension to the Match file format to support such non-unique and semantically complex alignments. We introduce two virtual pointer notes - virtual score notes and virtual performance notes - which allow to encode multiple links between performance and score notes. In addition we expand the Match file's 'section' line to include semantically meaningful annotations of performance regions beyond score-indicated musical repetitions. We further demonstrate the utility of these extensions through two representative use-cases in piano rehearsal and basso continuo.
0
0
cs.SD 2026-06-29

Grammar parses audio events into activity hierarchies without extra labels

by Peng Zhang, Qingyu Luo +2 more

Grammar-Guided Hierarchical Parsing for Long-form Audio Activity Recognition

Order-consistent trees from event posteriors yield sub-activities and classifications via grammar constraints

Figure from the paper full image
abstract click to expand
Long-form audio exhibits an inherent hierarchy: fine-grained events form sub-activities, which in turn constitute higher-level activities. Prior work often models these levels separately, leading to cross-level inconsistencies and requiring supervision at multiple levels. We formulate the problem as hierarchical parsing from event-level evidence: given detected event segments with class posteriors, we infer an order-consistent Act-Sub-Event parse tree. We propose Hierarchical Activity Grammar, encoding hierarchical composition and temporal-order constraints, and perform grammar-guided decoding that combines event evidence with a grammar prior. This yields a temporally grounded parse tree from which sub-activity segmentation and activity classification are derived, without requiring sub-activity or activity labels for training. Experiments on the long-form MultiAct audio dataset demonstrate improved temporal-order consistency (Edit score) and produces interpretable hierarchies.
1 0
0
cs.SD 2026-06-29

LoRA-tuned LLM reaches 90.14 F1 for dementia from four speech views

by Jonghyeon Park, Olivier Jiyoun Jung +1 more

LoRA-Tuned Large Language Models for Dementia Detection via Multi-View Speech-Derived Features

One prompt combines transcripts, topics, fluency and phonology so a single adapted model handles the task without fusion stages.

Figure from the paper full image
abstract click to expand
Early detection of dementia enables timely intervention, and reflecting cognitive impairment, spontaneous speech offers a non-invasive screening modality. Conventional approaches often focus on a single representational dimension -- such as acoustic descriptors, pause modeling, automatic speech recognition (ASR) transcripts, or multimodal fusion -- limiting integrative reasoning across heterogeneous cognitive symptoms. We propose a low-rank adaptation (LoRA)-tuned large language model (LLM) that performs structured multi-view reasoning over four complementary speech-derived signals: ASR transcripts with pause markers, discourse-level topic cues, temporal fluency statistics, and phonological sequences. These cues are encoded within a unified prompt, enabling a single LLM to learn a coherent decision function without modality-specific encoders or late-stage fusion. On ADReSSo, our best model achieves an F1-score of 90.14%, and ablation confirms the complementary contribution of each view.
1 0
0
cs.SD 2026-06-29

Pretrained audio taggers transfer to sound localization via FOA descriptors

by Stefano Giacomelli, Stefano Damiano +3 more

From General-Purpose Audio Tagging to Spatially Grounded Sound Event Localization and Detection

Multi-stage search shows magnitude-phase-intensity vectors and early spatial encoding let semantic priors aid event detection and direction

Figure from the paper full image
abstract click to expand
This report investigates the extension of pretrained General-Purpose Audio Tagging (GP-AT) models toward spatially grounded Sound Event Localization and Detection (SELD). The proposed AT2SELD framework couples a pretrained AT backbone with compact First-Order Ambisonics (FOA) spatial processing, track-wise SED and Cartesian DOA estimation, permutation aware supervision, and calibration. It characterizes how semantic audio priors support localization-aware scene analysis under data, computation, and deployment constraints. The framework is developed through informed multi-stage Neural Architecture Search (NAS). Stage 1 shows that spectral FOA descriptors, based on magnitude, phase, and Intensity Vectors (IVs), provide the most reliable interface for semantic-to-spatial transfer. Stage 2 identifies early residual spatial encoding as the main capacity-sensitive component, while late track-wise abstraction and recurrent smoothing act mainly as refinement stages. Stage 3 shows that late cross-stitch coupling improves semantic-spatial interaction, whereas early fusion is costlier and less effective. Diagnostic evaluation analyzes the selected architecture under class balancing, focal loss, activity-conditioned DOA supervision, threshold calibration, and transfer across STARSS23, TAU2019, TAU-NIGENS2020, and TAU-NIGENS2021. Focal loss improves the activity point, active-only DOA supervision mitigates inactive target dominance, and validation-selected thresholds recover calibration without replacing spatial learning. Cross-dataset and oracle-activity analyses indicate strong fixed source localization on TAU2019, transferable representations from TAU NIGENS2021, and meaningful but uncertain behavior on STARSS23. Overall, GP-AT priors appear promising for SELD design when embedded in spatial-aware architectures and optimized through integrated calibration and deployment oriented strategies.
0
0
cs.CL 2026-06-29

Multilingual training improves emphasis model transfer across languages

by Megan Wei, Deepali Aneja +4 more

Do Speech Emphasis Models Generalize across Languages and Emotions?

New corpus of 10,000 utterances shows robust cross-emotion performance and holds at smaller data scales.

Figure from the paper full image
abstract click to expand
Prosodic emphasis varies across languages, emotions, and speaking styles, yet existing emphasis detection models are largely trained and evaluated on monolingual neutral read speech. We introduce MMEE (Multilingual Multi-Emotion Emphasis), a corpus of 10,000 professionally recorded expressive utterances (14.13 hours) across 7 languages and 34 emotion/style categories, with three-level perceptual labels (10 annotations per sample). We benchmark two state-of-the-art architectures under monolingual, cross-lingual, multilingual, cross-emotion, cross-dataset, and data-scale settings. Monolingual models show limited zero-shot transfer, degrading across typologically distant languages, while multilingual training substantially improves robustness. Models transfer robustly between high- and low-arousal emotions; bidirectional transfer between synthetic and perceptual benchmarks suggests shared prosodic structure; and performance stays robust even at smaller training scales.
1 0
0
cs.SD 2026-06-29

Acoustic simulation raises voice AI attack success by 94.5%

by Andrew C. Cullen, Neil G. Marchant +5 more

Room for Error: Large-Scale Simulation of Over-the-Air Acoustic Attacks

Large-scale testing of over-the-air attacks shows physical acoustics sharply increase word error rates in speech recognition models.

Figure from the paper full image
abstract click to expand
While voice control is rapidly becoming a ubiquitous vector of human-AI communication, the risks facing these systems remain poorly understood. This is, in part, a product of the difficulties in scaling strictly digital adversarial workflows to the physical world. These scale barriers have led the community to abstract away key acoustic factors relating to detectability and the influence of geometry on acoustics. These methodological and metrological shortcomings undermine our understanding of risk. We illuminate these issues through real-world testing, conceptual discussions, and a novel, high-throughput reality simulation framework. By testing over 8 million adversarial evaluations, we demonstrate that acoustic awareness yields relative Word Error Rate increases of up to 94.5\% under Whisper and wav2vec. We employ this framework to explore a formalize and operationalize a Dual-Form Signal to Noise Ratio to decouple source stealth from victim attack efficacy, resolving a crucial limitation in current works. This lays the groundwork for repeatable, verifiable research that embraces, rather than abstracts, the acoustic environment.
0
0
cs.LG 2026-06-29

Certification pipeline cuts speech recognition errors by 55%

by Andrew C. Cullen, Neil G. Marchant +3 more

What Was That Again? Certified Robustness for Automatic Speech Recognition

Dual audit certifies correct tokens and excludes attacks without oracle knowledge of the true words.

Figure from the paper full image
abstract click to expand
Automatic Speech Recognition systems are notoriously both sensitive to adversarial and benign perturbations. While this has been repeatedly demonstrated using reference datasets, detecting such behaviors in deployed systems is incredibly challenging, due to the absence of oracle knowledge of the true transcription. We demonstrate that employing a certification-inspired mechanism can significantly decrease WER, increase recall, and decrease the Spearman correlation between confidence and WER. We achieve this through a dual-gate diagnostic pipeline: a Two-Sided Atomic Audit that accumulates statistical wealth to certify both token existence and adversarial exclusion, and a Rank-Based Tournament that selects the winning sequence. Our evaluations across four diverse architectures demonstrate up to a 55% relative reduction in Word Error Rate, while also providing granular word- and sentence-level certifications to enhance acoustic security.
0
0
cs.SD 2026-06-26

WavLM reaches 78.2% accuracy on vocal effort classification

by Zahra Omidi, John H. L. Hansen

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

Augmentations and Gaussian soft labels cut boundary errors across whisper to shout in naturalistic recordings.

abstract click to expand
The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is challenging since effort lies on a continuum, adjacent categories are easily confused, and labeled data remain scarce. Prior SSL approaches with wav2vec2, HuBERT, and AST improve performance on the AVID corpus but still suffer from boundary errors. In this study, we introduce WavLM for the first time in vocal effort classification and benchmark it against wav2vec2 and HuBERT. To address data scarcity, we conduct a systematic study of augmentation strategies, covering RIR convolution, additive noise, time masking, speed perturbation, band-limiting, MixUp, and CutMix. Augmentation consistently improves WavLM, with gains ranging from +0.6% to +1.8% absolute. We further propose Gaussian-neighbor soft labels, which further reduce near-boundary confusions by modeling the vocal effort continuum. Our best system, WavLM-BASE with gradual unfreezing, augmentation, and Gaussian-neighbor soft labels, achieves 78.2% mean accuracy, establishing a new state-of-the-art on AVID.
1 0
0
cs.SD 2026-06-26

Vote distributions cut divergence in speech emotion recognition

by Zahra Omidi, John H.L. Hansen

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

Training on annotator vote spreads rather than single consensus labels reduces JSD and KLD to human disagreement patterns.

Figure from the paper full image
abstract click to expand
Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for categorical emotion and dimensional VAD. Hard-label training is compared with targets from primary and merged primary--secondary annotator vote distributions. Distributional objectives improve alignment with human vote distributions, reducing JSD/KLD relative to hard-label training. Analysis shows that hard supervision partly benefits from assigning ambiguous utterances to the residual Other class, whereas distributional supervision redistributes uncertainty across emotion categories. Entropy-stratified evaluation shows that high-ambiguity utterances remain challenging, but distribution-based supervision better captures perceptual uncertainty. These findings support moving beyond hard labels toward targets that reflect listener disagreement.
1 0
0
cs.SD 2026-06-26

Learned predictor skips frames in audio autoencoders

by Dimitrios Bralios, Paris Smaragdis +1 more

Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding

Elastic Time turns fixed-rate models dynamic, enabling post-training rate control and shorter latent sequences.

Figure from the paper full image
abstract click to expand
Neural audio autoencoders have become a core component of compression, feature extraction, and generation. However, while existing systems support variable bitrate, the vast majority of models still operate at a fixed latent frame-rate, allocating equal temporal budget to regions with very different information density, which can result in unnecessarily long sequences. We introduce Elastic Time, a dynamic frame-rate bottleneck that converts fixed-frame-rate autoencoders to dynamic ones. Our method learns a lightweight latent predictor used to decide which frames can be skipped and later reconstructed, enabling efficient greedy boundary selection at inference. Experiments show our method enables deployment-time rate control while improving efficiency-quality tradeoffs relative to baselines. Overall, we provide a flexible mechanism for adjusting temporal resolution in audio autoencoders, potentially facilitating more efficient downstream modeling for generation and long-context tasks.
1 0
0
eess.AS 2026-06-26

Tool logs exact effort to label speaker turns in audio

by Fumiaki Yamaguchi

voxmap-studio: An open-source speaker diarization annotation tool with built-in cost instrumentation

Automatic initialization and uncertainty highlights lower cost in test on nine files by turning creation into correction.

Figure from the paper full image
abstract click to expand
Labeling speaker diarization data is costly, yet annotation tools rarely measure that cost. We present voxmap-studio, an open-source, React-based diarization annotation tool integrated with the pyannote-based diarization ecosystem. Its canvas is initialized by a fast stride-accelerated diarization engine so that the annotator corrects a hypothesis rather than drawing every speaker turn by hand, and the tool records annotation cost - typed edit-operation counts and time - as a first-class output, enabling quantitative comparison of how much different forms of assistance actually help. Export is gated on per-segment human confirmation and guarded by injected "phantom" attention checks, which prevent unverified automatic output from being released as ground truth. In a preliminary study on nine AMI audio files, unassisted manual annotation was the costliest and least accurate, and automatic initialization shifted the work from creating turns to correcting them; highlighting uncertain segments gave the lowest cost in our small sample. The tool and its instrumentation are open source.
0
0
cs.SD 2026-06-26

Audio tokenization scales while keeping explicit pairwise alignment

by Adhiraj Banerjee, Vipul Arora

wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval

wav2tok 2.0 stages contrastive learning before CTC and DTW losses to raise spoken term detection accuracy without losing efficiency.

Figure from the paper full image
abstract click to expand
Learning discrete speech representations that preserve similarity across variable-length utterances is central to query-by-example spoken term detection (QbE-STD). While wav2tok introduced CTC-based sequence alignment to enforce token consistency, its tightly coupled clustering and alignment training recipe limits scalability. We propose wav2tok 2.0, a scalable alignment-aware speech tokenizer built on the BEST-STD backbone. wav2tok 2.0 employs staged training, first learning discriminative, speaker-invariant representations via contrastive learning and vector quantization, and then enforcing pairwise token consistency using a CTC alignment loss and a novel DTW-aligned framewise prediction objective with adaptive weighting. Experiments show that wav2tok 2.0 consistently outperforms BEST-STD and general-purpose tokenizers on QbE-STD while remaining efficient and scalable.
1 0
0
cs.SD 2026-06-26

Dual-encoder with gated attention scores 0.836 on audio challenge

by Mingda Lin, Lei Ding +7 more

WQ-Fusion: Dynamic Gated Attention for Cross-Domain Audio Representation

Dynamic routing of features from Whisper and Qwen improves results across acoustic domains.

Figure from the paper full image
abstract click to expand
While pre-trained models excel in specialized tasks, learning universal representations across diverse acoustic domains remains challenging. To address this, we propose WQ-Fusion, a robust dual-encoder framework for cross-domain audio representation learning. Overcoming the limitations of static concatenation, WQ-Fusion integrates whisper and qwen via an Adaptive Feature Modulation module and a novel element-wise gated attention mechanism. This design enables dynamic feature selection, allowing the model to selectively emphasize relevant acoustic and semantic dimensions. Extensive experiments on the Interspeech 2026 Audio Encoder Capability Challenge (Track A) benchmark demonstrate that by effectively routing heterogeneous information, WQ-Fusion achieves a superior overall score of 0.836, significantly outperforming the strongest single-encoder baseline.
1 0
0
cs.SD 2026-06-26

Test-time RL adapts zero-shot TTS to uncommon speech styles

by Tianxin Xie, Chenxing Li +2 more

VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation

Optimizing prefixes with F0, energy, similarity and WER rewards improves imitation on rare prompts without retraining.

Figure from the paper full image
abstract click to expand
Recently, zero-shot text-to-speech (TTS) has enabled high-fidelity and expressive speech synthesis, but it often fails to imitate unseen speaking styles from uncommon scenarios (e.g., crosstalk, dialects). Moreover, fine-tuning pretrained models requires large, high-quality datasets, limiting rapid personalization. We propose VoiceTTA, a reinforcement learning-based test-time adaptation (TTA) method that improves voice imitation of pretrained zero-shot TTS models. VoiceTTA introduces two style rewards based on coefficient-of-variation differences of F0 and energy, combined with speaker similarity and intelligibility (WER from a pretrained Whisper model), and optimizes learnable prefixes via group relative preference optimization (GRPO) in a flow matching-based model at inference time. Extensive experiments demonstrate substantial improvements on uncommon speech prompts, outperforming state-of-the-art baselines. Audio samples are available at https://voicetta.pages.dev/
1 0
0
cs.CL 2026-06-25

One tiny model handles multiple speech tasks via similarity

by Sourav Ghosh, Yash Bhatia +4 more

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

AnySimLite recasts classification as text similarity and matches large models with under 1/250th the size in few-shot settings.

Figure from the paper full image
abstract click to expand
To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but deploying multiple specialized models creates a memory footprint challenge. We investigate: Can a single lightweight architecture solve multiple Speech-Adjacent (SA) classification tasks through reduction to a nuanced text similarity formulation? We propose AnySimLite, a lightweight similarity encoder that combines word-level and character-level channels. Together with a dataset transformation strategy, we evaluate AnySimLite across multiple SA classification tasks and show that it consistently achieves state-of-the-art (SOTA) or SOTA-competitive performance in few-shot settings while maintaining a low memory footprint. Even in the worst case, the performance drop remains below 7% while using $<\frac{1}{250}^{\mathrm{th}}$ of the model size of the SOTA qLLaMA_LoRA-7B baseline.
1 0
0
cs.SD 2026-06-25

Framework aligns lyric blocks with pitch and rhythm for singing scores

by Neelam Saini, Sourav Ghosh

Listening Like a Judge: A Music-Aware Framework for Automatic Singing Performance Evaluation

Multi-signal matching and targeted transcription fine-tuning produce scores that match human experts on lyrical and musical accuracy.

Figure from the paper full image
abstract click to expand
Automatic singing quality assessment (SQA) requires evaluating lyrical correctness and musical fidelity while handling expressive variations. However, existing systems largely rely on either acoustic cues or lyric transcriptions exclusively, limiting holistic performance evaluation. Furthermore, their integration is non-trivial due to challenges in robust singing transcription amid melisma, vibrato, and tempo elasticity. To this end, we propose MusicJudge, a modality-guided framework for automated SQA that performs block-aligned multimodal analysis by coupling lyric correctness with pitch-rhythm fidelity. It detects semantically meaningful lyric blocks using multi-signal matching that integrates semantic embeddings, lexical similarity, and phonetic alignment. To improve singing audio transcription, we introduce Modality-Guided LoRA for ASR fine-tuning. Experiments across datasets demonstrate strong agreement with human expert judgments and validate the generalizability of MusicJudge.
1 0
0
cs.CL 2026-06-25

SpeechEQ shows voice models hit text shortcut and safety trap

by Liang-Yuan Wu, Zih-Ching Chen +3 more

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

2265-dialogue tests find end-to-end systems still lose context and default to text over spoken emotion cues.

Figure from the paper full image
abstract click to expand
As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations of machine emotional intelligence assess reasoning exclusively through isolated text or passive acoustic perception, overlooking the complex cross-modal reasoning required for active, multi-turn dialogue. We introduce \textsc{SpeechEQ}, a comprehensive framework designed to evaluate the sociolinguistic reasoning of Speech-Language Models (SLMs). The framework includes a validated dataset of 2,265 dialogues across 15 Emotional Quotient (EQ) subscales grounded in EQ-i 2.0 theory, along with a multi-turn evaluation protocol measured by our proposed Spoken EQ (SEQ) score inspired by human EQ assessments. Experiments show limitations in how both existing Speech Emotion Recognition and end-to-end Speech-Language Models understand and apply paralinguistic cues through speech. While end-to-end architectures outperform cascaded systems, \textsc{SpeechEQ} reveals that current multimodal models remain bottlenecked by a text-reliant ``modality shortcut,'' an alignment-induced ``safety trap,'' and ``contextual amnesia,'' highlighting the barriers to truly emotionally aware AI. Our benchmark can be accessed at https://huggingface.co/datasets/SpeechEQ/SpeechEQ and demo page at https://binomial14.github.io/speecheq-demo/
0
0
cs.SD 2026-06-25

Dataset releases 10,000 Foley clips with two-level labels

by Sunshiyu Wang, Alexander Lerch

FoleySet: A Multi-Level Human-Annotated Foley Sound Dataset

Standardized resource targets classification, retrieval, and generation of action-linked sound effects

Figure from the paper full image
abstract click to expand
In audiovisual post-production, Foley refers to synchronous sound effects associated with human actions, such as footsteps, cloth rustle, and prop handling, that are recreated to match the on-screen movements and interactions of characters. These sounds are often recorded by professional Foley artists using physical props. This resource-intensive workflow has motivated data-driven research on Foley, including tasks such as classification, retrieval, and generation; however, high-quality annotated Foley datasets for training remain scarce. To address this gap, we present FoleySet, a publicly available Foley dataset of 10,000 audio clips annotated with a two-level Foley taxonomy. This dataset provides a standardized, Creative Commons-licensed resource for data-driven Foley classification, retrieval, and generation.
1 0

browse all of cs.SD → full archive · search · sub-categories