Benchmark standardizes early Parkinson's speech detection
A Benchmark for Early-stage Parkinson's Disease Detection from Speech
Speaker-independent splits on accessible datasets enable fair, replicable comparisons across tasks and training settings.
Sound
Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.
A Benchmark for Early-stage Parkinson's Disease Detection from Speech
Speaker-independent splits on accessible datasets enable fair, replicable comparisons across tasks and training settings.
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
Tests across 12 systems show trade-offs, large reliability gaps, and drops from accents or noise in simulated conversations.
PHALAR: Phasors for Learned Musical Audio Representations
Contrastive framework adds pitch and phase equivariance via spectral pooling and complex head, trains seven times faster, and matches human
full image
Audio-Based Understanding of Audiobook Narration Appeal
Vocal features extracted from recordings remain tied to view-rate and engagement after title controls are applied.
SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios
SelectTSL steers attention to user-specified targets and estimates their direction plus count in overlapping scenes.
full image
Speaker head orientation estimation with a single microphone array using phase spectrogram features
Simulated voice directivity training followed by real-data fine-tuning reaches 11.3 degree mean error after personalization.
A Multi-Branch Hierarchy-Aware Framework for Heterogeneous Audio Classification
CLAP features, separate acoustic branches, and KNN post-processing lift scores on heterogeneous sound taxonomy task.
full image
Shared parameters create resolution-specific kernels by scaling size and stride to each token interval
Decomposer: Learning to Decompile Symbolic Music to Programs
Two-stage training on synthetic pairs plus dual-reward RL produces more faithful and editable code than LLMs or heuristics.
full image
RT-Tango: Real-Time Distributed Binaural Speech Enhancement for Low-Power Hearing Aid Devices
Distributed two-stage system applies perceptual compression and recurrent estimation to keep quality competitive on constrained hardware.
full image
DRL-CLBA: A Clean Label Backdoor Attack for Speech Classification via DDPG Reinforcement Learning
DDPG shifts target audio to hidden steganographic anchors in latent space without label changes, resisting fine-tuning and pruning.
full image
New trigger spreads frame-level timbre info to create natural poisoned samples that bypass detectors.
full image
UT-AISTimprt submission for ICME 2026 Grand Challenge on Academic Text-to-Music Generation
Grouping similar text embeddings in batches outperforms audio clusters, with moderate granularity best for metrics and finer clusters best f
H-SAGE: Holistic Speaker-Aware Guided Experts for MoE-based Multi-Talker ASR
A global encoder plus overlap-aware loss helps experts route better in high-overlap conditions on LibriSpeechMix.
Quantifying the Uncertainty of Blindly Estimated Room Embeddings Using a Dispersion-Calibrated Score
Calibrated on dispersion from artificial corruptions, it tracks representation quality without downstream labels or multiple recordings.
full image
NPUsper: Eliminating Redundant Computation for Real-Time Whisper on Mobile NPUs
Hallucination detection lets short audio chunks replace padded inputs and chunked decoding trims cache work while accuracy holds.
full image
A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models
Geometric measurements show low-dimensional disentangled representations enable better single-site control while joint steering trades inten
full image
CNN Models for Microphone Array Covariance Matrix Upsampling and Acoustic Imaging
Models trained on real recordings achieve lower error than random guessing and produce sound maps nearly identical to those from a full 32-c
full image
Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition
They generalize better than from-scratch spectrogram models yet track performer identity, partially fixed by a contrastive projection.
full image
AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization
AV-SyncBench lets researchers measure offset accuracy apart from content matching on 38k verified samples.
full image
Speech Playground: An Interactive Tool for Speech Analysis and Comparison
Supports continuous, discrete and variable-length representations plus TextGrid alignment for research and CAPT tasks.
full image
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis
Heterogeneous augmentation and trajectory rectification remove CFG overhead and raise speaker similarity.
full image
A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models
Performers adjust parameters directly while audio continues without interruption, using any of three backends.
full image
Adaptive Perturbation Selection for Contrastive Audio Decoding
A lightweight network trained on base-model states picks the best perturbation per example and task without retraining the language model.
full image
Dilemmadata: On the Interoperability of Heterogeneous Roman Numeral Datasets
84 overlapping pieces allow note-for-note comparison of two analytical traditions on identical music
ZEBRA fuses zero-shot and prompt logits to raise novel-class accuracy while preserving base performance across audio datasets.
full image
Improving multichannel speech enhancement through accurate room-acoustic simulations
Wave-based and hybrid acoustic data for training outperforms purely geometrical simulations on real measured recordings.
full image
Bilingual Finnish-Russian EMA data shows intermediate layers capture tongue and lip positions across languages using only minutes of trainin
full image
Building an ASR Solution for Training and Assessing Children's Reading
Training on 55 hours of data from 60 children enables practical assessment of literacy skills.
full image
Speaker identity stays mostly partitioned but room parameters emerge unsupervised in acoustic embeddings and leak elsewhere.
full image
Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models
Models display position bias, confusable errors and temporal inconsistencies on extended tests
full image
SwiftAudio trains a fast text-to-audio generator on 45K captions without audio pairs and tops other one-step methods.
full image
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
It beats fixed-rate 7B models on quality and halves inference time at lower rates.
full image
Discrete phonetic tokens let users change small sound units or whole words while controlling voice and mood in the same system.
full image
Attacking UTMOS: Probing the Robustness of a Speech Quality Assessment Model
Optimization in waveform, mel, and EnCodec spaces decouples the model's output from what listeners actually hear.
full image
Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems
Conditioning on speaker traits and interaction state yields expected flag rates on human data and interpretable deviations unlike pooled ave
full image
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation
PRIME-Speech trains only a post-decoder on hidden states to generate spoken responses without degrading original text reasoning.
full image
SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation
It reuses stable background residuals across blocks while refreshing only audio-driven human regions to keep exact lip sync.
full image
AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation
Shared encoder and codebook enable joint reconstruction plus audio-to-video and video-to-audio tasks without separate branches.
full image
Independent probes rank transformer layers by cross-domain power, then fuse only the strongest ones for lower error with far fewer parameter
full image
Staged SFT and DPO training separates musicality from controllability and acoustics to beat open-source baselines.
full image
MeloDISinger: Melody-Aware & Duration-Preserving Singing Voice Editing with Audio Infilling
MeloDISinger predicts duration ratios via phonetic-melodic cross-attention to keep timing and tune intact during text changes.
full image
SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition
One XAI-derived mask scopes magnitude-bounded updates and maintains competitive success rates across models while improving explanation cons
full image
Predicting Timbre Traits for Interpretable Assessment of Musical Sound Synthesizers
The model analyzes individual sounds on 20 traits to show which synthesizer outputs and dimensions need work.
full image
OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL
OLIVE keeps recognition performance competitive by using waveform reconstruction to retain signal details alongside masked prediction for in
full image
BEST-RQ-2: Contextualize-Then-Predict, a Two-Step Approach for Self-Supervised Audio Representations
Decomposing masked prediction into context and prediction stages improves overall benchmark transfer without extra runtime compute.
SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset
Domain-generalization losses isolate culture from individual style, improving realism and consistency on a new four-group TED dataset.
full image
Child-Centric Voice Anonymization in Single and Multi-Speaker Speech via Domain-Adapted SSL Models
Experiments on MyST data show better speech quality and privacy for kids in solo and mixed-speaker recordings.
full image
Targeted insertion inside the backbone stabilizes outputs across SNR levels and noise types with no extra data or redesign.
Public acoustic datasets leak session information when clips are split randomly, inflating reported counter-drone performance.
full image
TF-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation
Alternating expert modules increase capacity on a mel-band Conformer without raising inference cost beyond prior compact models.
full image
Proteus: Automated Adversarial Robustness Testing for Audio Deepfake Detectors
A search framework finds chains of codec, noise and compression that fool detectors but keep speech clear and speakers recognizable.
Position-aware extraction generates attributed streams for simple VAD, yielding ASR gains over CSS in meetings.
full image
DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection
A binary keep-mask and learned embedding let the codec drop redundant frames while counting every overhead bit and still raising quality and
full image
AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification
Adaptive Modality Routing uses per-sample adapters and KL-supervised training to maintain performance when modalities drop or languages chan
full image
Resistance and closure terms let a damped mass-spring system match subject glottal waveforms to under 3 percent error without vocal-tract co
A new underwater dataset plus margin-enhanced loss and feature alignment yield stronger robustness when models move between acoustic domains
Majority label inside each cluster removes most relabeled trigger samples before training.
full image
The system reaches comparable accuracy on unseen data and improves further after fine-tuning on target speech.
full image
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models
ALM2Vec pulls capabilities from large audio-language models to create one embedding space for many retrieval tasks and natural language cont
full image
HPRO extracts separate style tokens and aligns rewards at frame-to-sentence scales so emotional expressiveness rises while intelligibility h
full image
DG^VoiC: Speaker Clustering for Fraud Investigation under Real Call-Centre Conditions
96 percent AMI on human-verified reference set shows cross-profile linkage is feasible for fraud checks
full image
A Flexible Encoding Model for Non-Unique Note Alignments
Virtual pointer notes allow multiple performance-to-score connections while old parsers continue to work unchanged.
Grammar-Guided Hierarchical Parsing for Long-form Audio Activity Recognition
Order-consistent trees from event posteriors yield sub-activities and classifications via grammar constraints
full image
LoRA-Tuned Large Language Models for Dementia Detection via Multi-View Speech-Derived Features
One prompt combines transcripts, topics, fluency and phonology so a single adapted model handles the task without fusion stages.
full image
From General-Purpose Audio Tagging to Spatially Grounded Sound Event Localization and Detection
Multi-stage search shows magnitude-phase-intensity vectors and early spatial encoding let semantic priors aid event detection and direction
full image
Do Speech Emphasis Models Generalize across Languages and Emotions?
New corpus of 10,000 utterances shows robust cross-emotion performance and holds at smaller data scales.
full image
Room for Error: Large-Scale Simulation of Over-the-Air Acoustic Attacks
Large-scale testing of over-the-air attacks shows physical acoustics sharply increase word error rates in speech recognition models.
full image
What Was That Again? Certified Robustness for Automatic Speech Recognition
Dual audit certifies correct tokens and excludes attacks without oracle knowledge of the true words.
full image
Augmentations and Gaussian soft labels cut boundary errors across whisper to shout in naturalistic recordings.
Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition
Training on annotator vote spreads rather than single consensus labels reduces JSD and KLD to human disagreement patterns.
full image
Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding
Elastic Time turns fixed-rate models dynamic, enabling post-training rate control and shorter latent sequences.
full image
voxmap-studio: An open-source speaker diarization annotation tool with built-in cost instrumentation
Automatic initialization and uncertainty highlights lower cost in test on nine files by turning creation into correction.
full image
wav2tok 2.0 stages contrastive learning before CTC and DTW losses to raise spoken term detection accuracy without losing efficiency.
full image
WQ-Fusion: Dynamic Gated Attention for Cross-Domain Audio Representation
Dynamic routing of features from Whisper and Qwen improves results across acoustic domains.
full image
VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation
Optimizing prefixes with F0, energy, similarity and WER rewards improves imitation on rare prompts without retraining.
full image
AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification
AnySimLite recasts classification as text similarity and matches large models with under 1/250th the size in few-shot settings.
full image
Listening Like a Judge: A Music-Aware Framework for Automatic Singing Performance Evaluation
Multi-signal matching and targeted transcription fine-tuning produce scores that match human experts on lyrical and musical accuracy.
full image
SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models
2265-dialogue tests find end-to-end systems still lose context and default to text over spoken emotion cues.
full image
FoleySet: A Multi-Level Human-Annotated Foley Sound Dataset
Standardized resource targets classification, retrieval, and generation of action-linked sound effects
full image