hub

Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition

· 2024 · arXiv 2407.04675

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

TRADE: Transducer-Augmented Decoder for Speech LLM

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.

Audio Interaction Model

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

cs.CL · 2026-05-16 · unverdicted · novelty 6.0

JSPG jointly combines semantic, pinyin, and glyph retrieval with an extended Smith-Waterman algorithm to dynamically filter keyword dictionaries and improve accuracy in Chinese contextual ASR.

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

cs.SD · 2026-05-06 · unverdicted · novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

eess.AS · 2026-04-09 · unverdicted · novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

StepAudio 2.5 Technical Report

eess.AS · 2026-05-22 · unverdicted · novelty 5.0

StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.

From Synthesis to Clinical Assistance: A Strategy-Aware Agent Framework for Autism Intervention based on Real Clinical Dataset

cs.LG · 2026-04-09 · unverdicted · novelty 5.0

ASDAgent generates synthetic ABA-strategy dialogues that match human therapist distributions (KL 0.083) and achieves 80% expert consistency, while its outputs improve small language models for therapeutic tasks.

Qwen2.5-Omni Technical Report

cs.CL · 2025-03-26 · conditional · novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

cs.CL · 2026-07-02 · unverdicted · novelty 4.0

JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.

LLMs and Speech: Integration vs. Combination

eess.AS · 2026-03-16 · unverdicted · novelty 4.0

Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

eess.AS · 2026-04-20

citing papers explorer

Showing 13 of 13 citing papers.

TRADE: Transducer-Augmented Decoder for Speech LLM cs.CL · 2026-06-07 · unverdicted · none · ref 2
TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.
Audio Interaction Model cs.SD · 2026-06-03 · unverdicted · none · ref 3
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR cs.CL · 2026-05-16 · unverdicted · none · ref 2
JSPG jointly combines semantic, pinyin, and glyph retrieval with an extended Smith-Waterman algorithm to dynamically filter keyword dictionaries and improve accuracy in Chinese contextual ASR.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models cs.SD · 2026-05-06 · unverdicted · none · ref 2
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs eess.AS · 2026-04-09 · unverdicted · none · ref 2
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 5
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
StepAudio 2.5 Technical Report eess.AS · 2026-05-22 · unverdicted · none · ref 7
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition cs.CL · 2026-04-10 · unverdicted · none · ref 19
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
From Synthesis to Clinical Assistance: A Strategy-Aware Agent Framework for Autism Intervention based on Real Clinical Dataset cs.LG · 2026-04-09 · unverdicted · none · ref 1
ASDAgent generates synthetic ABA-strategy dialogues that match human therapist distributions (KL 0.083) and achieves 80% expert consistency, while its outputs improve small language models for therapeutic tasks.
Qwen2.5-Omni Technical Report cs.CL · 2025-03-26 · conditional · none · ref 4
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving cs.CL · 2026-07-02 · unverdicted · none · ref 16
JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.
LLMs and Speech: Integration vs. Combination eess.AS · 2026-03-16 · unverdicted · none · ref 17
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR eess.AS · 2026-04-20 · unreviewed · ref 2

Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer