Query-key normalization for transformers

Henry, Alex, Dachapally, Prudhvi Raj, Pawar, Shubham Shantaram, Chen, Yuxuan · 2020 · DOI 10.18653/v1/2020.findings-emnlp.379

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

representative citing papers

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

LoRA-Muon applies Muon's spectral steepest descent to low-rank factors with split weight decay, acting as a transferable proxy for full-rank Muon and Shampoo optimizers.

The Transformer as a Polar State Estimator

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.

BloombergGPT: A Large Language Model for Finance

cs.LG · 2023-03-30 · conditional · novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

ZONOS2 Technical Report

cs.SD · 2026-06-23 · unverdicted · novelty 4.0 · 2 refs

ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.

citing papers explorer

Showing 5 of 5 citing papers.

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding cs.LG · 2026-07-02 · unverdicted · none · ref 97
Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold cs.LG · 2026-06-11 · unverdicted · none · ref 22
LoRA-Muon applies Muon's spectral steepest descent to low-rank factors with split weight decay, acting as a transferable proxy for full-rank Muon and Shampoo optimizers.
The Transformer as a Polar State Estimator cs.LG · 2026-05-10 · unverdicted · none · ref 213
The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.
BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 43
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
ZONOS2 Technical Report cs.SD · 2026-06-23 · unverdicted · none · ref 194 · 2 links
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.

Query-key normalization for transformers

fields

years

verdicts

representative citing papers

citing papers explorer