Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
Query-key normalization for transformers
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LoRA-Muon applies Muon's spectral steepest descent to low-rank factors with split weight decay, acting as a transferable proxy for full-rank Muon and Shampoo optimizers.
The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.
citing papers explorer
-
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
-
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
LoRA-Muon applies Muon's spectral steepest descent to low-rank factors with split weight decay, acting as a transferable proxy for full-rank Muon and Shampoo optimizers.
-
The Transformer as a Polar State Estimator
The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
ZONOS2 Technical Report
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.