pith. sign in

Query-key normalization for transformers

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

fields

cs.LG 4 cs.SD 1

years

2026 4 2023 1

representative citing papers

The Transformer as a Polar State Estimator

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.

BloombergGPT: A Large Language Model for Finance

cs.LG · 2023-03-30 · conditional · novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

ZONOS2 Technical Report

cs.SD · 2026-06-23 · unverdicted · novelty 4.0 · 2 refs

ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.

citing papers explorer

Showing 5 of 5 citing papers.

  • Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding cs.LG · 2026-07-02 · unverdicted · none · ref 97

    Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.

  • LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold cs.LG · 2026-06-11 · unverdicted · none · ref 22

    LoRA-Muon applies Muon's spectral steepest descent to low-rank factors with split weight decay, acting as a transferable proxy for full-rank Muon and Shampoo optimizers.

  • The Transformer as a Polar State Estimator cs.LG · 2026-05-10 · unverdicted · none · ref 213

    The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.

  • BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 43

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  • ZONOS2 Technical Report cs.SD · 2026-06-23 · unverdicted · none · ref 194 · 2 links

    ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.