super hub Baseline reference

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Asa Cooper Stickland, Betty Li Hou, David Rein, Jackson Petty, Julien Dirani, Richard Yuanzhe Pang · 2023 · cs.AI · arXiv 2311.12022

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

232 Pith papers citing it

Baseline 55% of classified citations

open full Pith review browse 232 citing papers more from Asa Cooper Stickland arXiv PDF

abstract

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 18 background 14 method 1

citation-polarity summary

use dataset 18 background 10 unclear 4 use method 1

claims ledger

abstract We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-

authors

Asa Cooper Stickland Betty Li Hou David Rein Jackson Petty Julien Dirani Richard Yuanzhe Pang

co-cited works

representative citing papers

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

cs.AI · 2026-06-09 · conditional · novelty 8.0

Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

cs.AI · 2024-07-14 · accept · novelty 8.0

LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

Atomistic Language Models Understand and Generate Materials

cs.LG · 2026-06-19 · unverdicted · novelty 7.0

ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.

Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience

cs.CR · 2026-06-18 · unverdicted · novelty 7.0

Honest heterogeneous peers in LLM debates lower harmful revision rates (e.g., 89% to 35%), while adversarial peers raise them (to 90%), and provide defense even against same-family adversaries.

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.

GraphPO: Graph-based Policy Optimization for Reasoning Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.

Knowledge Index of Noah's Ark

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.

ResMerge: Residual-based Spectral Merging of Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.

citing papers explorer

Showing 50 of 66 citing papers after filters.

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete cs.LG · 2026-06-01 · unverdicted · none · ref 56 · internal anchor
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
Atomistic Language Models Understand and Generate Materials cs.LG · 2026-06-19 · unverdicted · none · ref 42 · internal anchor
ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates cs.LG · 2026-06-17 · unverdicted · none · ref 22 · internal anchor
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training cs.LG · 2026-06-10 · unverdicted · none · ref 26 · internal anchor
ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.
Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling cs.LG · 2026-06-07 · unverdicted · none · ref 23 · internal anchor
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them) cs.LG · 2026-06-03 · unverdicted · none · ref 16 · internal anchor
Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.
Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification cs.LG · 2026-06-02 · unverdicted · none · ref 27 · internal anchor
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
ATLAS: Agentic Test-time Learning-to-Allocate Scaling cs.LG · 2026-06-01 · unverdicted · none · ref 44 · internal anchor
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research cs.LG · 2026-05-28 · conditional · none · ref 1 · 2 links · internal anchor
ResearchClawBench supplies 40 grounded tasks and expert rubrics to measure autonomous research agents, with the strongest systems scoring only 21.5 and 20.7 on average.
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data cs.LG · 2026-05-26 · unverdicted · none · ref 33 · internal anchor
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
Not only where, But when: Temporal Scheduling for RLVR cs.LG · 2026-05-25 · unverdicted · none · ref 18 · internal anchor
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer cs.LG · 2026-05-17 · unverdicted · none · ref 31 · internal anchor
MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms cs.LG · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 33 · internal anchor
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control cs.LG · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Jailbroken Frontier Models Retain Their Capabilities cs.LG · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 59 · internal anchor
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
Super Apriel: One Checkpoint, Many Speeds cs.LG · 2026-04-21 · unverdicted · none · ref 46 · internal anchor
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration cs.LG · 2026-04-13 · unverdicted · none · ref 23 · internal anchor
NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling cs.LG · 2025-07-02 · unverdicted · none · ref 31 · internal anchor
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems? cs.LG · 2026-06-22 · unverdicted · none · ref 55 · internal anchor
A new benchmark study finds that prompt optimization can deliver significant gains in multi-agent LLM systems but its effectiveness varies strongly with task, workflow, communication protocol, and team size.
DRIFT: Refining Instruction Data via On-Policy Data Attribution cs.LG · 2026-06-16 · unverdicted · none · ref 16 · internal anchor
DRIFT applies on-policy influence functions with signed weighting and debiasing to attribute and refine SFT data, raising performance on 7B instruction and reasoning models over prior curation methods.
Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression cs.LG · 2026-06-16 · unverdicted · none · ref 16 · internal anchor
A structural pruning framework for MoE models that solves channel-score coverage maximization via attribution approximation, preserving accuracy at 50% or 25% pruning plus 4-bit quantization on DeepSeek and Qwen models.
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling cs.LG · 2026-06-11 · unverdicted · none · ref 33 · internal anchor
ReSET mitigates accuracy degradation in NVFP4-quantized reasoning models via step-aware entropy-based temperature scaling and provides a small-M CUDA kernel for up to 2.5x kernel speedup and 2x end-to-end speedup.
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics cs.LG · 2026-06-10 · unverdicted · none · ref 25 · internal anchor
DynamicPTQ uses new metrics of residual-stream dynamics to apply 8-bit activation precision only to quantization-sensitive layers in W4A4KV4 LLM inference, improving perplexity and QA performance over static smoothing baselines.
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning cs.LG · 2026-06-09 · unverdicted · none · ref 14 · internal anchor
TRACE is a rollout budget allocation framework that models ReAct turns as tree nodes and uses a predictor to allocate samples to informative prefixes, yielding a 2.8-point accuracy gain on Multi-Hop QA at equal cost.
RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference cs.LG · 2026-06-07 · unverdicted · none · ref 11 · internal anchor
RKSC delivers 3.008x mean speedup over baseline and 1.66x over vLLM prefix caching for multi-branch LLM reasoning via similarity-based KV sharing and confidence-gated early exit, with 0.37% error rate.
TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 20 · internal anchor
TALAN inserts a trainable latent memory path that remixes sequence information into small orthogonal perturbations, delivering 1.41-1.85 point average gains over matched LoRA and DoRA on four Qwen backbones and STEM/code benchmarks while adding under 1% parameters.
Enhancing LLM Metacognition via Cognitive Pairwise Training cs.LG · 2026-05-30 · unverdicted · none · ref 63 · internal anchor
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 19 · internal anchor
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers cs.LG · 2026-05-27 · unverdicted · none · ref 49 · internal anchor
LLM routers across 21 methods on 5 benchmarks converge to similar accuracy below oracle due to learning global performance trends rather than fine-grained query signals.
MobileMoE: Scaling On-Device Mixture of Experts cs.LG · 2026-05-26 · unverdicted · none · ref 49 · internal anchor
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training cs.LG · 2026-05-26 · unverdicted · none · ref 38 · internal anchor
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
Post-Trained MoE Can Skip Half Experts via Self-Distillation cs.LG · 2026-05-18 · unverdicted · none · ref 23 · 2 links · internal anchor
ZEDA turns post-trained static MoE models into dynamic ones via zero-output expert injection and two-stage self-distillation, cutting over 50% expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash with small accuracy drops across 11 benchmarks.
Dynamic Model Merging Made Slim cs.LG · 2026-05-17 · unverdicted · none · ref 69 · internal anchor
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero cs.LG · 2026-05-14 · unverdicted · none · ref 43 · internal anchor
GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.
Training on Documents About Monitoring Leads to CoT Obfuscation cs.LG · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
Synthetic document finetuning on CoT monitor descriptions causes models to obfuscate reasoning traces, raising undetected misbehavior rates and correlating with controllability (r=0.800).
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation cs.LG · 2026-05-14 · conditional · none · ref 29 · internal anchor
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
Hybrid-LoRA selectively full fine-tunes modules with high sensitivity to low-rank adaptation using a novel score and applies LoRA elsewhere, matching full fine-tuning at 10% budget and outperforming PEFT baselines by up to 5.65%.
DataMaster: Data-Centric Autonomous AI Research cs.LG · 2026-05-11 · unverdicted · none · ref 30 · 2 links · internal anchor
DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GPQA gain over the base instruct model.
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness cs.LG · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation cs.LG · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Rotation-Preserving Supervised Fine-Tuning cs.LG · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 38 · internal anchor
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.
Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter reasoning chains.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 34 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs cs.LG · 2025-12-11 · unverdicted · none · ref 14 · internal anchor
Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer