Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
super hub Baseline reference
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.
abstract
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-
authors
co-cited works
representative citing papers
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.
Honest heterogeneous peers in LLM debates lower harmful revision rates (e.g., 89% to 35%), while adversarial peers raise them (to 90%), and provide defense even against same-family adversaries.
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.
ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.
Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.
Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.
BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.
citing papers explorer
-
Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
-
Atomistic Language Models Understand and Generate Materials
ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.
-
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
-
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.
-
Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
-
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.
-
Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
-
ATLAS: Agentic Test-time Learning-to-Allocate Scaling
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
-
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
ResearchClawBench supplies 40 grounded tasks and expert rubrics to measure autonomous research agents, with the strongest systems scoring only 21.5 and 20.7 on average.
-
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
-
Not only where, But when: Temporal Scheduling for RLVR
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
-
\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer
MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
-
Super Apriel: One Checkpoint, Many Speeds
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
-
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
-
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
-
MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?
A new benchmark study finds that prompt optimization can deliver significant gains in multi-agent LLM systems but its effectiveness varies strongly with task, workflow, communication protocol, and team size.
-
DRIFT: Refining Instruction Data via On-Policy Data Attribution
DRIFT applies on-policy influence functions with signed weighting and debiasing to attribute and refine SFT data, raising performance on 7B instruction and reasoning models over prior curation methods.
-
Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression
A structural pruning framework for MoE models that solves channel-score coverage maximization via attribution approximation, preserving accuracy at 50% or 25% pruning plus 4-bit quantization on DeepSeek and Qwen models.
-
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
ReSET mitigates accuracy degradation in NVFP4-quantized reasoning models via step-aware entropy-based temperature scaling and provides a small-M CUDA kernel for up to 2.5x kernel speedup and 2x end-to-end speedup.
-
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics
DynamicPTQ uses new metrics of residual-stream dynamics to apply 8-bit activation precision only to quantization-sensitive layers in W4A4KV4 LLM inference, improving perplexity and QA performance over static smoothing baselines.
-
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
TRACE is a rollout budget allocation framework that models ReAct turns as tree nodes and uses a predictor to allocate samples to informative prefixes, yielding a 2.8-point accuracy gain on Multi-Hop QA at equal cost.
-
RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
RKSC delivers 3.008x mean speedup over baseline and 1.66x over vLLM prefix caching for multi-branch LLM reasoning via similarity-based KV sharing and confidence-gated early exit, with 0.37% error rate.
-
TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models
TALAN inserts a trainable latent memory path that remixes sequence information into small orthogonal perturbations, delivering 1.41-1.85 point average gains over matched LoRA and DoRA on four Qwen backbones and STEM/code benchmarks while adding under 1% parameters.
-
Enhancing LLM Metacognition via Cognitive Pairwise Training
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
-
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
-
The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers
LLM routers across 21 methods on 5 benchmarks converge to similar accuracy below oracle due to learning global performance trends rather than fine-grained query signals.
-
MobileMoE: Scaling On-Device Mixture of Experts
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
-
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
-
Post-Trained MoE Can Skip Half Experts via Self-Distillation
ZEDA turns post-trained static MoE models into dynamic ones via zero-output expert injection and two-stage self-distillation, cutting over 50% expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash with small accuracy drops across 11 benchmarks.
-
Dynamic Model Merging Made Slim
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
-
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.
-
Training on Documents About Monitoring Leads to CoT Obfuscation
Synthetic document finetuning on CoT monitor descriptions causes models to obfuscate reasoning traces, raising undetected misbehavior rates and correlating with controllability (r=0.800).
-
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
-
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA selectively full fine-tunes modules with high sensitivity to low-rank adaptation using a novel score and applies LoRA elsewhere, matching full fine-tuning at 10% budget and outperforming PEFT baselines by up to 5.65%.
-
DataMaster: Data-Centric Autonomous AI Research
DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GPQA gain over the base instruct model.
-
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness
VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
Rotation-Preserving Supervised Fine-Tuning
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
-
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.
-
Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning
Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter reasoning chains.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.