Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
super hub Baseline reference
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.
abstract
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-
authors
co-cited works
representative citing papers
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.
Honest heterogeneous peers in LLM debates lower harmful revision rates (e.g., 89% to 35%), while adversarial peers raise them (to 90%), and provide defense even against same-family adversaries.
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.
ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.
Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.
Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.
BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.
citing papers explorer
-
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
-
Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
-
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
-
Evaluating Large Language Models in Scientific Discovery
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
-
Meta-Benchmarks for Financial-Services LLM Evaluation
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
-
Atomistic Language Models Understand and Generate Materials
ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.
-
Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience
Honest heterogeneous peers in LLM debates lower harmful revision rates (e.g., 89% to 35%), while adversarial peers raise them (to 90%), and provide defense even against same-family adversaries.
-
Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
-
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
-
GraphPO: Graph-based Policy Optimization for Reasoning Models
GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.
-
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
-
Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs
ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.
-
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.
-
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text
Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.
-
Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
-
SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
-
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.
-
Knowledge Index of Noah's Ark
Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.
-
BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents
BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.
-
Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
-
ResMerge: Residual-based Spectral Merging of Large Language Models
ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.
-
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
-
ATLAS: Agentic Test-time Learning-to-Allocate Scaling
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
-
From Talking Words to Sharing Thoughts: Scalable Multi-LLM Aggregation via Structured Message Passing
A bipartite factor graph with message-passing protocol and asymmetric damping aggregates multi-LLM predictions, cutting token use by 97% and API calls by 6X while outperforming baselines on MMLU, MMLU-Pro, GPQA, and MedMCQA.
-
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
ResearchClawBench supplies 40 grounded tasks and expert rubrics to measure autonomous research agents, with the strongest systems scoring only 21.5 and 20.7 on average.
-
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
-
JobBench: Aligning Agent Work With Human Will
JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.
-
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.
-
Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models
MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.
-
Not only where, But when: Temporal Scheduling for RLVR
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
-
Large Language Model Selection with Limited Annotations
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
-
\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer
MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
-
SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning
SliceGraph maps process isomers in multi-run CoT reasoning, finding that 85.5% of 954 problem-model cells show correct trajectories splitting into multiple process families with 76.6% of run pairs cross-family on average.
-
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.
-
Active Testing of Large Language Models via Approximate Neyman Allocation
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings versus uniform sampling.
-
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs
TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.