L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
hub Canonical reference
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
Operadic consistency is a new per-question signal that correlates strongly with accuracy (r 0.86-0.94) across four multi-hop QA datasets and improves selective prediction over CoT-SC baselines.
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
Aggregating many LLM-synthesized weak verifiers via weak learning from sparse labels yields stronger verifiers that improve F1 by up to 7X over direct LLM judges on 3D room and 2D poster tasks and boost generation quality by 66.2%.
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
OSPO trains optimal order dispatch policies for homogeneous AV fleets using only one-step group rewards, outperforming GRPO on a real ride-hailing dataset.
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Theoria rewrites solutions into auditable typed state transitions with justifications, certifying 105 of 185 HLE problems at 91.4% precision and outperforming holistic judges on adversarial poisoned proofs by catching hidden premises.
RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.
AdaMEM proposes hybrid long-term and short-term memory for test-time adaptation in language agents, reporting relative gains of up to 13% on ALFWorld and 11% on WebShop over static baselines.
Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.
ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.
SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
SRaR attributes rubric items to specific steps via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards via a decoupled advantage estimator, yielding 3.57-point accuracy gains on Qwen3-8B across math benchmarks.
CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
citing papers explorer
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.