hub Canonical reference

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li · 2023 · cs.AI · arXiv 2312.08935

Canonical reference. 73% of citing Pith papers cite this work as background.

51 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 51 citing papers arXiv PDF

abstract

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 baseline 1 dataset 1

citation-polarity summary

background 8 use method 2 baseline 1

representative citing papers

Transformers Provably Learn to Internalize Chain-of-Thought

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

Operadic consistency is a new per-question signal that correlates strongly with accuracy (r 0.86-0.94) across four multi-hop QA datasets and improves selective prediction over CoT-SC baselines.

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.

Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation

cs.GR · 2026-06-03 · unverdicted · novelty 7.0

Aggregating many LLM-synthesized weak verifiers via weak learning from sparse labels yields stronger verifiers that improve F1 by up to 7X over direct LLM judges on 3D room and 2D poster tasks and boost generation quality by 66.2%.

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms

cs.AI · 2025-07-21 · conditional · novelty 7.0

OSPO trains optimal order dispatch policies for homogeneous AV fleets using only one-step group rewards, outperforming GRPO on a real ride-hailing dataset.

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

cs.CL · 2024-10-10 · conditional · novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

cs.AI · 2026-07-01 · unverdicted · novelty 6.0

Theoria rewrites solutions into auditable typed state transitions with justifications, certifying 105 of 185 HLE problems at 91.4% precision and outperforming holistic judges on adversarial poisoned proofs by catching hidden premises.

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.

AdaMEM: Test-Time Adaptive Memory for Language Agents

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

AdaMEM proposes hybrid long-term and short-term memory for test-time adaptation in language agents, reporting relative gains of up to 13% on ALFWorld and 11% on WebShop over static baselines.

Label-Free Reinforcement Learning via Cross-Model Entropy

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

Step-wise Rubric Rewards for LLM Reasoning

cs.LG · 2026-05-17 · conditional · novelty 6.0

SRaR attributes rubric items to specific steps via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards via a decoupled advantage estimator, yielding 3.57-point accuracy gains on Qwen3-8B across math benchmarks.

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

SeLaR: Selective Latent Reasoning in Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

citing papers explorer

Showing 2 of 2 citing papers after filters.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 185 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 232 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer