pith. sign in

hub Canonical reference

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Canonical reference. 73% of citing Pith papers cite this work as background.

51 Pith papers citing it
Background 73% of classified citations
abstract

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

hub tools

citation-role summary

background 7 method 2 baseline 1 dataset 1

citation-polarity summary

representative citing papers

Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation

cs.GR · 2026-06-03 · unverdicted · novelty 7.0

Aggregating many LLM-synthesized weak verifiers via weak learning from sparse labels yields stronger verifiers that improve F1 by up to 7X over direct LLM judges on 3D room and 2D poster tasks and boost generation quality by 66.2%.

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.

AdaMEM: Test-Time Adaptive Memory for Language Agents

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

AdaMEM proposes hybrid long-term and short-term memory for test-time adaptation in language agents, reporting relative gains of up to 13% on ALFWorld and 11% on WebShop over static baselines.

Label-Free Reinforcement Learning via Cross-Model Entropy

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.

Step-wise Rubric Rewards for LLM Reasoning

cs.LG · 2026-05-17 · conditional · novelty 6.0

SRaR attributes rubric items to specific steps via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards via a decoupled advantage estimator, yielding 3.57-point accuracy gains on Qwen3-8B across math benchmarks.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

citing papers explorer

Showing 50 of 51 citing papers.