Agentprocessbench: Diagnosing step-level process quality in tool-using agents

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, et al · 2026 · cs.AI · arXiv 2603.14465

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

cs.AI · 2026-03-28 · unverdicted · novelty 7.0

WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.

SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

OpenClawBench annotates 31,264 agent trajectories to show that roughly 9% of task-successful executions contain measurable process anomalies, and a fine-tuned detector reaches F1 0.729 on held-out data.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

cs.CL · 2026-05-18 · unverdicted · novelty 5.0

SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.

Interactive Evaluation Requires a Design Science

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.

citing papers explorer

Showing 7 of 7 citing papers.

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking cs.AI · 2026-03-28 · unverdicted · none · ref 4 · internal anchor
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use cs.AI · 2026-07-02 · unverdicted · none · ref 8 · internal anchor
SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior cs.AI · 2026-06-10 · unverdicted · none · ref 2 · internal anchor
Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories cs.AI · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
OpenClawBench annotates 31,264 agent trajectories to show that roughly 9% of task-successful executions contain measurable process anomalies, and a fine-tuned detector reaches F1 0.729 on held-out data.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution cs.CL · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
Interactive Evaluation Requires a Design Science cs.AI · 2026-05-18 · unverdicted · none · ref 13 · internal anchor
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.

Agentprocessbench: Diagnosing step-level process quality in tool-using agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer