WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
Agentprocessbench: Diagnosing step-level process quality in tool-using agents
7 Pith papers cite this work. Polarity classification is still indexing.
abstract
While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
background 2polarities
background 2representative citing papers
SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.
Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.
OpenClawBench annotates 31,264 agent trajectories to show that roughly 9% of task-successful executions contain measurable process anomalies, and a fine-tuned detector reaches F1 0.729 on held-out data.
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
citing papers explorer
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.