pith. sign in

arxiv: 2605.11946 · v2 · pith:G5ZA4AWUnew · submitted 2026-05-12 · 💻 cs.AI

Counterfactual Trace Auditing of LLM Agent Skills

Pith reviewed 2026-06-30 22:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords Counterfactual Trace AuditingLLM agent skillsSkill Influence Patternsagent trace analysisbehavioral evaluationsoftware engineering agentspass rate gap
0
0 comments X

The pith

Counterfactual Trace Auditing pairs with-skill and without-skill traces to expose 522 behavioral changes that pass rates overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Counterfactual Trace Auditing as a way to measure how attached skills alter LLM agent behavior on the same tasks. It pairs traces, divides them into goal-directed phases, aligns those phases, and labels Skill Influence Patterns that record specific effects. Applied to 49 software engineering tasks, the method records 522 such patterns even though the aggregate pass rate rises by only 0.3 percentage points. The audit also isolates recurring changes such as template copying and excess planning that current benchmarks miss.

Core claim

Counterfactual Trace Auditing pairs each with-skill agent trace with a without-skill counterpart on the same task, segments both traces into goal-directed phases, aligns the phases, and emits structured Skill Influence Pattern annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. On SWE-Skills-Bench with Claude across 49 tasks, CTA identifies 522 SIP instances while pass rate changes by only +0.3 percentage points on average. The audit separates effects including literal template copying, off-task artifact creation, excess planning, and task recovery, and shows that high-baseline tasks contain most skill effects, moderate-baseline tasks s

What carries the argument

Counterfactual Trace Auditing (CTA) framework that produces Skill Influence Pattern (SIP) annotations by pairing traces, segmenting into goal-directed phases, and aligning phases to isolate skill-driven behavioral changes.

If this is right

  • High-baseline tasks contain most observed skill effects even though their pass rates are already saturated.
  • Moderate-baseline tasks show the largest recoverable performance gains, often accompanied by substantially higher token cost.
  • Surface anchoring SIPs appear most often on ceiling tasks while edge-case prompting SIPs dominate on mid-range and floor tasks.
  • Skills produce recurring behavioral effects such as literal template copying, off-task artifact creation, excess planning, and task recovery that pass rates do not register.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • CTA-style pairing could be applied to non-software domains such as web agents to detect analogous hidden behavioral shifts.
  • Benchmarks may need to report both outcome metrics and SIP counts to give a fuller picture of skill value.
  • The method opens a route to skill design that targets specific SIP reductions rather than outcome improvement alone.

Load-bearing premise

The phase segmentation and alignment process accurately isolates skill-driven changes without introducing systematic bias from the segmentation rules or the choice of goal-directed phase boundaries.

What would settle it

A manual inspection of a random sample of paired traces that finds the emitted SIP annotations do not match observable differences between the with-skill and without-skill versions or that many clear skill-induced changes receive no SIP label.

Figures

Figures reproduced from arXiv: 2605.11946 by Jinbo Liu, Li Li, Ryan A. Rossi, Xiaolin Zhou, Xiyang Hu.

Figure 1
Figure 1. Figure 1: Counterfactual Trace Auditing (CTA). For each task, CTA compares a paired set of agent trajectories generated with and without an attached skill. The pipeline parses raw logs into typed events, segments each trace into goal-directed phases using a deterministic finite state machine, aligns the two traces at the phase and intent levels, and extracts divergence records that localize behavioral differences. E… view at source ↗
read the original abstract

Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Counterfactual Trace Auditing (CTA), which pairs with-skill and without-skill agent traces on the same task, segments both into goal-directed phases, aligns the phases, and annotates Skill Influence Patterns (SIPs) to characterize behavioral changes induced by skills. On 49 SWE-Skills-Bench tasks with Claude, CTA reports 522 SIP instances while pass rate changes by only +0.3 pp on average; it further decomposes recurring effects (literal template copying, excess planning, task recovery) and shows that SIP type and prevalence vary systematically by baseline performance bucket.

Significance. If the measurement pipeline is reliable, CTA supplies a structured, reproducible alternative to outcome-only evaluation for agent skills, converting informal failure-mode observations into countable behavioral patterns. The paired-trace design and separation of surface vs. deeper effects are strengths that could generalize beyond software engineering tasks.

major comments (2)
  1. [§3] §3 (phase segmentation and alignment): the 522 SIP count and all downstream claims about effect types rest on the segmentation rules and alignment procedure, yet the manuscript reports neither inter-annotator agreement, ablation on boundary definitions, nor human validation of the resulting SIP labels. This is load-bearing because the separation of 'literal template copying' from 'excess planning' is defined by those rules.
  2. [§4] §4 (SIP distribution by baseline bucket): the finding that surface anchoring dominates ceiling tasks while edge-case prompting dominates mid-range tasks is presented as an empirical regularity, but without validation that the phase boundaries are insensitive to verbosity or template artifacts, the bucket-specific patterns could be artifacts of the segmentation heuristic rather than skill-driven differences.
minor comments (2)
  1. [Abstract, §4] The abstract and §4 refer to 'three findings' but do not explicitly map each finding to the SIP taxonomy or to a numbered table/figure; adding such cross-references would improve traceability.
  2. [§2] Notation for SIP subtypes (e.g., 'surface anchoring', 'edge-case prompting') is introduced without a consolidated definition table; a small glossary or table in §2 would reduce ambiguity when reading the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for validation of the phase segmentation and alignment procedures in Counterfactual Trace Auditing. These are indeed central to the reliability of the SIP counts and distributional findings. We address each major comment below and commit to targeted revisions that add the requested analyses while preserving the core empirical results.

read point-by-point responses
  1. Referee: [§3] §3 (phase segmentation and alignment): the 522 SIP count and all downstream claims about effect types rest on the segmentation rules and alignment procedure, yet the manuscript reports neither inter-annotator agreement, ablation on boundary definitions, nor human validation of the resulting SIP labels. This is load-bearing because the separation of 'literal template copying' from 'excess planning' is defined by those rules.

    Authors: We agree that the segmentation and alignment rules are load-bearing for the 522 SIP instances and the separation of effect types such as literal template copying versus excess planning. The rules are deterministic, based on explicit criteria for goal-directed phases (planning, execution, verification) derived from trace actions, which reduces but does not eliminate the need for validation. In the revised manuscript we will add to §3: (i) an ablation varying boundary definitions (e.g., minimum phase length and action-type thresholds) and reporting resulting changes in SIP counts and type distributions; (ii) human validation on a stratified sample of 50 paired traces, with two independent annotators labeling SIPs and reporting agreement (Cohen's kappa) against the automated outputs. These additions directly address the concern without altering the reported aggregate findings. revision: yes

  2. Referee: [§4] §4 (SIP distribution by baseline bucket): the finding that surface anchoring dominates ceiling tasks while edge-case prompting dominates mid-range tasks is presented as an empirical regularity, but without validation that the phase boundaries are insensitive to verbosity or template artifacts, the bucket-specific patterns could be artifacts of the segmentation heuristic rather than skill-driven differences.

    Authors: The bucket-specific patterns (surface anchoring on ceiling tasks, edge-case prompting on mid-range tasks) rely on consistent phase boundaries. The ablation study described in the response to §3 will explicitly test sensitivity to verbosity and template artifacts by (a) length-normalizing traces before segmentation and (b) masking recurring template phrases prior to phase detection, then re-computing the SIP distributions by baseline bucket. We will report in the revised §4 whether the dominance patterns remain stable under these perturbations. If they do, this supports interpreting the regularities as skill-driven rather than heuristic artifacts; any sensitivity will be disclosed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; CTA is a direct measurement procedure on paired traces

full rationale

The paper defines CTA as a sequence of operations (pairing traces, segmenting into goal-directed phases, aligning, and annotating SIPs) and applies it to produce the 522 SIP count and behavioral effect classifications. These outputs are generated by executing the defined procedure on the input traces rather than by fitting parameters to a subset and relabeling the fit as a prediction, or by reducing via self-citation to an unverified premise. No equations, uniqueness theorems, or ansatzes are invoked that loop back to the target quantities. The framework is presented as an auditing tool whose results are the direct product of its application rules.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that traces can be reliably segmented into goal-directed phases and that alignment between paired traces isolates skill effects. No free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Agent traces can be segmented into goal-directed phases whose boundaries are identifiable by human or automated inspection.
    The method description states that traces are segmented into goal-directed phases before alignment.
  • domain assumption Alignment of with-skill and without-skill phases on the same task isolates the behavioral contribution of the skill.
    The core of CTA is the pairing and phase alignment step.
invented entities (1)
  • Skill Influence Pattern (SIP) no independent evidence
    purpose: Structured annotation describing a specific behavioral effect of attaching a skill.
    New label type introduced to capture effects beyond task outcome.

pith-pipeline@v0.9.1-grok · 5834 in / 1359 out tokens · 18192 ms · 2026-06-30T22:30:27.863805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

    cs.AI 2026-07 unverdicted novelty 6.0

    SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.

  2. Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.

Reference graph

Works this paper leans on

29 extracted references · 10 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66. 9

  2. [2]

    Introducing SWE-bench Verified

    OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024

  3. [3]

    arXiv preprint arXiv:2603.15401 , year=

    Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?, 2026. URL https://arxiv.org/abs/2603.15401

  4. [4]

    Introducing agent skills.https://claude.com/blog/skills, 2025

    Anthropic. Introducing agent skills.https://claude.com/blog/skills, 2025

  5. [5]

    Defenses against prompt attacks learn surface heuristics

    Shawn Li, Chenxiao Yu, Zhiyu Ni, Hao Li, Charith Peris, Chaowei Xiao, and Yue Zhao. Defenses against prompt attacks learn surface heuristics. InACL, 2026

  6. [6]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Yacmpz84TH

  7. [7]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=WE_vluYUL-X

  8. [8]

    The Autonomy Tax: Defense Training Breaks LLM Agents

    Shawn Li and Yue Zhao. The autonomy tax: Defense training breaks llm agents, 2026. URL https://arxiv.org/abs/2603.19423

  9. [9]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

  10. [10]

    Skilltester: Benchmarking utility and security of agent skills.arXiv preprint arXiv:2603.28815, 2026

    Leye Wang, Zixing Wang, and Anjie Xu. Skilltester: Benchmarking utility and security of agent skills.arXiv preprint arXiv:2603.28815, 2026

  11. [11]

    Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios

    Zhi Chen, Wei Ma, and Lingxiao Jiang. Beyond final code: A process-oriented error analysis of software development agents in real-world github scenarios.arXiv preprint arXiv:2503.12374, 2025

  12. [12]

    Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

    Tural Mehtiyev and Wesley Assunção. Beyond resolution rates: Behavioral drivers of coding agent success and failure.arXiv preprint arXiv:2604.02547, 2026

  13. [13]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=vAElhFcKW6

  14. [14]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

  15. [15]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  16. [16]

    Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

    Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, and Chanyoung Park. Beyond the final answer: Evaluating the reasoning trajectories of tool-augmented agents.arXiv preprint arXiv:2510.02837, 2025

  17. [17]

    When agents go astray: Course-correcting swe agents with prms.arXiv preprint arXiv:2509.02360,

    Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, and Yara Rizk. When agents go astray: Course-correcting swe agents with prms.arXiv preprint arXiv:2509.02360, 2025

  18. [18]

    Treble counterfactual VLMs: A causal approach to hallucination

    Li Shawn, Jiashu Qu, Linxin Song, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, and Yue Zhao. Treble counterfactual VLMs: A causal approach to hallucination. InEMNLP, pages 18423– 18434, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8-89176-335-7. 10

  19. [19]

    Calibrate before use: Improving few-shot performance of language models

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 18–24 Jul 2021. URLhttps://pr...

  20. [20]

    Biased-predicate annotation identification via unbiased visual predicate representation

    Li Li, Chenwei Wang, You Qin, Wei Ji, and Renjie Liang. Biased-predicate annotation identification via unbiased visual predicate representation. InACM MM, page 4410–4420. Association for Computing Machinery, 2023. ISBN 9798400701085. doi: 10.1145/3581783. 3611847. URLhttps://doi.org/10.1145/3581783.3611847

  21. [21]

    Panoptic scene graph generation with semantics-prototype learning.AAAI, 38(4):3145–3153, Mar

    Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, and Roger Zimmermann. Panoptic scene graph generation with semantics-prototype learning.AAAI, 38(4):3145–3153, Mar. 2024. doi: 10.1609/aaai.v38i4.28098

  22. [22]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, 2022

  23. [23]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...

  24. [24]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  25. [25]

    Dpu: Dynamic prototype updating for multimodal out-of-distribution detection

    Shawn Li, Huixian Gong, Hao Dong, Tiankai Yang, Zhengzhong Tu, and Yue Zhao. Dpu: Dynamic prototype updating for multimodal out-of-distribution detection. InCVPR, pages 10193–10202, June 2025

  26. [26]

    Secure on-device video ood detection without backpropagation

    Shawn Li, Peilin Cai, Yuxiao Zhou, Zhiyu Ni, Renjie Liang, You Qin, Yi Nian, Zhengzhong Tu, Xiyang Hu, and Yue Zhao. Secure on-device video ood detection without backpropagation. In ICCV, October 2025

  27. [27]

    Mitigating copy bias in in-context learning through neuron pruning

    Ameen Ali Ali, Lior Wolf, and Ivan Titov. Mitigating copy bias in in-context learning through neuron pruning. InFindings of the Association for Computational Linguistics: EACL 2026, pages 230–251, 2026

  28. [28]

    Understanding in-context learning from repetitions

    Jianhao Yan, Jin Xu, Chiyu Song, Chenming Wu, Yafu Li, and Yue Zhang. Understanding in-context learning from repetitions. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=bGGYcvw8mp

  29. [29]

    Let’s think step by step

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences (PNAS), 120(30), 2023. 11 A Case-study trace excerpts This appendix accompanies §5 and reproduces, for each of the five mechanism case studies, (i) the section of the skill template that the with-sk...