pith. sign in

arxiv: 2607.02436 · v1 · pith:6RSK5U7Rnew · submitted 2026-07-02 · 💻 cs.SE · cs.AI

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

Pith reviewed 2026-07-03 08:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords agentic code generationreasoning effortfirst-try reliabilityobservational studyAI coding agentstesting toolsprompt designsoftware engineering
0
0 comments X

The pith

Raising reasoning effort lifts first-try perfect agentic code runs from 28 to 89 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from an observational study of ninety independent agent runs that each generated the same real-time retrospective board from one fixed specification. Each run was scored on a fourteen-criterion functional rubric plus a visual quality review. The study varied model generations, agent harnesses, reasoning effort levels, a testing tool, and design-oriented prompts. Higher reasoning effort produced the largest gains in first-try functional correctness and sharply reduced corrective prompts, while the testing tool increased cost with no reliability benefit and design prompts mainly improved visuals. The central lesson drawn is that most first-run failures trace to weak reasoning rather than detectable interface flaws.

Core claim

Capability tier dominated, with frontier models near ceiling scores and a low-cost local model at 24 to 37 points. Container deployment failed first-try in 44 percent of runs. Raising reasoning effort from High to xHigh increased first-try perfect runs from 28 percent to 89 percent and cut corrective prompts about fivefold for 9 to 29 percent more cost. The testing tool raised cost by 42 to 68 percent without raising functional score or reliability even on interface-visible criteria. A design-oriented prompt raised visual quality from 3.0 to 4.5 on a five-point scale without lifting function, and a one-paragraph paraphrase reproduced the lift.

What carries the argument

The set of ninety independent runs that held the target application and specification fixed while varying reasoning effort and tool access, scored on a fixed 14-criterion rubric.

Load-bearing premise

Observed differences across the runs can be attributed to the manipulated variables of reasoning effort and tool access rather than uncontrolled differences in agent harness behavior or model internals.

What would settle it

Repeating the ninety runs while randomizing harness behavior or model sampling parameters but keeping reasoning effort and tool access fixed, then checking whether the 28-to-89 percent gap in perfect runs disappears.

Figures

Figures reproduced from arXiv: 2607.02436 by Achint Mehta.

Figure 2
Figure 2. Figure 2: Effect of adding the testing tool on Opus 4.7. Left: functional score is unchanged. Right: session cost is higher with the tool. Boxes show the median and interquartile range, whiskers show the spread, and open circles are outliers. The session token records show what each premium actually buys ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning effort on Opus 4.7, High versus xHigh. Left: the share of runs perfect on the first try, by condition; raising effort lifts the base and Playwright cells to the ceiling and partially recovers the full design prompt cell, but leaves the abridged prompt cell unchanged. The dashed line separates the sweep conditions from the abridged prompt ablation, which is excluded from the pooled statistics in t… view at source ↗
Figure 5
Figure 5. Figure 5: makes the contrast concrete. Each row is one Opus configuration and each column is a condition: the base prompt, the base prompt plus the testing tool, the base prompt plus the full design prompt, and, from the ablation, the base prompt plus the abridged design prompt. The first two columns, which lack a design prompt, show clean but plain default interfaces across all three configurations; adding the test… view at source ↗
read the original abstract

Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts. Capability tier dominated: frontier models clustered near the ceiling while a low cost local model fell to 24 to 37 points. A criterion level analysis revealed what run totals conceal. Container deployment was the dominant defect, failing first try in 44 percent of runs, with its failure rate shifting sharply across model generations while mean totals moved less than a point. The testing tool raised cost by 42 to 68 percent without improving functional score or reliability, even on interface visible criteria. Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent and cut corrective prompts about five fold, for 9 to 29 percent more cost. A design oriented prompt raised visual quality, 4.5 versus 3.0 on a 5 point scale, without lifting function, and a one paragraph paraphrase of its directive reproduced the entire lift. The practical lesson is to match the fix to the failure: most first run failures came from weak reasoning, which a stronger model or more effort prevents, not from visible flaws a checking tool would catch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports results from an observational study of 90 independent agent runs, each building the same real-time retrospective board application from one detailed specification. Runs vary across model generations, two agent harnesses, two reasoning-effort levels (High vs. xHigh), presence/absence of a testing tool, and two design-oriented prompts. Outcomes are scored on a 14-criterion functional rubric (max 42 points) plus visual quality. The central claims are that model capability dominates scores, the testing tool increases cost 42-68% with no reliability or score benefit, raising reasoning effort from High to xHigh increases first-try perfect runs from 28% to 89% and reduces corrective prompts fivefold at 9-29% extra cost, and a design prompt improves visual quality (4.5 vs 3.0) without affecting function.

Significance. If the attribution of the 28%-to-89% lift and fivefold prompt reduction to reasoning effort (rather than model/harness imbalance) can be substantiated, the work supplies concrete, actionable empirical guidance for agentic code-generation configurations. The criterion-level breakdown (e.g., container-deployment failures) and the observation that a one-paragraph prompt paraphrase reproduces the visual-quality gain are useful for practitioners. The study also supplies a reproducible rubric and 90-run dataset that future controlled experiments can build upon.

major comments (2)
  1. [Abstract] Abstract: The headline result that 'Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent' is presented as a causal effect of the effort variable. However, the 90 runs span 'several model generations' and 'two agent harnesses' with no mention of randomization, blocking, or pre-specified stratification on those factors. Without evidence that the xHigh subset was not disproportionately allocated stronger models or harnesses, the observed difference cannot be confidently attributed to reasoning effort rather than confounding.
  2. [Abstract] Abstract: The parallel claim that 'the testing tool raised cost by 42 to 68 percent without improving functional score or reliability' is likewise observational. The manuscript provides no table or section showing the marginal allocation of tool access independent of model/harness/effort combinations, so the same confounding risk applies to the 'no benefit' conclusion.
minor comments (2)
  1. [Abstract] Abstract: 'real time retrospective board' should be hyphenated as 'real-time'.
  2. [Abstract] Abstract: The phrase 'cut corrective prompts about five fold' should be 'fivefold' (one word) for standard usage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the risk of over-interpreting observational associations as causal effects. We agree that the abstract phrasing requires revision to better reflect the study's observational design and to supply readers with the allocation details needed to evaluate confounding. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result that 'Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent' is presented as a causal effect of the effort variable. However, the 90 runs span 'several model generations' and 'two agent harnesses' with no mention of randomization, blocking, or pre-specified stratification on those factors. Without evidence that the xHigh subset was not disproportionately allocated stronger models or harnesses, the observed difference cannot be confidently attributed to reasoning effort rather than confounding.

    Authors: We accept the point. The manuscript is explicitly framed as an observational study (see title and methods), and the abstract already notes that runs 'spanned several model generations, two agent harnesses' etc. Nevertheless, the verb 'lifted' can be read as implying a controlled causal contrast. We will revise the abstract to replace 'lifted' with 'was associated with' (or equivalent) and will add a short table or paragraph in the results section that cross-tabulates reasoning-effort level against model generation and harness. This will let readers directly inspect the marginal distributions and judge the plausibility of confounding. The same revision will be applied to the parallel claim about corrective-prompt reduction. revision: yes

  2. Referee: [Abstract] Abstract: The parallel claim that 'the testing tool raised cost by 42 to 68 percent without improving functional score or reliability' is likewise observational. The manuscript provides no table or section showing the marginal allocation of tool access independent of model/harness/effort combinations, so the same confounding risk applies to the 'no benefit' conclusion.

    Authors: We agree. The testing-tool comparison is also observational and subject to the same allocation concerns. We will revise the abstract wording from 'raised cost … without improving' to a more neutral phrasing such as 'was associated with 42–68 % higher cost and showed no reliable difference in functional score or first-try reliability.' In addition, we will include the requested marginal-allocation table (or expanded cross-tabulation) covering tool access alongside model, harness, and effort. This change will be made in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from independent runs

full rationale

The paper is an observational study reporting raw counts and percentages from 90 independent agent runs scored against a fixed external 14-criterion rubric. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. All claims (e.g., 28% to 89% lift in first-try perfect runs) are direct tallies of observed outcomes, not reductions by construction. The derivation chain is self-contained as empirical reporting with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the 14-criterion rubric as a proxy for functional quality and on the assumption that the single chosen application and 90 runs adequately represent broader agent behavior.

axioms (1)
  • domain assumption The fixed 14-criterion functional rubric accurately captures application correctness across all runs
    All functional scoring and reliability claims are derived from this rubric.

pith-pipeline@v0.9.1-grok · 5824 in / 1228 out tokens · 44341 ms · 2026-07-03T08:29:48.007309+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 2 canonical work pages

  1. [1]

    The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies

    von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med. 2007;4(10):e296. doi: 10.1371/journal.pmed.0040296

  2. [2]

    Extended thinking

    Anthropic. Extended thinking. Claude Platform documentation. [cited 2026 Jun 29]. Available from: https://platform.claude.com/docs/en/build-with-claude/extended-thinking

  3. [3]

    Gemini 3 developer guide (Interactions API)

    Google. Gemini 3 developer guide (Interactions API). Gemini API documentation. [cited 2026 Jun 29]. Available from: https://ai.google.dev/gemini-api/docs/interactions/gemini-3 18

  4. [4]

    OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection

    OWASP Foundation. OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection. 2025 [cited 2026 Jun 27]. Available from: https://genai.owasp.org/llmrisk/llm01-prompt-injection/

  5. [5]

    Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts

    Mehta A. Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts. Version 2.3.0. Zenodo; 2026 [cited 2026 Jul 2]. doi: 10.5281/zenodo.21134406 Appendix Table S1. Per-run data. Functional score (out of 42), session cost in US dollars, base-10 logarithm of cost, and visual quality rating (1 to 5) for all 90 runs. Run folder Model E...