Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

Achint Mehta

arxiv: 2607.02436 · v1 · pith:6RSK5U7Rnew · submitted 2026-07-02 · 💻 cs.SE · cs.AI

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

Achint Mehta This is my paper

Pith reviewed 2026-07-03 08:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords agentic code generationreasoning effortfirst-try reliabilityobservational studyAI coding agentstesting toolsprompt designsoftware engineering

0 comments

The pith

Raising reasoning effort lifts first-try perfect agentic code runs from 28 to 89 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from an observational study of ninety independent agent runs that each generated the same real-time retrospective board from one fixed specification. Each run was scored on a fourteen-criterion functional rubric plus a visual quality review. The study varied model generations, agent harnesses, reasoning effort levels, a testing tool, and design-oriented prompts. Higher reasoning effort produced the largest gains in first-try functional correctness and sharply reduced corrective prompts, while the testing tool increased cost with no reliability benefit and design prompts mainly improved visuals. The central lesson drawn is that most first-run failures trace to weak reasoning rather than detectable interface flaws.

Core claim

Capability tier dominated, with frontier models near ceiling scores and a low-cost local model at 24 to 37 points. Container deployment failed first-try in 44 percent of runs. Raising reasoning effort from High to xHigh increased first-try perfect runs from 28 percent to 89 percent and cut corrective prompts about fivefold for 9 to 29 percent more cost. The testing tool raised cost by 42 to 68 percent without raising functional score or reliability even on interface-visible criteria. A design-oriented prompt raised visual quality from 3.0 to 4.5 on a five-point scale without lifting function, and a one-paragraph paraphrase reproduced the lift.

What carries the argument

The set of ninety independent runs that held the target application and specification fixed while varying reasoning effort and tool access, scored on a fixed 14-criterion rubric.

Load-bearing premise

Observed differences across the runs can be attributed to the manipulated variables of reasoning effort and tool access rather than uncontrolled differences in agent harness behavior or model internals.

What would settle it

Repeating the ninety runs while randomizing harness behavior or model sampling parameters but keeping reasoning effort and tool access fixed, then checking whether the 28-to-89 percent gap in perfect runs disappears.

Figures

Figures reproduced from arXiv: 2607.02436 by Achint Mehta.

**Figure 2.** Figure 2: Effect of adding the testing tool on Opus 4.7. Left: functional score is unchanged. Right: session cost is higher with the tool. Boxes show the median and interquartile range, whiskers show the spread, and open circles are outliers. The session token records show what each premium actually buys ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Reasoning effort on Opus 4.7, High versus xHigh. Left: the share of runs perfect on the first try, by condition; raising effort lifts the base and Playwright cells to the ceiling and partially recovers the full design prompt cell, but leaves the abridged prompt cell unchanged. The dashed line separates the sweep conditions from the abridged prompt ablation, which is excluded from the pooled statistics in t… view at source ↗

**Figure 5.** Figure 5: makes the contrast concrete. Each row is one Opus configuration and each column is a condition: the base prompt, the base prompt plus the testing tool, the base prompt plus the full design prompt, and, from the ablation, the base prompt plus the abridged design prompt. The first two columns, which lack a design prompt, show clean but plain default interfaces across all three configurations; adding the test… view at source ↗

read the original abstract

Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts. Capability tier dominated: frontier models clustered near the ceiling while a low cost local model fell to 24 to 37 points. A criterion level analysis revealed what run totals conceal. Container deployment was the dominant defect, failing first try in 44 percent of runs, with its failure rate shifting sharply across model generations while mean totals moved less than a point. The testing tool raised cost by 42 to 68 percent without improving functional score or reliability, even on interface visible criteria. Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent and cut corrective prompts about five fold, for 9 to 29 percent more cost. A design oriented prompt raised visual quality, 4.5 versus 3.0 on a 5 point scale, without lifting function, and a one paragraph paraphrase of its directive reproduced the entire lift. The practical lesson is to match the fix to the failure: most first run failures came from weak reasoning, which a stronger model or more effort prevents, not from visible flaws a checking tool would catch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Higher reasoning effort lifted first-try perfect runs from 28% to 89% in the 90 runs, but the observational mix of models and harnesses leaves the cause unclear.

read the letter

The main thing to take from this paper is the size of the reported lift when reasoning effort went from high to xhigh: first-try perfect runs jumped from 28 percent to 89 percent and corrective prompts dropped fivefold, at only 9 to 29 percent extra cost. The testing tool, by contrast, raised costs 42 to 68 percent with no score or reliability improvement.

The study supplies direct counts from 90 runs on one real application, scored against a 14-criterion rubric plus visual review. The criterion-level view is the stronger part: container deployment failed first try in 44 percent of runs and varied sharply by model generation while totals barely moved. The design prompt improved visuals without touching function, and a short paraphrase reproduced the gain. These are concrete, falsifiable observations.

The soft spot is the design. Runs crossed several model generations and two harnesses in addition to the effort and tool variables. The abstract gives no sign of randomization, blocking, or balanced allocation, so the effort effect could partly reflect stronger models or harnesses landing in the xhigh group. No error bars or tests appear in the summary, and everything comes from a single application.

This is for practitioners tuning agentic coding setups and for researchers who want head-to-head numbers on effort versus tools. A reader who needs empirical trade-off data will find usable figures even with the limits. It deserves peer review because the raw comparisons are uncommon and the rubric scoring is transparent; revisions would mainly tighten the causal language and add basic statistics.

Referee Report

2 major / 2 minor

Summary. The paper reports results from an observational study of 90 independent agent runs, each building the same real-time retrospective board application from one detailed specification. Runs vary across model generations, two agent harnesses, two reasoning-effort levels (High vs. xHigh), presence/absence of a testing tool, and two design-oriented prompts. Outcomes are scored on a 14-criterion functional rubric (max 42 points) plus visual quality. The central claims are that model capability dominates scores, the testing tool increases cost 42-68% with no reliability or score benefit, raising reasoning effort from High to xHigh increases first-try perfect runs from 28% to 89% and reduces corrective prompts fivefold at 9-29% extra cost, and a design prompt improves visual quality (4.5 vs 3.0) without affecting function.

Significance. If the attribution of the 28%-to-89% lift and fivefold prompt reduction to reasoning effort (rather than model/harness imbalance) can be substantiated, the work supplies concrete, actionable empirical guidance for agentic code-generation configurations. The criterion-level breakdown (e.g., container-deployment failures) and the observation that a one-paragraph prompt paraphrase reproduces the visual-quality gain are useful for practitioners. The study also supplies a reproducible rubric and 90-run dataset that future controlled experiments can build upon.

major comments (2)

[Abstract] Abstract: The headline result that 'Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent' is presented as a causal effect of the effort variable. However, the 90 runs span 'several model generations' and 'two agent harnesses' with no mention of randomization, blocking, or pre-specified stratification on those factors. Without evidence that the xHigh subset was not disproportionately allocated stronger models or harnesses, the observed difference cannot be confidently attributed to reasoning effort rather than confounding.
[Abstract] Abstract: The parallel claim that 'the testing tool raised cost by 42 to 68 percent without improving functional score or reliability' is likewise observational. The manuscript provides no table or section showing the marginal allocation of tool access independent of model/harness/effort combinations, so the same confounding risk applies to the 'no benefit' conclusion.

minor comments (2)

[Abstract] Abstract: 'real time retrospective board' should be hyphenated as 'real-time'.
[Abstract] Abstract: The phrase 'cut corrective prompts about five fold' should be 'fivefold' (one word) for standard usage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the risk of over-interpreting observational associations as causal effects. We agree that the abstract phrasing requires revision to better reflect the study's observational design and to supply readers with the allocation details needed to evaluate confounding. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result that 'Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent' is presented as a causal effect of the effort variable. However, the 90 runs span 'several model generations' and 'two agent harnesses' with no mention of randomization, blocking, or pre-specified stratification on those factors. Without evidence that the xHigh subset was not disproportionately allocated stronger models or harnesses, the observed difference cannot be confidently attributed to reasoning effort rather than confounding.

Authors: We accept the point. The manuscript is explicitly framed as an observational study (see title and methods), and the abstract already notes that runs 'spanned several model generations, two agent harnesses' etc. Nevertheless, the verb 'lifted' can be read as implying a controlled causal contrast. We will revise the abstract to replace 'lifted' with 'was associated with' (or equivalent) and will add a short table or paragraph in the results section that cross-tabulates reasoning-effort level against model generation and harness. This will let readers directly inspect the marginal distributions and judge the plausibility of confounding. The same revision will be applied to the parallel claim about corrective-prompt reduction. revision: yes
Referee: [Abstract] Abstract: The parallel claim that 'the testing tool raised cost by 42 to 68 percent without improving functional score or reliability' is likewise observational. The manuscript provides no table or section showing the marginal allocation of tool access independent of model/harness/effort combinations, so the same confounding risk applies to the 'no benefit' conclusion.

Authors: We agree. The testing-tool comparison is also observational and subject to the same allocation concerns. We will revise the abstract wording from 'raised cost … without improving' to a more neutral phrasing such as 'was associated with 42–68 % higher cost and showed no reliable difference in functional score or first-try reliability.' In addition, we will include the requested marginal-allocation table (or expanded cross-tabulation) covering tool access alongside model, harness, and effort. This change will be made in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from independent runs

full rationale

The paper is an observational study reporting raw counts and percentages from 90 independent agent runs scored against a fixed external 14-criterion rubric. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. All claims (e.g., 28% to 89% lift in first-try perfect runs) are direct tallies of observed outcomes, not reductions by construction. The derivation chain is self-contained as empirical reporting with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the 14-criterion rubric as a proxy for functional quality and on the assumption that the single chosen application and 90 runs adequately represent broader agent behavior.

axioms (1)

domain assumption The fixed 14-criterion functional rubric accurately captures application correctness across all runs
All functional scoring and reliability claims are derived from this rubric.

pith-pipeline@v0.9.1-grok · 5824 in / 1228 out tokens · 44341 ms · 2026-07-03T08:29:48.007309+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 2 canonical work pages

[1]

The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies

von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med. 2007;4(10):e296. doi: 10.1371/journal.pmed.0040296

work page doi:10.1371/journal.pmed.0040296 2007
[2]

Extended thinking

Anthropic. Extended thinking. Claude Platform documentation. [cited 2026 Jun 29]. Available from: https://platform.claude.com/docs/en/build-with-claude/extended-thinking

2026
[3]

Gemini 3 developer guide (Interactions API)

Google. Gemini 3 developer guide (Interactions API). Gemini API documentation. [cited 2026 Jun 29]. Available from: https://ai.google.dev/gemini-api/docs/interactions/gemini-3 18

2026
[4]

OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection

OWASP Foundation. OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection. 2025 [cited 2026 Jun 27]. Available from: https://genai.owasp.org/llmrisk/llm01-prompt-injection/

2025
[5]

Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts

Mehta A. Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts. Version 2.3.0. Zenodo; 2026 [cited 2026 Jul 2]. doi: 10.5281/zenodo.21134406 Appendix Table S1. Per-run data. Functional score (out of 42), session cost in US dollars, base-10 logarithm of cost, and visual quality rating (1 to 5) for all 90 runs. Run folder Model E...

work page doi:10.5281/zenodo.21134406 2026

[1] [1]

The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies

von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med. 2007;4(10):e296. doi: 10.1371/journal.pmed.0040296

work page doi:10.1371/journal.pmed.0040296 2007

[2] [2]

Extended thinking

Anthropic. Extended thinking. Claude Platform documentation. [cited 2026 Jun 29]. Available from: https://platform.claude.com/docs/en/build-with-claude/extended-thinking

2026

[3] [3]

Gemini 3 developer guide (Interactions API)

Google. Gemini 3 developer guide (Interactions API). Gemini API documentation. [cited 2026 Jun 29]. Available from: https://ai.google.dev/gemini-api/docs/interactions/gemini-3 18

2026

[4] [4]

OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection

OWASP Foundation. OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection. 2025 [cited 2026 Jun 27]. Available from: https://genai.owasp.org/llmrisk/llm01-prompt-injection/

2025

[5] [5]

Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts

Mehta A. Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts. Version 2.3.0. Zenodo; 2026 [cited 2026 Jul 2]. doi: 10.5281/zenodo.21134406 Appendix Table S1. Per-run data. Functional score (out of 42), session cost in US dollars, base-10 logarithm of cost, and visual quality rating (1 to 5) for all 90 runs. Run folder Model E...

work page doi:10.5281/zenodo.21134406 2026