Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study
Pith reviewed 2026-07-03 08:29 UTC · model grok-4.3
The pith
Raising reasoning effort lifts first-try perfect agentic code runs from 28 to 89 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Capability tier dominated, with frontier models near ceiling scores and a low-cost local model at 24 to 37 points. Container deployment failed first-try in 44 percent of runs. Raising reasoning effort from High to xHigh increased first-try perfect runs from 28 percent to 89 percent and cut corrective prompts about fivefold for 9 to 29 percent more cost. The testing tool raised cost by 42 to 68 percent without raising functional score or reliability even on interface-visible criteria. A design-oriented prompt raised visual quality from 3.0 to 4.5 on a five-point scale without lifting function, and a one-paragraph paraphrase reproduced the lift.
What carries the argument
The set of ninety independent runs that held the target application and specification fixed while varying reasoning effort and tool access, scored on a fixed 14-criterion rubric.
Load-bearing premise
Observed differences across the runs can be attributed to the manipulated variables of reasoning effort and tool access rather than uncontrolled differences in agent harness behavior or model internals.
What would settle it
Repeating the ninety runs while randomizing harness behavior or model sampling parameters but keeping reasoning effort and tool access fixed, then checking whether the 28-to-89 percent gap in perfect runs disappears.
Figures
read the original abstract
Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts. Capability tier dominated: frontier models clustered near the ceiling while a low cost local model fell to 24 to 37 points. A criterion level analysis revealed what run totals conceal. Container deployment was the dominant defect, failing first try in 44 percent of runs, with its failure rate shifting sharply across model generations while mean totals moved less than a point. The testing tool raised cost by 42 to 68 percent without improving functional score or reliability, even on interface visible criteria. Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent and cut corrective prompts about five fold, for 9 to 29 percent more cost. A design oriented prompt raised visual quality, 4.5 versus 3.0 on a 5 point scale, without lifting function, and a one paragraph paraphrase of its directive reproduced the entire lift. The practical lesson is to match the fix to the failure: most first run failures came from weak reasoning, which a stronger model or more effort prevents, not from visible flaws a checking tool would catch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports results from an observational study of 90 independent agent runs, each building the same real-time retrospective board application from one detailed specification. Runs vary across model generations, two agent harnesses, two reasoning-effort levels (High vs. xHigh), presence/absence of a testing tool, and two design-oriented prompts. Outcomes are scored on a 14-criterion functional rubric (max 42 points) plus visual quality. The central claims are that model capability dominates scores, the testing tool increases cost 42-68% with no reliability or score benefit, raising reasoning effort from High to xHigh increases first-try perfect runs from 28% to 89% and reduces corrective prompts fivefold at 9-29% extra cost, and a design prompt improves visual quality (4.5 vs 3.0) without affecting function.
Significance. If the attribution of the 28%-to-89% lift and fivefold prompt reduction to reasoning effort (rather than model/harness imbalance) can be substantiated, the work supplies concrete, actionable empirical guidance for agentic code-generation configurations. The criterion-level breakdown (e.g., container-deployment failures) and the observation that a one-paragraph prompt paraphrase reproduces the visual-quality gain are useful for practitioners. The study also supplies a reproducible rubric and 90-run dataset that future controlled experiments can build upon.
major comments (2)
- [Abstract] Abstract: The headline result that 'Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent' is presented as a causal effect of the effort variable. However, the 90 runs span 'several model generations' and 'two agent harnesses' with no mention of randomization, blocking, or pre-specified stratification on those factors. Without evidence that the xHigh subset was not disproportionately allocated stronger models or harnesses, the observed difference cannot be confidently attributed to reasoning effort rather than confounding.
- [Abstract] Abstract: The parallel claim that 'the testing tool raised cost by 42 to 68 percent without improving functional score or reliability' is likewise observational. The manuscript provides no table or section showing the marginal allocation of tool access independent of model/harness/effort combinations, so the same confounding risk applies to the 'no benefit' conclusion.
minor comments (2)
- [Abstract] Abstract: 'real time retrospective board' should be hyphenated as 'real-time'.
- [Abstract] Abstract: The phrase 'cut corrective prompts about five fold' should be 'fivefold' (one word) for standard usage.
Simulated Author's Rebuttal
We thank the referee for highlighting the risk of over-interpreting observational associations as causal effects. We agree that the abstract phrasing requires revision to better reflect the study's observational design and to supply readers with the allocation details needed to evaluate confounding. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result that 'Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent' is presented as a causal effect of the effort variable. However, the 90 runs span 'several model generations' and 'two agent harnesses' with no mention of randomization, blocking, or pre-specified stratification on those factors. Without evidence that the xHigh subset was not disproportionately allocated stronger models or harnesses, the observed difference cannot be confidently attributed to reasoning effort rather than confounding.
Authors: We accept the point. The manuscript is explicitly framed as an observational study (see title and methods), and the abstract already notes that runs 'spanned several model generations, two agent harnesses' etc. Nevertheless, the verb 'lifted' can be read as implying a controlled causal contrast. We will revise the abstract to replace 'lifted' with 'was associated with' (or equivalent) and will add a short table or paragraph in the results section that cross-tabulates reasoning-effort level against model generation and harness. This will let readers directly inspect the marginal distributions and judge the plausibility of confounding. The same revision will be applied to the parallel claim about corrective-prompt reduction. revision: yes
-
Referee: [Abstract] Abstract: The parallel claim that 'the testing tool raised cost by 42 to 68 percent without improving functional score or reliability' is likewise observational. The manuscript provides no table or section showing the marginal allocation of tool access independent of model/harness/effort combinations, so the same confounding risk applies to the 'no benefit' conclusion.
Authors: We agree. The testing-tool comparison is also observational and subject to the same allocation concerns. We will revise the abstract wording from 'raised cost … without improving' to a more neutral phrasing such as 'was associated with 42–68 % higher cost and showed no reliable difference in functional score or first-try reliability.' In addition, we will include the requested marginal-allocation table (or expanded cross-tabulation) covering tool access alongside model, harness, and effort. This change will be made in the next manuscript version. revision: yes
Circularity Check
No circularity: direct empirical counts from independent runs
full rationale
The paper is an observational study reporting raw counts and percentages from 90 independent agent runs scored against a fixed external 14-criterion rubric. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. All claims (e.g., 28% to 89% lift in first-try perfect runs) are direct tallies of observed outcomes, not reductions by construction. The derivation chain is self-contained as empirical reporting with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The fixed 14-criterion functional rubric accurately captures application correctness across all runs
Reference graph
Works this paper leans on
-
[1]
von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med. 2007;4(10):e296. doi: 10.1371/journal.pmed.0040296
-
[2]
Extended thinking
Anthropic. Extended thinking. Claude Platform documentation. [cited 2026 Jun 29]. Available from: https://platform.claude.com/docs/en/build-with-claude/extended-thinking
2026
-
[3]
Gemini 3 developer guide (Interactions API)
Google. Gemini 3 developer guide (Interactions API). Gemini API documentation. [cited 2026 Jun 29]. Available from: https://ai.google.dev/gemini-api/docs/interactions/gemini-3 18
2026
-
[4]
OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection
OWASP Foundation. OWASP Top 10 for Large Language Model Applications: LLM01 Prompt Injection. 2025 [cited 2026 Jun 27]. Available from: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
2025
-
[5]
Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts
Mehta A. Realtime Retrospective Board: AI Model Benchmark Dataset and Evaluation Artifacts. Version 2.3.0. Zenodo; 2026 [cited 2026 Jul 2]. doi: 10.5281/zenodo.21134406 Appendix Table S1. Per-run data. Functional score (out of 42), session cost in US dollars, base-10 logarithm of cost, and visual quality rating (1 to 5) for all 90 runs. Run folder Model E...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.