AutoRestTest at the SBFT 2026 Tool Competition

Alessandro Orso; Myeongsoo Kim; Saurabh Sinha; Tyler Stennett

arxiv: 2607.01063 · v1 · pith:PFXDAXU2new · submitted 2026-07-01 · 💻 cs.SE

AutoRestTest at the SBFT 2026 Tool Competition

Tyler Stennett , Myeongsoo Kim , Saurabh Sinha , Alessandro Orso This is my paper

Pith reviewed 2026-07-02 08:15 UTC · model grok-4.3

classification 💻 cs.SE

keywords REST API testingblack-box testingdependency graphreinforcement learninglarge language modelsfault detectionAPI input exploration

0 comments

The pith

AutoRestTest combines a Semantic Property Dependency Graph with multi-agent reinforcement learning and large language models to explore REST API input spaces and ranks first in fault detection, efficiency, and effectiveness on 11 APIs with

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoRestTest as a black-box testing approach for REST APIs that must handle large input spaces and complex inter-operation dependencies. It builds a Semantic Property Dependency Graph to model relationships, then applies multi-agent reinforcement learning and large language models to guide input generation and exploration. Evaluated under a one-hour budget on 11 APIs containing 317 operations, the tool produced an average of 67.09 unique server errors and 17.27 successfully processed operations per API while placing first in all three competition categories. A sympathetic reader would care because improved automated discovery of server errors can reduce the manual effort required to make web services more reliable.

Core claim

AutoRestTest's integration of a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models enables more effective exploration of API input spaces, resulting in the highest fault detection, overall efficiency, and overall effectiveness scores when tested on 11 APIs averaging 29 operations each under a fixed one-hour budget.

What carries the argument

The Semantic Property Dependency Graph, which captures inter-operation dependencies to direct the multi-agent reinforcement learning and large language model components in selecting and generating test inputs.

If this is right

The approach scales to APIs with roughly 29 operations each while maintaining high rates of unique error discovery within time limits.
Intelligent modeling of operation dependencies reduces wasted tests on invalid input combinations.
Integration of reinforcement learning with language models allows dynamic adaptation during the testing process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dependency-graph guidance could be adapted to test other kinds of service interfaces that share similar input and ordering constraints.
Shorter testing budgets might still yield useful results if the graph construction step is made faster.
Results on server-error counts suggest the technique could complement static analysis tools that focus on code-level faults.

Load-bearing premise

The 11 selected APIs and one-hour testing budget provide a representative measure of performance across real-world REST services.

What would settle it

Applying the same tool and competing approaches to a new collection of APIs with substantially different operation counts, dependency structures, or error profiles and checking whether the ranking across the three metrics holds.

Figures

Figures reproduced from arXiv: 2607.01063 by Alessandro Orso, Myeongsoo Kim, Saurabh Sinha, Tyler Stennett.

**Figure 1.** Figure 1: Overview of AutoRestTest [2]. a refined query; this process repeats for up to three attempts per operation. The successfully validated values—or, if all attempts fail, the values from the final retry—are stored in the value agent’s action space for efficient reuse across test generation. 2.4 Request Generation With agents initialized and values preprocessed, AutoRestTest begins constructing requests. Each … view at source ↗

read the original abstract

Large input spaces and complex inter-operation dependencies make black-box REST API testing challenging. AutoRestTest combines a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models to intelligently explore large API input spaces. In the SBFT 2026 REST League, AutoRestTest ranked first in all three evaluation categories -- fault detection, overall efficiency, and overall effectiveness -- on 11 APIs (317 operations, approximately 29 per API), averaging 67.09 unique server errors and 17.27 successfully processed operations per API under a one-hour testing budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoRestTest's first-place result in the SBFT 2026 REST competition is the core claim, but it stands or falls on whether those 11 APIs and the one-hour budget actually test the right things.

read the letter

The paper's main contribution is a tool entry that combines a semantic property dependency graph with multi-agent reinforcement learning and LLMs, then reports first place across fault detection, efficiency, and effectiveness on the competition's 11 APIs. It gives concrete numbers: roughly 67 unique server errors and 17 successful operations per API under a one-hour limit. That is a clear empirical outcome from a standardized setting.

What works is the direct reporting of competition performance. Tool papers in this area often struggle to show head-to-head results, and the SBFT league supplies a fixed set of targets and metrics. The approach itself sounds like a reasonable way to handle operation dependencies that pure random or grammar-based methods miss.

The soft spot is the narrow test bed. The abstract gives no breakdown of how the 11 APIs were selected, what domains they cover, or how their dependency structures compare to other public REST corpora. A one-hour budget and three aggregate metrics can reward tools tuned to that exact setup without proving broader advantage. No error bars, no ablation on the individual components, and no external validation appear in the provided text.

This is useful for people who follow REST API testing competitions or need a current baseline for black-box tools. It is less useful for anyone trying to judge whether the method generalizes beyond the 2026 league. The work shows clear thinking about the problem and honest engagement with the competition format, so it deserves referee time even if the authors will need to add more on API selection and component contributions.

Referee Report

2 major / 0 minor

Summary. The paper presents AutoRestTest, a black-box REST API testing tool that integrates a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models to handle large input spaces and inter-operation dependencies. It reports that the tool ranked first across all three evaluation categories (fault detection, overall efficiency, and overall effectiveness) in the SBFT 2026 REST League on a benchmark of 11 APIs (317 operations), achieving averages of 67.09 unique server errors and 17.27 successfully processed operations per API under a one-hour testing budget.

Significance. If the competition results hold, the first-place ranking provides a concrete empirical benchmark for hybrid semantic-RL-LLM approaches in REST API testing. The work contributes by demonstrating performance on a standardized set of 11 APIs with 317 operations, which can serve as a reference point for other tools in the SBFT competition series.

major comments (2)

The manuscript reports the competition ranking but provides no implementation details, configuration parameters, or pseudocode for the Semantic Property Dependency Graph construction, the multi-agent RL policy, or the LLM prompting strategy. This absence makes it impossible to assess whether the reported averages (67.09 errors and 17.27 operations per API) are reproducible or sensitive to implementation choices.
No statistical analysis, variance measures, or error bars are supplied for the per-API averages or the overall ranking. Given that the central claim is an empirical ranking on 11 APIs, the lack of any measure of result stability or sensitivity to the one-hour budget undermines confidence in the reported first-place outcome.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestions. We address each major comment below. The manuscript is a short tool-competition paper, which imposes space limits, but we agree that additional details and analysis will strengthen it.

read point-by-point responses

Referee: The manuscript reports the competition ranking but provides no implementation details, configuration parameters, or pseudocode for the Semantic Property Dependency Graph construction, the multi-agent RL policy, or the LLM prompting strategy. This absence makes it impossible to assess whether the reported averages (67.09 errors and 17.27 operations per API) are reproducible or sensitive to implementation choices.

Authors: We agree that the current version lacks these details due to the page constraints typical of competition papers. In the revised manuscript we will add a dedicated subsection containing pseudocode for Semantic Property Dependency Graph construction, the key hyperparameters and architecture of the multi-agent RL policy, and the exact LLM prompting templates and strategies used. This will allow readers to assess reproducibility. revision: yes
Referee: No statistical analysis, variance measures, or error bars are supplied for the per-API averages or the overall ranking. Given that the central claim is an empirical ranking on 11 APIs, the lack of any measure of result stability or sensitivity to the one-hour budget undermines confidence in the reported first-place outcome.

Authors: The reported figures are the official single-run results from the SBFT 2026 competition under a fixed one-hour budget; the competition protocol does not supply per-run variance data. We will nevertheless add a paragraph discussing result stability across the 11 APIs and the standardized budget, and we will include any available per-API spread if the competition organizers release additional data. If variance cannot be computed, we will explicitly state this limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical competition ranking only

full rationale

The paper is a competition report whose sole load-bearing claim is an empirical ranking (first place across three metrics on 11 APIs under a one-hour budget). No equations, parameter fits, uniqueness theorems, ansatzes, or derivations are present; the result is a direct measurement from the SBFT 2026 league and cannot reduce to its own inputs by construction. Self-citations, if any, are irrelevant because no theoretical premise depends on them.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the assumption that the competition benchmark is valid.

axioms (1)

domain assumption The SBFT 2026 REST League evaluation setup (11 APIs, one-hour budget, three metrics) is a fair and representative test of black-box REST API testing tools.
The paper's claim of superiority is grounded entirely in competition outcomes.

pith-pipeline@v0.9.1-grok · 5621 in / 1203 out tokens · 27831 ms · 2026-07-02T08:15:06.882692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 2 canonical work pages · 1 internal anchor

[1]

2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=

Evomaster: Evolutionary multi-context automated system test generation , author=. 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=. 2018 , organization=

2018
[2]

2025 , month = sep, day =

2025
[3]

2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) , pages=

Autoresttest: A tool for automated rest api testing using llms and marl , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) , pages=. 2025 , organization=

2025
[4]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Glove: Global vectors for word representation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

2014
[5]

2026 , numpages =

Pasqua, Michele and Corradini, Davide and Mari, Sofia and Ceccato, Mariano , booktitle =. 2026 , numpages =

2026
[6]

and Barto, Andrew G

Sutton, Richard S. and Barto, Andrew G. , title =. 2018 , isbn =

2018
[7]

2025 , month = sep, note =

Gemini 2.5 Flash-Lite: Model Card , author =. 2025 , month = sep, note =

2025
[8]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2411.07098 , year=

A multi-agent approach for rest api testing with semantic graphs and llm-driven inputs , author=. arXiv preprint arXiv:2411.07098 , year=

work page arXiv
[10]

2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

2025

[1] [1]

2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=

Evomaster: Evolutionary multi-context automated system test generation , author=. 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=. 2018 , organization=

2018

[2] [2]

2025 , month = sep, day =

2025

[3] [3]

2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) , pages=

Autoresttest: A tool for automated rest api testing using llms and marl , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) , pages=. 2025 , organization=

2025

[4] [4]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Glove: Global vectors for word representation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

2014

[5] [5]

2026 , numpages =

Pasqua, Michele and Corradini, Davide and Mari, Sofia and Ceccato, Mariano , booktitle =. 2026 , numpages =

2026

[6] [6]

and Barto, Andrew G

Sutton, Richard S. and Barto, Andrew G. , title =. 2018 , isbn =

2018

[7] [7]

2025 , month = sep, note =

Gemini 2.5 Flash-Lite: Model Card , author =. 2025 , month = sep, note =

2025

[8] [8]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2411.07098 , year=

A multi-agent approach for rest api testing with semantic graphs and llm-driven inputs , author=. arXiv preprint arXiv:2411.07098 , year=

work page arXiv

[10] [10]

2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

2025