AutoRestTest at the SBFT 2026 Tool Competition
Pith reviewed 2026-07-02 08:15 UTC · model grok-4.3
The pith
AutoRestTest combines a Semantic Property Dependency Graph with multi-agent reinforcement learning and large language models to explore REST API input spaces and ranks first in fault detection, efficiency, and effectiveness on 11 APIs with
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoRestTest's integration of a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models enables more effective exploration of API input spaces, resulting in the highest fault detection, overall efficiency, and overall effectiveness scores when tested on 11 APIs averaging 29 operations each under a fixed one-hour budget.
What carries the argument
The Semantic Property Dependency Graph, which captures inter-operation dependencies to direct the multi-agent reinforcement learning and large language model components in selecting and generating test inputs.
If this is right
- The approach scales to APIs with roughly 29 operations each while maintaining high rates of unique error discovery within time limits.
- Intelligent modeling of operation dependencies reduces wasted tests on invalid input combinations.
- Integration of reinforcement learning with language models allows dynamic adaptation during the testing process.
Where Pith is reading between the lines
- The same dependency-graph guidance could be adapted to test other kinds of service interfaces that share similar input and ordering constraints.
- Shorter testing budgets might still yield useful results if the graph construction step is made faster.
- Results on server-error counts suggest the technique could complement static analysis tools that focus on code-level faults.
Load-bearing premise
The 11 selected APIs and one-hour testing budget provide a representative measure of performance across real-world REST services.
What would settle it
Applying the same tool and competing approaches to a new collection of APIs with substantially different operation counts, dependency structures, or error profiles and checking whether the ranking across the three metrics holds.
Figures
read the original abstract
Large input spaces and complex inter-operation dependencies make black-box REST API testing challenging. AutoRestTest combines a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models to intelligently explore large API input spaces. In the SBFT 2026 REST League, AutoRestTest ranked first in all three evaluation categories -- fault detection, overall efficiency, and overall effectiveness -- on 11 APIs (317 operations, approximately 29 per API), averaging 67.09 unique server errors and 17.27 successfully processed operations per API under a one-hour testing budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AutoRestTest, a black-box REST API testing tool that integrates a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models to handle large input spaces and inter-operation dependencies. It reports that the tool ranked first across all three evaluation categories (fault detection, overall efficiency, and overall effectiveness) in the SBFT 2026 REST League on a benchmark of 11 APIs (317 operations), achieving averages of 67.09 unique server errors and 17.27 successfully processed operations per API under a one-hour testing budget.
Significance. If the competition results hold, the first-place ranking provides a concrete empirical benchmark for hybrid semantic-RL-LLM approaches in REST API testing. The work contributes by demonstrating performance on a standardized set of 11 APIs with 317 operations, which can serve as a reference point for other tools in the SBFT competition series.
major comments (2)
- The manuscript reports the competition ranking but provides no implementation details, configuration parameters, or pseudocode for the Semantic Property Dependency Graph construction, the multi-agent RL policy, or the LLM prompting strategy. This absence makes it impossible to assess whether the reported averages (67.09 errors and 17.27 operations per API) are reproducible or sensitive to implementation choices.
- No statistical analysis, variance measures, or error bars are supplied for the per-API averages or the overall ranking. Given that the central claim is an empirical ranking on 11 APIs, the lack of any measure of result stability or sensitivity to the one-hour budget undermines confidence in the reported first-place outcome.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive suggestions. We address each major comment below. The manuscript is a short tool-competition paper, which imposes space limits, but we agree that additional details and analysis will strengthen it.
read point-by-point responses
-
Referee: The manuscript reports the competition ranking but provides no implementation details, configuration parameters, or pseudocode for the Semantic Property Dependency Graph construction, the multi-agent RL policy, or the LLM prompting strategy. This absence makes it impossible to assess whether the reported averages (67.09 errors and 17.27 operations per API) are reproducible or sensitive to implementation choices.
Authors: We agree that the current version lacks these details due to the page constraints typical of competition papers. In the revised manuscript we will add a dedicated subsection containing pseudocode for Semantic Property Dependency Graph construction, the key hyperparameters and architecture of the multi-agent RL policy, and the exact LLM prompting templates and strategies used. This will allow readers to assess reproducibility. revision: yes
-
Referee: No statistical analysis, variance measures, or error bars are supplied for the per-API averages or the overall ranking. Given that the central claim is an empirical ranking on 11 APIs, the lack of any measure of result stability or sensitivity to the one-hour budget undermines confidence in the reported first-place outcome.
Authors: The reported figures are the official single-run results from the SBFT 2026 competition under a fixed one-hour budget; the competition protocol does not supply per-run variance data. We will nevertheless add a paragraph discussing result stability across the 11 APIs and the standardized budget, and we will include any available per-API spread if the competition organizers release additional data. If variance cannot be computed, we will explicitly state this limitation. revision: partial
Circularity Check
No circularity: empirical competition ranking only
full rationale
The paper is a competition report whose sole load-bearing claim is an empirical ranking (first place across three metrics on 11 APIs under a one-hour budget). No equations, parameter fits, uniqueness theorems, ansatzes, or derivations are present; the result is a direct measurement from the SBFT 2026 league and cannot reduce to its own inputs by construction. Self-citations, if any, are irrelevant because no theoretical premise depends on them.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The SBFT 2026 REST League evaluation setup (11 APIs, one-hour budget, three metrics) is a fair and representative test of black-box REST API testing tools.
Reference graph
Works this paper leans on
-
[1]
2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=
Evomaster: Evolutionary multi-context automated system test generation , author=. 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=. 2018 , organization=
2018
-
[2]
2025 , month = sep, day =
2025
-
[3]
2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) , pages=
Autoresttest: A tool for automated rest api testing using llms and marl , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) , pages=. 2025 , organization=
2025
-
[4]
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=
Glove: Global vectors for word representation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=
2014
-
[5]
2026 , numpages =
Pasqua, Michele and Corradini, Davide and Mari, Sofia and Ceccato, Mariano , booktitle =. 2026 , numpages =
2026
-
[6]
and Barto, Andrew G
Sutton, Richard S. and Barto, Andrew G. , title =. 2018 , isbn =
2018
-
[7]
2025 , month = sep, note =
Gemini 2.5 Flash-Lite: Model Card , author =. 2025 , month = sep, note =
2025
-
[8]
Instruction-Following Evaluation for Large Language Models
Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2411.07098 , year=
A multi-agent approach for rest api testing with semantic graphs and llm-driven inputs , author=. arXiv preprint arXiv:2411.07098 , year=
-
[10]
2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=
A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.