A Model-based Testing Technique for Amazon Lex Task-based Chatbots
Pith reviewed 2026-07-02 08:09 UTC · model grok-4.3
The pith
LexTester builds a Dialog Graph of Amazon Lex chatbot conversations to generate test suites that outperform Botium in coverage and fault detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LexTester explores the conversational space of the chatbot under test to generate a Dialog Graph of all possible interactions, from which an executable test suite is generated according to different coverage strategies. When compared to Botium on five Amazon Lex chatbots, LexTester produced more tests of nearly double complexity, achieved 83-95 percent coverage of conversational elements, and improved fault detection effectiveness by up to four times at comparable time costs.
What carries the argument
The Dialog Graph, which records every reachable state and transition in the chatbot's conversational space and serves as the source for coverage-driven test generation.
If this is right
- Test suites generated from the Dialog Graph cover more conversational elements than manually crafted or simpler automated approaches.
- Higher fault detection rates become available without increasing testing time beyond current tool levels.
- Different coverage strategies on the same graph allow testers to trade off suite size against thoroughness.
- The same exploration process can be reapplied after chatbot updates to refresh the test suite automatically.
Where Pith is reading between the lines
- The Dialog Graph could serve as input for analyses beyond testing, such as identifying unreachable intents or measuring conversation complexity.
- Similar graph-construction techniques might apply to chatbots built on other platforms if their interaction models can be queried in the same way.
- Integrating the exploration step into continuous integration would let teams receive updated test suites after every chatbot change.
Load-bearing premise
Automatically exploring the chatbot produces a complete and accurate Dialog Graph that includes every relevant interaction without missing states or transitions.
What would settle it
A manual enumeration of all conversation paths in one of the five evaluated chatbots that finds a path absent from the Dialog Graph produced by LexTester.
Figures
read the original abstract
Task-based chatbots are nowadays widely adopted software systems, usually integrated into real-world applications and communication channels, designed to assist users in completing tasks through conversational interfaces. Like any other software, even chatbots are prone to bugs. Despite their increasing pervasiveness in everyday activities, existing techniques for assessing their quality still exhibit several limitations, such as the simplicity of generated test scenarios and oracle weaknesses. In this paper, we present LexTester, an automated model-based testing technique for Amazon Lex chatbots. The technique explores the conversational space of the chatbot under test to generate a Dialog Graph of all possible interactions, from which an executable test suite is generated according to different coverage strategies. LexTester was evaluated against the state-of-the-practice chatbot testing tool Botium on five Amazon Lex chatbots, consistently outperforming it in all subjects, generating more tests with nearly double complexity, achieving overall 83-95% coverage of conversational elements, and improving fault detection effectiveness by up to four times at comparable time costs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LexTester, a model-based testing technique for Amazon Lex task-based chatbots. It explores the conversational space to construct a Dialog Graph of all possible interactions and generates executable test suites according to coverage strategies. On five Amazon Lex chatbots, LexTester is reported to outperform Botium by producing more tests of nearly double complexity, achieving 83-95% coverage of conversational elements, and improving fault detection by up to 4x at comparable time costs.
Significance. If the Dialog Graph is verifiably complete and the evaluation details are provided, the technique could offer a practical advance over existing chatbot testing tools by addressing limitations in scenario simplicity and oracle strength. The quantitative comparison on multiple subjects and metrics (coverage, complexity, fault detection) would be a useful contribution to model-based testing for conversational systems.
major comments (2)
- [Abstract] Abstract (paragraph 2): The central evaluation claims (83-95% coverage, up to 4x fault detection improvement, consistent outperformance of Botium) depend on the exploration step producing a complete and accurate Dialog Graph. No ground-truth comparison, manual inspection, soundness argument, or discussion of potential missed states/transitions (due to context, untriggered intents, or exploration limits) is provided. This is load-bearing for all reported results.
- [Abstract] Abstract: No details are supplied on fault seeding method, oracle definition, subject selection criteria for the five Amazon Lex chatbots, or statistical significance of the results. Without these, the reliability of the performance claims cannot be assessed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph 2): The central evaluation claims (83-95% coverage, up to 4x fault detection improvement, consistent outperformance of Botium) depend on the exploration step producing a complete and accurate Dialog Graph. No ground-truth comparison, manual inspection, soundness argument, or discussion of potential missed states/transitions (due to context, untriggered intents, or exploration limits) is provided. This is load-bearing for all reported results.
Authors: The Dialog Graph is constructed via systematic API-driven exploration that enumerates all reachable states from the chatbot model by exercising every intent and slot combination. We acknowledge that the current abstract and manuscript lack an explicit soundness argument or discussion of potential missed transitions (e.g., due to context or exploration bounds). In revision we will add a dedicated paragraph in the methodology section describing the exploration algorithm and its limitations, and we will update the abstract to qualify the coverage claims as applying to the explored space. revision: yes
-
Referee: [Abstract] Abstract: No details are supplied on fault seeding method, oracle definition, subject selection criteria for the five Amazon Lex chatbots, or statistical significance of the results. Without these, the reliability of the performance claims cannot be assessed.
Authors: The full manuscript's evaluation section defines fault seeding via five mutation operators on intents/slots, the oracle as response and state-transition matching, subject selection as five representative task-based Lex chatbots, and reports averages over repeated runs. These details are not summarized in the abstract. We will revise the abstract to include concise descriptions of the fault-seeding method, oracle, and subject criteria, and we will add a short discussion noting the absence of formal statistical significance tests given the small number of subjects. revision: partial
Circularity Check
No significant circularity; evaluation is external comparison
full rationale
The paper describes an exploration-based technique to build a Dialog Graph followed by test generation and an empirical evaluation against the independent external tool Botium on five Amazon Lex chatbots. No equations, fitted parameters, self-citations used as load-bearing premises, or reductions of results to inputs by construction are present. The reported outperformance metrics rest on direct observable comparisons rather than internal definitions or self-referential steps, satisfying the condition for a self-contained result against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The conversational space of an Amazon Lex chatbot can be systematically explored to produce a complete Dialog Graph of all possible interactions.
Reference graph
Works this paper leans on
-
[1]
Amazon Lex (May 2026),https://docs.aws.amazon.com/lex/
2026
-
[2]
Botium (May 2026),https://botium-docs.readthedocs.io/en/latest
2026
-
[3]
Chatbottest (May 2026),https://chatbottest.com
2026
-
[4]
Dialogflow (May 2026),https://docs.cloud.google.com/dialogflow/docs
2026
-
[5]
Rasa (May 2026),https://rasa.com/docs/
2026
-
[6]
IEEE Trans- actions on Software Engineering (TSE)48(8), 3087–3102 (2021)
Abdellatif, A., Badran, K., Costa, D.E., Shihab, E.: A comparison of natural lan- guage understanding platforms for chatbots in software engineering. IEEE Trans- actions on Software Engineering (TSE)48(8), 3087–3102 (2021)
2021
-
[7]
Machine Learning with Applications2, 100006 (2020)
Adamopoulou, E., Moussiades, L.: Chatbots: History, technology, and applications. Machine Learning with Applications2, 100006 (2020)
2020
-
[8]
Adamopoulou, E., Moussiades, L.: An overview of chatbot technology. In: Pro- ceedings of the International Conference on Artificial Intelligence Applications and Innovations (AIAI) (2020) A Model-based Testing Technique for Amazon Lex Task-based Chatbots 15
2020
-
[9]
In: Proceedings of the Interna- tional Conference on Software Maintenance and Evolution (ICSME) (2020)
Asyrofi, M.H., Thung, F., Lo, D., Jiang, L.: CrossASR: Efficient differential testing of automatic speech recognition via text-to-speech. In: Proceedings of the Interna- tional Conference on Software Maintenance and Evolution (ICSME) (2020)
2020
-
[10]
IEEE Access12, 78799– 78810 (2024)
Benaddi, L., Ouaddi, C., Jakimi, A., Ouchao, B.: A systematic review of chatbots: Classification, development, and their impact on tourism. IEEE Access12, 78799– 78810 (2024)
2024
-
[11]
Božić,J.:Ontology-basedmetamorphictestingforchatbots.SoftwareQualityJour- nal (SQJ)30(1), 227–251 (2022)
2022
-
[12]
In: Proceed- ings of the International Conference On Artificial Intelligence Testing (AITest) (2019)
Božić, J., Tazl, O.A., Wotawa, F.: Chatbot testing using AI planning. In: Proceed- ings of the International Conference On Artificial Intelligence Testing (AITest) (2019)
2019
-
[13]
In: Proceed- ings of the International Conference on Testing Software and Systems (ICTSS) (2019)
Božić, J., Wotawa, F.: Testing chatbots using metamorphic relations. In: Proceed- ings of the International Conference on Testing Software and Systems (ICTSS) (2019)
2019
-
[14]
In: Pro- ceedings of the International Conference on the Quality of Information and Com- munications Technology (QUATIC) (2020)
Bravo-Santos, S., Guerra, E., de Lara, J.: Testing chatbots with Charm. In: Pro- ceedings of the International Conference on the Quality of Information and Com- munications Technology (QUATIC) (2020)
2020
-
[15]
In: Proceedings of the International Workshop on Bots in Software Engineering (BotSE) (2021)
Cabot, J., Burgueño, L., Clarisó, R., Daniel, G., Perianez-Pascual, J., Rodriguez- Echeverria, R.: Testing challenges for NLP-intensive bots. In: Proceedings of the International Workshop on Bots in Software Engineering (BotSE) (2021)
2021
-
[16]
In: Proceedings of the International Conference on Automation of Software Test (AST) (2024)
Cañizares, P.C., Ávila, D., Pérez-Soler, S., Guerra, E., de Lara, J.: Coverage-based strategies for the automated synthesis of test scenarios for conversational agents. In: Proceedings of the International Conference on Automation of Software Test (AST) (2024)
2024
-
[17]
In: Proceedings of the Symposium on Applied Computing (SAC) (2022)
Cañizares, P.C., Pérez-Soler, S., Guerra, E., de Lara, J.: Automating the mea- surement of heterogeneous chatbot designs. In: Proceedings of the Symposium on Applied Computing (SAC) (2022)
2022
-
[18]
ACM Computing Sur- veys (CSUR)51(1), 1–27 (2018)
Chen, T.Y., Kuo, F.C., Liu, H., Poon, P.L., Towey, D., Tse, T., Zhou, Z.Q.: Meta- morphic testing: A review of challenges and opportunities. ACM Computing Sur- veys (CSUR)51(1), 1–27 (2018)
2018
-
[19]
In: Proceedings of the 9th International Work- shop on Software Faults (IWSF) (2025)
Clerissi, D., Masserini, E., Micucci, D., Mariani, L.: Towards multi-platform muta- tion testing of task-based chatbots. In: Proceedings of the 9th International Work- shop on Software Faults (IWSF) (2025)
2025
-
[20]
In: Proceedings of ACL 2017, System Demonstrations
Cui, L., Huang, S., Wei, F., Tan, C., Duan, C., Zhou, M.: Superagent: A customer service chatbot for e-commerce websites. In: Proceedings of ACL 2017, System Demonstrations. pp. 97–102 (2017)
2017
-
[21]
It’s on its way
De Cicco, R., da Costa e Silva, S.C.L., Alparone, F.R.: “It’s on its way”: Chatbots applied for online food delivery services, social or task-oriented interaction style? Journal of Foodservice Business Research (JFBR)24(2), 140–164 (2021)
2021
-
[22]
Journal of Systems and Software (JSS) p
De Lara, J., Del Pozzo, A., Guerra, E., Cuadrado, J.S.: Automated end-to-end testing for conversational agents. Journal of Systems and Software (JSS) p. 112685 (2025)
2025
-
[23]
Artificial Intelligence Re- view54, 755–810 (2021)
Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., Cieliebak, M.: Survey on evaluation methods for dialogue systems. Artificial Intelligence Re- view54, 755–810 (2021)
2021
-
[24]
In: Proceedings of the International Conference on Software Engineering, Companion (ICSE-C) (2024)
Ferdinando Urrico, M., Clerissi, D., Mariani, L.: Mutabot: A mutation testing approach for chatbots. In: Proceedings of the International Conference on Software Engineering, Companion (ICSE-C) (2024)
2024
-
[25]
Forgot your password again?
Fiore, D., Baldauf, M., Thiel, C.: “Forgot your password again?” Acceptance and user experience of a chatbot for in-company IT support. In: Proceedings of the International Conference on Mobile and Ubiquitous Multimedia (MUM) (2019) 16 D. Clerissi et al
2019
-
[26]
In: Proceedings of the International Con- ference on Evaluation and Assessment in Software Engineering (EASE) (2024)
Gómez-Abajo, P., Pérez-Soler, S., Cañizares, P.C., Guerra, E., de Lara, J.: Muta- tion testing for task-oriented chatbots. In: Proceedings of the International Con- ference on Evaluation and Assessment in Software Engineering (EASE) (2024)
2024
-
[27]
In: Proceedings of the Conference on Human Factors in Computing Systems (CHI) (2019)
Grudin, J., Jacques, R.: Chatbots, humbots, and the quest for artificial general intelligence. In: Proceedings of the Conference on Human Factors in Computing Systems (CHI) (2019)
2019
-
[28]
In: Proceedings of the Inter- national Conference On Artificial Intelligence Testing (AITest) (2019)
Guichard, J., Ruane, E., Smith, R., Bean, D., Ventresque, A.: Assessing the ro- bustness of conversational agents using paraphrases. In: Proceedings of the Inter- national Conference On Artificial Intelligence Testing (AITest) (2019)
2019
-
[29]
In: IFIP Interna- tional Conference on Testing Software and Systems (ICTSS)
del Horno, I.S., del Pozzo, A., Guerra, E., de Lara, J.: Automated exploration of conversational agents for the synthesis of testing profiles. In: IFIP Interna- tional Conference on Testing Software and Systems (ICTSS). pp. 213–230. Springer (2025)
2025
-
[30]
In: Proceedings of the Conference on Software Testing, Vali- dation and Verification (ICST) (2019)
Iwama,F.,Fukuda,T.:Automatedtestingofbasicrecognitioncapabilityforspeech recognition systems. In: Proceedings of the Conference on Software Testing, Vali- dation and Verification (ICST) (2019)
2019
-
[31]
ACM Computing Surveys57(4), 1–37 (2024)
Lambiase, S., Catolino, G., Palomba, F., Ferrucci, F.: Motivations, challenges, best practices, and benefits for bots and conversational agents in software engineering: A multivocal literature review. ACM Computing Surveys57(4), 1–37 (2024)
2024
-
[32]
In: Proceedings of the International Conference On Artificial Intelligence Testing (AITest) (2022)
Li, X., Tao, C., Gao, J., Guo, H.: A review of quality assurance research of dialogue systems. In: Proceedings of the International Conference On Artificial Intelligence Testing (AITest) (2022)
2022
-
[33]
In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA) (2021)
Liu, Z., Feng, Y., Chen, Z.: DialTest: Automated testing for recurrent-neural- network-driven dialogue systems. In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA) (2021)
2021
-
[34]
In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) (2026)
Masserini, E., Clerissi, D., Micucci, D., Mariani, L.: Assessing task-based chat- bots: Snapshot and curated datasets for Dialogflow. In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) (2026)
2026
-
[35]
In: Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE) (2025)
Masserini, E., Clerissi, D., Micucci, D., Rodrigues Campos, J., Mariani, L.: To- wards the assessment of task-based chatbots: From the TOFU-R snapshot to the BRASATO curated dataset. In: Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE) (2025)
2025
-
[36]
ACM Computing Surveys55(5), 1–42 (2022)
Motger, Q., Franch, X., Marco, J.: Software-based dialogue systems: Survey, tax- onomy, and challenges. ACM Computing Surveys55(5), 1–42 (2022)
2022
-
[37]
IEEE Software38(4), 94–103 (2021)
Pérez-Soler, S., Juarez-Puerta, S., Guerra, E., de Lara, J.: Choosing a chatbot development tool. IEEE Software38(4), 94–103 (2021)
2021
-
[38]
In: Proceedings of the International Conference on Software Testing, Verification and Validation, Workshops (ICSTW) (2025)
Rapisarda, R.G., Ginelli, D., Clerissi, D., Mariani, L.: Test case generation for Dialogflow task-based chatbots. In: Proceedings of the International Conference on Software Testing, Verification and Validation, Workshops (ICSTW) (2025)
2025
-
[39]
In: Proceedings of the International Conference on Intelligent User Interfaces, Companion (IUI-C) (2018)
Ruane, E., Faure, T., Smith, R., Bean, D., Carson-Berndsen, J., Ventresque, A.: Botest: A framework to test the quality of conversational agents using divergent input examples. In: Proceedings of the International Conference on Intelligent User Interfaces, Companion (IUI-C) (2018)
2018
-
[40]
In: Proceedings of the Brazilian Sym- posium on Human Factors in Computing Systems (IHC) (2017)
Vasconcelos, M., Candello, H., Pinhanez, C., dos Santos, T.: Bottester: Testing conversational systems with simulated users. In: Proceedings of the Brazilian Sym- posium on Human Factors in Computing Systems (IHC) (2017)
2017
-
[41]
In: Proceedings of the International Conference on Software Analysis, Evo- lution and Reengineering (SANER) (2021)
Zheng, W., Liu, G., Zhang, M., Chen, X., Zhao, W.: Research progress of flaky tests. In: Proceedings of the International Conference on Software Analysis, Evo- lution and Reengineering (SANER) (2021)
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.