pith. sign in

arxiv: 2607.01094 · v1 · pith:MAAORFT5new · submitted 2026-07-01 · 💻 cs.SE

A Model-based Testing Technique for Amazon Lex Task-based Chatbots

Pith reviewed 2026-07-02 08:09 UTC · model grok-4.3

classification 💻 cs.SE
keywords model-based testingchatbot testingAmazon LexDialog Graphtest generationfault detectionconversational interfacestask-based chatbots
0
0 comments X

The pith

LexTester builds a Dialog Graph of Amazon Lex chatbot conversations to generate test suites that outperform Botium in coverage and fault detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LexTester as an automated model-based testing approach that explores the full space of possible user-bot exchanges in Amazon Lex task-based chatbots. It constructs a Dialog Graph representing all interactions and derives executable tests from it using coverage criteria. Evaluation across five chatbots shows the generated suites are more numerous and nearly twice as complex as those from Botium, reach 83-95 percent coverage of conversational elements, and detect up to four times as many faults while using comparable time. A reader would care because chatbots now handle real tasks in everyday apps, so stronger automated testing directly reduces the risk of broken conversations reaching users.

Core claim

LexTester explores the conversational space of the chatbot under test to generate a Dialog Graph of all possible interactions, from which an executable test suite is generated according to different coverage strategies. When compared to Botium on five Amazon Lex chatbots, LexTester produced more tests of nearly double complexity, achieved 83-95 percent coverage of conversational elements, and improved fault detection effectiveness by up to four times at comparable time costs.

What carries the argument

The Dialog Graph, which records every reachable state and transition in the chatbot's conversational space and serves as the source for coverage-driven test generation.

If this is right

  • Test suites generated from the Dialog Graph cover more conversational elements than manually crafted or simpler automated approaches.
  • Higher fault detection rates become available without increasing testing time beyond current tool levels.
  • Different coverage strategies on the same graph allow testers to trade off suite size against thoroughness.
  • The same exploration process can be reapplied after chatbot updates to refresh the test suite automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The Dialog Graph could serve as input for analyses beyond testing, such as identifying unreachable intents or measuring conversation complexity.
  • Similar graph-construction techniques might apply to chatbots built on other platforms if their interaction models can be queried in the same way.
  • Integrating the exploration step into continuous integration would let teams receive updated test suites after every chatbot change.

Load-bearing premise

Automatically exploring the chatbot produces a complete and accurate Dialog Graph that includes every relevant interaction without missing states or transitions.

What would settle it

A manual enumeration of all conversation paths in one of the five evaluated chatbots that finds a path absent from the Dialog Graph produced by LexTester.

Figures

Figures reproduced from arXiv: 2607.01094 by Alessandro Vasina, Diego Clerissi, Leonardo Mariani.

Figure 1
Figure 1. Figure 1: Task-based chatbot architecture in Amazon Lex. 2.2 Amazon Lex Chatbots Amazon Lex is the chatbot design platform developed by Amazon (current ver￾sion V22 ), integrated within Amazon Web Services (AWS), natively supporting connection to, among the others, Amazon cloud services, advanced Natural Lan￾guage Understanding (NLU) technologies, and lambda functions (i.e., serverless computing services that enable… view at source ↗
Figure 2
Figure 2. Figure 2: A portion of a conversation modeled in the Visual Conversation Builder. #me I want to order some roses #bot When? #me tomorrow #bot Ok, I will place your order. Else? #me That’s all, thanks #bot INTENT DONE [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An exemplified Botium test case. In Botium, a test case is a conversational scenario, composed of text-based steps between the user (annotated with #me) and the chatbot (annotated with #bot), derived from the files of the chatbot implementation that encode the possible requests and responses [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LexTester pipeline. generation and the data used to generate user input. The tool parses the con￾tent of the chatbot conversational data, then (i) generates a Dialog Graph, (ii) generates abstract test cases from the graph in the form of conversational paths according to the chosen coverage strategy, and (iii) generates concrete executable test cases in a Botium-like format from abstract test cases, instan… view at source ↗
Figure 5
Figure 5. Figure 5: A Dialog Graph generated by LexTester. Code Hook block invoking a lambda function. The block exposes three possible outcomes (success, failure, and timeout), whose associated values are undefined since they are not statically defined in the chatbot configuration but depend on the lambda implementation. In the success branch, multiple Get slot value blocks elicit the required slots (e.g., FlowerType), follo… view at source ↗
read the original abstract

Task-based chatbots are nowadays widely adopted software systems, usually integrated into real-world applications and communication channels, designed to assist users in completing tasks through conversational interfaces. Like any other software, even chatbots are prone to bugs. Despite their increasing pervasiveness in everyday activities, existing techniques for assessing their quality still exhibit several limitations, such as the simplicity of generated test scenarios and oracle weaknesses. In this paper, we present LexTester, an automated model-based testing technique for Amazon Lex chatbots. The technique explores the conversational space of the chatbot under test to generate a Dialog Graph of all possible interactions, from which an executable test suite is generated according to different coverage strategies. LexTester was evaluated against the state-of-the-practice chatbot testing tool Botium on five Amazon Lex chatbots, consistently outperforming it in all subjects, generating more tests with nearly double complexity, achieving overall 83-95% coverage of conversational elements, and improving fault detection effectiveness by up to four times at comparable time costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents LexTester, a model-based testing technique for Amazon Lex task-based chatbots. It explores the conversational space to construct a Dialog Graph of all possible interactions and generates executable test suites according to coverage strategies. On five Amazon Lex chatbots, LexTester is reported to outperform Botium by producing more tests of nearly double complexity, achieving 83-95% coverage of conversational elements, and improving fault detection by up to 4x at comparable time costs.

Significance. If the Dialog Graph is verifiably complete and the evaluation details are provided, the technique could offer a practical advance over existing chatbot testing tools by addressing limitations in scenario simplicity and oracle strength. The quantitative comparison on multiple subjects and metrics (coverage, complexity, fault detection) would be a useful contribution to model-based testing for conversational systems.

major comments (2)
  1. [Abstract] Abstract (paragraph 2): The central evaluation claims (83-95% coverage, up to 4x fault detection improvement, consistent outperformance of Botium) depend on the exploration step producing a complete and accurate Dialog Graph. No ground-truth comparison, manual inspection, soundness argument, or discussion of potential missed states/transitions (due to context, untriggered intents, or exploration limits) is provided. This is load-bearing for all reported results.
  2. [Abstract] Abstract: No details are supplied on fault seeding method, oracle definition, subject selection criteria for the five Amazon Lex chatbots, or statistical significance of the results. Without these, the reliability of the performance claims cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph 2): The central evaluation claims (83-95% coverage, up to 4x fault detection improvement, consistent outperformance of Botium) depend on the exploration step producing a complete and accurate Dialog Graph. No ground-truth comparison, manual inspection, soundness argument, or discussion of potential missed states/transitions (due to context, untriggered intents, or exploration limits) is provided. This is load-bearing for all reported results.

    Authors: The Dialog Graph is constructed via systematic API-driven exploration that enumerates all reachable states from the chatbot model by exercising every intent and slot combination. We acknowledge that the current abstract and manuscript lack an explicit soundness argument or discussion of potential missed transitions (e.g., due to context or exploration bounds). In revision we will add a dedicated paragraph in the methodology section describing the exploration algorithm and its limitations, and we will update the abstract to qualify the coverage claims as applying to the explored space. revision: yes

  2. Referee: [Abstract] Abstract: No details are supplied on fault seeding method, oracle definition, subject selection criteria for the five Amazon Lex chatbots, or statistical significance of the results. Without these, the reliability of the performance claims cannot be assessed.

    Authors: The full manuscript's evaluation section defines fault seeding via five mutation operators on intents/slots, the oracle as response and state-transition matching, subject selection as five representative task-based Lex chatbots, and reports averages over repeated runs. These details are not summarized in the abstract. We will revise the abstract to include concise descriptions of the fault-seeding method, oracle, and subject criteria, and we will add a short discussion noting the absence of formal statistical significance tests given the small number of subjects. revision: partial

Circularity Check

0 steps flagged

No significant circularity; evaluation is external comparison

full rationale

The paper describes an exploration-based technique to build a Dialog Graph followed by test generation and an empirical evaluation against the independent external tool Botium on five Amazon Lex chatbots. No equations, fitted parameters, self-citations used as load-bearing premises, or reductions of results to inputs by construction are present. The reported outperformance metrics rest on direct observable comparisons rather than internal definitions or self-referential steps, satisfying the condition for a self-contained result against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that exhaustive exploration of the conversational space yields a faithful model of the chatbot under test.

axioms (1)
  • domain assumption The conversational space of an Amazon Lex chatbot can be systematically explored to produce a complete Dialog Graph of all possible interactions.
    Invoked in the description of how LexTester generates the graph (abstract, paragraph 2).

pith-pipeline@v0.9.1-grok · 5702 in / 1172 out tokens · 17549 ms · 2026-07-02T08:09:53.815152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references

  1. [1]

    Amazon Lex (May 2026),https://docs.aws.amazon.com/lex/

  2. [2]

    Botium (May 2026),https://botium-docs.readthedocs.io/en/latest

  3. [3]

    Chatbottest (May 2026),https://chatbottest.com

  4. [4]

    Dialogflow (May 2026),https://docs.cloud.google.com/dialogflow/docs

  5. [5]

    Rasa (May 2026),https://rasa.com/docs/

  6. [6]

    IEEE Trans- actions on Software Engineering (TSE)48(8), 3087–3102 (2021)

    Abdellatif, A., Badran, K., Costa, D.E., Shihab, E.: A comparison of natural lan- guage understanding platforms for chatbots in software engineering. IEEE Trans- actions on Software Engineering (TSE)48(8), 3087–3102 (2021)

  7. [7]

    Machine Learning with Applications2, 100006 (2020)

    Adamopoulou, E., Moussiades, L.: Chatbots: History, technology, and applications. Machine Learning with Applications2, 100006 (2020)

  8. [8]

    Adamopoulou, E., Moussiades, L.: An overview of chatbot technology. In: Pro- ceedings of the International Conference on Artificial Intelligence Applications and Innovations (AIAI) (2020) A Model-based Testing Technique for Amazon Lex Task-based Chatbots 15

  9. [9]

    In: Proceedings of the Interna- tional Conference on Software Maintenance and Evolution (ICSME) (2020)

    Asyrofi, M.H., Thung, F., Lo, D., Jiang, L.: CrossASR: Efficient differential testing of automatic speech recognition via text-to-speech. In: Proceedings of the Interna- tional Conference on Software Maintenance and Evolution (ICSME) (2020)

  10. [10]

    IEEE Access12, 78799– 78810 (2024)

    Benaddi, L., Ouaddi, C., Jakimi, A., Ouchao, B.: A systematic review of chatbots: Classification, development, and their impact on tourism. IEEE Access12, 78799– 78810 (2024)

  11. [11]

    Božić,J.:Ontology-basedmetamorphictestingforchatbots.SoftwareQualityJour- nal (SQJ)30(1), 227–251 (2022)

  12. [12]

    In: Proceed- ings of the International Conference On Artificial Intelligence Testing (AITest) (2019)

    Božić, J., Tazl, O.A., Wotawa, F.: Chatbot testing using AI planning. In: Proceed- ings of the International Conference On Artificial Intelligence Testing (AITest) (2019)

  13. [13]

    In: Proceed- ings of the International Conference on Testing Software and Systems (ICTSS) (2019)

    Božić, J., Wotawa, F.: Testing chatbots using metamorphic relations. In: Proceed- ings of the International Conference on Testing Software and Systems (ICTSS) (2019)

  14. [14]

    In: Pro- ceedings of the International Conference on the Quality of Information and Com- munications Technology (QUATIC) (2020)

    Bravo-Santos, S., Guerra, E., de Lara, J.: Testing chatbots with Charm. In: Pro- ceedings of the International Conference on the Quality of Information and Com- munications Technology (QUATIC) (2020)

  15. [15]

    In: Proceedings of the International Workshop on Bots in Software Engineering (BotSE) (2021)

    Cabot, J., Burgueño, L., Clarisó, R., Daniel, G., Perianez-Pascual, J., Rodriguez- Echeverria, R.: Testing challenges for NLP-intensive bots. In: Proceedings of the International Workshop on Bots in Software Engineering (BotSE) (2021)

  16. [16]

    In: Proceedings of the International Conference on Automation of Software Test (AST) (2024)

    Cañizares, P.C., Ávila, D., Pérez-Soler, S., Guerra, E., de Lara, J.: Coverage-based strategies for the automated synthesis of test scenarios for conversational agents. In: Proceedings of the International Conference on Automation of Software Test (AST) (2024)

  17. [17]

    In: Proceedings of the Symposium on Applied Computing (SAC) (2022)

    Cañizares, P.C., Pérez-Soler, S., Guerra, E., de Lara, J.: Automating the mea- surement of heterogeneous chatbot designs. In: Proceedings of the Symposium on Applied Computing (SAC) (2022)

  18. [18]

    ACM Computing Sur- veys (CSUR)51(1), 1–27 (2018)

    Chen, T.Y., Kuo, F.C., Liu, H., Poon, P.L., Towey, D., Tse, T., Zhou, Z.Q.: Meta- morphic testing: A review of challenges and opportunities. ACM Computing Sur- veys (CSUR)51(1), 1–27 (2018)

  19. [19]

    In: Proceedings of the 9th International Work- shop on Software Faults (IWSF) (2025)

    Clerissi, D., Masserini, E., Micucci, D., Mariani, L.: Towards multi-platform muta- tion testing of task-based chatbots. In: Proceedings of the 9th International Work- shop on Software Faults (IWSF) (2025)

  20. [20]

    In: Proceedings of ACL 2017, System Demonstrations

    Cui, L., Huang, S., Wei, F., Tan, C., Duan, C., Zhou, M.: Superagent: A customer service chatbot for e-commerce websites. In: Proceedings of ACL 2017, System Demonstrations. pp. 97–102 (2017)

  21. [21]

    It’s on its way

    De Cicco, R., da Costa e Silva, S.C.L., Alparone, F.R.: “It’s on its way”: Chatbots applied for online food delivery services, social or task-oriented interaction style? Journal of Foodservice Business Research (JFBR)24(2), 140–164 (2021)

  22. [22]

    Journal of Systems and Software (JSS) p

    De Lara, J., Del Pozzo, A., Guerra, E., Cuadrado, J.S.: Automated end-to-end testing for conversational agents. Journal of Systems and Software (JSS) p. 112685 (2025)

  23. [23]

    Artificial Intelligence Re- view54, 755–810 (2021)

    Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., Cieliebak, M.: Survey on evaluation methods for dialogue systems. Artificial Intelligence Re- view54, 755–810 (2021)

  24. [24]

    In: Proceedings of the International Conference on Software Engineering, Companion (ICSE-C) (2024)

    Ferdinando Urrico, M., Clerissi, D., Mariani, L.: Mutabot: A mutation testing approach for chatbots. In: Proceedings of the International Conference on Software Engineering, Companion (ICSE-C) (2024)

  25. [25]

    Forgot your password again?

    Fiore, D., Baldauf, M., Thiel, C.: “Forgot your password again?” Acceptance and user experience of a chatbot for in-company IT support. In: Proceedings of the International Conference on Mobile and Ubiquitous Multimedia (MUM) (2019) 16 D. Clerissi et al

  26. [26]

    In: Proceedings of the International Con- ference on Evaluation and Assessment in Software Engineering (EASE) (2024)

    Gómez-Abajo, P., Pérez-Soler, S., Cañizares, P.C., Guerra, E., de Lara, J.: Muta- tion testing for task-oriented chatbots. In: Proceedings of the International Con- ference on Evaluation and Assessment in Software Engineering (EASE) (2024)

  27. [27]

    In: Proceedings of the Conference on Human Factors in Computing Systems (CHI) (2019)

    Grudin, J., Jacques, R.: Chatbots, humbots, and the quest for artificial general intelligence. In: Proceedings of the Conference on Human Factors in Computing Systems (CHI) (2019)

  28. [28]

    In: Proceedings of the Inter- national Conference On Artificial Intelligence Testing (AITest) (2019)

    Guichard, J., Ruane, E., Smith, R., Bean, D., Ventresque, A.: Assessing the ro- bustness of conversational agents using paraphrases. In: Proceedings of the Inter- national Conference On Artificial Intelligence Testing (AITest) (2019)

  29. [29]

    In: IFIP Interna- tional Conference on Testing Software and Systems (ICTSS)

    del Horno, I.S., del Pozzo, A., Guerra, E., de Lara, J.: Automated exploration of conversational agents for the synthesis of testing profiles. In: IFIP Interna- tional Conference on Testing Software and Systems (ICTSS). pp. 213–230. Springer (2025)

  30. [30]

    In: Proceedings of the Conference on Software Testing, Vali- dation and Verification (ICST) (2019)

    Iwama,F.,Fukuda,T.:Automatedtestingofbasicrecognitioncapabilityforspeech recognition systems. In: Proceedings of the Conference on Software Testing, Vali- dation and Verification (ICST) (2019)

  31. [31]

    ACM Computing Surveys57(4), 1–37 (2024)

    Lambiase, S., Catolino, G., Palomba, F., Ferrucci, F.: Motivations, challenges, best practices, and benefits for bots and conversational agents in software engineering: A multivocal literature review. ACM Computing Surveys57(4), 1–37 (2024)

  32. [32]

    In: Proceedings of the International Conference On Artificial Intelligence Testing (AITest) (2022)

    Li, X., Tao, C., Gao, J., Guo, H.: A review of quality assurance research of dialogue systems. In: Proceedings of the International Conference On Artificial Intelligence Testing (AITest) (2022)

  33. [33]

    In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA) (2021)

    Liu, Z., Feng, Y., Chen, Z.: DialTest: Automated testing for recurrent-neural- network-driven dialogue systems. In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA) (2021)

  34. [34]

    In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) (2026)

    Masserini, E., Clerissi, D., Micucci, D., Mariani, L.: Assessing task-based chat- bots: Snapshot and curated datasets for Dialogflow. In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) (2026)

  35. [35]

    In: Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE) (2025)

    Masserini, E., Clerissi, D., Micucci, D., Rodrigues Campos, J., Mariani, L.: To- wards the assessment of task-based chatbots: From the TOFU-R snapshot to the BRASATO curated dataset. In: Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE) (2025)

  36. [36]

    ACM Computing Surveys55(5), 1–42 (2022)

    Motger, Q., Franch, X., Marco, J.: Software-based dialogue systems: Survey, tax- onomy, and challenges. ACM Computing Surveys55(5), 1–42 (2022)

  37. [37]

    IEEE Software38(4), 94–103 (2021)

    Pérez-Soler, S., Juarez-Puerta, S., Guerra, E., de Lara, J.: Choosing a chatbot development tool. IEEE Software38(4), 94–103 (2021)

  38. [38]

    In: Proceedings of the International Conference on Software Testing, Verification and Validation, Workshops (ICSTW) (2025)

    Rapisarda, R.G., Ginelli, D., Clerissi, D., Mariani, L.: Test case generation for Dialogflow task-based chatbots. In: Proceedings of the International Conference on Software Testing, Verification and Validation, Workshops (ICSTW) (2025)

  39. [39]

    In: Proceedings of the International Conference on Intelligent User Interfaces, Companion (IUI-C) (2018)

    Ruane, E., Faure, T., Smith, R., Bean, D., Carson-Berndsen, J., Ventresque, A.: Botest: A framework to test the quality of conversational agents using divergent input examples. In: Proceedings of the International Conference on Intelligent User Interfaces, Companion (IUI-C) (2018)

  40. [40]

    In: Proceedings of the Brazilian Sym- posium on Human Factors in Computing Systems (IHC) (2017)

    Vasconcelos, M., Candello, H., Pinhanez, C., dos Santos, T.: Bottester: Testing conversational systems with simulated users. In: Proceedings of the Brazilian Sym- posium on Human Factors in Computing Systems (IHC) (2017)

  41. [41]

    In: Proceedings of the International Conference on Software Analysis, Evo- lution and Reengineering (SANER) (2021)

    Zheng, W., Liu, G., Zhang, M., Chen, X., Zhao, W.: Research progress of flaky tests. In: Proceedings of the International Conference on Software Analysis, Evo- lution and Reengineering (SANER) (2021)