pith. sign in

arxiv: 2607.01980 · v1 · pith:X7B3LEK6new · submitted 2026-07-02 · 💻 cs.SE

Epic-Organized vs. Requirement-Aligned Gherkin: An Empirical Evaluation of LLM-Based Acceptance Criteria Generation

Pith reviewed 2026-07-03 08:56 UTC · model grok-4.3

classification 💻 cs.SE
keywords GherkinBDDLLMacceptance criteriarequirements engineeringepic-organized generationsemantic coveragePURE dataset
0
0 comments X

The pith

Epic-organized LLM generation of Gherkin produces higher expert-rated quality than requirement-aligned generation while preserving semantic coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares two LLM strategies for generating Gherkin acceptance criteria in behavior-driven development. It tests whether organizing generation around epics rather than aligning directly to individual requirements yields better results. On four documents from the PURE dataset, the epic-organized pipeline matched the baseline on semantic coverage but received higher ratings from experts on correctness, executability, and completeness. The work addresses the manual bottleneck in authoring Gherkin scenarios for requirements engineering.

Core claim

The epic-organized approach, implemented as the Timeless pipeline, produced Gherkin scenarios that experts rated higher on correctness (4.61 vs 4.14), executability (4.61 vs 4.07), and completeness (4.31 vs 3.50) compared to a requirement-aligned zero-shot baseline, while achieving comparable semantic requirement coverage (94.3% vs 92.9%).

What carries the argument

The Timeless epic-organized LLM pipeline, which structures generation around epics rather than individual requirements, evaluated through structural validity checks, TF-IDF and embedding-based semantic coverage, and blind expert assessment.

If this is right

  • JSON-constrained LLM pipelines can achieve full structural validity in generated Gherkin scenarios.
  • TF-IDF lexical metrics may underestimate coverage when scenarios paraphrase requirements at a higher level of abstraction.
  • Epic-organized generation can improve perceived quality metrics without loss of semantic requirement coverage.
  • The comparison is limited to four documents and requires broader replication before generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams using BDD could reduce manual Gherkin authoring effort by adopting epic-organized LLM pipelines in their requirements tools.
  • Applying the approach to industry requirements outside the PURE dataset would test whether quality gains hold in practice.
  • Combining epic organization with additional prompting methods might produce further gains in the rated quality dimensions.

Load-bearing premise

The four requirements documents from the PURE dataset and assessments by four expert raters provide a sufficient and representative basis for generalizing that epic-organized generation improves perceived Gherkin quality.

What would settle it

A replication using ten or more independent expert raters across additional requirements documents that finds no significant preference for the epic-organized outputs on correctness, executability, or completeness.

Figures

Figures reproduced from arXiv: 2607.01980 by Fran\c{c}ois Christophe, Jussi Rasku, Mateen Abbasi, Pekka Abrahamsson, Shahbaz Siddeeq, Tommi Mikkonen, Zheying Zhang.

Figure 1
Figure 1. Figure 1: Timeless system architecture. The Manager Service coordinates the pipeline; SSE delivers real-time state updates to the React frontend. 3.2 The Timeless Platform Timeless [15] is a requirements engineering platform that transcribes meeting speech and extracts requirements in real time [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Timeless requirements-to-Gherkin pipeline. Each stage runs as an asyn￾chronous LLM call; the SSE stream updates the User Interface (UI) as each stage completes. with Given-When-Then scenarios for each epic); Step 4 (Gherkin generation) runs after epics stabilize and is the focus of this evaluation. 4 Evaluation 4.1 Research Questions We evaluate the approach through four Research Questions. The zero-sh… view at source ↗
Figure 3
Figure 3. Figure 3: Results overview. (a) Quality profile: six headline metrics normalized to [0, 1] (higher = better for all axes). Timeless (solid, filled) and the zero-shot baseline (dashed, open markers) are compared across structural validity, lexical and semantic coverage, and the three expert evaluation dimensions. (b) Lexical vs semantic RCR per document; filled circles = Timeless, open squares = baseline. Points abov… view at source ↗
read the original abstract

Automated authoring of Gherkin Behavior-Driven Development (BDD) acceptance criteria remains a manual bottleneck in requirements engineering. This study investigates whether epic-organized LLM-generated Gherkin produces higher quality and coverage than requirement-aligned generation. We compare our Timeless (an epic-organized LLM pipeline) approach against a naive large language model (LLM) baseline on four requirements documents (107 requirements) from the PURE dataset. Evaluation covers structural metrics, automated requirement coverage via TF-IDF and dense embeddings, and blind expert assessment by four researchers. In our evaluation, the JSON-constrained pipeline produced structurally valid scenarios across all generated outputs, while the zero-shot baseline achieved 99% structural validity. Semantic coverage was comparable to the baseline, with Timeless achieving 94.3% semantic Requirement Coverage Rate compared with 92.9% for the baseline. TF-IDF produced lower coverage scores for the epic-organized output, suggesting that lexical metrics may miss coverage when scenarios paraphrase requirements at a higher level of abstraction. Expert raters prefer the epic-organized strategy on Correctness (4.61 vs 4.14), Executability (4.61 vs 4.07), and Completeness (4.31 vs 3.50). Overall, the results suggest that epic-organized generation can improve perceived Gherkin quality while maintaining comparable semantic coverage, although broader replication is needed before generalizing this finding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that an epic-organized LLM pipeline (Timeless) for generating Gherkin BDD acceptance criteria yields higher perceived quality than a requirement-aligned zero-shot baseline, as measured by expert ratings on Correctness (4.61 vs 4.14), Executability (4.61 vs 4.07), and Completeness (4.31 vs 3.50), while achieving comparable semantic coverage (94.3% vs 92.9% via embeddings) on four PURE dataset documents containing 107 requirements. Structural validity is near-perfect for both approaches, and the authors conclude that epic organization can improve quality with a call for broader replication.

Significance. If the central claim holds after addressing evaluation limitations, the work offers a practical contribution to automated requirements engineering by showing that organizing LLM generation around epics rather than individual requirements can enhance perceived Gherkin attributes without sacrificing coverage. The combination of structural metrics, dual semantic coverage methods (TF-IDF and embeddings), and blind expert assessment is a methodological strength, as is the explicit acknowledgment that replication is needed.

major comments (2)
  1. [Expert Assessment / Results] Expert Assessment (results section): The mean expert preference scores are reported from only four raters on four documents with no inter-rater agreement statistic (e.g., Fleiss' kappa or ICC), no per-rater or per-document score distributions, no standard deviations or confidence intervals, and no hypothesis tests. This directly undermines the load-bearing claim that epic-organized generation improves perceived quality, as the differences could reflect sampling variability or rater-specific biases rather than a general effect.
  2. [Evaluation Setup] Evaluation Setup (experimental design): The study relies on only four requirements documents (107 requirements total) from a single source (PURE dataset). While the paper correctly flags the need for replication, the absence of any power analysis, diversity justification, or sensitivity checks means the observed quality differences cannot yet support the generalization that epic-organized generation improves Gherkin quality.
minor comments (2)
  1. [Methods] The abstract states the JSON-constrained pipeline achieved 100% structural validity while the baseline reached 99%; the precise definition and automated checks used to determine 'structural validity' of Gherkin scenarios should be detailed in the methods.
  2. [Semantic Coverage Analysis] The discrepancy between TF-IDF coverage (lower for Timeless) and embedding-based coverage (comparable) is noted as evidence that lexical metrics miss higher-abstraction paraphrasing, but no quantitative comparison of the two coverage methods or example scenarios illustrating the difference is provided.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will revise the manuscript to improve statistical reporting and limitation discussions while preserving the original empirical observations.

read point-by-point responses
  1. Referee: [Expert Assessment / Results] Expert Assessment (results section): The mean expert preference scores are reported from only four raters on four documents with no inter-rater agreement statistic (e.g., Fleiss' kappa or ICC), no per-rater or per-document score distributions, no standard deviations or confidence intervals, and no hypothesis tests. This directly undermines the load-bearing claim that epic-organized generation improves perceived quality, as the differences could reflect sampling variability or rater-specific biases rather than a general effect.

    Authors: We agree that the small rater and document counts warrant more transparent reporting. In revision we will add standard deviations and 95% confidence intervals to the mean scores, include per-document score tables, and compute/report Fleiss' kappa from the existing ratings. We will also explicitly note the absence of hypothesis tests and the risk of rater bias. The consistent directional preference across metrics still supports the observed pattern, but we accept that stronger statistical framing is needed and will adjust the results and discussion sections accordingly. revision: partial

  2. Referee: [Evaluation Setup] Evaluation Setup (experimental design): The study relies on only four requirements documents (107 requirements total) from a single source (PURE dataset). While the paper correctly flags the need for replication, the absence of any power analysis, diversity justification, or sensitivity checks means the observed quality differences cannot yet support the generalization that epic-organized generation improves Gherkin quality.

    Authors: We concur that the single-source sample of four documents limits generalizability, which is why the manuscript already calls for replication. We will add a post-hoc power note, a short justification for PURE document selection (public availability and domain variety), and a sensitivity discussion (e.g., noting that leave-one-document-out patterns remain directionally consistent). No new data collection is proposed; the revision will strengthen the limitations paragraph without altering the core claim for the studied cases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with direct measurements

full rationale

This is a purely empirical comparison paper. It generates Gherkin scenarios via two LLM pipelines (Timeless epic-organized vs. zero-shot baseline), then measures structural validity, TF-IDF/dense-embedding coverage, and blind expert ratings on four PURE documents. No equations, fitted parameters, predictions derived from inputs, or self-citations are used to support the central claims. All reported differences (e.g., expert means 4.61 vs 4.14) are direct observations from the evaluation protocol. The paper itself notes the need for replication, confirming the results are not presented as self-contained derivations. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper's claims depend on the reliability of human evaluation and the representativeness of the chosen dataset rather than on new mathematical constructs or fitted parameters.

axioms (2)
  • domain assumption Expert assessments by four researchers provide a valid measure of Gherkin scenario quality
    The preference for epic-organized generation is based on these ratings for Correctness, Executability, and Completeness.
  • domain assumption The subset of four requirements documents from the PURE dataset is sufficient to draw conclusions about the two generation strategies
    The evaluation and claims are based on results from these documents.

pith-pipeline@v0.9.1-grok · 5816 in / 1449 out tokens · 41173 ms · 2026-07-03T08:56:41.622166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    In: Generative AI for Effective Software Development, pp

    Arora, C., Grundy, J., Abdelrazek, M.: Advancing requirements engineering through generative ai: Assessing the role of llms. In: Generative AI for Effective Software Development, pp. 129–148. Springer (2024)

  2. [2]

    In: 2018 IEEE 25th Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER)

    Binamungu, L.P., Embury, S.M., Konstantinou, N.: Maintaining behaviour driven development specifications: Challenges and opportunities. In: 2018 IEEE 25th Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). pp. 175–184. IEEE (2018)

  3. [3]

    Journal of Systems and Software203, 111749 (2023) 16 S

    Binamungu, L.P., Maro, S.: Behaviour driven development: A systematic mapping study. Journal of Systems and Software203, 111749 (2023) 16 S. Siddeeq et al

  4. [4]

    Cucumber Open Source Project: Cucumber – bdd testing and automation.https: //cucumber.io(2023)

  5. [5]

    In: Proceedings of the 25th International Requirements Engineering Conference (RE)

    Ferrari, A., Spagnolo, G.O., Gnesi, S.: PURE: A dataset of public requirements documents. In: Proceedings of the 25th International Requirements Engineering Conference (RE). pp. 502–503. IEEE (2017).https://doi.org/10.1109/RE.2017. 29, dataset available at:https://zenodo.org/record/1414117

  6. [6]

    In: 2025 IEEE/ACM International Conference on Automation of Software Test (AST)

    Ferreira, M., Viegas, L., Faria, J.P., Lima, B.: Acceptance test generation with large language models: An industrial case study. In: 2025 IEEE/ACM International Conference on Automation of Software Test (AST). pp. 1–11. IEEE (2025)

  7. [7]

    Information and Software Technology p

    Hassani, S., Sabetzadeh, M., Amyot, D.: From law to gherkin: A human-centred quasi-experiment on the quality of llm-generated behavioural specifications from food-safety regulations. Information and Software Technology p. 108122 (2026)

  8. [8]

    Requirements Engineering3(2), 84–90 (1998)

    Kamsties, E., Hörmann, K., Schlich, M.: Requirements engineering in small and medium enterprises: State-of-the-practice, problems, solutions, and technology transfer. Requirements Engineering3(2), 84–90 (1998)

  9. [9]

    IEEE Access (2024)

    Karpurapu, S., Myneni, S., Nettur, U., Gajja, L.S., Burke, D., Stiehm, T., Payne, J.: Comprehensive evaluation and insights into the use of large language models in the automation of behavior-driven development acceptance test formulation. IEEE Access (2024)

  10. [10]

    Requirements engineering 21(3), 383–403 (2016)

    Lucassen, G., Dalpiaz, F., van der Werf, J.M.E., Brinkkemper, S.: Improving agile requirements: the quality user story framework and tool. Requirements engineering 21(3), 383–403 (2016)

  11. [11]

    Better Software Magazine.[Online]

    North, D.: Introducing bdd. Better Software Magazine.[Online]. (2006), available at:https://dannorth.net/blog/introducing-bdd

  12. [12]

    In: Proceed- ings of the Conference on the Future of Software Engineering (ICSE 2000)

    Nuseibeh, B., Easterbrook, S.: Requirements engineering: A roadmap. In: Proceed- ings of the Conference on the Future of Software Engineering (ICSE 2000). pp. 35–46. ACM (2000)

  13. [13]

    OpenAI: Gpt-4 technical report. Tech. rep., OpenAI (2023), arXiv:2303.08774

  14. [14]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022)

  15. [15]

    arXiv preprint arXiv:2411.08507 (2024)

    Rasheed, Z., Sami, M.A., Rasku, J., Kemell, K.K., Zhang, Z., Harjamaki, J., Siddeeq, S., Lahti, S., Herda, T., Nurminen, M., et al.: Timeless: A vision for the next generation of software development. arXiv preprint arXiv:2411.08507 (2024)

  16. [16]

    arXiv preprint arXiv:2603.04729 (2026)

    Rathnayake, A., Shahin, M., Abaei, G.: Behaviour driven development scenario generation with large language models. arXiv preprint arXiv:2603.04729 (2026). https://doi.org/10.48550/arXiv.2603.04729

  17. [17]

    In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 3982–3992. Association for Computa- tional Linguistics (2019)

  18. [18]

    Lecture Notes in Business Information Processing, vol

    Ronanki, K., Cabrero-Daniel, B., Berger, C.: ChatGPT as a tool for user story quality evaluation: Trustworthy out of the box? In: Agile Processes in Software Engineering and Extreme Programming – Workshops (XP 2022 Workshops). Lecture Notes in Business Information Processing, vol. 489, pp. 173–181. Springer (2024). https://doi.org/10.1007/978-3-031-48550-...

  19. [19]

    Simon and Schuster (2023)

    Smart, J.F., Molak, J.: BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle. Simon and Schuster (2023)

  20. [20]

    In: Proceedings of the 37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA)

    Solís, C., Wang, X.: A study of the characteristics of behaviour driven development. In: Proceedings of the 37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). pp. 383–387. IEEE (2011)