pith. sign in

arxiv: 2606.15817 · v2 · pith:OA5EKGEEnew · submitted 2026-06-14 · 💻 cs.PL · cs.SE

ScratchLens: Lens-Parametric Behavioral Equivalence for Scratch Programs

Pith reviewed 2026-06-27 03:57 UTC · model grok-4.3

classification 💻 cs.PL cs.SE
keywords behavioral equivalenceScratch programslens-parametric observationcausal intermediate representationMazurkiewicz tracesconcurrent block programsprogram equivalencemutation testing
0
0 comments X

The pith

ScratchLens decides behavioral equivalence of Scratch programs by reducing them to a canonical causal form under parametric observation lenses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Two Scratch programs can differ in syntax yet behave the same, or differ in behavior only under particular schedules or lenses such as event traces. Tree differencing is too strict while single-run testing is unsound for concurrent and timing-dependent code. ScratchLens makes equivalence lens-parametric by building a causal IR of resources and transactions, canonicalizing renamings and procedures, applying Mazurkiewicz trace quotients, and discharging residual cases with SMT and VM refinement. On a corpus of real-project mutations it classifies every validated pair and records no false equivalence claims under strict scoring. This supplies reliable evidence for uses such as automated grading and repair validation.

Core claim

ScratchLens compiles Scratch projects into a causal IR of typed resources and semantic transactions, canonicalizes renamings, guards, and procedure bodies, quotients same-trigger concurrency with Mazurkiewicz trace normal forms, separates program order from races, and resolves residual frontiers via SMT obligations and VM-backed counterexample-guided refinement, yielding conclusive verdicts with evidence: equivalence by bijection and trace quotient or difference by a typed witness.

What carries the argument

Lens-parametric behavioral equivalence over a causal IR with Mazurkiewicz trace quotients that separate program order from races.

If this is right

  • Automated feedback systems can accept syntactically different but behaviorally equivalent student submissions.
  • Repair validation can use typed difference witnesses to confirm that a fix actually changes observable behavior.
  • Grading tools can distinguish real functional changes from equivalent refactorings such as script reordering.
  • The taxonomy of causal divergence phenomena supplies a systematic way to handle concurrency and timing issues that random testing misses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same causal-IR and trace-quotient approach could be lifted to other block-based languages that expose concurrent scripts.
  • Adding a probabilistic lens to the existing set might give sound verdicts for programs whose randomness is not fully captured by current monitors.
  • The evidence-carrying verdicts could be fed directly to LLM repair agents to constrain candidate patches to those that restore equivalence under the chosen lens.

Load-bearing premise

The VM-witnessed mutation corpus from real Scratch projects is representative of the behavioral divergences that arise from concurrent, random, and timing-dependent behavior under the chosen observation lenses.

What would settle it

Any pair of Scratch programs on which ScratchLens returns equivalence yet a VM execution under some schedule produces observably different results under one of the defined lenses.

Figures

Figures reproduced from arXiv: 2606.15817 by Jialu Zhang, Yuan Si.

Figure 1
Figure 1. Figure 1: Running example. A distant refactoring (A/B) is equivalent, while a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Two Scratch programs can be syntactically far apart-renamed variables, split scripts, extracted custom blocks, or reordered initialization-and still behave identically; a one-block edit, such as replacing a blocking broadcast with an asynchronous one, can create divergences visible only under specific schedules. Deciding behavioral equivalence is central to automated feedback, grading support, and repair validation, yet tree differencing is too strict and single-run dynamic comparison is unsound for concurrent, random, and timing-dependent behavior. We observe that equivalence for block-based programs is lens-parametric: final state, frame traces, monitors, event causality, and debug traces induce different observation relations. ScratchLens makes this explicit through a taxonomy of causal divergence phenomena and observation lenses. It compiles Scratch projects into a causal IR of typed resources and semantic transactions, canonicalizes renamings, guards, and procedure bodies, quotients same-trigger concurrency with Mazurkiewicz trace normal forms, separates program order from races, and handles residual frontiers through SMT obligations and VM-backed counterexample-guided refinement. Conclusive verdicts carry evidence: equivalence by bijection and trace quotient, difference by a typed witness, and unresolved cases remain unknown. On a VM-witnessed mutation corpus from real Scratch projects, ScratchLens decides all 444 validated pairs and makes 0/158 false-equivalence claims on witnessed-different pairs under strict scoring. Structural, dynamic-only, and LLM baselines fail on the classes predicted by the taxonomy; ablations quantify the contribution of partial-order reduction and lens parametricity; and targeted scenarios expose ambiguous-mutant divergences missed by random testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ScratchLens, a system for deciding behavioral equivalence of Scratch programs under lens-parametric observations (final state, frame traces, monitors, event causality, debug traces). It compiles projects to a causal IR of typed resources and semantic transactions, canonicalizes renamings and procedures, quotients same-trigger concurrency via Mazurkiewicz trace normal forms, separates program order from races, and resolves residual frontiers with SMT obligations plus VM-backed CEGAR. Conclusive verdicts are witnessed by bijections or typed counterexamples. On a VM-witnessed mutation corpus from real Scratch projects, the tool decides all 444 validated pairs and reports 0/158 false-equivalence claims on witnessed-different pairs under strict scoring; structural, dynamic, and LLM baselines fail on taxonomy-predicted classes, with ablations quantifying partial-order reduction and lens contributions.

Significance. If the empirical claims hold, the work supplies a sound, evidence-carrying decision procedure for equivalence in a concurrent block-based language, directly supporting automated feedback and repair validation in educational settings. The explicit taxonomy of causal divergences and observation lenses, combined with the VM-witnessed corpus and targeted ablations, offers reproducible evidence that standard tree differencing and single-run testing are insufficient for schedule-dependent behavior. The machine-checked elements (trace quotients, SMT obligations) and falsifiable predictions (witnessed differences) strengthen the contribution.

major comments (2)
  1. [Evaluation] Evaluation section: the claim that the 444-pair corpus tests the partial-order reduction and lens-parametric machinery on concurrent/race/timing divergences rests on the representativeness of the mutation operators. The manuscript does not enumerate the operators or quantify how many mutants induce schedule-dependent rather than sequential or structural differences; without this, the 0/158 false-equivalence rate does not yet constitute evidence that the Mazurkiewicz normal forms and SMT frontiers handle the hard cases identified in the introduction and taxonomy.
  2. [§4] §4 (Causal IR and trace quotient): the separation of program order from races is central to the soundness argument, yet the manuscript provides no formal statement of the observation relation induced by each lens on the quotiented traces; without an explicit definition relating the normal forms to the VM witnesses, it is unclear whether the zero false-equivalence result follows from the theory or from post-hoc filtering of the corpus.
minor comments (2)
  1. [§3] Notation for semantic transactions and typed resources is introduced without a consolidated table; a single table listing resource kinds, transaction signatures, and their mapping to Scratch blocks would improve readability.
  2. [Evaluation] The abstract states 'all 444 validated pairs' but the evaluation text should explicitly state the total number of generated mutants before validation and the criteria used to discard unvalidated pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity on the evaluation corpus and the formal connections between the trace quotients and lens observations.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the claim that the 444-pair corpus tests the partial-order reduction and lens-parametric machinery on concurrent/race/timing divergences rests on the representativeness of the mutation operators. The manuscript does not enumerate the operators or quantify how many mutants induce schedule-dependent rather than sequential or structural differences; without this, the 0/158 false-equivalence rate does not yet constitute evidence that the Mazurkiewicz normal forms and SMT frontiers handle the hard cases identified in the introduction and taxonomy.

    Authors: We agree that an explicit enumeration of the mutation operators and a quantification of schedule-dependent cases would strengthen the argument for representativeness. The current manuscript describes the operators at a high level in Section 5.1 (as edits drawn from real Scratch project histories) but does not provide a full list or breakdown by divergence class. In the revised version we will add a table enumerating the 14 operators and report that 87 of the 444 pairs were classified by the taxonomy as inducing schedule-dependent (race or timing) divergences; the 0/158 false-equivalence rate continues to hold on this subset. This addition directly ties the empirical result to the hard cases highlighted in the introduction. revision: yes

  2. Referee: [§4] §4 (Causal IR and trace quotient): the separation of program order from races is central to the soundness argument, yet the manuscript provides no formal statement of the observation relation induced by each lens on the quotiented traces; without an explicit definition relating the normal forms to the VM witnesses, it is unclear whether the zero false-equivalence result follows from the theory or from post-hoc filtering of the corpus.

    Authors: The observation relation ~_L for each lens L is defined in Section 3.2, and Theorem 4.2 states that the Mazurkiewicz normal form preserves ~_L. The connection to VM witnesses is implicit in the CEGAR loop (Section 4.4), where SMT obligations are discharged only when the VM confirms the counterexample matches the semantic model. Nevertheless, we acknowledge that an explicit corollary or lemma directly relating the quotiented traces to the VM-validated witnesses would remove any ambiguity. We will add this corollary (with a short proof sketch) in the revised §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard techniques applied to new domain with independent empirical corpus

full rationale

The paper defines an algorithmic pipeline (causal IR compilation, Mazurkiewicz normal forms, SMT obligations, VM-backed refinement) whose correctness claims rest on explicit definitions and external verification techniques rather than self-referential equations or fitted parameters. The headline result (444/444 decided, 0/158 false equivalences) is measured against VM-witness labels on a mutation corpus constructed independently of the decision procedure; no step renames a fitted quantity as a prediction or imports uniqueness via self-citation. The evaluation therefore remains falsifiable outside the method's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the completeness of the causal IR and the chosen set of observation lenses; no free parameters are mentioned, but the corpus construction and lens definitions are paper-specific.

axioms (1)
  • standard math Mazurkiewicz trace equivalence soundly quotients independent concurrent actions for the purpose of behavioral comparison
    Invoked for same-trigger concurrency normalization.
invented entities (2)
  • observation lenses (final state, frame traces, monitors, event causality, debug traces) no independent evidence
    purpose: To parameterize what counts as observable behavior for equivalence
    Introduced as the central organizing concept; no independent evidence of completeness provided in abstract.
  • causal IR of typed resources and semantic transactions no independent evidence
    purpose: Intermediate representation that enables canonicalization and trace analysis
    Core compilation target defined by the paper.

pith-pipeline@v0.9.1-grok · 5815 in / 1263 out tokens · 54202 ms · 2026-06-27T03:57:06.482193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Checked Program Recovery from Execution Video: A Sound Oracle for Untrusted Generators

    cs.SE 2026-07 conditional novelty 7.0

    Vid2Prog recovers Scratch programs from execution videos via a sound oracle that certifies lens-equivalence with zero false accepts on 246 test pairs and 80% certificate rate for in-vocabulary cases while abstaining o...

  2. SchedCheck: Schedule-Robustness Analysis for Event-Driven Block Programs

    cs.SE 2026-07 conditional novelty 7.0

    SchedCheck performs partial-order exploration over dependence-equivalence classes of schedules on the Scratch VM to detect and localize schedule-sensitive behaviors, reporting 17-21% of real concurrent projects affected.

  3. Certificate-Carrying Transformation of Event-Driven Block Programs

    cs.PL 2026-07 accept novelty 7.0 full

    A certificate-carrying rewriting system for Scratch-like languages uses a trusted checker to verify optimizer rewrites by recomputing preservation conditions, with a Lean-mechanized cooperative-frame refinement theore...

  4. ScratchWorld: Evaluating If World Models Compute Executable Consequences

    cs.SE 2026-06 unverdicted novelty 7.0

    ScratchWorld benchmark finds that language models achieve at most 13.8% value-aware changed-field F1 on replay-verified Scratch state transitions and frequently ignore executable rules.

  5. Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery

    cs.LG 2026-07 conditional novelty 6.0

    The paper formalizes fixed-set worst-case corruption in PBE, implements corruption searches on a string DSL, and shows VPA recovers some margin-1 tasks but fails on public SyGuS where vote margins are near one.

Reference graph

Works this paper leans on

45 extracted references · cited by 5 Pith papers

  1. [1]

    Scratch 2024: Creativity around the world,

    Scratch Foundation, “Scratch 2024: Creativity around the world,” https: //annualreport.scratchfoundation.org/, 2025, accessed 2026-06-14

  2. [2]

    scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,

    ——, “scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,” https://github.com/scratchfoundation/scratch-vm, 2026, repository SHA e6f5711f25f607ce8370a5a7afcfb391b349a6e1; repository archived 2026-06-10 and migrated to scratch-editor; accessed 2026-06-12

  3. [3]

    Scratch: Programming for all,

    M. Resnick, J. Maloney, A. Monroy-Hern ´andez, N. Rusk, E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum, J. Silver, B. Silverman, and Y . Kafai, “Scratch: Programming for all,”Communications of the ACM, vol. 52, no. 11, pp. 60–67, 2009

  4. [4]

    The scratch programming language and environment,

    J. Maloney, M. Resnick, N. Rusk, B. Silverman, and E. Eastmond, “The scratch programming language and environment,”ACM Transactions on Computing Education, vol. 10, no. 4, pp. 16:1–16:15, 2010

  5. [5]

    How kids code and how we know: An exploratory study on the scratch repository,

    E. Aivaloglou and F. Hermans, “How kids code and how we know: An exploratory study on the scratch repository,” inProceedings of the 2016 ACM Conference on International Computing Education Research, ser. ICER ’16. ACM, 2016, pp. 53–61

  6. [6]

    Fine-grained and accurate source code differencing,

    J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Fine-grained and accurate source code differencing,” inProceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’14. ACM, 2014, pp. 313–324

  7. [7]

    Simple fast algorithms for the editing distance between trees and related problems,

    K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,”SIAM Journal on Computing, vol. 18, no. 6, pp. 1245–1262, 1989

  8. [8]

    Testing scratch programs automatically,

    A. Stahlbauer, M. Kreis, and G. Fraser, “Testing scratch programs automatically,” inProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’19. ACM, 2019, pp. 165–175

  9. [9]

    Trace theory,

    A. Mazurkiewicz, “Trace theory,” inPetri Nets: Applications and Relationships to Other Models of Concurrency. Springer, 1987, pp. 278–324

  10. [10]

    Z3: An efficient SMT solver,

    L. de Moura and N. Bjørner, “Z3: An efficient SMT solver,” inTools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’08. Springer, 2008, pp. 337–340

  11. [11]

    Counterexample- guided abstraction refinement for symbolic model checking,

    E. Clarke, O. Grumberg, S. Jha, Y . Lu, and H. Veith, “Counterexample- guided abstraction refinement for symbolic model checking,”Journal of the ACM, vol. 50, no. 5, pp. 752–794, 2003

  12. [12]

    LitterBox: A linter for scratch programs,

    G. Fraser, U. Heuer, N. K ¨orber, F. Oberm ¨uller, and E. Wasmeier, “LitterBox: A linter for scratch programs,” inProceedings of the 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering Education and Training, ser. ICSE-SEET ’21. IEEE, 2021, pp. 183–188

  13. [13]

    Common bugs in scratch programs,

    C. Fr ¨adrich, F. Oberm ¨uller, N. K ¨orber, U. Heuer, and G. Fraser, “Common bugs in scratch programs,” inProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’20. ACM, 2020, pp. 89–95

  14. [14]

    A reduction of a graph to a canonical form and an algebra arising during this reduction,

    B. Weisfeiler and A. Leman, “A reduction of a graph to a canonical form and an algebra arising during this reduction,”Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16, 1968

  15. [15]

    Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser

    P. Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser. Lecture Notes in Computer Science. Springer, 1996, vol. 1032

  16. [16]

    Note on the sampling error of the difference between correlated proportions or percentages,

    Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

  17. [17]

    Efron and R

    B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. Chap- man & Hall/CRC, 1994

  18. [18]

    Hairball: Lint-inspired static analysis of scratch projects,

    B. Boe, C. Hill, M. Len, G. Dreschler, P. Conrad, and D. Franklin, “Hairball: Lint-inspired static analysis of scratch projects,” inProceeding of the 44th ACM Technical Symposium on Computer Science Education, ser. SIGCSE ’13. ACM, 2013, pp. 215–220

  19. [19]

    Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,

    J. Moreno-Le ´on, G. Robles, and M. Rom ´an-Gonz´alez, “Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,”RED. Revista de Educaci ´on a Distancia, no. 46, 2015

  20. [20]

    Verified from scratch: Program analysis for learners’ programs,

    A. Stahlbauer, C. Fr ¨adrich, and G. Fraser, “Verified from scratch: Program analysis for learners’ programs,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing, ser. ASE ’20. ACM, 2020, pp. 150–162

  21. [21]

    Au- tomated test generation for scratch programs,

    A. Deiner, C. Fr ¨adrich, G. Fraser, S. Geserer, and N. Zantner, “Au- tomated test generation for scratch programs,”Empirical Software Engineering, vol. 28, no. 3, p. 79, 2023

  22. [22]

    Model-based testing of scratch programs,

    K. G ¨otz, P. Feldmeier, and G. Fraser, “Model-based testing of scratch programs,” inProceedings of the 2022 IEEE Conference on Software Testing, Verification and Validation, ser. ICST ’22. IEEE, 2022, pp. 411–421

  23. [23]

    NuzzleBug: Debugging block-based programs in scratch,

    A. Deiner and G. Fraser, “NuzzleBug: Debugging block-based programs in scratch,” inProceedings of the 46th IEEE/ACM International Confer- ence on Software Engineering, ser. ICSE ’24. ACM, 2024, pp. 1–13

  24. [24]

    Guiding next-step hint generation using automated tests,

    F. Oberm ¨uller, E. Wasmeier, and G. Fraser, “Guiding next-step hint generation using automated tests,” inProceedings of the 26th ACM Con- ference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’21. ACM, 2021, pp. 220–226

  25. [25]

    Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,

    D. Li, D. Li, H. Shi, and J. Zhang, “Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,” 2026

  26. [26]

    Stitch: Step-by-step LLM guided tutoring for scratch,

    Y . Si, K. Qi, D. Li, H. Shi, and J. Zhang, “Stitch: Step-by-step LLM guided tutoring for scratch,” 2025

  27. [27]

    EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,

    Y . Si, M. Wang, D. Li, H. Shi, and J. Zhang, “EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,” 2026

  28. [28]

    ViScratch: Using large language models and gameplay videos for automated feedback in scratch,

    Y . Si, D. Li, H. Shi, and J. Zhang, “ViScratch: Using large language models and gameplay videos for automated feedback in scratch,” 2025

  29. [29]

    ScratchEval: A multimodal evaluation framework for llms in block-based programming,

    Y . Si, S. Han, D. Li, H. Shi, and J. Zhang, “ScratchEval: A multimodal evaluation framework for llms in block-based programming,” 2026

  30. [30]

    Change distilling: Tree differencing for fine-grained source code change extraction,

    B. Fluri, M. W ¨ursch, M. Pinzger, and H. Gall, “Change distilling: Tree differencing for fine-grained source code change extraction,”IEEE Transactions on Software Engineering, vol. 33, no. 11, pp. 725–743, 2007

  31. [31]

    Semantic diff: A tool for summarizing the effects of modifications,

    D. Jackson and D. A. Ladd, “Semantic diff: A tool for summarizing the effects of modifications,” inProceedings of the International Conference on Software Maintenance, ser. ICSM ’94. IEEE, 1994, pp. 243–252

  32. [32]

    SYMDIFF: A language-agnostic semantic diff tool for imperative programs,

    S. K. Lahiri, C. Hawblitzel, M. Kawaguchi, and H. Reb ˆelo, “SYMDIFF: A language-agnostic semantic diff tool for imperative programs,” in Computer Aided Verification, ser. CA V ’12. Springer, 2012, pp. 712– 717

  33. [33]

    Differential symbolic execution,

    S. Person, M. B. Dwyer, S. Elbaum, and C. S. P ˘as˘areanu, “Differential symbolic execution,” inProceedings of the 16th ACM SIGSOFT Inter- national Symposium on Foundations of Software Engineering, ser. FSE ’08. ACM, 2008, pp. 226–237

  34. [34]

    Regression verification: Proving the equivalence of similar programs,

    B. Godlin and O. Strichman, “Regression verification: Proving the equivalence of similar programs,”Software Testing, Verification and Reliability, vol. 23, no. 3, pp. 241–258, 2013

  35. [35]

    Practical, low-effort equivalence verification of real code,

    D. A. Ramos and D. R. Engler, “Practical, low-effort equivalence verification of real code,” inComputer Aided Verification, ser. CA V ’11. Springer, 2011, pp. 669–685

  36. [36]

    Translation validation,

    A. Pnueli, M. Siegel, and E. Singerman, “Translation validation,” in Tools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’98. Springer, 1998, pp. 151–166

  37. [37]

    Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,

    C. K. Roy, J. R. Cordy, and R. Koschke, “Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,” Science of Computer Programming, vol. 74, no. 7, pp. 470–495, 2009

  38. [38]

    Dynamic partial-order reduction for model checking software,

    C. Flanagan and P. Godefroid, “Dynamic partial-order reduction for model checking software,” inProceedings of the 32nd ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, ser. POPL ’05. ACM, 2005, pp. 110–121

  39. [39]

    The state explosion problem,

    A. Valmari, “The state explosion problem,” inLectures on Petri Nets I: Basic Models. Springer, 1998, pp. 429–528

  40. [40]

    Milner,Communication and Concurrency

    R. Milner,Communication and Concurrency. Prentice Hall, 1989

  41. [41]

    Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),

    J. Zhang, T. Mytkowicz, M. Kaufman, R. Piskac, and S. K. Lahiri, “Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’22. ACM, 2022, pp. 77–88

  42. [42]

    Automated feedback generation for competition-level code,

    J. Zhang, D. Li, J. C. Kolesar, H. Shi, and R. Piskac, “Automated feedback generation for competition-level code,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22. ACM, 2022

  43. [43]

    Hints on test data selection: Help for the practicing programmer,

    R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test data selection: Help for the practicing programmer,”Computer, vol. 11, no. 4, pp. 34–41, 1978

  44. [44]

    An analysis and survey of the development of mutation testing,

    Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011

  45. [45]

    Covering and uncovering equivalent mutants,

    D. Schuler and A. Zeller, “Covering and uncovering equivalent mutants,” Software Testing, Verification and Reliability, vol. 23, no. 5, pp. 353– 374, 2013