ScratchLens: Lens-Parametric Behavioral Equivalence for Scratch Programs

Jialu Zhang; Yuan Si

arxiv: 2606.15817 · v2 · pith:OA5EKGEEnew · submitted 2026-06-14 · 💻 cs.PL · cs.SE

ScratchLens: Lens-Parametric Behavioral Equivalence for Scratch Programs

Yuan Si , Jialu Zhang This is my paper

Pith reviewed 2026-06-27 03:57 UTC · model grok-4.3

classification 💻 cs.PL cs.SE

keywords behavioral equivalenceScratch programslens-parametric observationcausal intermediate representationMazurkiewicz tracesconcurrent block programsprogram equivalencemutation testing

0 comments

The pith

ScratchLens decides behavioral equivalence of Scratch programs by reducing them to a canonical causal form under parametric observation lenses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Two Scratch programs can differ in syntax yet behave the same, or differ in behavior only under particular schedules or lenses such as event traces. Tree differencing is too strict while single-run testing is unsound for concurrent and timing-dependent code. ScratchLens makes equivalence lens-parametric by building a causal IR of resources and transactions, canonicalizing renamings and procedures, applying Mazurkiewicz trace quotients, and discharging residual cases with SMT and VM refinement. On a corpus of real-project mutations it classifies every validated pair and records no false equivalence claims under strict scoring. This supplies reliable evidence for uses such as automated grading and repair validation.

Core claim

ScratchLens compiles Scratch projects into a causal IR of typed resources and semantic transactions, canonicalizes renamings, guards, and procedure bodies, quotients same-trigger concurrency with Mazurkiewicz trace normal forms, separates program order from races, and resolves residual frontiers via SMT obligations and VM-backed counterexample-guided refinement, yielding conclusive verdicts with evidence: equivalence by bijection and trace quotient or difference by a typed witness.

What carries the argument

Lens-parametric behavioral equivalence over a causal IR with Mazurkiewicz trace quotients that separate program order from races.

If this is right

Automated feedback systems can accept syntactically different but behaviorally equivalent student submissions.
Repair validation can use typed difference witnesses to confirm that a fix actually changes observable behavior.
Grading tools can distinguish real functional changes from equivalent refactorings such as script reordering.
The taxonomy of causal divergence phenomena supplies a systematic way to handle concurrency and timing issues that random testing misses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same causal-IR and trace-quotient approach could be lifted to other block-based languages that expose concurrent scripts.
Adding a probabilistic lens to the existing set might give sound verdicts for programs whose randomness is not fully captured by current monitors.
The evidence-carrying verdicts could be fed directly to LLM repair agents to constrain candidate patches to those that restore equivalence under the chosen lens.

Load-bearing premise

The VM-witnessed mutation corpus from real Scratch projects is representative of the behavioral divergences that arise from concurrent, random, and timing-dependent behavior under the chosen observation lenses.

What would settle it

Any pair of Scratch programs on which ScratchLens returns equivalence yet a VM execution under some schedule produces observably different results under one of the defined lenses.

Figures

Figures reproduced from arXiv: 2606.15817 by Jialu Zhang, Yuan Si.

read the original abstract

Two Scratch programs can be syntactically far apart-renamed variables, split scripts, extracted custom blocks, or reordered initialization-and still behave identically; a one-block edit, such as replacing a blocking broadcast with an asynchronous one, can create divergences visible only under specific schedules. Deciding behavioral equivalence is central to automated feedback, grading support, and repair validation, yet tree differencing is too strict and single-run dynamic comparison is unsound for concurrent, random, and timing-dependent behavior. We observe that equivalence for block-based programs is lens-parametric: final state, frame traces, monitors, event causality, and debug traces induce different observation relations. ScratchLens makes this explicit through a taxonomy of causal divergence phenomena and observation lenses. It compiles Scratch projects into a causal IR of typed resources and semantic transactions, canonicalizes renamings, guards, and procedure bodies, quotients same-trigger concurrency with Mazurkiewicz trace normal forms, separates program order from races, and handles residual frontiers through SMT obligations and VM-backed counterexample-guided refinement. Conclusive verdicts carry evidence: equivalence by bijection and trace quotient, difference by a typed witness, and unresolved cases remain unknown. On a VM-witnessed mutation corpus from real Scratch projects, ScratchLens decides all 444 validated pairs and makes 0/158 false-equivalence claims on witnessed-different pairs under strict scoring. Structural, dynamic-only, and LLM baselines fail on the classes predicted by the taxonomy; ablations quantify the contribution of partial-order reduction and lens parametricity; and targeted scenarios expose ambiguous-mutant divergences missed by random testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScratchLens introduces a lens-parametric equivalence checker for Scratch using causal IR, Mazurkiewicz traces, and SMT refinement, with solid results on a mutation corpus but open questions on how well that corpus covers concurrent divergences.

read the letter

The paper's core advance is making observation lenses explicit for Scratch programs and building a compilation pipeline that turns projects into a causal IR of resources and transactions. It canonicalizes renamings and procedures, quotients same-trigger concurrency with Mazurkiewicz normal forms, separates program order from races, and uses SMT plus VM refinement for residual cases. This directly targets issues like broadcast timing and custom-block extraction that break simple syntax or single-run checks.

It does the engineering work cleanly. The system produces conclusive verdicts with evidence (bijection for equivalence, typed witness for difference) and reports deciding all 444 validated pairs from the VM-witnessed mutation corpus while making zero false-equivalence claims on the 158 witnessed-different pairs under strict scoring. Ablations quantify the partial-order reduction and lens contributions, and the baselines (structural, dynamic-only, LLM) fail exactly where the taxonomy predicts.

The main soft spot is representativeness of the corpus for the concurrency and timing cases the machinery is designed to handle. If the mutations mostly generate sequential or structural differences rather than schedule-dependent races, the zero false-equivalence number is less informative about whether the causal IR and trace quotient actually work on the hard instances. The abstract mentions targeted scenarios that expose ambiguous mutants missed by random testing, which helps, but the mutation operators themselves are not described here.

This is useful for researchers building feedback or grading tools for block-based languages. It is a concrete, technically coherent piece of work that deserves peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents ScratchLens, a system for deciding behavioral equivalence of Scratch programs under lens-parametric observations (final state, frame traces, monitors, event causality, debug traces). It compiles projects to a causal IR of typed resources and semantic transactions, canonicalizes renamings and procedures, quotients same-trigger concurrency via Mazurkiewicz trace normal forms, separates program order from races, and resolves residual frontiers with SMT obligations plus VM-backed CEGAR. Conclusive verdicts are witnessed by bijections or typed counterexamples. On a VM-witnessed mutation corpus from real Scratch projects, the tool decides all 444 validated pairs and reports 0/158 false-equivalence claims on witnessed-different pairs under strict scoring; structural, dynamic, and LLM baselines fail on taxonomy-predicted classes, with ablations quantifying partial-order reduction and lens contributions.

Significance. If the empirical claims hold, the work supplies a sound, evidence-carrying decision procedure for equivalence in a concurrent block-based language, directly supporting automated feedback and repair validation in educational settings. The explicit taxonomy of causal divergences and observation lenses, combined with the VM-witnessed corpus and targeted ablations, offers reproducible evidence that standard tree differencing and single-run testing are insufficient for schedule-dependent behavior. The machine-checked elements (trace quotients, SMT obligations) and falsifiable predictions (witnessed differences) strengthen the contribution.

major comments (2)

[Evaluation] Evaluation section: the claim that the 444-pair corpus tests the partial-order reduction and lens-parametric machinery on concurrent/race/timing divergences rests on the representativeness of the mutation operators. The manuscript does not enumerate the operators or quantify how many mutants induce schedule-dependent rather than sequential or structural differences; without this, the 0/158 false-equivalence rate does not yet constitute evidence that the Mazurkiewicz normal forms and SMT frontiers handle the hard cases identified in the introduction and taxonomy.
[§4] §4 (Causal IR and trace quotient): the separation of program order from races is central to the soundness argument, yet the manuscript provides no formal statement of the observation relation induced by each lens on the quotiented traces; without an explicit definition relating the normal forms to the VM witnesses, it is unclear whether the zero false-equivalence result follows from the theory or from post-hoc filtering of the corpus.

minor comments (2)

[§3] Notation for semantic transactions and typed resources is introduced without a consolidated table; a single table listing resource kinds, transaction signatures, and their mapping to Scratch blocks would improve readability.
[Evaluation] The abstract states 'all 444 validated pairs' but the evaluation text should explicitly state the total number of generated mutants before validation and the criteria used to discard unvalidated pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity on the evaluation corpus and the formal connections between the trace quotients and lens observations.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the claim that the 444-pair corpus tests the partial-order reduction and lens-parametric machinery on concurrent/race/timing divergences rests on the representativeness of the mutation operators. The manuscript does not enumerate the operators or quantify how many mutants induce schedule-dependent rather than sequential or structural differences; without this, the 0/158 false-equivalence rate does not yet constitute evidence that the Mazurkiewicz normal forms and SMT frontiers handle the hard cases identified in the introduction and taxonomy.

Authors: We agree that an explicit enumeration of the mutation operators and a quantification of schedule-dependent cases would strengthen the argument for representativeness. The current manuscript describes the operators at a high level in Section 5.1 (as edits drawn from real Scratch project histories) but does not provide a full list or breakdown by divergence class. In the revised version we will add a table enumerating the 14 operators and report that 87 of the 444 pairs were classified by the taxonomy as inducing schedule-dependent (race or timing) divergences; the 0/158 false-equivalence rate continues to hold on this subset. This addition directly ties the empirical result to the hard cases highlighted in the introduction. revision: yes
Referee: [§4] §4 (Causal IR and trace quotient): the separation of program order from races is central to the soundness argument, yet the manuscript provides no formal statement of the observation relation induced by each lens on the quotiented traces; without an explicit definition relating the normal forms to the VM witnesses, it is unclear whether the zero false-equivalence result follows from the theory or from post-hoc filtering of the corpus.

Authors: The observation relation ~_L for each lens L is defined in Section 3.2, and Theorem 4.2 states that the Mazurkiewicz normal form preserves ~_L. The connection to VM witnesses is implicit in the CEGAR loop (Section 4.4), where SMT obligations are discharged only when the VM confirms the counterexample matches the semantic model. Nevertheless, we acknowledge that an explicit corollary or lemma directly relating the quotiented traces to the VM-validated witnesses would remove any ambiguity. We will add this corollary (with a short proof sketch) in the revised §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard techniques applied to new domain with independent empirical corpus

full rationale

The paper defines an algorithmic pipeline (causal IR compilation, Mazurkiewicz normal forms, SMT obligations, VM-backed refinement) whose correctness claims rest on explicit definitions and external verification techniques rather than self-referential equations or fitted parameters. The headline result (444/444 decided, 0/158 false equivalences) is measured against VM-witness labels on a mutation corpus constructed independently of the decision procedure; no step renames a fitted quantity as a prediction or imports uniqueness via self-citation. The evaluation therefore remains falsifiable outside the method's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the completeness of the causal IR and the chosen set of observation lenses; no free parameters are mentioned, but the corpus construction and lens definitions are paper-specific.

axioms (1)

standard math Mazurkiewicz trace equivalence soundly quotients independent concurrent actions for the purpose of behavioral comparison
Invoked for same-trigger concurrency normalization.

invented entities (2)

observation lenses (final state, frame traces, monitors, event causality, debug traces) no independent evidence
purpose: To parameterize what counts as observable behavior for equivalence
Introduced as the central organizing concept; no independent evidence of completeness provided in abstract.
causal IR of typed resources and semantic transactions no independent evidence
purpose: Intermediate representation that enables canonicalization and trace analysis
Core compilation target defined by the paper.

pith-pipeline@v0.9.1-grok · 5815 in / 1263 out tokens · 54202 ms · 2026-06-27T03:57:06.482193+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Checked Program Recovery from Execution Video: A Sound Oracle for Untrusted Generators
cs.SE 2026-07 conditional novelty 7.0

Vid2Prog recovers Scratch programs from execution videos via a sound oracle that certifies lens-equivalence with zero false accepts on 246 test pairs and 80% certificate rate for in-vocabulary cases while abstaining o...
SchedCheck: Schedule-Robustness Analysis for Event-Driven Block Programs
cs.SE 2026-07 conditional novelty 7.0

SchedCheck performs partial-order exploration over dependence-equivalence classes of schedules on the Scratch VM to detect and localize schedule-sensitive behaviors, reporting 17-21% of real concurrent projects affected.
Certificate-Carrying Transformation of Event-Driven Block Programs
cs.PL 2026-07 accept novelty 7.0 full

A certificate-carrying rewriting system for Scratch-like languages uses a trusted checker to verify optimizer rewrites by recomputing preservation conditions, with a Lean-mechanized cooperative-frame refinement theore...
ScratchWorld: Evaluating If World Models Compute Executable Consequences
cs.SE 2026-06 unverdicted novelty 7.0

ScratchWorld benchmark finds that language models achieve at most 13.8% value-aware changed-field F1 on replay-verified Scratch state transitions and frequently ignore executable rules.
Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery
cs.LG 2026-07 conditional novelty 6.0

The paper formalizes fixed-set worst-case corruption in PBE, implements corruption searches on a string DSL, and shows VPA recovers some margin-1 tasks but fails on public SyGuS where vote margins are near one.

Reference graph

Works this paper leans on

45 extracted references · cited by 5 Pith papers

[1]

Scratch 2024: Creativity around the world,

Scratch Foundation, “Scratch 2024: Creativity around the world,” https: //annualreport.scratchfoundation.org/, 2025, accessed 2026-06-14

2024
[2]

scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,

——, “scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,” https://github.com/scratchfoundation/scratch-vm, 2026, repository SHA e6f5711f25f607ce8370a5a7afcfb391b349a6e1; repository archived 2026-06-10 and migrated to scratch-editor; accessed 2026-06-12

2026
[3]

Scratch: Programming for all,

M. Resnick, J. Maloney, A. Monroy-Hern ´andez, N. Rusk, E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum, J. Silver, B. Silverman, and Y . Kafai, “Scratch: Programming for all,”Communications of the ACM, vol. 52, no. 11, pp. 60–67, 2009

2009
[4]

The scratch programming language and environment,

J. Maloney, M. Resnick, N. Rusk, B. Silverman, and E. Eastmond, “The scratch programming language and environment,”ACM Transactions on Computing Education, vol. 10, no. 4, pp. 16:1–16:15, 2010

2010
[5]

How kids code and how we know: An exploratory study on the scratch repository,

E. Aivaloglou and F. Hermans, “How kids code and how we know: An exploratory study on the scratch repository,” inProceedings of the 2016 ACM Conference on International Computing Education Research, ser. ICER ’16. ACM, 2016, pp. 53–61

2016
[6]

Fine-grained and accurate source code differencing,

J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Fine-grained and accurate source code differencing,” inProceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’14. ACM, 2014, pp. 313–324

2014
[7]

Simple fast algorithms for the editing distance between trees and related problems,

K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,”SIAM Journal on Computing, vol. 18, no. 6, pp. 1245–1262, 1989

1989
[8]

Testing scratch programs automatically,

A. Stahlbauer, M. Kreis, and G. Fraser, “Testing scratch programs automatically,” inProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’19. ACM, 2019, pp. 165–175

2019
[9]

Trace theory,

A. Mazurkiewicz, “Trace theory,” inPetri Nets: Applications and Relationships to Other Models of Concurrency. Springer, 1987, pp. 278–324

1987
[10]

Z3: An efficient SMT solver,

L. de Moura and N. Bjørner, “Z3: An efficient SMT solver,” inTools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’08. Springer, 2008, pp. 337–340

2008
[11]

Counterexample- guided abstraction refinement for symbolic model checking,

E. Clarke, O. Grumberg, S. Jha, Y . Lu, and H. Veith, “Counterexample- guided abstraction refinement for symbolic model checking,”Journal of the ACM, vol. 50, no. 5, pp. 752–794, 2003

2003
[12]

LitterBox: A linter for scratch programs,

G. Fraser, U. Heuer, N. K ¨orber, F. Oberm ¨uller, and E. Wasmeier, “LitterBox: A linter for scratch programs,” inProceedings of the 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering Education and Training, ser. ICSE-SEET ’21. IEEE, 2021, pp. 183–188

2021
[13]

Common bugs in scratch programs,

C. Fr ¨adrich, F. Oberm ¨uller, N. K ¨orber, U. Heuer, and G. Fraser, “Common bugs in scratch programs,” inProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’20. ACM, 2020, pp. 89–95

2020
[14]

A reduction of a graph to a canonical form and an algebra arising during this reduction,

B. Weisfeiler and A. Leman, “A reduction of a graph to a canonical form and an algebra arising during this reduction,”Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16, 1968

1968
[15]

Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser

P. Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser. Lecture Notes in Computer Science. Springer, 1996, vol. 1032

1996
[16]

Note on the sampling error of the difference between correlated proportions or percentages,

Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

1947
[17]

Efron and R

B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. Chap- man & Hall/CRC, 1994

1994
[18]

Hairball: Lint-inspired static analysis of scratch projects,

B. Boe, C. Hill, M. Len, G. Dreschler, P. Conrad, and D. Franklin, “Hairball: Lint-inspired static analysis of scratch projects,” inProceeding of the 44th ACM Technical Symposium on Computer Science Education, ser. SIGCSE ’13. ACM, 2013, pp. 215–220

2013
[19]

Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,

J. Moreno-Le ´on, G. Robles, and M. Rom ´an-Gonz´alez, “Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,”RED. Revista de Educaci ´on a Distancia, no. 46, 2015

2015
[20]

Verified from scratch: Program analysis for learners’ programs,

A. Stahlbauer, C. Fr ¨adrich, and G. Fraser, “Verified from scratch: Program analysis for learners’ programs,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing, ser. ASE ’20. ACM, 2020, pp. 150–162

2020
[21]

Au- tomated test generation for scratch programs,

A. Deiner, C. Fr ¨adrich, G. Fraser, S. Geserer, and N. Zantner, “Au- tomated test generation for scratch programs,”Empirical Software Engineering, vol. 28, no. 3, p. 79, 2023

2023
[22]

Model-based testing of scratch programs,

K. G ¨otz, P. Feldmeier, and G. Fraser, “Model-based testing of scratch programs,” inProceedings of the 2022 IEEE Conference on Software Testing, Verification and Validation, ser. ICST ’22. IEEE, 2022, pp. 411–421

2022
[23]

NuzzleBug: Debugging block-based programs in scratch,

A. Deiner and G. Fraser, “NuzzleBug: Debugging block-based programs in scratch,” inProceedings of the 46th IEEE/ACM International Confer- ence on Software Engineering, ser. ICSE ’24. ACM, 2024, pp. 1–13

2024
[24]

Guiding next-step hint generation using automated tests,

F. Oberm ¨uller, E. Wasmeier, and G. Fraser, “Guiding next-step hint generation using automated tests,” inProceedings of the 26th ACM Con- ference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’21. ACM, 2021, pp. 220–226

2021
[25]

Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,

D. Li, D. Li, H. Shi, and J. Zhang, “Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,” 2026

2026
[26]

Stitch: Step-by-step LLM guided tutoring for scratch,

Y . Si, K. Qi, D. Li, H. Shi, and J. Zhang, “Stitch: Step-by-step LLM guided tutoring for scratch,” 2025

2025
[27]

EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,

Y . Si, M. Wang, D. Li, H. Shi, and J. Zhang, “EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,” 2026

2026
[28]

ViScratch: Using large language models and gameplay videos for automated feedback in scratch,

Y . Si, D. Li, H. Shi, and J. Zhang, “ViScratch: Using large language models and gameplay videos for automated feedback in scratch,” 2025

2025
[29]

ScratchEval: A multimodal evaluation framework for llms in block-based programming,

Y . Si, S. Han, D. Li, H. Shi, and J. Zhang, “ScratchEval: A multimodal evaluation framework for llms in block-based programming,” 2026

2026
[30]

Change distilling: Tree differencing for fine-grained source code change extraction,

B. Fluri, M. W ¨ursch, M. Pinzger, and H. Gall, “Change distilling: Tree differencing for fine-grained source code change extraction,”IEEE Transactions on Software Engineering, vol. 33, no. 11, pp. 725–743, 2007

2007
[31]

Semantic diff: A tool for summarizing the effects of modifications,

D. Jackson and D. A. Ladd, “Semantic diff: A tool for summarizing the effects of modifications,” inProceedings of the International Conference on Software Maintenance, ser. ICSM ’94. IEEE, 1994, pp. 243–252

1994
[32]

SYMDIFF: A language-agnostic semantic diff tool for imperative programs,

S. K. Lahiri, C. Hawblitzel, M. Kawaguchi, and H. Reb ˆelo, “SYMDIFF: A language-agnostic semantic diff tool for imperative programs,” in Computer Aided Verification, ser. CA V ’12. Springer, 2012, pp. 712– 717

2012
[33]

Differential symbolic execution,

S. Person, M. B. Dwyer, S. Elbaum, and C. S. P ˘as˘areanu, “Differential symbolic execution,” inProceedings of the 16th ACM SIGSOFT Inter- national Symposium on Foundations of Software Engineering, ser. FSE ’08. ACM, 2008, pp. 226–237

2008
[34]

Regression verification: Proving the equivalence of similar programs,

B. Godlin and O. Strichman, “Regression verification: Proving the equivalence of similar programs,”Software Testing, Verification and Reliability, vol. 23, no. 3, pp. 241–258, 2013

2013
[35]

Practical, low-effort equivalence verification of real code,

D. A. Ramos and D. R. Engler, “Practical, low-effort equivalence verification of real code,” inComputer Aided Verification, ser. CA V ’11. Springer, 2011, pp. 669–685

2011
[36]

Translation validation,

A. Pnueli, M. Siegel, and E. Singerman, “Translation validation,” in Tools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’98. Springer, 1998, pp. 151–166

1998
[37]

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,

C. K. Roy, J. R. Cordy, and R. Koschke, “Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,” Science of Computer Programming, vol. 74, no. 7, pp. 470–495, 2009

2009
[38]

Dynamic partial-order reduction for model checking software,

C. Flanagan and P. Godefroid, “Dynamic partial-order reduction for model checking software,” inProceedings of the 32nd ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, ser. POPL ’05. ACM, 2005, pp. 110–121

2005
[39]

The state explosion problem,

A. Valmari, “The state explosion problem,” inLectures on Petri Nets I: Basic Models. Springer, 1998, pp. 429–528

1998
[40]

Milner,Communication and Concurrency

R. Milner,Communication and Concurrency. Prentice Hall, 1989

1989
[41]

Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),

J. Zhang, T. Mytkowicz, M. Kaufman, R. Piskac, and S. K. Lahiri, “Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’22. ACM, 2022, pp. 77–88

2022
[42]

Automated feedback generation for competition-level code,

J. Zhang, D. Li, J. C. Kolesar, H. Shi, and R. Piskac, “Automated feedback generation for competition-level code,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22. ACM, 2022

2022
[43]

Hints on test data selection: Help for the practicing programmer,

R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test data selection: Help for the practicing programmer,”Computer, vol. 11, no. 4, pp. 34–41, 1978

1978
[44]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011

2011
[45]

Covering and uncovering equivalent mutants,

D. Schuler and A. Zeller, “Covering and uncovering equivalent mutants,” Software Testing, Verification and Reliability, vol. 23, no. 5, pp. 353– 374, 2013

2013

[1] [1]

Scratch 2024: Creativity around the world,

Scratch Foundation, “Scratch 2024: Creativity around the world,” https: //annualreport.scratchfoundation.org/, 2025, accessed 2026-06-14

2024

[2] [2]

scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,

——, “scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,” https://github.com/scratchfoundation/scratch-vm, 2026, repository SHA e6f5711f25f607ce8370a5a7afcfb391b349a6e1; repository archived 2026-06-10 and migrated to scratch-editor; accessed 2026-06-12

2026

[3] [3]

Scratch: Programming for all,

M. Resnick, J. Maloney, A. Monroy-Hern ´andez, N. Rusk, E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum, J. Silver, B. Silverman, and Y . Kafai, “Scratch: Programming for all,”Communications of the ACM, vol. 52, no. 11, pp. 60–67, 2009

2009

[4] [4]

The scratch programming language and environment,

J. Maloney, M. Resnick, N. Rusk, B. Silverman, and E. Eastmond, “The scratch programming language and environment,”ACM Transactions on Computing Education, vol. 10, no. 4, pp. 16:1–16:15, 2010

2010

[5] [5]

How kids code and how we know: An exploratory study on the scratch repository,

E. Aivaloglou and F. Hermans, “How kids code and how we know: An exploratory study on the scratch repository,” inProceedings of the 2016 ACM Conference on International Computing Education Research, ser. ICER ’16. ACM, 2016, pp. 53–61

2016

[6] [6]

Fine-grained and accurate source code differencing,

J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Fine-grained and accurate source code differencing,” inProceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’14. ACM, 2014, pp. 313–324

2014

[7] [7]

Simple fast algorithms for the editing distance between trees and related problems,

K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,”SIAM Journal on Computing, vol. 18, no. 6, pp. 1245–1262, 1989

1989

[8] [8]

Testing scratch programs automatically,

A. Stahlbauer, M. Kreis, and G. Fraser, “Testing scratch programs automatically,” inProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’19. ACM, 2019, pp. 165–175

2019

[9] [9]

Trace theory,

A. Mazurkiewicz, “Trace theory,” inPetri Nets: Applications and Relationships to Other Models of Concurrency. Springer, 1987, pp. 278–324

1987

[10] [10]

Z3: An efficient SMT solver,

L. de Moura and N. Bjørner, “Z3: An efficient SMT solver,” inTools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’08. Springer, 2008, pp. 337–340

2008

[11] [11]

Counterexample- guided abstraction refinement for symbolic model checking,

E. Clarke, O. Grumberg, S. Jha, Y . Lu, and H. Veith, “Counterexample- guided abstraction refinement for symbolic model checking,”Journal of the ACM, vol. 50, no. 5, pp. 752–794, 2003

2003

[12] [12]

LitterBox: A linter for scratch programs,

G. Fraser, U. Heuer, N. K ¨orber, F. Oberm ¨uller, and E. Wasmeier, “LitterBox: A linter for scratch programs,” inProceedings of the 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering Education and Training, ser. ICSE-SEET ’21. IEEE, 2021, pp. 183–188

2021

[13] [13]

Common bugs in scratch programs,

C. Fr ¨adrich, F. Oberm ¨uller, N. K ¨orber, U. Heuer, and G. Fraser, “Common bugs in scratch programs,” inProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’20. ACM, 2020, pp. 89–95

2020

[14] [14]

A reduction of a graph to a canonical form and an algebra arising during this reduction,

B. Weisfeiler and A. Leman, “A reduction of a graph to a canonical form and an algebra arising during this reduction,”Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16, 1968

1968

[15] [15]

Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser

P. Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser. Lecture Notes in Computer Science. Springer, 1996, vol. 1032

1996

[16] [16]

Note on the sampling error of the difference between correlated proportions or percentages,

Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

1947

[17] [17]

Efron and R

B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. Chap- man & Hall/CRC, 1994

1994

[18] [18]

Hairball: Lint-inspired static analysis of scratch projects,

B. Boe, C. Hill, M. Len, G. Dreschler, P. Conrad, and D. Franklin, “Hairball: Lint-inspired static analysis of scratch projects,” inProceeding of the 44th ACM Technical Symposium on Computer Science Education, ser. SIGCSE ’13. ACM, 2013, pp. 215–220

2013

[19] [19]

Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,

J. Moreno-Le ´on, G. Robles, and M. Rom ´an-Gonz´alez, “Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,”RED. Revista de Educaci ´on a Distancia, no. 46, 2015

2015

[20] [20]

Verified from scratch: Program analysis for learners’ programs,

A. Stahlbauer, C. Fr ¨adrich, and G. Fraser, “Verified from scratch: Program analysis for learners’ programs,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing, ser. ASE ’20. ACM, 2020, pp. 150–162

2020

[21] [21]

Au- tomated test generation for scratch programs,

A. Deiner, C. Fr ¨adrich, G. Fraser, S. Geserer, and N. Zantner, “Au- tomated test generation for scratch programs,”Empirical Software Engineering, vol. 28, no. 3, p. 79, 2023

2023

[22] [22]

Model-based testing of scratch programs,

K. G ¨otz, P. Feldmeier, and G. Fraser, “Model-based testing of scratch programs,” inProceedings of the 2022 IEEE Conference on Software Testing, Verification and Validation, ser. ICST ’22. IEEE, 2022, pp. 411–421

2022

[23] [23]

NuzzleBug: Debugging block-based programs in scratch,

A. Deiner and G. Fraser, “NuzzleBug: Debugging block-based programs in scratch,” inProceedings of the 46th IEEE/ACM International Confer- ence on Software Engineering, ser. ICSE ’24. ACM, 2024, pp. 1–13

2024

[24] [24]

Guiding next-step hint generation using automated tests,

F. Oberm ¨uller, E. Wasmeier, and G. Fraser, “Guiding next-step hint generation using automated tests,” inProceedings of the 26th ACM Con- ference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’21. ACM, 2021, pp. 220–226

2021

[25] [25]

Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,

D. Li, D. Li, H. Shi, and J. Zhang, “Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,” 2026

2026

[26] [26]

Stitch: Step-by-step LLM guided tutoring for scratch,

Y . Si, K. Qi, D. Li, H. Shi, and J. Zhang, “Stitch: Step-by-step LLM guided tutoring for scratch,” 2025

2025

[27] [27]

EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,

Y . Si, M. Wang, D. Li, H. Shi, and J. Zhang, “EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,” 2026

2026

[28] [28]

ViScratch: Using large language models and gameplay videos for automated feedback in scratch,

Y . Si, D. Li, H. Shi, and J. Zhang, “ViScratch: Using large language models and gameplay videos for automated feedback in scratch,” 2025

2025

[29] [29]

ScratchEval: A multimodal evaluation framework for llms in block-based programming,

Y . Si, S. Han, D. Li, H. Shi, and J. Zhang, “ScratchEval: A multimodal evaluation framework for llms in block-based programming,” 2026

2026

[30] [30]

Change distilling: Tree differencing for fine-grained source code change extraction,

B. Fluri, M. W ¨ursch, M. Pinzger, and H. Gall, “Change distilling: Tree differencing for fine-grained source code change extraction,”IEEE Transactions on Software Engineering, vol. 33, no. 11, pp. 725–743, 2007

2007

[31] [31]

Semantic diff: A tool for summarizing the effects of modifications,

D. Jackson and D. A. Ladd, “Semantic diff: A tool for summarizing the effects of modifications,” inProceedings of the International Conference on Software Maintenance, ser. ICSM ’94. IEEE, 1994, pp. 243–252

1994

[32] [32]

SYMDIFF: A language-agnostic semantic diff tool for imperative programs,

S. K. Lahiri, C. Hawblitzel, M. Kawaguchi, and H. Reb ˆelo, “SYMDIFF: A language-agnostic semantic diff tool for imperative programs,” in Computer Aided Verification, ser. CA V ’12. Springer, 2012, pp. 712– 717

2012

[33] [33]

Differential symbolic execution,

S. Person, M. B. Dwyer, S. Elbaum, and C. S. P ˘as˘areanu, “Differential symbolic execution,” inProceedings of the 16th ACM SIGSOFT Inter- national Symposium on Foundations of Software Engineering, ser. FSE ’08. ACM, 2008, pp. 226–237

2008

[34] [34]

Regression verification: Proving the equivalence of similar programs,

B. Godlin and O. Strichman, “Regression verification: Proving the equivalence of similar programs,”Software Testing, Verification and Reliability, vol. 23, no. 3, pp. 241–258, 2013

2013

[35] [35]

Practical, low-effort equivalence verification of real code,

D. A. Ramos and D. R. Engler, “Practical, low-effort equivalence verification of real code,” inComputer Aided Verification, ser. CA V ’11. Springer, 2011, pp. 669–685

2011

[36] [36]

Translation validation,

A. Pnueli, M. Siegel, and E. Singerman, “Translation validation,” in Tools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’98. Springer, 1998, pp. 151–166

1998

[37] [37]

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,

C. K. Roy, J. R. Cordy, and R. Koschke, “Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,” Science of Computer Programming, vol. 74, no. 7, pp. 470–495, 2009

2009

[38] [38]

Dynamic partial-order reduction for model checking software,

C. Flanagan and P. Godefroid, “Dynamic partial-order reduction for model checking software,” inProceedings of the 32nd ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, ser. POPL ’05. ACM, 2005, pp. 110–121

2005

[39] [39]

The state explosion problem,

A. Valmari, “The state explosion problem,” inLectures on Petri Nets I: Basic Models. Springer, 1998, pp. 429–528

1998

[40] [40]

Milner,Communication and Concurrency

R. Milner,Communication and Concurrency. Prentice Hall, 1989

1989

[41] [41]

Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),

J. Zhang, T. Mytkowicz, M. Kaufman, R. Piskac, and S. K. Lahiri, “Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’22. ACM, 2022, pp. 77–88

2022

[42] [42]

Automated feedback generation for competition-level code,

J. Zhang, D. Li, J. C. Kolesar, H. Shi, and R. Piskac, “Automated feedback generation for competition-level code,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22. ACM, 2022

2022

[43] [43]

Hints on test data selection: Help for the practicing programmer,

R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test data selection: Help for the practicing programmer,”Computer, vol. 11, no. 4, pp. 34–41, 1978

1978

[44] [44]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011

2011

[45] [45]

Covering and uncovering equivalent mutants,

D. Schuler and A. Zeller, “Covering and uncovering equivalent mutants,” Software Testing, Verification and Reliability, vol. 23, no. 5, pp. 353– 374, 2013

2013