ScratchLens: Lens-Parametric Behavioral Equivalence for Scratch Programs
Pith reviewed 2026-06-27 03:57 UTC · model grok-4.3
The pith
ScratchLens decides behavioral equivalence of Scratch programs by reducing them to a canonical causal form under parametric observation lenses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScratchLens compiles Scratch projects into a causal IR of typed resources and semantic transactions, canonicalizes renamings, guards, and procedure bodies, quotients same-trigger concurrency with Mazurkiewicz trace normal forms, separates program order from races, and resolves residual frontiers via SMT obligations and VM-backed counterexample-guided refinement, yielding conclusive verdicts with evidence: equivalence by bijection and trace quotient or difference by a typed witness.
What carries the argument
Lens-parametric behavioral equivalence over a causal IR with Mazurkiewicz trace quotients that separate program order from races.
If this is right
- Automated feedback systems can accept syntactically different but behaviorally equivalent student submissions.
- Repair validation can use typed difference witnesses to confirm that a fix actually changes observable behavior.
- Grading tools can distinguish real functional changes from equivalent refactorings such as script reordering.
- The taxonomy of causal divergence phenomena supplies a systematic way to handle concurrency and timing issues that random testing misses.
Where Pith is reading between the lines
- The same causal-IR and trace-quotient approach could be lifted to other block-based languages that expose concurrent scripts.
- Adding a probabilistic lens to the existing set might give sound verdicts for programs whose randomness is not fully captured by current monitors.
- The evidence-carrying verdicts could be fed directly to LLM repair agents to constrain candidate patches to those that restore equivalence under the chosen lens.
Load-bearing premise
The VM-witnessed mutation corpus from real Scratch projects is representative of the behavioral divergences that arise from concurrent, random, and timing-dependent behavior under the chosen observation lenses.
What would settle it
Any pair of Scratch programs on which ScratchLens returns equivalence yet a VM execution under some schedule produces observably different results under one of the defined lenses.
Figures
read the original abstract
Two Scratch programs can be syntactically far apart-renamed variables, split scripts, extracted custom blocks, or reordered initialization-and still behave identically; a one-block edit, such as replacing a blocking broadcast with an asynchronous one, can create divergences visible only under specific schedules. Deciding behavioral equivalence is central to automated feedback, grading support, and repair validation, yet tree differencing is too strict and single-run dynamic comparison is unsound for concurrent, random, and timing-dependent behavior. We observe that equivalence for block-based programs is lens-parametric: final state, frame traces, monitors, event causality, and debug traces induce different observation relations. ScratchLens makes this explicit through a taxonomy of causal divergence phenomena and observation lenses. It compiles Scratch projects into a causal IR of typed resources and semantic transactions, canonicalizes renamings, guards, and procedure bodies, quotients same-trigger concurrency with Mazurkiewicz trace normal forms, separates program order from races, and handles residual frontiers through SMT obligations and VM-backed counterexample-guided refinement. Conclusive verdicts carry evidence: equivalence by bijection and trace quotient, difference by a typed witness, and unresolved cases remain unknown. On a VM-witnessed mutation corpus from real Scratch projects, ScratchLens decides all 444 validated pairs and makes 0/158 false-equivalence claims on witnessed-different pairs under strict scoring. Structural, dynamic-only, and LLM baselines fail on the classes predicted by the taxonomy; ablations quantify the contribution of partial-order reduction and lens parametricity; and targeted scenarios expose ambiguous-mutant divergences missed by random testing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ScratchLens, a system for deciding behavioral equivalence of Scratch programs under lens-parametric observations (final state, frame traces, monitors, event causality, debug traces). It compiles projects to a causal IR of typed resources and semantic transactions, canonicalizes renamings and procedures, quotients same-trigger concurrency via Mazurkiewicz trace normal forms, separates program order from races, and resolves residual frontiers with SMT obligations plus VM-backed CEGAR. Conclusive verdicts are witnessed by bijections or typed counterexamples. On a VM-witnessed mutation corpus from real Scratch projects, the tool decides all 444 validated pairs and reports 0/158 false-equivalence claims on witnessed-different pairs under strict scoring; structural, dynamic, and LLM baselines fail on taxonomy-predicted classes, with ablations quantifying partial-order reduction and lens contributions.
Significance. If the empirical claims hold, the work supplies a sound, evidence-carrying decision procedure for equivalence in a concurrent block-based language, directly supporting automated feedback and repair validation in educational settings. The explicit taxonomy of causal divergences and observation lenses, combined with the VM-witnessed corpus and targeted ablations, offers reproducible evidence that standard tree differencing and single-run testing are insufficient for schedule-dependent behavior. The machine-checked elements (trace quotients, SMT obligations) and falsifiable predictions (witnessed differences) strengthen the contribution.
major comments (2)
- [Evaluation] Evaluation section: the claim that the 444-pair corpus tests the partial-order reduction and lens-parametric machinery on concurrent/race/timing divergences rests on the representativeness of the mutation operators. The manuscript does not enumerate the operators or quantify how many mutants induce schedule-dependent rather than sequential or structural differences; without this, the 0/158 false-equivalence rate does not yet constitute evidence that the Mazurkiewicz normal forms and SMT frontiers handle the hard cases identified in the introduction and taxonomy.
- [§4] §4 (Causal IR and trace quotient): the separation of program order from races is central to the soundness argument, yet the manuscript provides no formal statement of the observation relation induced by each lens on the quotiented traces; without an explicit definition relating the normal forms to the VM witnesses, it is unclear whether the zero false-equivalence result follows from the theory or from post-hoc filtering of the corpus.
minor comments (2)
- [§3] Notation for semantic transactions and typed resources is introduced without a consolidated table; a single table listing resource kinds, transaction signatures, and their mapping to Scratch blocks would improve readability.
- [Evaluation] The abstract states 'all 444 validated pairs' but the evaluation text should explicitly state the total number of generated mutants before validation and the criteria used to discard unvalidated pairs.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity on the evaluation corpus and the formal connections between the trace quotients and lens observations.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the claim that the 444-pair corpus tests the partial-order reduction and lens-parametric machinery on concurrent/race/timing divergences rests on the representativeness of the mutation operators. The manuscript does not enumerate the operators or quantify how many mutants induce schedule-dependent rather than sequential or structural differences; without this, the 0/158 false-equivalence rate does not yet constitute evidence that the Mazurkiewicz normal forms and SMT frontiers handle the hard cases identified in the introduction and taxonomy.
Authors: We agree that an explicit enumeration of the mutation operators and a quantification of schedule-dependent cases would strengthen the argument for representativeness. The current manuscript describes the operators at a high level in Section 5.1 (as edits drawn from real Scratch project histories) but does not provide a full list or breakdown by divergence class. In the revised version we will add a table enumerating the 14 operators and report that 87 of the 444 pairs were classified by the taxonomy as inducing schedule-dependent (race or timing) divergences; the 0/158 false-equivalence rate continues to hold on this subset. This addition directly ties the empirical result to the hard cases highlighted in the introduction. revision: yes
-
Referee: [§4] §4 (Causal IR and trace quotient): the separation of program order from races is central to the soundness argument, yet the manuscript provides no formal statement of the observation relation induced by each lens on the quotiented traces; without an explicit definition relating the normal forms to the VM witnesses, it is unclear whether the zero false-equivalence result follows from the theory or from post-hoc filtering of the corpus.
Authors: The observation relation ~_L for each lens L is defined in Section 3.2, and Theorem 4.2 states that the Mazurkiewicz normal form preserves ~_L. The connection to VM witnesses is implicit in the CEGAR loop (Section 4.4), where SMT obligations are discharged only when the VM confirms the counterexample matches the semantic model. Nevertheless, we acknowledge that an explicit corollary or lemma directly relating the quotiented traces to the VM-validated witnesses would remove any ambiguity. We will add this corollary (with a short proof sketch) in the revised §4. revision: yes
Circularity Check
No significant circularity; standard techniques applied to new domain with independent empirical corpus
full rationale
The paper defines an algorithmic pipeline (causal IR compilation, Mazurkiewicz normal forms, SMT obligations, VM-backed refinement) whose correctness claims rest on explicit definitions and external verification techniques rather than self-referential equations or fitted parameters. The headline result (444/444 decided, 0/158 false equivalences) is measured against VM-witness labels on a mutation corpus constructed independently of the decision procedure; no step renames a fitted quantity as a prediction or imports uniqueness via self-citation. The evaluation therefore remains falsifiable outside the method's own definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Mazurkiewicz trace equivalence soundly quotients independent concurrent actions for the purpose of behavioral comparison
invented entities (2)
-
observation lenses (final state, frame traces, monitors, event causality, debug traces)
no independent evidence
-
causal IR of typed resources and semantic transactions
no independent evidence
Forward citations
Cited by 5 Pith papers
-
Checked Program Recovery from Execution Video: A Sound Oracle for Untrusted Generators
Vid2Prog recovers Scratch programs from execution videos via a sound oracle that certifies lens-equivalence with zero false accepts on 246 test pairs and 80% certificate rate for in-vocabulary cases while abstaining o...
-
SchedCheck: Schedule-Robustness Analysis for Event-Driven Block Programs
SchedCheck performs partial-order exploration over dependence-equivalence classes of schedules on the Scratch VM to detect and localize schedule-sensitive behaviors, reporting 17-21% of real concurrent projects affected.
-
Certificate-Carrying Transformation of Event-Driven Block Programs
A certificate-carrying rewriting system for Scratch-like languages uses a trusted checker to verify optimizer rewrites by recomputing preservation conditions, with a Lean-mechanized cooperative-frame refinement theore...
-
ScratchWorld: Evaluating If World Models Compute Executable Consequences
ScratchWorld benchmark finds that language models achieve at most 13.8% value-aware changed-field F1 on replay-verified Scratch state transitions and frequently ignore executable rules.
-
Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery
The paper formalizes fixed-set worst-case corruption in PBE, implements corruption searches on a string DSL, and shows VPA recovers some margin-1 tasks but fails on public SyGuS where vote margins are near one.
Reference graph
Works this paper leans on
-
[1]
Scratch 2024: Creativity around the world,
Scratch Foundation, “Scratch 2024: Creativity around the world,” https: //annualreport.scratchfoundation.org/, 2025, accessed 2026-06-14
2024
-
[2]
scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,
——, “scratch-vm v5.0.300: Virtual machine used to represent, run, and maintain the state of programs for scratch 3.0,” https://github.com/scratchfoundation/scratch-vm, 2026, repository SHA e6f5711f25f607ce8370a5a7afcfb391b349a6e1; repository archived 2026-06-10 and migrated to scratch-editor; accessed 2026-06-12
2026
-
[3]
Scratch: Programming for all,
M. Resnick, J. Maloney, A. Monroy-Hern ´andez, N. Rusk, E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum, J. Silver, B. Silverman, and Y . Kafai, “Scratch: Programming for all,”Communications of the ACM, vol. 52, no. 11, pp. 60–67, 2009
2009
-
[4]
The scratch programming language and environment,
J. Maloney, M. Resnick, N. Rusk, B. Silverman, and E. Eastmond, “The scratch programming language and environment,”ACM Transactions on Computing Education, vol. 10, no. 4, pp. 16:1–16:15, 2010
2010
-
[5]
How kids code and how we know: An exploratory study on the scratch repository,
E. Aivaloglou and F. Hermans, “How kids code and how we know: An exploratory study on the scratch repository,” inProceedings of the 2016 ACM Conference on International Computing Education Research, ser. ICER ’16. ACM, 2016, pp. 53–61
2016
-
[6]
Fine-grained and accurate source code differencing,
J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Fine-grained and accurate source code differencing,” inProceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’14. ACM, 2014, pp. 313–324
2014
-
[7]
Simple fast algorithms for the editing distance between trees and related problems,
K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,”SIAM Journal on Computing, vol. 18, no. 6, pp. 1245–1262, 1989
1989
-
[8]
Testing scratch programs automatically,
A. Stahlbauer, M. Kreis, and G. Fraser, “Testing scratch programs automatically,” inProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’19. ACM, 2019, pp. 165–175
2019
-
[9]
Trace theory,
A. Mazurkiewicz, “Trace theory,” inPetri Nets: Applications and Relationships to Other Models of Concurrency. Springer, 1987, pp. 278–324
1987
-
[10]
Z3: An efficient SMT solver,
L. de Moura and N. Bjørner, “Z3: An efficient SMT solver,” inTools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’08. Springer, 2008, pp. 337–340
2008
-
[11]
Counterexample- guided abstraction refinement for symbolic model checking,
E. Clarke, O. Grumberg, S. Jha, Y . Lu, and H. Veith, “Counterexample- guided abstraction refinement for symbolic model checking,”Journal of the ACM, vol. 50, no. 5, pp. 752–794, 2003
2003
-
[12]
LitterBox: A linter for scratch programs,
G. Fraser, U. Heuer, N. K ¨orber, F. Oberm ¨uller, and E. Wasmeier, “LitterBox: A linter for scratch programs,” inProceedings of the 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering Education and Training, ser. ICSE-SEET ’21. IEEE, 2021, pp. 183–188
2021
-
[13]
Common bugs in scratch programs,
C. Fr ¨adrich, F. Oberm ¨uller, N. K ¨orber, U. Heuer, and G. Fraser, “Common bugs in scratch programs,” inProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’20. ACM, 2020, pp. 89–95
2020
-
[14]
A reduction of a graph to a canonical form and an algebra arising during this reduction,
B. Weisfeiler and A. Leman, “A reduction of a graph to a canonical form and an algebra arising during this reduction,”Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16, 1968
1968
-
[15]
Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser
P. Godefroid,Partial-Order Methods for the Verification of Concurrent Systems, ser. Lecture Notes in Computer Science. Springer, 1996, vol. 1032
1996
-
[16]
Note on the sampling error of the difference between correlated proportions or percentages,
Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947
1947
-
[17]
Efron and R
B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. Chap- man & Hall/CRC, 1994
1994
-
[18]
Hairball: Lint-inspired static analysis of scratch projects,
B. Boe, C. Hill, M. Len, G. Dreschler, P. Conrad, and D. Franklin, “Hairball: Lint-inspired static analysis of scratch projects,” inProceeding of the 44th ACM Technical Symposium on Computer Science Education, ser. SIGCSE ’13. ACM, 2013, pp. 215–220
2013
-
[19]
Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,
J. Moreno-Le ´on, G. Robles, and M. Rom ´an-Gonz´alez, “Dr. scratch: Automatic analysis of scratch projects to assess and foster computational thinking,”RED. Revista de Educaci ´on a Distancia, no. 46, 2015
2015
-
[20]
Verified from scratch: Program analysis for learners’ programs,
A. Stahlbauer, C. Fr ¨adrich, and G. Fraser, “Verified from scratch: Program analysis for learners’ programs,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing, ser. ASE ’20. ACM, 2020, pp. 150–162
2020
-
[21]
Au- tomated test generation for scratch programs,
A. Deiner, C. Fr ¨adrich, G. Fraser, S. Geserer, and N. Zantner, “Au- tomated test generation for scratch programs,”Empirical Software Engineering, vol. 28, no. 3, p. 79, 2023
2023
-
[22]
Model-based testing of scratch programs,
K. G ¨otz, P. Feldmeier, and G. Fraser, “Model-based testing of scratch programs,” inProceedings of the 2022 IEEE Conference on Software Testing, Verification and Validation, ser. ICST ’22. IEEE, 2022, pp. 411–421
2022
-
[23]
NuzzleBug: Debugging block-based programs in scratch,
A. Deiner and G. Fraser, “NuzzleBug: Debugging block-based programs in scratch,” inProceedings of the 46th IEEE/ACM International Confer- ence on Software Engineering, ser. ICSE ’24. ACM, 2024, pp. 1–13
2024
-
[24]
Guiding next-step hint generation using automated tests,
F. Oberm ¨uller, E. Wasmeier, and G. Fraser, “Guiding next-step hint generation using automated tests,” inProceedings of the 26th ACM Con- ference on Innovation and Technology in Computer Science Education, ser. ITiCSE ’21. ACM, 2021, pp. 220–226
2021
-
[25]
Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,
D. Li, D. Li, H. Shi, and J. Zhang, “Raven: Rethinking automated assessment for scratch programs via video-grounded evaluation,” 2026
2026
-
[26]
Stitch: Step-by-step LLM guided tutoring for scratch,
Y . Si, K. Qi, D. Li, H. Shi, and J. Zhang, “Stitch: Step-by-step LLM guided tutoring for scratch,” 2025
2025
-
[27]
EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,
Y . Si, M. Wang, D. Li, H. Shi, and J. Zhang, “EcoScratch: Cost-effective multimodal repair for scratch using execution feedback,” 2026
2026
-
[28]
ViScratch: Using large language models and gameplay videos for automated feedback in scratch,
Y . Si, D. Li, H. Shi, and J. Zhang, “ViScratch: Using large language models and gameplay videos for automated feedback in scratch,” 2025
2025
-
[29]
ScratchEval: A multimodal evaluation framework for llms in block-based programming,
Y . Si, S. Han, D. Li, H. Shi, and J. Zhang, “ScratchEval: A multimodal evaluation framework for llms in block-based programming,” 2026
2026
-
[30]
Change distilling: Tree differencing for fine-grained source code change extraction,
B. Fluri, M. W ¨ursch, M. Pinzger, and H. Gall, “Change distilling: Tree differencing for fine-grained source code change extraction,”IEEE Transactions on Software Engineering, vol. 33, no. 11, pp. 725–743, 2007
2007
-
[31]
Semantic diff: A tool for summarizing the effects of modifications,
D. Jackson and D. A. Ladd, “Semantic diff: A tool for summarizing the effects of modifications,” inProceedings of the International Conference on Software Maintenance, ser. ICSM ’94. IEEE, 1994, pp. 243–252
1994
-
[32]
SYMDIFF: A language-agnostic semantic diff tool for imperative programs,
S. K. Lahiri, C. Hawblitzel, M. Kawaguchi, and H. Reb ˆelo, “SYMDIFF: A language-agnostic semantic diff tool for imperative programs,” in Computer Aided Verification, ser. CA V ’12. Springer, 2012, pp. 712– 717
2012
-
[33]
Differential symbolic execution,
S. Person, M. B. Dwyer, S. Elbaum, and C. S. P ˘as˘areanu, “Differential symbolic execution,” inProceedings of the 16th ACM SIGSOFT Inter- national Symposium on Foundations of Software Engineering, ser. FSE ’08. ACM, 2008, pp. 226–237
2008
-
[34]
Regression verification: Proving the equivalence of similar programs,
B. Godlin and O. Strichman, “Regression verification: Proving the equivalence of similar programs,”Software Testing, Verification and Reliability, vol. 23, no. 3, pp. 241–258, 2013
2013
-
[35]
Practical, low-effort equivalence verification of real code,
D. A. Ramos and D. R. Engler, “Practical, low-effort equivalence verification of real code,” inComputer Aided Verification, ser. CA V ’11. Springer, 2011, pp. 669–685
2011
-
[36]
Translation validation,
A. Pnueli, M. Siegel, and E. Singerman, “Translation validation,” in Tools and Algorithms for the Construction and Analysis of Systems, ser. TACAS ’98. Springer, 1998, pp. 151–166
1998
-
[37]
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,
C. K. Roy, J. R. Cordy, and R. Koschke, “Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,” Science of Computer Programming, vol. 74, no. 7, pp. 470–495, 2009
2009
-
[38]
Dynamic partial-order reduction for model checking software,
C. Flanagan and P. Godefroid, “Dynamic partial-order reduction for model checking software,” inProceedings of the 32nd ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, ser. POPL ’05. ACM, 2005, pp. 110–121
2005
-
[39]
The state explosion problem,
A. Valmari, “The state explosion problem,” inLectures on Petri Nets I: Basic Models. Springer, 1998, pp. 429–528
1998
-
[40]
Milner,Communication and Concurrency
R. Milner,Communication and Concurrency. Prentice Hall, 1989
1989
-
[41]
Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),
J. Zhang, T. Mytkowicz, M. Kaufman, R. Piskac, and S. K. Lahiri, “Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper),” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’22. ACM, 2022, pp. 77–88
2022
-
[42]
Automated feedback generation for competition-level code,
J. Zhang, D. Li, J. C. Kolesar, H. Shi, and R. Piskac, “Automated feedback generation for competition-level code,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22. ACM, 2022
2022
-
[43]
Hints on test data selection: Help for the practicing programmer,
R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test data selection: Help for the practicing programmer,”Computer, vol. 11, no. 4, pp. 34–41, 1978
1978
-
[44]
An analysis and survey of the development of mutation testing,
Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011
2011
-
[45]
Covering and uncovering equivalent mutants,
D. Schuler and A. Zeller, “Covering and uncovering equivalent mutants,” Software Testing, Verification and Reliability, vol. 23, no. 5, pp. 353– 374, 2013
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.