ScratchWorld: Evaluating If World Models Compute Executable Consequences
Pith reviewed 2026-07-01 04:25 UTC · model grok-4.3
The pith
Prompted models reach at most 13.8% on value-aware changed-field F1 in ScratchWorld executable tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScratchWorld demonstrates that prompted language and reasoning models reach at most 13.8% value-aware changed-field F1 when predicting state changes from partial observations. The benchmark supplies replay-verified targets from a pinned Scratch VM so that persistent fields do not inflate accuracy. Models frequently react to interventions or actions yet fail to follow the specific executable rule that determines the new field value.
What carries the argument
The value-aware changed-field F1 metric, which awards credit only when a model correctly identifies both the changed field and its executed value from replay-verified Scratch transitions.
If this is right
- Full-state overlap metrics can be gamed by copying the input state, producing 98% apparent accuracy while scoring 0% on changed fields.
- Performance remains low across next-state prediction, long-horizon tracking, causal event attribution, and counterfactual prediction.
- Real projects exhibit the largest gap between overlap accuracy and changed-field accuracy compared with synthetic cases.
- Models show sensitivity to interventions yet often ignore the rule that fixes the resulting value.
Where Pith is reading between the lines
- Architectures that explicitly simulate execution rules may be needed before world models reliably predict consequences.
- Benchmarks in other executable domains could test whether the observed failure generalizes beyond Scratch projects.
- Training signals that emphasize changed fields rather than full-state overlap might improve consequence tracking.
Load-bearing premise
The chosen Scratch projects and pinned VM produce transitions that serve as accurate ground truth for the executable consequences world models should compute.
What would settle it
A model that scores substantially above 13.8% value-aware changed-field F1 on the 659 examples while still scoring near zero on the same-instance copy diagnostic would indicate the models can compute executable consequences.
Figures
read the original abstract
World-model evaluations often score a predicted future by overlap with a target state or observation. In sparse-change worlds, this can turn copied persistent state into apparent accuracy. We introduce ScratchWorld, an offline diagnostic benchmark that treats Scratch projects as executable worlds and uses a pinned Scratch VM to produce replay-verified transitions, hidden variables, causal traces, and counterfactual outcomes. ScratchWorld evaluates next-state prediction, long-horizon tracking, causal event attribution, and counterfactual prediction; each replay-verified target can be presented under raw-program, structured-state, natural-language, or rendered input modalities, and our experiments use the structured-state condition. Its primary state metric is value-aware changed-field $F_1$, which gives credit only for the changed field and its executed value. In a 659-example release, seven prompted language/reasoning models reach at most 13.8% value-aware changed-field $F_1$ in a state-only partial-observation stress test. A same-instance copy diagnostic makes the overlap confound concrete: copying the input state reaches 98.0% implied full-state field accuracy and 0.0% changed-field $F_1$, with the largest inflation on real projects. Auxiliary diagnostics separate hidden-state rollout drift, intervention sensitivity, causal attribution, and perturbation robustness. Across these settings, models often react to actions or interventions without following the executable rule that determines the changed value.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ScratchWorld, an offline diagnostic benchmark that models Scratch projects as executable worlds. It uses a pinned Scratch VM to generate replay-verified transitions, hidden variables, causal traces, and counterfactuals. The benchmark supports multiple input modalities and tasks including next-state prediction, long-horizon tracking, causal attribution, and counterfactual prediction. Its primary metric is value-aware changed-field F1, which credits only correctly predicted changes. On a 659-example release under a state-only partial-observation stress test, seven prompted language/reasoning models achieve at most 13.8% on this metric. Auxiliary diagnostics (copy baseline, drift, intervention sensitivity, perturbation robustness) show that models frequently react to actions without following the executable rules that determine changed values, while a same-instance copy diagnostic reaches 98.0% full-state accuracy but 0.0% changed-field F1.
Significance. If the central empirical result holds, the work is significant because it supplies a reproducible, falsifiable benchmark with replay-verified ground truth that isolates executable rule-following from persistent-state overlap confounds. Explicit strengths include the pinned VM for verifiable transitions, the value-aware metric that penalizes copying, and the suite of auxiliary diagnostics that separate distinct failure modes. The 659-example release and copy diagnostic make the overlap problem concrete and testable.
minor comments (3)
- The abstract states the 13.8% ceiling and the metric but does not include the exact definition of value-aware changed-field F1 or the selection criteria for the 659 examples; the full manuscript should place the formal definition and protocol summary in §3 or §4 so that the primary result can be reproduced from the text alone.
- Figure or table presenting the per-model scores on the 659-example set should report both the value-aware changed-field F1 and the standard full-state F1 side-by-side to make the inflation effect visible at a glance.
- The manuscript should state the exact version or commit hash of the pinned Scratch VM and confirm that all 659 transitions were replay-verified in the released artifact.
Simulated Author's Rebuttal
We thank the referee for the detailed and positive summary of our manuscript, the recognition of its strengths (pinned VM, value-aware metric, auxiliary diagnostics, and copy baseline), and the recommendation for minor revision. No major comments were listed in the report.
Circularity Check
No significant circularity identified
full rationale
The paper defines an external benchmark using a pinned Scratch VM for replay-verified ground truth and an independently specified value-aware changed-field F1 metric that explicitly excludes persistent-state overlap. No equations, predictions, or central claims reduce by construction to fitted parameters, self-citations, or renamed inputs; model scores are direct measurements against this reference. The setup is self-contained against the described VM and diagnostics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scratch projects executed in a pinned VM provide representative and verifiable executable consequences for testing world models
Forward citations
Cited by 4 Pith papers
-
Checked Program Recovery from Execution Video: A Sound Oracle for Untrusted Generators
Vid2Prog recovers Scratch programs from execution videos via a sound oracle that certifies lens-equivalence with zero false accepts on 246 test pairs and 80% certificate rate for in-vocabulary cases while abstaining o...
-
SchedCheck: Schedule-Robustness Analysis for Event-Driven Block Programs
SchedCheck performs partial-order exploration over dependence-equivalence classes of schedules on the Scratch VM to detect and localize schedule-sensitive behaviors, reporting 17-21% of real concurrent projects affected.
-
Certificate-Carrying Transformation of Event-Driven Block Programs
A certificate-carrying rewriting system for Scratch-like languages uses a trusted checker to verify optimizer rewrites by recomputing preservation conditions, with a Lean-mechanized cooperative-frame refinement theore...
-
Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery
The paper formalizes fixed-set worst-case corruption in PBE, implements corruption searches on a string DSL, and shows VPA recovers some margin-1 tasks but fails on public SyGuS where vote margins are near one.
Reference graph
Works this paper leans on
-
[1]
VideoPhy: Evaluating Physical Commonsense for Video Generation
PHYRE: A New Benchmark for Physical Reasoning. InAdvances in Neural Information Processing Systems (NeurIPS). Bansal, H.; Lin, Z.; Xie, T.; Zong, Z.; Yarom, M.; Bitton, Y.; Jiang,C.;Sun,Y.;Chang,K.-W.;andGrover,A.2024. Video- Phy:EvaluatingPhysicalCommonsenseforVideoGeneration. arXiv preprint arXiv:2406.03520. Bruce, J.; Dennis, M. D.; Edwards, A.; Parker...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
InInternational Conference on Machine Learning (ICML), volume 235, 4603–4623
Genie: Generative Interactive Environments. InInternational Conference on Machine Learning (ICML), volume 235, 4603–4623. Cobbe,K.;Hesse,C.;Hilton,J.;andSchulman,J.2020.Lever- aging Procedural Generation to Benchmark Reinforcement Learning. InProceedings of the 37th International Confer- ence on Machine Learning, volume 119 ofProceedings of Machine Learni...
2020
-
[3]
InConference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, 885–897
RoboNet: Large-Scale Multi-Robot Learning. InConference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, 885–897. PMLR. Fu,R.;Luo,Z.;Lin,H.;Ye,Z.;andMa,J.2025. ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Mul- timodal Models with Visual Programming Challenges. In Proceedings of the 2025 Conference of the ...
2025
-
[4]
Dream to Control: Learning Behaviors by Latent Imagination
Hafner,D.;Lillicrap,T.;Ba,J.;andNorouzi,M.2020. Dream to Control: Learning Behaviors by Latent Imagination. InIn- ternational Conference on Learning Representations (ICLR). Hafner, D.; Lillicrap, T.; Fischer, I.; Villegas, R.; Ha, D.; Lee, H.; and Davidson, J
2020
-
[5]
Mastering Diverse Domains through World Models
Mastering Diverse Domains through World Models.arXiv preprint arXiv:2301.04104. Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.;Wu,T.;Jin,Q.;Chanpaisit,N.;Wang,Y.;Chen,X.;Wang, L.; Lin, D.; Qiao, Y.; and Liu, Z
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation
VBench: Com- prehensive Benchmark Suite for Video Generative Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21807–21818. Li,D.;Li,D.;Shi,H.;andZhang,J.2026. Raven:Rethinking Automated Assessment for Scratch Programs via Video- Grounded Evaluation. arXiv:2604.17820. Liang, J.; Ku, M.; Hui, K.-H.; Nie, P....
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction.arXiv preprint arXiv:2602.13294. Resnick, M.; Maloney, J.; Monroy-Hernández, A.; Rusk, N.; Eastmond, E.; Brennan, K.; Millner, A.; Rosenbaum, E.; Silver, J.; Silverman, B.; and Kafai, Y
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
IntPhys: A Framework andBenchmarkforVisualIntuitivePhysicsReasoning.arXiv preprint arXiv:1803.07616. Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; Parikh, D.; and Batra, D
-
[9]
ScratchEval: A multi- modal evaluation framework for LLMs in block-based programming,
ALFRED: A Benchmark for Interpreting Grounded Instructions for EverydayTasks. InProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),10740– 10749. Si, Y.; Han, S.; Li, D.; Shi, H.; and Zhang, J. 2026a. ScratchEval:AMultimodalEvaluationFrameworkforLLMs in Block-Based Programming. arXiv:2602.00757. Si, Y.; Li, D.; Shi, H.; and Zhan...
-
[10]
ScratchLens: Lens-Parametric Behavioral Equivalence for Scratch Programs
ScratchLens: Lens- Parametric Behavioral Equivalence for Scratch Programs. arXiv:2606.15817. Stahlbauer, A.; Kreis, M.; and Fraser, G
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
InProceedings of the 2019 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 165–175
Testing Scratch Programs Automatically. InProceedings of the 2019 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 165–175. Yi, K.; Gan, C.; Li, Y.; Kohli, P.; Wu, J.; Torralba, A.; and Tenenbaum, J. B
2019
-
[12]
arXiv preprint arXiv:2510.18135 (2025) 3, 4
World-in-World: World Models in a Closed-Loop World.arXiv preprint arXiv:2510.18135. Zhang,X.;Ye,Y.;Huang,K.;Li,W.;andWang,X.2026. See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch. arXiv preprint arXiv:2602.10814
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.