pith. sign in

arxiv: 2606.31689 · v1 · pith:2L3JUUFUnew · submitted 2026-06-30 · 💻 cs.SE

ScratchWorld: Evaluating If World Models Compute Executable Consequences

Pith reviewed 2026-07-01 04:25 UTC · model grok-4.3

classification 💻 cs.SE
keywords world modelsexecutable consequencesstate predictionbenchmarkScratchlanguage modelscausal attributioncounterfactual prediction
7
0 comments X

The pith

Prompted models reach at most 13.8% on value-aware changed-field F1 in ScratchWorld executable tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScratchWorld as a benchmark that scores world models on whether they predict the actual executed changes in state rather than matching persistent observations. It runs Scratch projects through a pinned virtual machine to generate replay-verified next states, hidden variables, and counterfactual outcomes under multiple input formats. Seven prompted language and reasoning models are tested in a state-only partial-observation setting on 659 examples and score no higher than 13.8% on the primary metric. A copy diagnostic shows that repeating the input state produces 98% full-state accuracy but 0% on the changed-field measure, exposing the overlap problem. Auxiliary tests indicate that models often respond to actions or interventions without applying the executable rule that sets the changed value.

Core claim

ScratchWorld demonstrates that prompted language and reasoning models reach at most 13.8% value-aware changed-field F1 when predicting state changes from partial observations. The benchmark supplies replay-verified targets from a pinned Scratch VM so that persistent fields do not inflate accuracy. Models frequently react to interventions or actions yet fail to follow the specific executable rule that determines the new field value.

What carries the argument

The value-aware changed-field F1 metric, which awards credit only when a model correctly identifies both the changed field and its executed value from replay-verified Scratch transitions.

If this is right

  • Full-state overlap metrics can be gamed by copying the input state, producing 98% apparent accuracy while scoring 0% on changed fields.
  • Performance remains low across next-state prediction, long-horizon tracking, causal event attribution, and counterfactual prediction.
  • Real projects exhibit the largest gap between overlap accuracy and changed-field accuracy compared with synthetic cases.
  • Models show sensitivity to interventions yet often ignore the rule that fixes the resulting value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that explicitly simulate execution rules may be needed before world models reliably predict consequences.
  • Benchmarks in other executable domains could test whether the observed failure generalizes beyond Scratch projects.
  • Training signals that emphasize changed fields rather than full-state overlap might improve consequence tracking.

Load-bearing premise

The chosen Scratch projects and pinned VM produce transitions that serve as accurate ground truth for the executable consequences world models should compute.

What would settle it

A model that scores substantially above 13.8% value-aware changed-field F1 on the 659 examples while still scoring near zero on the same-instance copy diagnostic would indicate the models can compute executable consequences.

Figures

Figures reproduced from arXiv: 2606.31689 by Jialu Zhang, Yufeng Lin.

Figure 1
Figure 1. Figure 1: ScratchWorld construction pipeline. Scratch programs define executable worlds. Pinned VM replay produces structured traces, task targets, and evidence views; model predictions are scored by changed fields, changed values, and copy-sensitive overlap diagnostics. ming environment (Resnick et al. 2009). Prior Scratch work (Si et al. 2025a,b, 2026b; Li et al. 2026; Si and Zhang 2026) uses the same environment … view at source ↗
read the original abstract

World-model evaluations often score a predicted future by overlap with a target state or observation. In sparse-change worlds, this can turn copied persistent state into apparent accuracy. We introduce ScratchWorld, an offline diagnostic benchmark that treats Scratch projects as executable worlds and uses a pinned Scratch VM to produce replay-verified transitions, hidden variables, causal traces, and counterfactual outcomes. ScratchWorld evaluates next-state prediction, long-horizon tracking, causal event attribution, and counterfactual prediction; each replay-verified target can be presented under raw-program, structured-state, natural-language, or rendered input modalities, and our experiments use the structured-state condition. Its primary state metric is value-aware changed-field $F_1$, which gives credit only for the changed field and its executed value. In a 659-example release, seven prompted language/reasoning models reach at most 13.8% value-aware changed-field $F_1$ in a state-only partial-observation stress test. A same-instance copy diagnostic makes the overlap confound concrete: copying the input state reaches 98.0% implied full-state field accuracy and 0.0% changed-field $F_1$, with the largest inflation on real projects. Auxiliary diagnostics separate hidden-state rollout drift, intervention sensitivity, causal attribution, and perturbation robustness. Across these settings, models often react to actions or interventions without following the executable rule that determines the changed value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces ScratchWorld, an offline diagnostic benchmark that models Scratch projects as executable worlds. It uses a pinned Scratch VM to generate replay-verified transitions, hidden variables, causal traces, and counterfactuals. The benchmark supports multiple input modalities and tasks including next-state prediction, long-horizon tracking, causal attribution, and counterfactual prediction. Its primary metric is value-aware changed-field F1, which credits only correctly predicted changes. On a 659-example release under a state-only partial-observation stress test, seven prompted language/reasoning models achieve at most 13.8% on this metric. Auxiliary diagnostics (copy baseline, drift, intervention sensitivity, perturbation robustness) show that models frequently react to actions without following the executable rules that determine changed values, while a same-instance copy diagnostic reaches 98.0% full-state accuracy but 0.0% changed-field F1.

Significance. If the central empirical result holds, the work is significant because it supplies a reproducible, falsifiable benchmark with replay-verified ground truth that isolates executable rule-following from persistent-state overlap confounds. Explicit strengths include the pinned VM for verifiable transitions, the value-aware metric that penalizes copying, and the suite of auxiliary diagnostics that separate distinct failure modes. The 659-example release and copy diagnostic make the overlap problem concrete and testable.

minor comments (3)
  1. The abstract states the 13.8% ceiling and the metric but does not include the exact definition of value-aware changed-field F1 or the selection criteria for the 659 examples; the full manuscript should place the formal definition and protocol summary in §3 or §4 so that the primary result can be reproduced from the text alone.
  2. Figure or table presenting the per-model scores on the 659-example set should report both the value-aware changed-field F1 and the standard full-state F1 side-by-side to make the inflation effect visible at a glance.
  3. The manuscript should state the exact version or commit hash of the pinned Scratch VM and confirm that all 659 transitions were replay-verified in the released artifact.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and positive summary of our manuscript, the recognition of its strengths (pinned VM, value-aware metric, auxiliary diagnostics, and copy baseline), and the recommendation for minor revision. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines an external benchmark using a pinned Scratch VM for replay-verified ground truth and an independently specified value-aware changed-field F1 metric that explicitly excludes persistent-state overlap. No equations, predictions, or central claims reduce by construction to fitted parameters, self-citations, or renamed inputs; model scores are direct measurements against this reference. The setup is self-contained against the described VM and diagnostics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Scratch projects and their VM executions form a valid proxy for general world-model testing; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Scratch projects executed in a pinned VM provide representative and verifiable executable consequences for testing world models
    Invoked to justify generalization from the 659-example set to broader claims about model behavior.

pith-pipeline@v0.9.1-grok · 5777 in / 1220 out tokens · 41582 ms · 2026-07-01T04:25:48.890384+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Checked Program Recovery from Execution Video: A Sound Oracle for Untrusted Generators

    cs.SE 2026-07 conditional novelty 7.0

    Vid2Prog recovers Scratch programs from execution videos via a sound oracle that certifies lens-equivalence with zero false accepts on 246 test pairs and 80% certificate rate for in-vocabulary cases while abstaining o...

  2. SchedCheck: Schedule-Robustness Analysis for Event-Driven Block Programs

    cs.SE 2026-07 conditional novelty 7.0

    SchedCheck performs partial-order exploration over dependence-equivalence classes of schedules on the Scratch VM to detect and localize schedule-sensitive behaviors, reporting 17-21% of real concurrent projects affected.

  3. Certificate-Carrying Transformation of Event-Driven Block Programs

    cs.PL 2026-07 accept novelty 7.0 full

    A certificate-carrying rewriting system for Scratch-like languages uses a trusted checker to verify optimizer rewrites by recomputing preservation conditions, with a Lean-mechanized cooperative-frame refinement theore...

  4. Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery

    cs.LG 2026-07 conditional novelty 6.0

    The paper formalizes fixed-set worst-case corruption in PBE, implements corruption searches on a string DSL, and shows VPA recovers some margin-1 tasks but fails on public SyGuS where vote margins are near one.

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · cited by 4 Pith papers · 5 internal anchors

  1. [1]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    PHYRE: A New Benchmark for Physical Reasoning. InAdvances in Neural Information Processing Systems (NeurIPS). Bansal, H.; Lin, Z.; Xie, T.; Zong, Z.; Yarom, M.; Bitton, Y.; Jiang,C.;Sun,Y.;Chang,K.-W.;andGrover,A.2024. Video- Phy:EvaluatingPhysicalCommonsenseforVideoGeneration. arXiv preprint arXiv:2406.03520. Bruce, J.; Dennis, M. D.; Edwards, A.; Parker...

  2. [2]

    InInternational Conference on Machine Learning (ICML), volume 235, 4603–4623

    Genie: Generative Interactive Environments. InInternational Conference on Machine Learning (ICML), volume 235, 4603–4623. Cobbe,K.;Hesse,C.;Hilton,J.;andSchulman,J.2020.Lever- aging Procedural Generation to Benchmark Reinforcement Learning. InProceedings of the 37th International Confer- ence on Machine Learning, volume 119 ofProceedings of Machine Learni...

  3. [3]

    InConference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, 885–897

    RoboNet: Large-Scale Multi-Robot Learning. InConference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, 885–897. PMLR. Fu,R.;Luo,Z.;Lin,H.;Ye,Z.;andMa,J.2025. ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Mul- timodal Models with Visual Programming Challenges. In Proceedings of the 2025 Conference of the ...

  4. [4]

    Dream to Control: Learning Behaviors by Latent Imagination

    Hafner,D.;Lillicrap,T.;Ba,J.;andNorouzi,M.2020. Dream to Control: Learning Behaviors by Latent Imagination. InIn- ternational Conference on Learning Representations (ICLR). Hafner, D.; Lillicrap, T.; Fischer, I.; Villegas, R.; Ha, D.; Lee, H.; and Davidson, J

  5. [5]

    Mastering Diverse Domains through World Models

    Mastering Diverse Domains through World Models.arXiv preprint arXiv:2301.04104. Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.;Wu,T.;Jin,Q.;Chanpaisit,N.;Wang,Y.;Chen,X.;Wang, L.; Lin, D.; Qiao, Y.; and Liu, Z

  6. [6]

    Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation

    VBench: Com- prehensive Benchmark Suite for Video Generative Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21807–21818. Li,D.;Li,D.;Shi,H.;andZhang,J.2026. Raven:Rethinking Automated Assessment for Scratch Programs via Video- Grounded Evaluation. arXiv:2604.17820. Liang, J.; Ku, M.; Hui, K.-H.; Nie, P....

  7. [7]

    VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

    VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction.arXiv preprint arXiv:2602.13294. Resnick, M.; Maloney, J.; Monroy-Hernández, A.; Rusk, N.; Eastmond, E.; Brennan, K.; Millner, A.; Rosenbaum, E.; Silver, J.; Silverman, B.; and Kafai, Y

  8. [8]

    Intphys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018

    IntPhys: A Framework andBenchmarkforVisualIntuitivePhysicsReasoning.arXiv preprint arXiv:1803.07616. Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; Parikh, D.; and Batra, D

  9. [9]

    ScratchEval: A multi- modal evaluation framework for LLMs in block-based programming,

    ALFRED: A Benchmark for Interpreting Grounded Instructions for EverydayTasks. InProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),10740– 10749. Si, Y.; Han, S.; Li, D.; Shi, H.; and Zhang, J. 2026a. ScratchEval:AMultimodalEvaluationFrameworkforLLMs in Block-Based Programming. arXiv:2602.00757. Si, Y.; Li, D.; Shi, H.; and Zhan...

  10. [10]

    ScratchLens: Lens-Parametric Behavioral Equivalence for Scratch Programs

    ScratchLens: Lens- Parametric Behavioral Equivalence for Scratch Programs. arXiv:2606.15817. Stahlbauer, A.; Kreis, M.; and Fraser, G

  11. [11]

    InProceedings of the 2019 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 165–175

    Testing Scratch Programs Automatically. InProceedings of the 2019 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 165–175. Yi, K.; Gan, C.; Li, Y.; Kohli, P.; Wu, J.; Torralba, A.; and Tenenbaum, J. B

  12. [12]

    arXiv preprint arXiv:2510.18135 (2025) 3, 4

    World-in-World: World Models in a Closed-Loop World.arXiv preprint arXiv:2510.18135. Zhang,X.;Ye,Y.;Huang,K.;Li,W.;andWang,X.2026. See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch. arXiv preprint arXiv:2602.10814