Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander
Pith reviewed 2026-07-03 17:41 UTC · model grok-4.3
The pith
A composite score based on reward observability selects world model checkpoints that support superior model-based control in LunarLander.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Reward Observability Fraction, which measures the reward predictor's dependence on the observable subspace, is the strongest single predictor of downstream CEM-MPC return. When combined with three structural regularizers into the Composite Reward Observability Fraction, it enables reliable offline checkpoint selection for an RSSM world model trained on LunarLander v3. The selected model supports a model-based A2C policy that outperforms a model-free A2C baseline by approximately 24.5 return points using about 65 times fewer real-environment interactions and also drives a strong zero-shot CEM-MPC policy.
What carries the argument
The Composite Reward Observability Fraction (CROF), a single-number offline checkpoint selection score that aggregates the Reward Observability Fraction with three structural regularizers.
If this is right
- The CROF-selected world model supports strong zero-shot CEM-MPC performance in LunarLander.
- A model-based A2C policy trained from the CROF model achieves higher returns than model-free A2C with 65 times fewer environment interactions.
- Validation loss and multi-step prediction error are not reliable indicators of closed-loop quality for these world models.
- Structural diagnostics from optimal control theory can be used to rank checkpoints without running policies or MPC.
Where Pith is reading between the lines
- If the approach generalizes, it could reduce the data and compute needed to deploy model-based methods in other control tasks.
- The emphasis on reward observability may point to a broader principle for designing latent representations in planning algorithms.
- Testing CROF on different reward structures could show whether the metric captures a fundamental property of useful world models.
Load-bearing premise
The Reward Observability Fraction and the three structural regularizers remain predictive of closed-loop performance when the environment, reward shaping, or world-model architecture changes from the LunarLander v3 RSSM setup used to tune them.
What would settle it
Computing CROF scores for checkpoints from world models trained on a different environment with shaped rewards and then measuring whether the highest-CROF model indeed gives the best closed-loop returns would test the claim.
Figures
read the original abstract
We study how to predict the downstream closed-loop performance of a learned latent world model from validation-time diagnostics alone. Choosing the right checkpoint from a world-model training run is difficult: validation loss and multi-step prediction RMSE keep improving long after closed-loop performance has collapsed. We present a suite of structural validation-time diagnostics drawn from optimal-control theory and apply them to Gymnasium's LunarLander v3, which features shaped rewards. We train an RSSM [5, 4] world model on it and treat per checkpoint CEM-MPC return as the oracle for closed-loop quality. By evaluating 40 metrics against this oracle, we find that the strongest single predictor is the Reward Observability Fraction (ROF), which measures the reward predictor's dependence on the observable subspace. We combine ROF with three structural regularizers into a single-number offline checkpoint selection score, the Composite Reward Observability Fraction (CROF). The CROF-selected world model trains a model-based A2C policy that beats a fairly evaluated model-free A2C baseline by ~24.5 return points while using ~65x fewer real-environment interactions, and the same world model also drives a strong zero-shot CEM-MPC policy. Code and data: https://github.com/nsmoly/LunarLander_RSSM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard validation losses fail to predict closed-loop performance of RSSM world models on LunarLander v3 (shaped rewards). It introduces the Reward Observability Fraction (ROF) — the reward predictor's dependence on the observable subspace — as the strongest single predictor among 40 metrics when correlated against a CEM-MPC oracle. Combining ROF with three structural regularizers yields the Composite ROF (CROF) score for offline checkpoint selection. The CROF-selected model yields a model-based A2C policy that outperforms a model-free A2C baseline by ~24.5 return while using ~65× fewer environment steps and also supports strong zero-shot CEM-MPC control. Code and data are released.
Significance. If the empirical result holds, the work supplies a concrete, reproducible protocol for selecting world-model checkpoints without expensive closed-loop rollouts. The explicit comparison to a fairly evaluated model-free baseline, the zero-shot MPC result, and the public repository are strengths. The approach is scoped to one environment and architecture, so its broader utility depends on whether ROF and the regularizers remain predictive under changes in reward shaping or dynamics.
major comments (2)
- [Abstract, §3] Abstract and methods: ROF is computed directly from the reward predictor evaluated on the same validation trajectories used during world-model training. This creates a moderate dependence between the selection score and quantities already fitted in training; the manuscript should report whether an independent held-out set or cross-validation was used when computing the oracle correlations and the final 24.5-point gap.
- [§4, Table 2] Evaluation protocol: The claim that CROF is the strongest predictor rests on screening 40 metrics against the CEM-MPC oracle on the identical checkpoint set. The paper should state whether the metric suite (including ROF) was fixed before seeing the correlations or whether the strongest metric was identified post-hoc; otherwise the reported superiority of CROF risks being an in-sample selection effect.
minor comments (2)
- [§3.2] Notation: Define the precise mathematical expression for ROF (e.g., the fraction of reward variance explained by the observable latent dimensions) in the main text rather than only in the appendix.
- [Figure 3] Figure clarity: The caption for the correlation plot should explicitly state the number of checkpoints and whether error bars reflect multiple random seeds or bootstrap resampling.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments below regarding evaluation protocol and potential selection effects. Both points can be clarified with revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and methods: ROF is computed directly from the reward predictor evaluated on the same validation trajectories used during world-model training. This creates a moderate dependence between the selection score and quantities already fitted in training; the manuscript should report whether an independent held-out set or cross-validation was used when computing the oracle correlations and the final 24.5-point gap.
Authors: We acknowledge the dependence: ROF is evaluated on the validation trajectories from world-model training. However, the CEM-MPC oracle returns are generated via closed-loop rollouts in the environment and are independent of the validation loss. The 24.5-point gap arises from training and evaluating a model-based A2C policy in the real environment using the CROF-selected checkpoint. No separate held-out set or cross-validation was employed for the reported correlations; the validation split is the one used throughout training. We will revise §3 and the evaluation section to explicitly state this protocol and the separation between diagnostics and the closed-loop oracle. revision: yes
-
Referee: [§4, Table 2] Evaluation protocol: The claim that CROF is the strongest predictor rests on screening 40 metrics against the CEM-MPC oracle on the identical checkpoint set. The paper should state whether the metric suite (including ROF) was fixed before seeing the correlations or whether the strongest metric was identified post-hoc; otherwise the reported superiority of CROF risks being an in-sample selection effect.
Authors: The 40 metrics were drawn from optimal-control theory and RSSM structural properties (detailed in §3) and fixed prior to any correlation analysis. ROF was included based on the reward-observability hypothesis before screening. The screening ranked the pre-specified suite; CROF combines ROF with three regularizers also motivated a priori. We will add an explicit statement in §4 confirming the metric families were defined before computing correlations to address the in-sample selection concern. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper reports an empirical correlation study: 40 validation metrics (including ROF computed from the already-trained reward predictor) are evaluated against a CEM-MPC oracle on the same set of checkpoints, the strongest correlates are combined into CROF, and the selected checkpoint is then used for downstream policy training and zero-shot MPC. No equation or definition in the provided material shows a prediction that reduces to its inputs by construction, no self-citation chain is load-bearing for the central claim, and the result is scoped to concrete LunarLander experiments with supplied code and data. The derivation is therefore self-contained as a data-driven selection procedure rather than a tautological reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CEM-MPC return computed on the same environment is a faithful proxy for downstream closed-loop performance of any policy that uses the world model.
invented entities (1)
-
Reward Observability Fraction (ROF)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Machado, Pablo Samuel Castro, and Marc G
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, and Marc G. Bellemare. Con- trastive behavioral similarity embeddings for generalization in reinforcement learning. InIn- ternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[2]
Deep reinforcement learning in a handful of trials using probabilistic dynamics models
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018
work page 2018
- [3]
-
[4]
Dream to control: Learning behaviors by latent imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Represen- tations (ICLR), 2020
work page 2020
-
[5]
Learning latent dynamics for planning from pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning (ICML), pages 2555–2565. PMLR, 2019
work page 2019
-
[6]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Rudolf E. Kalman. Mathematical description of linear dynamical systems.Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2):152–192, 1963
work page 1963
-
[8]
Objective mismatch in model-based reinforcement learning
Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based reinforcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), pages 761–770. PMLR, 2020
work page 2020
-
[9]
Bruce C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction.IEEE Transactions on Automatic Control, 26(1):17–32, 1981
work page 1981
-
[10]
Ng, Daishi Harada, and Stuart Russell
Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor- mations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), pages 278–287, 1999
work page 1999
-
[11]
Reuven Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimiza- tion.Methodology and Computing in Applied Probability, 1(2):127–190, 1999
work page 1999
-
[12]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
David Sussillo and Omri Barak. Opening the black box: Low-dimensional dynamics in high- dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013
work page 2013
-
[14]
Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew J. Johnson, and Sergey Levine. SOLAR: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning (ICML), pages 7444–7453. PMLR, 2019. 17 A Supplementary Figures Figure 6: Per-metric correlations with smoothed (MA-7) MPC mean return...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.