Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander

Nikolai Smolyanskiy

arxiv: 2607.01736 · v1 · pith:FLY3QCTInew · submitted 2026-07-02 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander

Nikolai Smolyanskiy This is my paper

Pith reviewed 2026-07-03 17:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY

keywords latent world modelscheckpoint selectionmodel-based RLMPCreward observabilityLunarLanderoffline diagnosticsRSSM

0 comments

The pith

A composite score based on reward observability selects world model checkpoints that support superior model-based control in LunarLander.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that usual validation losses keep improving even after a latent world model stops being useful for control in environments with shaped rewards. It identifies the Reward Observability Fraction as the best predictor of closed-loop performance among many candidates and combines it with structural regularizers to form the CROF score. This score allows choosing good checkpoints without any closed-loop testing or extra environment interactions. The chosen model then trains a model-based policy that beats the model-free baseline while using far less data and also works well for model predictive control.

Core claim

The Reward Observability Fraction, which measures the reward predictor's dependence on the observable subspace, is the strongest single predictor of downstream CEM-MPC return. When combined with three structural regularizers into the Composite Reward Observability Fraction, it enables reliable offline checkpoint selection for an RSSM world model trained on LunarLander v3. The selected model supports a model-based A2C policy that outperforms a model-free A2C baseline by approximately 24.5 return points using about 65 times fewer real-environment interactions and also drives a strong zero-shot CEM-MPC policy.

What carries the argument

The Composite Reward Observability Fraction (CROF), a single-number offline checkpoint selection score that aggregates the Reward Observability Fraction with three structural regularizers.

If this is right

The CROF-selected world model supports strong zero-shot CEM-MPC performance in LunarLander.
A model-based A2C policy trained from the CROF model achieves higher returns than model-free A2C with 65 times fewer environment interactions.
Validation loss and multi-step prediction error are not reliable indicators of closed-loop quality for these world models.
Structural diagnostics from optimal control theory can be used to rank checkpoints without running policies or MPC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approach generalizes, it could reduce the data and compute needed to deploy model-based methods in other control tasks.
The emphasis on reward observability may point to a broader principle for designing latent representations in planning algorithms.
Testing CROF on different reward structures could show whether the metric captures a fundamental property of useful world models.

Load-bearing premise

The Reward Observability Fraction and the three structural regularizers remain predictive of closed-loop performance when the environment, reward shaping, or world-model architecture changes from the LunarLander v3 RSSM setup used to tune them.

What would settle it

Computing CROF scores for checkpoints from world models trained on a different environment with shaped rewards and then measuring whether the highest-CROF model indeed gives the best closed-loop returns would test the claim.

Figures

Figures reproduced from arXiv: 2607.01736 by Nikolai Smolyanskiy.

**Figure 2.** Figure 2: CEM-MPC performance across 500 epochs of world-model training (20 test episodes per [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: ROF analysis. Top-left: smoothed MPC return (blue) vs. inverted [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Reacher contrast. Smoothed MPC return (blue) and inverted [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: CROF composite score over training. Left: smoothed MPC return (blue) vs. smoothed [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Per-metric correlations with smoothed (MA-7) MPC mean return for all 40 metrics in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Metrics vs. MPC dashboard. Standard metrics (validation losses, open-loop RMSEs, [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

We study how to predict the downstream closed-loop performance of a learned latent world model from validation-time diagnostics alone. Choosing the right checkpoint from a world-model training run is difficult: validation loss and multi-step prediction RMSE keep improving long after closed-loop performance has collapsed. We present a suite of structural validation-time diagnostics drawn from optimal-control theory and apply them to Gymnasium's LunarLander v3, which features shaped rewards. We train an RSSM [5, 4] world model on it and treat per checkpoint CEM-MPC return as the oracle for closed-loop quality. By evaluating 40 metrics against this oracle, we find that the strongest single predictor is the Reward Observability Fraction (ROF), which measures the reward predictor's dependence on the observable subspace. We combine ROF with three structural regularizers into a single-number offline checkpoint selection score, the Composite Reward Observability Fraction (CROF). The CROF-selected world model trains a model-based A2C policy that beats a fairly evaluated model-free A2C baseline by ~24.5 return points while using ~65x fewer real-environment interactions, and the same world model also drives a strong zero-shot CEM-MPC policy. Code and data: https://github.com/nsmoly/LunarLander_RSSM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces ROF and CROF as offline diagnostics that pick better LunarLander world-model checkpoints than standard losses, with code and a clear 24-point downstream gain.

read the letter

The main point is that they define a Reward Observability Fraction that tracks how much the reward head relies on observable state, then fold it with three regularizers into CROF. This score picks RSSM checkpoints whose CEM-MPC oracle returns are higher, and the chosen model later produces a model-based A2C policy that beats a model-free A2C baseline by roughly 24.5 points while using about 65 times fewer real steps; the same checkpoint also runs strong zero-shot CEM-MPC.

They evaluate 40 metrics against the oracle on the same LunarLander v3 runs, so the comparison is direct. The repo supplies code and data, which lets anyone reproduce the checkpoint selection and the policy training.

The control-theoretic regularizers give the metric a bit more structure than pure prediction error, and the result addresses a practical bottleneck where validation loss stops being informative.

The obvious limit is that everything was developed and validated inside one environment with shaped rewards and one RSSM architecture. The weights and the three regularizers were tuned on these checkpoints, so it is not yet clear whether ROF or CROF will stay predictive on other tasks or models. There is also moderate circularity because ROF is computed from the reward predictor that was already fit during world-model training.

This is worth a reading group for anyone who trains latent world models and has to decide which checkpoint to keep. It has enough concrete evidence and released artifacts to deserve peer review, though the authors would need to show the metric on at least one more domain before the claim can be treated as general.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard validation losses fail to predict closed-loop performance of RSSM world models on LunarLander v3 (shaped rewards). It introduces the Reward Observability Fraction (ROF) — the reward predictor's dependence on the observable subspace — as the strongest single predictor among 40 metrics when correlated against a CEM-MPC oracle. Combining ROF with three structural regularizers yields the Composite ROF (CROF) score for offline checkpoint selection. The CROF-selected model yields a model-based A2C policy that outperforms a model-free A2C baseline by ~24.5 return while using ~65× fewer environment steps and also supports strong zero-shot CEM-MPC control. Code and data are released.

Significance. If the empirical result holds, the work supplies a concrete, reproducible protocol for selecting world-model checkpoints without expensive closed-loop rollouts. The explicit comparison to a fairly evaluated model-free baseline, the zero-shot MPC result, and the public repository are strengths. The approach is scoped to one environment and architecture, so its broader utility depends on whether ROF and the regularizers remain predictive under changes in reward shaping or dynamics.

major comments (2)

[Abstract, §3] Abstract and methods: ROF is computed directly from the reward predictor evaluated on the same validation trajectories used during world-model training. This creates a moderate dependence between the selection score and quantities already fitted in training; the manuscript should report whether an independent held-out set or cross-validation was used when computing the oracle correlations and the final 24.5-point gap.
[§4, Table 2] Evaluation protocol: The claim that CROF is the strongest predictor rests on screening 40 metrics against the CEM-MPC oracle on the identical checkpoint set. The paper should state whether the metric suite (including ROF) was fixed before seeing the correlations or whether the strongest metric was identified post-hoc; otherwise the reported superiority of CROF risks being an in-sample selection effect.

minor comments (2)

[§3.2] Notation: Define the precise mathematical expression for ROF (e.g., the fraction of reward variance explained by the observable latent dimensions) in the main text rather than only in the appendix.
[Figure 3] Figure clarity: The caption for the correlation plot should explicitly state the number of checkpoints and whether error bars reflect multiple random seeds or bootstrap resampling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments below regarding evaluation protocol and potential selection effects. Both points can be clarified with revisions to the manuscript.

read point-by-point responses

Referee: [Abstract, §3] Abstract and methods: ROF is computed directly from the reward predictor evaluated on the same validation trajectories used during world-model training. This creates a moderate dependence between the selection score and quantities already fitted in training; the manuscript should report whether an independent held-out set or cross-validation was used when computing the oracle correlations and the final 24.5-point gap.

Authors: We acknowledge the dependence: ROF is evaluated on the validation trajectories from world-model training. However, the CEM-MPC oracle returns are generated via closed-loop rollouts in the environment and are independent of the validation loss. The 24.5-point gap arises from training and evaluating a model-based A2C policy in the real environment using the CROF-selected checkpoint. No separate held-out set or cross-validation was employed for the reported correlations; the validation split is the one used throughout training. We will revise §3 and the evaluation section to explicitly state this protocol and the separation between diagnostics and the closed-loop oracle. revision: yes
Referee: [§4, Table 2] Evaluation protocol: The claim that CROF is the strongest predictor rests on screening 40 metrics against the CEM-MPC oracle on the identical checkpoint set. The paper should state whether the metric suite (including ROF) was fixed before seeing the correlations or whether the strongest metric was identified post-hoc; otherwise the reported superiority of CROF risks being an in-sample selection effect.

Authors: The 40 metrics were drawn from optimal-control theory and RSSM structural properties (detailed in §3) and fixed prior to any correlation analysis. ROF was included based on the reward-observability hypothesis before screening. The screening ranked the pre-specified suite; CROF combines ROF with three regularizers also motivated a priori. We will add an explicit statement in §4 confirming the metric families were defined before computing correlations to address the in-sample selection concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports an empirical correlation study: 40 validation metrics (including ROF computed from the already-trained reward predictor) are evaluated against a CEM-MPC oracle on the same set of checkpoints, the strongest correlates are combined into CROF, and the selected checkpoint is then used for downstream policy training and zero-shot MPC. No equation or definition in the provided material shows a prediction that reduces to its inputs by construction, no self-citation chain is load-bearing for the central claim, and the result is scoped to concrete LunarLander experiments with supplied code and data. The derivation is therefore self-contained as a data-driven selection procedure rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that CEM-MPC return on LunarLander constitutes a reliable oracle for closed-loop quality and that the 40 evaluated metrics were chosen without post-hoc bias toward the reported winner.

axioms (1)

domain assumption CEM-MPC return computed on the same environment is a faithful proxy for downstream closed-loop performance of any policy that uses the world model.
The paper treats per-checkpoint CEM-MPC return as the ground-truth oracle against which all 40 validation metrics are scored.

invented entities (1)

Reward Observability Fraction (ROF) no independent evidence
purpose: Quantifies dependence of the learned reward predictor on the observable subspace of the latent state.
New scalar diagnostic introduced to capture reward observability; no independent falsifiable prediction outside the LunarLander experiments is provided.

pith-pipeline@v0.9.1-grok · 5781 in / 1418 out tokens · 21240 ms · 2026-07-03T17:41:46.799471+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Machado, Pablo Samuel Castro, and Marc G

Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, and Marc G. Bellemare. Con- trastive behavioral similarity embeddings for generalization in reinforcement learning. InIn- ternational Conference on Learning Representations (ICLR), 2021

work page 2021
[2]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

work page 2018
[3]

Bellemare

Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. Deep- MDP: Learning continuous latent space models for representation learning. InInternational Conference on Machine Learning (ICML), pages 2170–2179. PMLR, 2019. 16

work page 2019
[4]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Represen- tations (ICLR), 2020

work page 2020
[5]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning (ICML), pages 2555–2565. PMLR, 2019

work page 2019
[6]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Rudolf E. Kalman. Mathematical description of linear dynamical systems.Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2):152–192, 1963

work page 1963
[8]

Objective mismatch in model-based reinforcement learning

Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based reinforcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), pages 761–770. PMLR, 2020

work page 2020
[9]

Bruce C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction.IEEE Transactions on Automatic Control, 26(1):17–32, 1981

work page 1981
[10]

Ng, Daishi Harada, and Stuart Russell

Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor- mations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), pages 278–287, 1999

work page 1999
[11]

Rubinstein

Reuven Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimiza- tion.Methodology and Computing in Applied Probability, 1(2):127–190, 1999

work page 1999
[12]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Opening the black box: Low-dimensional dynamics in high- dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013

David Sussillo and Omri Barak. Opening the black box: Low-dimensional dynamics in high- dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013

work page 2013
[14]

Johnson, and Sergey Levine

Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew J. Johnson, and Sergey Levine. SOLAR: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning (ICML), pages 7444–7453. PMLR, 2019. 17 A Supplementary Figures Figure 6: Per-metric correlations with smoothed (MA-7) MPC mean return...

work page 2019

[1] [1]

Machado, Pablo Samuel Castro, and Marc G

Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, and Marc G. Bellemare. Con- trastive behavioral similarity embeddings for generalization in reinforcement learning. InIn- ternational Conference on Learning Representations (ICLR), 2021

work page 2021

[2] [2]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

work page 2018

[3] [3]

Bellemare

Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. Deep- MDP: Learning continuous latent space models for representation learning. InInternational Conference on Machine Learning (ICML), pages 2170–2179. PMLR, 2019. 16

work page 2019

[4] [4]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Represen- tations (ICLR), 2020

work page 2020

[5] [5]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning (ICML), pages 2555–2565. PMLR, 2019

work page 2019

[6] [6]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Rudolf E. Kalman. Mathematical description of linear dynamical systems.Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2):152–192, 1963

work page 1963

[8] [8]

Objective mismatch in model-based reinforcement learning

Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based reinforcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), pages 761–770. PMLR, 2020

work page 2020

[9] [9]

Bruce C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction.IEEE Transactions on Automatic Control, 26(1):17–32, 1981

work page 1981

[10] [10]

Ng, Daishi Harada, and Stuart Russell

Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor- mations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), pages 278–287, 1999

work page 1999

[11] [11]

Rubinstein

Reuven Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimiza- tion.Methodology and Computing in Applied Probability, 1(2):127–190, 1999

work page 1999

[12] [12]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Opening the black box: Low-dimensional dynamics in high- dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013

David Sussillo and Omri Barak. Opening the black box: Low-dimensional dynamics in high- dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013

work page 2013

[14] [14]

Johnson, and Sergey Levine

Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew J. Johnson, and Sergey Levine. SOLAR: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning (ICML), pages 7444–7453. PMLR, 2019. 17 A Supplementary Figures Figure 6: Per-metric correlations with smoothed (MA-7) MPC mean return...

work page 2019