pith. sign in

arxiv: 2607.01736 · v1 · pith:FLY3QCTInew · submitted 2026-07-02 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander

Pith reviewed 2026-07-03 17:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY
keywords latent world modelscheckpoint selectionmodel-based RLMPCreward observabilityLunarLanderoffline diagnosticsRSSM
0
0 comments X

The pith

A composite score based on reward observability selects world model checkpoints that support superior model-based control in LunarLander.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that usual validation losses keep improving even after a latent world model stops being useful for control in environments with shaped rewards. It identifies the Reward Observability Fraction as the best predictor of closed-loop performance among many candidates and combines it with structural regularizers to form the CROF score. This score allows choosing good checkpoints without any closed-loop testing or extra environment interactions. The chosen model then trains a model-based policy that beats the model-free baseline while using far less data and also works well for model predictive control.

Core claim

The Reward Observability Fraction, which measures the reward predictor's dependence on the observable subspace, is the strongest single predictor of downstream CEM-MPC return. When combined with three structural regularizers into the Composite Reward Observability Fraction, it enables reliable offline checkpoint selection for an RSSM world model trained on LunarLander v3. The selected model supports a model-based A2C policy that outperforms a model-free A2C baseline by approximately 24.5 return points using about 65 times fewer real-environment interactions and also drives a strong zero-shot CEM-MPC policy.

What carries the argument

The Composite Reward Observability Fraction (CROF), a single-number offline checkpoint selection score that aggregates the Reward Observability Fraction with three structural regularizers.

If this is right

  • The CROF-selected world model supports strong zero-shot CEM-MPC performance in LunarLander.
  • A model-based A2C policy trained from the CROF model achieves higher returns than model-free A2C with 65 times fewer environment interactions.
  • Validation loss and multi-step prediction error are not reliable indicators of closed-loop quality for these world models.
  • Structural diagnostics from optimal control theory can be used to rank checkpoints without running policies or MPC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the approach generalizes, it could reduce the data and compute needed to deploy model-based methods in other control tasks.
  • The emphasis on reward observability may point to a broader principle for designing latent representations in planning algorithms.
  • Testing CROF on different reward structures could show whether the metric captures a fundamental property of useful world models.

Load-bearing premise

The Reward Observability Fraction and the three structural regularizers remain predictive of closed-loop performance when the environment, reward shaping, or world-model architecture changes from the LunarLander v3 RSSM setup used to tune them.

What would settle it

Computing CROF scores for checkpoints from world models trained on a different environment with shaped rewards and then measuring whether the highest-CROF model indeed gives the best closed-loop returns would test the claim.

Figures

Figures reproduced from arXiv: 2607.01736 by Nikolai Smolyanskiy.

Figure 1
Figure 1. Figure 1: LunarLander-v3: four discrete actions (NOP, left/main/right thruster), 8-D observation, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CEM-MPC performance across 500 epochs of world-model training (20 test episodes per [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROF analysis. Top-left: smoothed MPC return (blue) vs. inverted [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reacher contrast. Smoothed MPC return (blue) and inverted [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CROF composite score over training. Left: smoothed MPC return (blue) vs. smoothed [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-metric correlations with smoothed (MA-7) MPC mean return for all 40 metrics in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Metrics vs. MPC dashboard. Standard metrics (validation losses, open-loop RMSEs, [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

We study how to predict the downstream closed-loop performance of a learned latent world model from validation-time diagnostics alone. Choosing the right checkpoint from a world-model training run is difficult: validation loss and multi-step prediction RMSE keep improving long after closed-loop performance has collapsed. We present a suite of structural validation-time diagnostics drawn from optimal-control theory and apply them to Gymnasium's LunarLander v3, which features shaped rewards. We train an RSSM [5, 4] world model on it and treat per checkpoint CEM-MPC return as the oracle for closed-loop quality. By evaluating 40 metrics against this oracle, we find that the strongest single predictor is the Reward Observability Fraction (ROF), which measures the reward predictor's dependence on the observable subspace. We combine ROF with three structural regularizers into a single-number offline checkpoint selection score, the Composite Reward Observability Fraction (CROF). The CROF-selected world model trains a model-based A2C policy that beats a fairly evaluated model-free A2C baseline by ~24.5 return points while using ~65x fewer real-environment interactions, and the same world model also drives a strong zero-shot CEM-MPC policy. Code and data: https://github.com/nsmoly/LunarLander_RSSM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard validation losses fail to predict closed-loop performance of RSSM world models on LunarLander v3 (shaped rewards). It introduces the Reward Observability Fraction (ROF) — the reward predictor's dependence on the observable subspace — as the strongest single predictor among 40 metrics when correlated against a CEM-MPC oracle. Combining ROF with three structural regularizers yields the Composite ROF (CROF) score for offline checkpoint selection. The CROF-selected model yields a model-based A2C policy that outperforms a model-free A2C baseline by ~24.5 return while using ~65× fewer environment steps and also supports strong zero-shot CEM-MPC control. Code and data are released.

Significance. If the empirical result holds, the work supplies a concrete, reproducible protocol for selecting world-model checkpoints without expensive closed-loop rollouts. The explicit comparison to a fairly evaluated model-free baseline, the zero-shot MPC result, and the public repository are strengths. The approach is scoped to one environment and architecture, so its broader utility depends on whether ROF and the regularizers remain predictive under changes in reward shaping or dynamics.

major comments (2)
  1. [Abstract, §3] Abstract and methods: ROF is computed directly from the reward predictor evaluated on the same validation trajectories used during world-model training. This creates a moderate dependence between the selection score and quantities already fitted in training; the manuscript should report whether an independent held-out set or cross-validation was used when computing the oracle correlations and the final 24.5-point gap.
  2. [§4, Table 2] Evaluation protocol: The claim that CROF is the strongest predictor rests on screening 40 metrics against the CEM-MPC oracle on the identical checkpoint set. The paper should state whether the metric suite (including ROF) was fixed before seeing the correlations or whether the strongest metric was identified post-hoc; otherwise the reported superiority of CROF risks being an in-sample selection effect.
minor comments (2)
  1. [§3.2] Notation: Define the precise mathematical expression for ROF (e.g., the fraction of reward variance explained by the observable latent dimensions) in the main text rather than only in the appendix.
  2. [Figure 3] Figure clarity: The caption for the correlation plot should explicitly state the number of checkpoints and whether error bars reflect multiple random seeds or bootstrap resampling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments below regarding evaluation protocol and potential selection effects. Both points can be clarified with revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and methods: ROF is computed directly from the reward predictor evaluated on the same validation trajectories used during world-model training. This creates a moderate dependence between the selection score and quantities already fitted in training; the manuscript should report whether an independent held-out set or cross-validation was used when computing the oracle correlations and the final 24.5-point gap.

    Authors: We acknowledge the dependence: ROF is evaluated on the validation trajectories from world-model training. However, the CEM-MPC oracle returns are generated via closed-loop rollouts in the environment and are independent of the validation loss. The 24.5-point gap arises from training and evaluating a model-based A2C policy in the real environment using the CROF-selected checkpoint. No separate held-out set or cross-validation was employed for the reported correlations; the validation split is the one used throughout training. We will revise §3 and the evaluation section to explicitly state this protocol and the separation between diagnostics and the closed-loop oracle. revision: yes

  2. Referee: [§4, Table 2] Evaluation protocol: The claim that CROF is the strongest predictor rests on screening 40 metrics against the CEM-MPC oracle on the identical checkpoint set. The paper should state whether the metric suite (including ROF) was fixed before seeing the correlations or whether the strongest metric was identified post-hoc; otherwise the reported superiority of CROF risks being an in-sample selection effect.

    Authors: The 40 metrics were drawn from optimal-control theory and RSSM structural properties (detailed in §3) and fixed prior to any correlation analysis. ROF was included based on the reward-observability hypothesis before screening. The screening ranked the pre-specified suite; CROF combines ROF with three regularizers also motivated a priori. We will add an explicit statement in §4 confirming the metric families were defined before computing correlations to address the in-sample selection concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports an empirical correlation study: 40 validation metrics (including ROF computed from the already-trained reward predictor) are evaluated against a CEM-MPC oracle on the same set of checkpoints, the strongest correlates are combined into CROF, and the selected checkpoint is then used for downstream policy training and zero-shot MPC. No equation or definition in the provided material shows a prediction that reduces to its inputs by construction, no self-citation chain is load-bearing for the central claim, and the result is scoped to concrete LunarLander experiments with supplied code and data. The derivation is therefore self-contained as a data-driven selection procedure rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that CEM-MPC return on LunarLander constitutes a reliable oracle for closed-loop quality and that the 40 evaluated metrics were chosen without post-hoc bias toward the reported winner.

axioms (1)
  • domain assumption CEM-MPC return computed on the same environment is a faithful proxy for downstream closed-loop performance of any policy that uses the world model.
    The paper treats per-checkpoint CEM-MPC return as the ground-truth oracle against which all 40 validation metrics are scored.
invented entities (1)
  • Reward Observability Fraction (ROF) no independent evidence
    purpose: Quantifies dependence of the learned reward predictor on the observable subspace of the latent state.
    New scalar diagnostic introduced to capture reward observability; no independent falsifiable prediction outside the LunarLander experiments is provided.

pith-pipeline@v0.9.1-grok · 5781 in / 1418 out tokens · 21240 ms · 2026-07-03T17:41:46.799471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Machado, Pablo Samuel Castro, and Marc G

    Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, and Marc G. Bellemare. Con- trastive behavioral similarity embeddings for generalization in reinforcement learning. InIn- ternational Conference on Learning Representations (ICLR), 2021

  2. [2]

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

  3. [3]

    Bellemare

    Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. Deep- MDP: Learning continuous latent space models for representation learning. InInternational Conference on Machine Learning (ICML), pages 2170–2179. PMLR, 2019. 16

  4. [4]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Represen- tations (ICLR), 2020

  5. [5]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning (ICML), pages 2555–2565. PMLR, 2019

  6. [6]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models.arXiv preprint arXiv:2301.04104, 2023

  7. [7]

    Rudolf E. Kalman. Mathematical description of linear dynamical systems.Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2):152–192, 1963

  8. [8]

    Objective mismatch in model-based reinforcement learning

    Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based reinforcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), pages 761–770. PMLR, 2020

  9. [9]

    Bruce C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction.IEEE Transactions on Automatic Control, 26(1):17–32, 1981

  10. [10]

    Ng, Daishi Harada, and Stuart Russell

    Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor- mations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), pages 278–287, 1999

  11. [11]

    Rubinstein

    Reuven Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimiza- tion.Methodology and Computing in Applied Probability, 1(2):127–190, 1999

  12. [12]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

  13. [13]

    Opening the black box: Low-dimensional dynamics in high- dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013

    David Sussillo and Omri Barak. Opening the black box: Low-dimensional dynamics in high- dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013

  14. [14]

    Johnson, and Sergey Levine

    Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew J. Johnson, and Sergey Levine. SOLAR: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning (ICML), pages 7444–7453. PMLR, 2019. 17 A Supplementary Figures Figure 6: Per-metric correlations with smoothed (MA-7) MPC mean return...