ScratchWorld benchmark finds that language models achieve at most 13.8% value-aware changed-field F1 on replay-verified Scratch state transitions and frequently ignore executable rules.
hub
World-in-world: World models in a closed-loop world
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 18roles
background 4polarities
background 4representative citing papers
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.
A framework combining LLM policy interpretation with a physically conserved graph-latent world model and uncertainty-separated learning achieves 33% higher rationale consistency and 82.3% operability on a 10-node semiconductor benchmark under perturbations.
COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.
Sensor2Sensor uses 4D Gaussian Splatting to create synthetic training pairs and a diffusion model to convert monocular dashcam videos into high-fidelity multi-modal AV sensor data.
DALI-R boosts 3D imitation learning success rates by 6.8% on average from suboptimal trajectories via latent imagination and reranking, with under 0.7x inference cost.
Temporal video pretraining induces stronger action-relevant structure in video world model latents than pixel reconstruction, as shown by inverse-dynamics probing across encoder families.
OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.
TC-WM converts foundation-model visual embeddings into parsimonious task-sufficient world model latents via linear projection, contrastive physical-state alignment, and embedding reconstruction, with a theoretical identification guarantee.
Robo-Cortex proposes a self-evolving embodied navigation agent using dual-grain cognitive memory and autonomous knowledge induction from trajectories, reporting SPL gains on IGNav, AR, AEQA and preliminary real-robot tests.
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.
citing papers explorer
-
ScratchWorld: Evaluating If World Models Compute Executable Consequences
ScratchWorld benchmark finds that language models achieve at most 13.8% value-aware changed-field F1 on replay-verified Scratch state transitions and frequently ignore executable rules.
-
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
-
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
-
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
-
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
-
Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?
Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.
-
ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience
A framework combining LLM policy interpretation with a physically conserved graph-latent world model and uncertainty-separated learning achieves 33% higher rationale consistency and 82.3% operability on a 10-node semiconductor benchmark under perturbations.
-
COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.
-
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Sensor2Sensor uses 4D Gaussian Splatting to create synthetic training pairs and a diffusion model to convert monocular dashcam videos into high-fidelity multi-modal AV sensor data.
-
Data-Asymmetric Latent Imagination and Reranking for 3D Robotic Imitation Learning
DALI-R boosts 3D imitation learning success rates by 6.8% on average from suboptimal trajectories via latent imagination and reranking, with under 0.7x inference cost.
-
What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction
Temporal video pretraining induces stronger action-relevant structure in video world model latents than pixel reconstruction, as shown by inverse-dynamics probing across encoder families.
-
OptiWorld: Optimal Control for Video World Generation under Physical Constraints
OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.
-
Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations
TC-WM converts foundation-model visual embeddings into parsimonious task-sufficient world model latents via linear projection, contrastive physical-state alignment, and embedding reconstruction, with a theoretical identification guarantee.
-
Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction
Robo-Cortex proposes a self-evolving embodied navigation agent using dual-grain cognitive memory and autonomous knowledge induction from trajectories, reporting SPL gains on IGNav, AR, AEQA and preliminary real-robot tests.
-
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
-
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.