pith. sign in

arxiv: 2301.04104 · v2 · submitted 2023-01-10 · 💻 cs.AI · cs.LG· stat.ML

Mastering Diverse Domains through World Models

Pith reviewed 2026-05-11 09:00 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML
keywords reinforcement learningworld modelsDreamerV3Minecraftmodel-based planningsparse rewardsgeneral agents
0
0 comments X

The pith

DreamerV3 learns a world model to imagine futures and masters over 150 tasks plus Minecraft diamond collection with one fixed setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a single reinforcement learning method can handle a broad range of control problems by building an internal model of the environment and using it to simulate possible future sequences. A reader would care because most current algorithms demand heavy human work to adapt to each new setting, and success here would reduce that barrier. The authors show the method reaching diamond collection in Minecraft from random starts using only pixel views and sparse rewards, an open-world task long viewed as difficult. They also report that the same configuration beats task-specific approaches across more than 150 varied problems. If the result holds, reinforcement learning could move from narrow lab experiments toward wider use in new domains without repeated retuning.

Core claim

DreamerV3 learns a model of the environment from interaction and improves its policy by imagining future scenarios inside that model. Techniques for normalization to keep signals in range, balancing to equalize different learning signals, and transformations to reshape inputs let the same algorithm run stably across domains. This produces the first from-scratch diamond collection in Minecraft and stronger results than specialized algorithms on more than 150 other tasks, all with an unchanged configuration.

What carries the argument

A learned world model that predicts future states, rewards, and continuation signals, allowing the agent to evaluate and improve actions by rolling out imagined trajectories rather than only real experience.

If this is right

  • The same algorithm applies to more than 150 tasks spanning games, robotics-style control, and open worlds without any per-task adjustments.
  • Minecraft diamond collection becomes solvable from pixels and sparse rewards without human demonstrations or staged curricula.
  • Challenging problems with long time horizons and delayed rewards can be addressed by planning inside the learned model instead of trial-and-error in the real environment.
  • Reinforcement learning becomes usable on new problems with far less human experimentation and domain expertise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the world model remains accurate at longer horizons, the approach could support planning in physical robot settings where real trials are costly.
  • The emphasis on a single configuration suggests model-based methods may reduce the engineering overhead that currently limits reinforcement learning deployment.
  • Extending the imagination process to include uncertainty estimates could improve robustness on tasks where predictions are noisy.

Load-bearing premise

The combination of normalization, balancing, and transformations is enough to keep learning stable and high-performing when the algorithm is moved to any new domain without further changes.

What would settle it

Running the published DreamerV3 configuration on a fresh control task or repeating the Minecraft diamond collection experiment and finding it fails to reach the reported performance would show the single-configuration claim does not hold.

read the original abstract

Developing a general algorithm that learns to solve tasks across a wide range of applications has been a fundamental challenge in artificial intelligence. Although current reinforcement learning algorithms can be readily applied to tasks similar to what they have been developed for, configuring them for new application domains requires significant human expertise and experimentation. We present DreamerV3, a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration. Dreamer learns a model of the environment and improves its behavior by imagining future scenarios. Robustness techniques based on normalization, balancing, and transformations enable stable learning across domains. Applied out of the box, Dreamer is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula. This achievement has been posed as a significant challenge in artificial intelligence that requires exploring farsighted strategies from pixels and sparse rewards in an open world. Our work allows solving challenging control problems without extensive experimentation, making reinforcement learning broadly applicable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents DreamerV3, a world-model-based reinforcement learning algorithm that incorporates three robustness techniques (normalization, balancing, and transformations) to enable stable learning. It claims that a single fixed hyperparameter configuration allows the method to outperform specialized algorithms across more than 150 tasks spanning multiple domains (Atari, DM Control, ProcGen, and others) and to be the first algorithm to collect diamonds in Minecraft from scratch using only pixels and sparse rewards, without human data or curricula.

Significance. If the empirical results hold under a truly fixed configuration, the work would constitute a meaningful advance toward general-purpose RL agents that require little or no per-domain engineering. The Minecraft diamond-collection result, if independently verified, would demonstrate non-trivial long-horizon planning from high-dimensional observations in an open world. The provision of a single configuration across 150+ tasks is a concrete strength that, if substantiated, reduces the barrier to applying model-based RL.

major comments (3)
  1. [Experiments] Experiments section (and associated tables/figures): the central claim that a single fixed configuration produces the reported results across all domains rests on the assertion that normalization, balancing, and transformation parameters are chosen once and never adjusted per domain. The manuscript should explicitly list every scalar hyperparameter (including any clipping thresholds, scaling factors, or transformation exponents) and state whether any of them were selected after inspecting per-domain statistics or performance; without this, the 'single configuration' and 'applied out of the box' claims cannot be evaluated.
  2. [Experiments] Minecraft results (likely §4 or dedicated subsection): the claim that DreamerV3 is the first algorithm to collect diamonds from scratch requires a precise description of the environment variant, reward function, episode length, and exact baseline implementations. The paper should also report the number of independent seeds, the precise success criterion (e.g., diamonds collected per episode), and whether any environment-specific wrappers were used; otherwise the 'first to solve' statement cannot be assessed for reproducibility.
  3. [Experiments] Ablation studies (if present in §4 or appendix): the robustness techniques are presented as jointly enabling cross-domain stability, yet the manuscript does not appear to isolate the contribution of each technique (normalization vs. balancing vs. transformations) under the fixed-configuration regime. An ablation that removes one technique at a time while keeping all other hyperparameters identical would directly test whether the combination is necessary for the reported generality.
minor comments (2)
  1. [Abstract] The abstract states 'outperforms specialized methods across over 150 diverse tasks' but does not name the exact task suites or the metric used for 'outperforms' (e.g., mean normalized score, median, etc.). Adding a short enumeration of the domains and the aggregate metric would improve clarity.
  2. [Method] Notation for the world-model components (encoder, dynamics, reward predictor) should be introduced once with consistent symbols; subsequent sections occasionally reuse symbols without redefinition, which can be confusing for readers unfamiliar with prior Dreamer papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the clarity and reproducibility of our work. We address each major comment point by point below. Revisions have been made to the manuscript to incorporate explicit hyperparameter listings, expanded experimental details, and additional ablation studies.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and associated tables/figures): the central claim that a single fixed configuration produces the reported results across all domains rests on the assertion that normalization, balancing, and transformation parameters are chosen once and never adjusted per domain. The manuscript should explicitly list every scalar hyperparameter (including any clipping thresholds, scaling factors, or transformation exponents) and state whether any of them were selected after inspecting per-domain statistics or performance; without this, the 'single configuration' and 'applied out of the box' claims cannot be evaluated.

    Authors: We agree that explicit enumeration of all scalar hyperparameters is necessary to substantiate the single-configuration claim. In the revised manuscript, we have added a dedicated appendix table that lists every scalar value, including normalization scales and clipping thresholds, balancing coefficients, transformation exponents (e.g., for symlog and other mappings), and all other fixed constants. These values were determined once via preliminary runs on a small, fixed set of representative tasks drawn from multiple domains and then locked for the entire evaluation suite; no subsequent per-domain inspection or adjustment occurred. The text now explicitly states this selection process to support the 'out of the box' assertion. revision: yes

  2. Referee: [Experiments] Minecraft results (likely §4 or dedicated subsection): the claim that DreamerV3 is the first algorithm to collect diamonds from scratch requires a precise description of the environment variant, reward function, episode length, and exact baseline implementations. The paper should also report the number of independent seeds, the precise success criterion (e.g., diamonds collected per episode), and whether any environment-specific wrappers were used; otherwise the 'first to solve' statement cannot be assessed for reproducibility.

    Authors: We have substantially expanded the Minecraft subsection and its caption to include all requested details. The environment uses the standard MineRL Minecraft 1.16.5 simulator with 64×64 RGB pixel observations, a sparse reward of +1 upon diamond collection and 0 otherwise, and a maximum episode length of 3600 steps. Results are reported over five independent seeds. The success criterion is collecting at least one diamond within an episode. Baseline algorithms are reimplemented from their original public codebases using the authors' recommended configurations; no environment-specific wrappers beyond the uniform preprocessing pipeline (frame stacking, normalization) applied to all methods were used. These clarifications have been inserted to allow independent verification of the 'first to solve' result. revision: yes

  3. Referee: [Experiments] Ablation studies (if present in §4 or appendix): the robustness techniques are presented as jointly enabling cross-domain stability, yet the manuscript does not appear to isolate the contribution of each technique (normalization vs. balancing vs. transformations) under the fixed-configuration regime. An ablation that removes one technique at a time while keeping all other hyperparameters identical would directly test whether the combination is necessary for the reported generality.

    Authors: We have added a new set of ablation experiments in the appendix that isolate each robustness technique. Keeping every other hyperparameter exactly as in the fixed configuration, we evaluate four variants: normalization removed, balancing removed, transformations removed, and all pairwise combinations. The results confirm that no single technique or incomplete subset suffices for stable performance across all 150+ tasks; only the full combination reproduces the reported cross-domain success. These ablations are presented with the same evaluation protocol and seed count as the main results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation or results.

full rationale

The paper's core contribution is an empirical demonstration that DreamerV3 with fixed robustness techniques (normalization, balancing, transformations) achieves strong performance on 150+ tasks plus Minecraft diamonds using one configuration. No mathematical derivation chain is presented that reduces predictions or first-principles results to fitted parameters or self-referential definitions by construction. Results are measured on held-out environments and tasks; the algorithm description does not contain equations where outputs are forced by inputs. Prior Dreamer papers by overlapping authors are cited for the base world-model approach, but the new robustness components and single-config generality claim rest on independent experimental evidence rather than load-bearing self-citation or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard model-based RL assumptions that a learned dynamics model is accurate enough for useful planning, plus the empirical claim that the listed robustness techniques transfer across domains. No new physical entities or forces are introduced.

free parameters (1)
  • single fixed hyperparameter configuration
    The paper asserts one set of values works across all 150+ tasks; these values are chosen once rather than per domain.
axioms (1)
  • domain assumption A learned world model can support effective long-horizon planning even when trained from pixels and sparse rewards.
    Invoked to justify imagining futures instead of only real-environment interaction.

pith-pipeline@v0.9.0 · 5467 in / 1314 out tokens · 65102 ms · 2026-05-11T09:00:31.715164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Textual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation

    cs.LG 2026-06 unverdicted novelty 8.0

    Introduces textual belief states and factorized GRPO to enforce strict latent state mediation in text-based world models, yielding preserved prediction accuracy with large gains in representation quality and rollout p...

  2. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  3. SUNTA: Hierarchical Video Prediction with Surprise-based Chunking

    cs.AI 2026-07 unverdicted novelty 7.0

    SUNTA uses surprise-driven chunk boundaries and decoupled training in hierarchical state-space models to sustain accurate video predictions over 250 timesteps where baselines fail after 10.

  4. ScratchWorld: Evaluating If World Models Compute Executable Consequences

    cs.SE 2026-06 unverdicted novelty 7.0

    ScratchWorld benchmark finds that language models achieve at most 13.8% value-aware changed-field F1 on replay-verified Scratch state transitions and frequently ignore executable rules.

  5. The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks

    cs.RO 2026-06 unverdicted novelty 7.0

    TISED framework reveals paradoxical effects where inference optimizations can lengthen task completion time on static tasks or raise success rates on dynamic tasks in embodied AI.

  6. Event-Conditioned Diagnostics of Kinematic, Contact, and Object-Permanence Fields in Passive Object-State World Models

    cs.RO 2026-06 unverdicted novelty 7.0

    Introduces controlled diagnostics showing world model latent states reweight kinematic, contact, and object-permanence field readouts by event type, with evidence that suppressing field-aligned directions degrades eve...

  7. Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

    cs.AI 2026-06 unverdicted novelty 7.0

    GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success fro...

  8. MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

    cs.CV 2026-06 unverdicted novelty 7.0

    MemoBench is a new diagnostic benchmark with 360 synthetic and real clips plus VQA evaluation that tests memory consistency in video models under the disappear-and-reappear paradigm in dynamically changing environments.

  9. MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

    cs.CV 2026-06 unverdicted novelty 7.0

    MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.

  10. MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

    cs.CV 2026-06 unverdicted novelty 7.0

    MemoBench curates 360 ground-truth clips and an evaluation suite to diagnose memory consistency failures in video models when objects change state while out of view.

  11. World Action Models Enable Continual Imitation Learning with Recurrent Generative Replays

    cs.RO 2026-06 unverdicted novelty 7.0

    REGEN uses recurrent generative replays from World Action Models to cut catastrophic forgetting by up to 50% in continual imitation learning compared to sequential fine-tuning.

  12. Equilibrium World Models

    econ.GN 2026-06 unverdicted novelty 7.0

    Equilibrium World Models are a deep-learning solver that enforces exact equilibrium conditions on broad model-generated state distributions to globally solve dynamic stochastic models featuring rare disasters, binding...

  13. Stealthy World Model Manipulation via Data Poisoning

    cs.LG 2026-06 unverdicted novelty 7.0

    SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning...

  14. PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

    cs.AI 2026-06 unverdicted novelty 7.0

    PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.

  15. M*: A Modular, Extensible, Serving System for Multimodal Models

    cs.LG 2026-06 unverdicted novelty 7.0

    M* introduces the Walk Graph abstraction to serve arbitrary compositions of multimodal model components and reports latency and throughput gains over vLLM-Omni and other baselines on text-to-image, text-to-speech, and...

  16. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

    cs.AI 2026-06 conditional novelty 7.0

    MCPS adapts a trajectory generator from autonomous driving to simulate counterfactual 3D pass outcomes in football and produces distribution-aware execution-surplus scores from value model rollouts.

  17. MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

    cs.CV 2026-05 unverdicted novelty 7.0

    MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

  18. SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

    cs.CV 2026-05 unverdicted novelty 7.0

    SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.

  19. MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

    cs.AI 2026-05 unverdicted novelty 7.0

    MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus,...

  20. What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

    cs.CV 2026-05 unverdicted novelty 7.0

    What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

  21. TPS-Drive: Task-Guided Representation Purification for VLM-based Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    TPS-Drive uses an agent-centric tokenizer supervised by a frozen 3D detection head to purify VLM spatial representations, enabling better scene forecasting and lower collision rates on nuScenes and NAVSIM benchmarks.

  22. UWM-JEPA: Predictive World Models That Imagine in Belief Space

    cs.LG 2026-05 unverdicted novelty 7.0

    UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.

  23. World Models as Group Actions

    cs.CV 2026-05 unverdicted novelty 7.0

    Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

  24. SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    SliceWorld introduces a world-state model for CT report generation that uses predictive and factor-aware objectives on axial slice sequences.

  25. WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    WMAttack automates finite-budget attack search for world-model agents via SCAS and RGAR, reporting higher normalized reward drops than baselines on Atari and DMC tasks.

  26. Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation

    cs.LG 2026-05 unverdicted novelty 7.0

    Partial fusion interpolates between neural network ensembles and weight aggregation by only fusing the most similar neurons identified via partial optimal transport, enabling flexible cost-performance tradeoffs.

  27. EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

    cs.RO 2026-05 conditional novelty 7.0

    EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines...

  28. AffectVerse: Emotional World Models for Multimodal Affective Computing

    cs.CV 2026-05 unverdicted novelty 7.0

    AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and...

  29. Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

    cs.CV 2026-05 unverdicted novelty 7.0

    Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...

  30. Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.

  31. Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Event-graph substrates represent states as RDF triple logs, prove a duality reducing explanatory and counterfactual queries to causal-ancestor traversal, and outperform symbolic and parametric baselines on CLEVRER and...

  32. WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

    cs.RO 2026-05 unverdicted novelty 7.0

    WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage trainin...

  33. WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer

    cs.GR 2026-05 unverdicted novelty 7.0

    A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.

  34. Coding Agent Is Good As World Simulator

    cs.AI 2026-05 unverdicted novelty 7.0

    A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.

  35. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  36. The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

    cs.AI 2026-05 unverdicted novelty 7.0

    KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.

  37. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  38. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  39. The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence

    cs.LG 2026-05 unverdicted novelty 7.0

    Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural...

  40. Counterfactual identifiability beyond global monotonicity: non-monotone triangular structural causal models

    cs.LG 2026-05 unverdicted novelty 7.0

    Non-monotone triangular SCMs with mechanism-wise invertibility and context-independent inverse transport are equivalent to exogenous isomorphism and achieve complete counterfactual identifiability, with supporting exp...

  41. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  42. Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

  43. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.

  44. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.

  45. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  46. 3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

    cs.RO 2026-04 unverdicted novelty 7.0

    3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.

  47. GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

  48. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  49. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  50. Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation

    cs.NI 2026-04 unverdicted novelty 7.0

    MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.

  51. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  52. PlayWorld: Learning Robot World Models from Autonomous Play

    cs.RO 2026-03 unverdicted novelty 7.0

    PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...

  53. PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.

  54. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  55. BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

    cs.LG 2025-06 conditional novelty 7.0

    BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.

  56. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    cs.RO 2023-10 conditional novelty 7.0

    SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.

  57. Learning Interactive Real-World Simulators

    cs.AI 2023-10 conditional novelty 7.0

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  58. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

    cs.AI 2023-10 unverdicted novelty 7.0

    LATS integrates Monte Carlo Tree Search with language models using in-context learning, value functions, and self-reflection to achieve 92.7% pass@1 on HumanEval and competitive web navigation performance.

  59. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  60. Repair the Amplifier, Not the Symptom: Stable World-Model Correction for Agent Rollouts

    cs.AI 2026-07 unverdicted novelty 6.0

    WM-SAR identifies and repairs causal subgraphs that amplify errors in agent planning graphs, outperforming symptom-scanning LLM correctors under token constraints.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 229 Pith papers · 14 internal anchors

  1. [1]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587): 484, 2016

  2. [2]

    OpenAI Five

    OpenAI. OpenAI Five. https://blog.openai.com/openai-five/, 2018

  3. [3]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  4. [4]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022

  5. [5]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  6. [6]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  7. [7]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

  8. [8]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019

  9. [9]

    Reinforcement Learning with Unsupervised Auxiliary Tasks

    Max Jaderberg, V olodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016

  10. [10]

    Unsupervised state representation learning in atari

    Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R Devon Hjelm. Unsupervised state representation learning in atari. Advances in neural information processing systems, 32, 2019

  11. [11]

    Reinforcement learning with neural radiance fields

    Danny Driess, Ingmar Schubert, Pete Florence, Yunzhu Li, and Marc Toussaint. Reinforcement learning with neural radiance fields. arXiv preprint arXiv:2206.01634, 2022

  12. [12]

    Mastering the game of go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017

  13. [13]

    What matters in on-policy reinforcement learning? a large-scale empirical study.arXiv preprint arXiv:2006.05990, 2020

    Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020

  14. [14]

    Dyna, an integrated architecture for learning, planning, and reacting

    Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991. 12

  15. [15]

    Deep visual foresight for planning robot motion

    Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017

  16. [16]

    World Models

    David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

  17. [17]

    arXiv preprint arXiv:1903.00374 , year=

    Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model- based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019

  18. [18]

    The minerl competition on sample efficient reinforcement learning using human priors

    William H Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, et al. The minerl competition on sample efficient reinforcement learning using human priors. arXiv e-prints, pages arXiv–1904, 2019

  19. [19]

    H., Houghton, B., Sampedro, R., Zhokhov, P., Baker, B., Ecoffet, A., Tang, J., et al

    Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton, Raul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, et al. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft. arXiv preprint arXiv:2106.14876, 2021

  20. [20]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022

  21. [21]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019

  22. [22]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

  23. [23]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  24. [24]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

  25. [25]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

  26. [26]

    Improved variational inference with inverse autoregressive flow.Advances in neural information processing systems, 29, 2016

    Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow.Advances in neural information processing systems, 29, 2016

  27. [27]

    CHILD, Very deep vaes, arXiv preprint arXiv:2011.10650, (2020), https://doi.org/10.48550/ arXiv.2011.10650

    Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650, 2020

  28. [28]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017

  29. [29]

    Reinforcement learning: An introduction

    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

  30. [30]

    Function optimization using connectionist reinforcement learning algorithms

    Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991. 13

  31. [31]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992

  32. [32]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018

  33. [33]

    Maximum a Posteriori Policy Optimisation

    Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920, 2018

  34. [34]

    A bi-symmetric log transformation for wide-range data

    J Beau W Webber. A bi-symmetric log transformation for wide-range data. Measurement Science and Technology, 24(2):027001, 2012

  35. [35]

    Recurrent experience replay in distributed reinforcement learning

    Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018

  36. [36]

    Multi-task deep reinforcement learning with popart

    Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3796–3803, 2019

  37. [37]

    Phasic policy gradient

    Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In International Conference on Machine Learning, pages 2020–2027. PMLR, 2021

  38. [38]

    The arcade learning environment: An evaluation platform for general agents

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013

  39. [40]

    Rainbow: Combining improvements in deep reinforcement learning

    Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  40. [41]

    Implicit quantile networks for distributional reinforcement learning

    Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pages 1096–1105. PMLR, 2018

  41. [42]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pages 2048–2056. PMLR, 2020

  42. [43]

    DeepMind Lab

    Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801, 2016

  43. [44]

    Mastering atari games with limited data

    Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021

  44. [45]

    Transformers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample efficient world models. arXiv preprint arXiv:2209.00588, 2022. 14

  45. [46]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  46. [47]

    Yarats, R

    Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021

  47. [48]

    D., Conway, A., Cowan, N., Donkin, C., Farrell, S., Hitch, G

    Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019

  48. [49]

    Investigating the practicality of existing reinforcement learning algorithms: A performance comparison

    Olivia Dizon-Paradis, Stephen Wormald, Daniel Capecci, Avanti Bhandarkar, and Damon Woodard. Investigating the practicality of existing reinforcement learning algorithms: A performance comparison. Authorea Preprints, 2023

  49. [50]

    Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780,

    Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021

  50. [51]

    Improving sample efficiency in model-free reinforcement learning from images

    Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019

  51. [52]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

  52. [53]

    The malmo platform for artificial intelligence experimentation

    Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In IJCAI, pages 4246–4247. Citeseer, 2016

  53. [54]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  54. [55]

    The 37 implementation details of proximal policy optimization

    Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. The 37 implementation details of proximal policy optimization. The ICLR Blog Track 2023, 2022

  55. [56]

    Acme: A research framework for distributed reinforcement learning

    Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, et al. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020

  56. [57]

    Off-policy actor-critic with shared experience replay

    Simon Schmitt, Matteo Hessel, and Karen Simonyan. Off-policy actor-critic with shared experience replay. In International Conference on Machine Learning, pages 8545–8554. PMLR, 2020

  57. [58]

    Prioritized Experience Replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

  58. [59]

    High-performance large-scale image recognition without normalization

    Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR, 2021

  59. [60]

    arXiv preprint arXiv:2002.04839 , year=

    Liu Ziyin, Zhikang T Wang, and Masahito Ueda. Laprop: Separating momentum and adaptivity in adam. arXiv preprint arXiv:2002.04839, 2020. 15

  60. [61]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  61. [62]

    The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

    Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare, and Remi Munos. The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. arXiv preprint arXiv:1704.04651, 2017

  62. [63]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

  63. [64]

    Rethinking Full Connectivity in Recurrent Neural Networks

    Matthijs Van Keirsbilck, Alexander Keller, and Xiaodong Yang. Rethinking full connectivity in recurrent neural networks. arXiv preprint arXiv:1905.12340, 2019

  64. [65]

    Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents

    Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018

  65. [66]

    IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018. 16 Methods Baselines We employ the Proximal Policy Optimization (PPO) algorithm 5, ...