pith. sign in

arxiv: 2005.01643 · v3 · pith:T65YNEXUnew · submitted 2020-05-04 · 💻 cs.LG · cs.AI· stat.ML

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Pith reviewed 2026-05-11 11:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords offline reinforcement learningdeep reinforcement learningpolicy optimizationstatic datasetsdecision makingreinforcement learning challengesopen problems in RL
0
0 comments X

The pith

Offline reinforcement learning can extract maximum-utility policies from fixed datasets without new data collection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline reinforcement learning trains decision policies solely from previously gathered data, avoiding any further interaction with the environment during learning. This setup promises to convert large existing datasets into effective automated decision systems across domains such as healthcare, education, and robotics. Current algorithms face limitations that prevent full extraction of high-utility policies, especially when using deep neural networks. The paper supplies conceptual tools to understand these issues, reviews solutions explored in recent studies, covers example applications, and outlines remaining open problems.

Core claim

Offline reinforcement learning algorithms hold promise for turning large datasets into powerful decision-making engines by extracting policies with the maximum possible utility out of available data. Effective methods would automate decision-making domains from healthcare to robotics. Limitations in current algorithms, particularly with modern deep reinforcement learning, make this extraction difficult. The work describes challenges, potential mitigating solutions from recent research, applications, and perspectives on open problems.

What carries the argument

Offline reinforcement learning, the paradigm that optimizes policies using only a static dataset of past experiences without any further online interaction or data gathering.

If this is right

  • Large static datasets from real-world logs can train agents for healthcare or robotics decisions without risky new interactions.
  • Policy optimization can proceed purely from recorded trajectories, separating data collection from learning.
  • Automation of decision domains becomes feasible once limitations are addressed through the reviewed techniques.
  • Research can focus on open problems to improve utility extraction from fixed data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could enable safer deployment of learned policies in settings where online exploration carries high cost or danger.
  • It opens connections to large-scale supervised learning on logged decision data from production systems.
  • Open problems identified may direct attention toward handling distribution shifts between dataset and deployment conditions.

Load-bearing premise

That the limitations of current offline algorithms can be overcome by the solutions explored in recent work, enabling effective extraction of maximum-utility policies from available data.

What would settle it

A controlled benchmark where applying all described mitigation techniques still yields offline policies whose performance falls short of online reinforcement learning baselines on standard control tasks.

read the original abstract

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper is a tutorial and review on offline reinforcement learning algorithms that use previously collected data without additional online data collection. It aims to equip readers with conceptual tools to start research in this area, emphasizing the promise of turning large datasets into powerful decision-making engines for domains like healthcare, education, and robotics. The manuscript discusses limitations of current algorithms, particularly in deep RL, potential solutions from recent work, applications, and perspectives on open problems.

Significance. This review could be significant for the field by providing a consolidated overview and highlighting open problems, potentially guiding future research in data-driven RL. As a tutorial from active researchers, it offers reliable conceptual framing of the core promise and difficulties of offline RL.

minor comments (1)
  1. [Abstract] Abstract: the statement that the paper will 'describe some potential solutions that have been explored in recent work' is vague on scope and examples; a brief enumeration of the main approaches covered would improve reader orientation without altering the tutorial structure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our tutorial on offline reinforcement learning, the assessment of its potential significance for the field, and the recommendation for minor revision. We are pleased that the manuscript is viewed as providing reliable conceptual framing and highlighting open problems to guide future research.

Circularity Check

0 steps flagged

No significant circularity: review paper with no derivations or self-referential claims

full rationale

This is a tutorial and review paper that catalogs existing offline RL methods, challenges, and open problems from the literature without presenting any new derivations, equations, fitted parameters, or predictions. No load-bearing steps reduce to self-citations or definitions by construction; all claims are descriptive summaries of prior work. The manuscript is self-contained as a survey and does not introduce novel results that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review and tutorial, the paper introduces no new free parameters, axioms, or invented entities; it summarizes prior offline RL research.

pith-pipeline@v0.9.0 · 5452 in / 946 out tokens · 42328 ms · 2026-05-11T11:27:57.485356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. Decision Transformer: Reinforcement Learning via Sequence Modeling

    cs.LG 2021-06 accept novelty 8.0

    Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

  3. D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    cs.LG 2020-04 accept novelty 8.0

    D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

  4. Offline Multi-agent Continual Cooperation via Skill Partition and Reuse

    cs.AI 2026-06 unverdicted novelty 7.0

    COMAD discovers and reuses coordination skills from mixed offline MARL data via auto-encoders and density-based estimation to achieve continual learning with better transfer.

  5. Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noise

    math.OC 2026-06 unverdicted novelty 7.0

    Robust Q-learning algorithm with convergence and finite-time bounds for mean-field control under Wasserstein uncertainty in common noise.

  6. Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

    stat.ML 2026-06 unverdicted novelty 7.0

    Identifies full-data conditional mean rewards under MNAR missingness via shadow variables and a bridge function, then builds a consistent FQE-style OPE estimator for missingness-aware policies.

  7. When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction

    cs.LG 2026-06 conditional novelty 7.0

    A three-stage diagnostic on edX data shows offline selectors (BC, DQN, CQL) fail to reach oracle performance due to local representational ambiguity rather than learner mismatch or label shift.

  8. Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

    cs.RO 2026-05 unverdicted novelty 7.0

    CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.

  9. Fast Convergence of Policy Regret in Learning Stochastic Optimal Control

    math.OC 2026-05 unverdicted novelty 7.0

    In stochastic optimal control, policy regret converges at rate n to the power of minus min of p over 2(p-q) and (m+1) over 2m given an n to the minus one-half accurate Q-star estimator, when the regularity exponent q ...

  10. Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.

  11. Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasonin...

  12. Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fin...

  13. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  14. Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.

  15. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  16. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

  17. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  18. Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.

  19. Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.

  20. Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

    cs.LG 2026-05 unverdicted novelty 7.0

    The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...

  21. Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    SeqRejectron constructs a stopping rule with a small set of validator policies to achieve horizon-free sample complexity for selective imitation learning under arbitrary dynamics shifts.

  22. Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    SeqRejectron builds a stopping rule from a small set of validator policies to achieve horizon-free sample-complexity guarantees for selective imitation learning under arbitrary train-test dynamics shifts.

  23. Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

    cs.LG 2026-05 conditional novelty 7.0

    FlowIQN is a quantile-coupled CFM critic that yields the first explicit Wasserstein-aligned approximate projection for distributional RL, with improved return-distribution accuracy and competitive offline RL performance.

  24. Zero-shot Imitation Learning by Latent Topology Mapping

    cs.LG 2026-05 unverdicted novelty 7.0

    ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

  25. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 7.0

    LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

  26. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  27. Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

  28. Dynamic Treatment on Networks

    stat.ML 2026-05 unverdicted novelty 7.0

    Q-Ising integrates Bayesian dynamic Ising modeling with offline RL to enable adaptive network treatment policies that outperform static centrality benchmarks under spillovers.

  29. Operator-Guided Invariance Learning for Continuous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.

  30. SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

    cs.LG 2026-05 unverdicted novelty 7.0

    SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

  31. Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...

  32. Adaptive Estimation and Optimal Control in Offline Contextual MDPs without Stationarity

    stat.ML 2026-05 unverdicted novelty 7.0

    A T-estimation-based procedure for adaptive density estimation and optimal control in offline contextual MDPs without stationarity, providing oracle risk bounds under two loss functions and finite-sample cost guarantees.

  33. Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.

  34. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.

  35. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.

  36. CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    CODA augments offline multi-agent RL with on-policy diffusion trajectories that evolve with the joint policy to enable coordination.

  37. CASP: Support-Aware Offline Policy Selection for Two-Stage Recommender Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    CASP selects lower-burden two-stage recommender policies by combining doubly robust estimation with a penalty for weak data support and provides theoretical guarantees for conservative selection.

  38. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  39. Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation

    stat.ML 2026-04 unverdicted novelty 7.0

    High-order generator regression from multi-step trajectories yields a second-order accurate estimator for finite-horizon continuous-time policy evaluation that outperforms the Bellman baseline in calibration studies a...

  40. Locality, Not Spectral Mixing, Governs Direct Propagation in Distributed Offline Dynamic Programming

    cs.DC 2026-04 unverdicted novelty 7.0

    Locality sets the fundamental round lower bound L_ε = floor(log(1/2ε)/log(1/γ)) for ε-accuracy on large-diameter graphs; direct propagation achieves it while gossip averaging pays extra 1/gap(W) factors.

  41. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  42. Adaptive Control in Autonomous Driving via Real-Time Recurrent RL

    cs.RO 2026-02 unverdicted novelty 7.0

    Combines offline behavioral cloning with online Real-Time Recurrent RL fine-tuning on LrcSSM models to adapt autonomous driving policies to distribution shifts, validated in simulation and on a real 1:10-scale robot w...

  43. Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

    cs.LG 2025-12 conditional novelty 7.0

    NEUBAY uses Bayesian posteriors over world models with long-horizon planning to match or exceed conservative offline RL methods without explicit conservatism.

  44. Mixed-Density Diffuser: Efficient Planning with Non-Uniform Temporal Resolution

    cs.AI 2025-10 unverdicted novelty 7.0

    Mixed-Density Diffuser achieves new state-of-the-art results on D4RL benchmarks by allowing non-uniform temporal resolution in diffusion planning.

  45. Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

    cs.LG 2025-07 unverdicted novelty 7.0

    Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.

  46. BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

    cs.LG 2025-06 conditional novelty 7.0

    BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.

  47. Offline Constrained Reinforcement Learning under Partial Data Coverage

    stat.ML 2025-05 unverdicted novelty 7.0

    PDOCRL is an oracle-efficient primal-dual method for offline constrained RL under general function approximation that returns near-optimal policies with O(eps^{-2}) samples under partial optimal-policy coverage and a ...

  48. A Tale of Two Cities: Pessimism and Opportunism in Offline Dynamic Pricing

    stat.ML 2024-11 unverdicted novelty 7.0

    Introduces pessimistic and opportunistic policies for offline dynamic pricing under no price coverage via partial identification from demand monotonicity, with finite-sample regret bounds that recover standard rates w...

  49. VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    cs.RO 2022-09 unverdicted novelty 7.0

    VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

  50. Generalization in offline RL: The structure is more important than the amount of pessimism

    cs.LG 2026-07 unverdicted novelty 6.0

    In offline RL, the structure of pessimism (set by dataset coverage) matters more for generalization than its amount; a symmetric overly pessimistic value function can outperform a non-symmetric mildly pessimistic one.

  51. Episodic-to-Semantic Consolidation Without Identity Drift

    cs.AI 2026-07 unverdicted novelty 6.0

    A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.

  52. HJ-SafeDMP: Hamilton-Jacobi Reachability-Guided Dynamic Movement Primitives for Provably Safe Robot Motion

    cs.RO 2026-06 unverdicted novelty 6.0

    HJ-SafeDMP learns a control barrier value function offline from demonstrations via finite-difference HJ recursion and uses it as a closed-form safety filter on DMP outputs, with conformal prediction for coverage guarantees.

  53. Hallucination in World Models is Predictable and Preventable

    cs.LG 2026-06 unverdicted novelty 6.0

    Hallucination in world models is a data coverage issue predictable by three signals and preventable through targeted training sampling and online data collection.

  54. Horizon Adaptive Offline Policy Learning via Value Stitching

    cs.LG 2026-06 unverdicted novelty 6.0

    VAST learns a horizon-adaptive auxiliary value function and stitching policy to compose variable-length returns for improved offline policy optimization on long-horizon tasks.

  55. Sim2O: Efficient Offline-to-Online MARL via Joint Action Composition

    cs.LG 2026-06 unverdicted novelty 6.0

    Sim2O enables efficient offline-to-online MARL by dynamically blending offline and online action proposals across agents and selecting high-value combinations via a centralized value function without auxiliary objectives.

  56. When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

    stat.ML 2026-06 unverdicted novelty 6.0

    Proposes OPAC for trajectory-level offline RL achieving 𝓣O(H^{2}√(C_sa(π*)/n)) bounds with matching lower bound, plus conditions for tractability in generalized nonlinear outcome settings.

  57. Reversal Q-Learning

    cs.LG 2026-06 unverdicted novelty 6.0

    Reversal Q-Learning (RQL) proposes reversing flows for virtual trajectories and bias-variance reduction in an expanded MDP to train flow policies, reporting best average performance on 50 simulated robotic tasks versu...

  58. Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

    cs.LG 2026-06 unverdicted novelty 6.0

    BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.

  59. Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 6.0

    Deep RL with action decomposition and reward shifting learns a symbolic multi-parameter policy for (1+(λ,λ))-GA on OneMax that outperforms baselines across problem sizes.

  60. Counterfactual Transport Flows for Offline Conservative Trajectory Refinement

    cs.LG 2026-06 unverdicted novelty 6.0

    Counterfactual transport flows enable conservative, instance-specific trajectory refinement in offline RL by constructing local preference pairs in latent space from offline data and learning refinement directions con...

Reference graph

Works this paper leans on

284 extracted references · 284 canonical work pages · cited by 157 Pith papers · 28 internal anchors

  1. [1]

    and Friedman, N

    Koller, D. and Friedman, N. , title =. 2009 , isbn =

  2. [2]

    2019 International Conference on Robotics and Automation (ICRA) , pages=

    Closing the sim-to-real loop: Adapting simulation randomization with real world experience , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

  3. [3]

    International journal of computer vision , volume=

    Imagenet large scale visual recognition challenge , author=. International journal of computer vision , volume=. 2015 , publisher=

  4. [4]

    Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

    Sim-to-real: Learning agile locomotion for quadruped robots , author=. arXiv preprint arXiv:1804.10332 , year=

  5. [5]

    Sadeghi, Fereshteh and Levine, Sergey , booktitle=

  6. [6]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation , author =

  7. [7]

    2019 , howpublished =

    Data-Driven Deep Reinforcement Learning , author =. 2019 , howpublished =

  8. [8]

    2020 , howpublished =

    Does On-Policy Data Collection Fix Errors in Reinforcement Learning? , author =. 2020 , howpublished =

  9. [9]

    The Journal of Machine Learning Research , volume=

    End-to-end training of deep visuomotor policies , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

  10. [10]

    International Conference on Machine Learning , pages=

    Guided policy search , author=. International Conference on Machine Learning , pages=

  11. [11]

    Nature , volume=

    Mastering the game of go without human knowledge , author=. Nature , volume=. 2017 , publisher=

  12. [12]

    Advances in neural information processing systems , pages=

    A natural policy gradient , author=. Advances in neural information processing systems , pages=

  13. [13]

    Playing Atari with Deep Reinforcement Learning

    Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

  14. [14]

    Advances in neural information processing systems , pages=

    Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , pages=

  15. [15]

    Journal of Machine Learning Research , volume=

    Tree-based batch mode reinforcement learning , author=. Journal of Machine Learning Research , volume=

  16. [16]

    Machine learning , volume=

    Reinforcement learning in feedback control , author=. Machine learning , volume=. 2011 , publisher=

  17. [17]

    Neural fitted

    Riedmiller, Martin , booktitle=. Neural fitted. 2005 , organization=

  18. [18]

    Reinforcement learning , pages=

    Batch reinforcement learning , author=. Reinforcement learning , pages=. 2012 , publisher=

  19. [19]

    International Conference on Machine Learning (ICML) , year =

    Bias in Natural Actor-Critic Algorithms , author =. International Conference on Machine Learning (ICML) , year =

  20. [20]

    and Storkey, A

    Toussaint, M. and Storkey, A. , title =. International Conference on Machine Learning (ICML) , year =

  21. [21]

    Attias , title =

    H. Attias , title =. Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics , year =

  22. [22]

    1994 , publisher=

    Tesauro, Gerald , journal=. 1994 , publisher=

  23. [23]

    European Workshop on Reinforcement Learning (EWRL) , year =

    Actor-Critic Reinforcement Learning with Energy-Based Policies , author =. European Workshop on Reinforcement Learning (EWRL) , year =

  24. [24]

    International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

    A reduction of imitation learning and structured prediction to no-regret online learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

  25. [25]

    International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

    Efficient reductions for imitation learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

  26. [26]

    International Conference on Machine Learning (ICML) , volume=

    Approximately optimal approximate reinforcement learning , author=. International Conference on Machine Learning (ICML) , volume=

  27. [27]

    Minka, T. P. , title =. Uncertainty in Artificial Intelligence (UAI) , year =

  28. [28]

    Maximum a Posteriori Policy Optimisation , author =

  29. [29]

    Williams, R. J. , title =. Machine Learning , issue_date =. 1992 , pages =

  30. [30]

    Williams, R. J. and Peng, J. , journal =

  31. [31]

    Sutton, R. S. and Barto, A. G. , title =. 1998 , isbn =

  32. [32]

    R. S. Sutton , title =. International Conference on Machine Learning (ICML) , year =

  33. [33]

    and Munos, R

    O'Donoghue, B. and Munos, R. and Kavukcuoglu, K. and Mnih, V. , year =. PGQ: Combining policy gradient and Q-learning , booktitle =

  34. [34]

    and Hinton, G

    Sallans, B. and Hinton, G. E. , title =. Journal of Machine Learning Research , volume =

  35. [35]

    L. P. Kaelbling and M. L. Littman and A. P. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research. 1996

  36. [36]

    Todorov , booktitle=

    E. Todorov , booktitle=. General duality between optimal control and estimation , year=

  37. [37]

    J. A. Bagnell and J. Schneider , title =. International Joint Conference on Artifical Intelligence (IJCAI) , year =

  38. [38]

    u lling, K. and Alt \

    Peters, J. and M \"u lling, K. and Alt \"u n, Y. Relative Entropy Policy Search. AAAI Conference on Artificial Intelligence (AAAI). 2010

  39. [39]

    Neural Information Processing Systems (NIPS) , year =

    Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , author =. Neural Information Processing Systems (NIPS) , year =

  40. [40]

    Todorov , title =

    E. Todorov , title =. Neural Information Processing Systems (NIPS) , year =

  41. [41]

    and Todorov, E

    Dvijotham, K. and Todorov, E. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

  42. [42]

    Neural Information Processing Systems (NeurIPS) , year =

    Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , author =. Neural Information Processing Systems (NeurIPS) , year =

  43. [43]

    arXiv preprint arXiv:1603.01312 , year=

    Learning physical intuition of block towers by example , author=. arXiv preprint arXiv:1603.01312 , year=

  44. [45]

    2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Deep visual foresight for planning robot motion , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=

  45. [46]

    Advances in neural information processing systems , pages=

    Interaction networks for learning about objects, relations and physics , author=. Advances in neural information processing systems , pages=

  46. [47]

    International Conference on Learning Representations , year=

    Distributionally Robust Neural Networks , author=. International Conference on Learning Representations , year=

  47. [49]

    Duchi , title =

    Certifying some distributional robustness with principled adversarial training , author=. arXiv preprint arXiv:1710.10571 , year=

  48. [50]

    Advances in neural information processing systems , pages=

    Semi-supervised learning with deep generative models , author=. Advances in neural information processing systems , pages=

  49. [51]

    Advances in neural information processing systems , pages=

    What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , pages=

  50. [52]

    international conference on machine learning , pages=

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=

  51. [53]

    Causality for machine learning

    Causality for Machine Learning , author=. arXiv preprint arXiv:1911.10500 , year=

  52. [54]

    nature , volume=

    Deep learning , author=. nature , volume=. 2015 , publisher=

  53. [55]

    Deep Reinforcement Learning and the Deadly Triad

    Deep reinforcement learning and the deadly triad , author=. arXiv preprint arXiv:1812.02648 , year=

  54. [56]

    International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

    Learning Policy Improvements with Path Integrals , author =. International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

  55. [57]

    International Conference on Machine Learning (ICML) , year =

    Trust Region Policy Optimization , author =. International Conference on Machine Learning (ICML) , year =

  56. [58]

    , title =

    Levine, S. , title =

  57. [59]

    , title =

    Ziebart, B. , title =

  58. [60]

    H. J. Kappen , title =. Inference and Learning in Dynamic Models , year =

  59. [61]

    Advances in Neural Information Processing Systems , pages=

    Learning continuous control policies by stochastic value gradients , author=. Advances in Neural Information Processing Systems , pages=

  60. [62]

    Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=

    Reinforcement learning with deep energy-based policies , author=. Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=. 2017 , organization=

  61. [63]

    H. J. Kappen and V. G. Optimal control as a graphical model inference problem , journal =. 2012 , pages =

  62. [64]

    Rawlik and M

    K. Rawlik and M. Toussaint and S. Vijayakumar , title =. 2013 , booktitle =

  63. [65]

    Toussaint , title =

    M. Toussaint , title =. International Conference on Machine Learning (ICML) , year =

  64. [66]

    Uncertainty in Artificial Intelligence (UAI) , volume=

    Hierarchical POMDP Controller Optimization by Likelihood Maximization , author=. Uncertainty in Artificial Intelligence (UAI) , volume=

  65. [67]

    Kalman , title=

    R. Kalman , title=. ASME Transactions journal of basic engineering , volume=

  66. [68]

    Advances in neural information processing systems , pages=

    Actor-critic algorithms , author=. Advances in neural information processing systems , pages=

  67. [69]

    Machine learning , volume=

    Self-improving reactive agents based on reinforcement learning, planning and teaching , author=. Machine learning , volume=. 1992 , publisher=

  68. [70]

    Machine learning , volume=

    Q-learning , author=. Machine learning , volume=. 1992 , publisher=

  69. [71]

    Todorov , title =

    E. Todorov , title =. Advances in Neural Information Processing Systems (NIPS) , year =

  70. [72]

    and Schaal, S

    Peters, J. and Schaal, S. , title =. International Conference on Machine Learning (ICML) , year =

  71. [73]

    Neumann , title =

    G. Neumann , title =. International Conference on Machine Learning (ICML) , year =

  72. [74]

    Levine and V

    S. Levine and V. Koltun , title =. Advances in Neural Information Processing Systems (NIPS) , year =

  73. [75]

    European Conference on Machine Learning (ECML) , year =

    Efficient Sample Reuse in EM-Based Policy Search , author =. European Conference on Machine Learning (ECML) , year =

  74. [76]

    Journal of Machine Learning Research , volume=

    Variational message passing , author=. Journal of Machine Learning Research , volume=. 2005 , pages=

  75. [77]

    International Conference on Machine Learning (ICML) , year=

    Reinforcement Learning with Deep Energy-Based Policies , author=. International Conference on Machine Learning (ICML) , year=

  76. [78]

    2018 , booktitle =

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. 2018 , booktitle =

  77. [79]

    2020 , booktitle =

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning , author =. 2020 , booktitle =

  78. [80]

    2017 , booktitle =

    Bridging the Gap Between Value and Policy Based Reinforcement Learning , author =. 2017 , booktitle =

  79. [81]

    and Koltun, V

    Levine, S. and Koltun, V. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

  80. [82]

    Ziebart, B. D. and Maas, A. and Bagnell, J. A. and Dey, A. K. , title =. International Conference on Artificial Intelligence (AAAI) , year =

Showing first 80 references.