Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Aviral Kumar; George Tucker; Justin Fu; Sergey Levine

arxiv: 2005.01643 · v3 · pith:T65YNEXUnew · submitted 2020-05-04 · 💻 cs.LG · cs.AI· stat.ML

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine , Aviral Kumar , George Tucker , Justin Fu This is my paper

Pith reviewed 2026-05-11 11:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords offline reinforcement learningdeep reinforcement learningpolicy optimizationstatic datasetsdecision makingreinforcement learning challengesopen problems in RL

0 comments

The pith

Offline reinforcement learning can extract maximum-utility policies from fixed datasets without new data collection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline reinforcement learning trains decision policies solely from previously gathered data, avoiding any further interaction with the environment during learning. This setup promises to convert large existing datasets into effective automated decision systems across domains such as healthcare, education, and robotics. Current algorithms face limitations that prevent full extraction of high-utility policies, especially when using deep neural networks. The paper supplies conceptual tools to understand these issues, reviews solutions explored in recent studies, covers example applications, and outlines remaining open problems.

Core claim

Offline reinforcement learning algorithms hold promise for turning large datasets into powerful decision-making engines by extracting policies with the maximum possible utility out of available data. Effective methods would automate decision-making domains from healthcare to robotics. Limitations in current algorithms, particularly with modern deep reinforcement learning, make this extraction difficult. The work describes challenges, potential mitigating solutions from recent research, applications, and perspectives on open problems.

What carries the argument

Offline reinforcement learning, the paradigm that optimizes policies using only a static dataset of past experiences without any further online interaction or data gathering.

If this is right

Large static datasets from real-world logs can train agents for healthcare or robotics decisions without risky new interactions.
Policy optimization can proceed purely from recorded trajectories, separating data collection from learning.
Automation of decision domains becomes feasible once limitations are addressed through the reviewed techniques.
Research can focus on open problems to improve utility extraction from fixed data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could enable safer deployment of learned policies in settings where online exploration carries high cost or danger.
It opens connections to large-scale supervised learning on logged decision data from production systems.
Open problems identified may direct attention toward handling distribution shifts between dataset and deployment conditions.

Load-bearing premise

That the limitations of current offline algorithms can be overcome by the solutions explored in recent work, enabling effective extraction of maximum-utility policies from available data.

What would settle it

A controlled benchmark where applying all described mitigation techniques still yields offline policies whose performance falls short of online reinforcement learning baselines on standard control tasks.

read the original abstract

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A solid tutorial that organizes offline RL challenges and open problems but introduces no new methods or results.

read the letter

Hi, this paper is a tutorial and review that pulls together the main ideas around offline reinforcement learning. The authors explain why the setting is promising for turning static datasets into decision-making systems in domains like healthcare and robotics, while clearly laying out the core difficulties such as distribution shift and overestimation on unseen actions. They review mitigation approaches from recent work and flag open questions, which gives readers a map of the area. Because the writers are active researchers in the subfield, the conceptual framing and literature pointers feel reliable and grounded. The paper stays honest about current limitations rather than claiming the problems are solved. The main soft spot is that it contains no new algorithms, experiments, or formal results, so its contribution is purely organizational and expository. A reader already deep in the literature might find the open-problems section high-level rather than sharply diagnostic, and any review inevitably reflects the authors' selection of which papers and angles to emphasize. This is the sort of piece that helps newcomers or people wanting a quick synthesis before reading primary sources. It deserves a serious referee to verify coverage and clarity, even though it is not an original research contribution. I would send it out for peer review.

Referee Report

0 major / 1 minor

Summary. The paper is a tutorial and review on offline reinforcement learning algorithms that use previously collected data without additional online data collection. It aims to equip readers with conceptual tools to start research in this area, emphasizing the promise of turning large datasets into powerful decision-making engines for domains like healthcare, education, and robotics. The manuscript discusses limitations of current algorithms, particularly in deep RL, potential solutions from recent work, applications, and perspectives on open problems.

Significance. This review could be significant for the field by providing a consolidated overview and highlighting open problems, potentially guiding future research in data-driven RL. As a tutorial from active researchers, it offers reliable conceptual framing of the core promise and difficulties of offline RL.

minor comments (1)

[Abstract] Abstract: the statement that the paper will 'describe some potential solutions that have been explored in recent work' is vague on scope and examples; a brief enumeration of the main approaches covered would improve reader orientation without altering the tutorial structure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our tutorial on offline reinforcement learning, the assessment of its potential significance for the field, and the recommendation for minor revision. We are pleased that the manuscript is viewed as providing reliable conceptual framing and highlighting open problems to guide future research.

Circularity Check

0 steps flagged

No significant circularity: review paper with no derivations or self-referential claims

full rationale

This is a tutorial and review paper that catalogs existing offline RL methods, challenges, and open problems from the literature without presenting any new derivations, equations, fitted parameters, or predictions. No load-bearing steps reduce to self-citations or definitions by construction; all claims are descriptive summaries of prior work. The manuscript is self-contained as a survey and does not introduce novel results that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review and tutorial, the paper introduces no new free parameters, axioms, or invented entities; it summarizes prior offline RL research.

pith-pipeline@v0.9.0 · 5452 in / 946 out tokens · 42328 ms · 2026-05-11T11:27:57.485356+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Decision Transformer: Reinforcement Learning via Sequence Modeling
cs.LG 2021-06 accept novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
cs.LG 2020-04 accept novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Offline Multi-agent Continual Cooperation via Skill Partition and Reuse
cs.AI 2026-06 unverdicted novelty 7.0

COMAD discovers and reuses coordination skills from mixed offline MARL data via auto-encoders and density-based estimation to achieve continual learning with better transfer.
Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noise
math.OC 2026-06 unverdicted novelty 7.0

Robust Q-learning algorithm with convergence and finite-time bounds for mean-field control under Wasserstein uncertainty in common noise.
Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random
stat.ML 2026-06 unverdicted novelty 7.0

Identifies full-data conditional mean rewards under MNAR missingness via shadow variables and a bridge function, then builds a consistent FQE-style OPE estimator for missingness-aware policies.
When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction
cs.LG 2026-06 conditional novelty 7.0

A three-stage diagnostic on edX data shows offline selectors (BC, DQN, CQL) fail to reach oracle performance due to local representational ambiguity rather than learner mismatch or label shift.
Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance
cs.RO 2026-05 unverdicted novelty 7.0

CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
Fast Convergence of Policy Regret in Learning Stochastic Optimal Control
math.OC 2026-05 unverdicted novelty 7.0

In stochastic optimal control, policy regret converges at rate n to the power of minus min of p over 2(p-q) and (m+1) over 2m given an n to the minus one-half accurate Q-star estimator, when the regularity exponent q ...
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasonin...
Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fin...
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.
Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
cs.LG 2026-05 unverdicted novelty 7.0

Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
cs.LG 2026-05 unverdicted novelty 7.0

Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
cs.LG 2026-05 unverdicted novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift
cs.LG 2026-05 unverdicted novelty 7.0

SeqRejectron constructs a stopping rule with a small set of validator policies to achieve horizon-free sample complexity for selective imitation learning under arbitrary dynamics shifts.
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift
cs.LG 2026-05 unverdicted novelty 7.0

SeqRejectron builds a stopping rule from a small set of validator policies to achieve horizon-free sample-complexity guarantees for selective imitation learning under arbitrary train-test dynamics shifts.
Quantile-Coupled Flow Matching for Distributional Reinforcement Learning
cs.LG 2026-05 conditional novelty 7.0

FlowIQN is a quantile-coupled CFM critic that yields the first explicit Wasserstein-aligned approximate projection for distributional RL, with improved return-distribution accuracy and competitive offline RL performance.
Zero-shot Imitation Learning by Latent Topology Mapping
cs.LG 2026-05 unverdicted novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
cs.AI 2026-05 unverdicted novelty 7.0

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
Dynamic Treatment on Networks
stat.ML 2026-05 unverdicted novelty 7.0

Q-Ising integrates Bayesian dynamic Ising modeling with offline RL to enable adaptive network treatment policies that outperform static centrality benchmarks under spillovers.
Operator-Guided Invariance Learning for Continuous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 unverdicted novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
Adaptive Estimation and Optimal Control in Offline Contextual MDPs without Stationarity
stat.ML 2026-05 unverdicted novelty 7.0

A T-estimation-based procedure for adaptive density estimation and optimal control in offline contextual MDPs without stationarity, providing oracle risk bounds under two loss functions and finite-sample cost guarantees.
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
cs.LG 2026-05 unverdicted novelty 7.0

FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 7.0

CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 7.0

CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.
CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

CODA augments offline multi-agent RL with on-policy diffusion trajectories that evolve with the joint policy to enable coordination.
CASP: Support-Aware Offline Policy Selection for Two-Stage Recommender Systems
cs.IR 2026-04 unverdicted novelty 7.0

CASP selects lower-burden two-stage recommender policies by combining doubly robust estimation with a penalty for weak data support and provides theoretical guarantees for conservative selection.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
cs.CV 2026-04 unverdicted novelty 7.0

SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
stat.ML 2026-04 unverdicted novelty 7.0

High-order generator regression from multi-step trajectories yields a second-order accurate estimator for finite-horizon continuous-time policy evaluation that outperforms the Bellman baseline in calibration studies a...
Locality, Not Spectral Mixing, Governs Direct Propagation in Distributed Offline Dynamic Programming
cs.DC 2026-04 unverdicted novelty 7.0

Locality sets the fundamental round lower bound L_ε = floor(log(1/2ε)/log(1/γ)) for ε-accuracy on large-diameter graphs; direct propagation achieves it while gossip averaging pays extra 1/gap(W) factors.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Adaptive Control in Autonomous Driving via Real-Time Recurrent RL
cs.RO 2026-02 unverdicted novelty 7.0

Combines offline behavioral cloning with online Real-Time Recurrent RL fine-tuning on LrcSSM models to adapt autonomous driving policies to distribution shifts, validated in simulation and on a real 1:10-scale robot w...
Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism
cs.LG 2025-12 conditional novelty 7.0

NEUBAY uses Bayesian posteriors over world models with long-horizon planning to match or exceed conservative offline RL methods without explicit conservatism.
Mixed-Density Diffuser: Efficient Planning with Non-Uniform Temporal Resolution
cs.AI 2025-10 unverdicted novelty 7.0

Mixed-Density Diffuser achieves new state-of-the-art results on D4RL benchmarks by allowing non-uniform temporal resolution in diffusion planning.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
cs.LG 2025-07 unverdicted novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning
cs.LG 2025-06 conditional novelty 7.0

BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.
Offline Constrained Reinforcement Learning under Partial Data Coverage
stat.ML 2025-05 unverdicted novelty 7.0

PDOCRL is an oracle-efficient primal-dual method for offline constrained RL under general function approximation that returns near-optimal policies with O(eps^{-2}) samples under partial optimal-policy coverage and a ...
A Tale of Two Cities: Pessimism and Opportunism in Offline Dynamic Pricing
stat.ML 2024-11 unverdicted novelty 7.0

Introduces pessimistic and opportunistic policies for offline dynamic pricing under no price coverage via partial identification from demand monotonicity, with finite-sample regret bounds that recover standard rates w...
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
cs.RO 2022-09 unverdicted novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
Generalization in offline RL: The structure is more important than the amount of pessimism
cs.LG 2026-07 unverdicted novelty 6.0

In offline RL, the structure of pessimism (set by dataset coverage) matters more for generalization than its amount; a symmetric overly pessimistic value function can outperform a non-symmetric mildly pessimistic one.
Episodic-to-Semantic Consolidation Without Identity Drift
cs.AI 2026-07 unverdicted novelty 6.0

A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
HJ-SafeDMP: Hamilton-Jacobi Reachability-Guided Dynamic Movement Primitives for Provably Safe Robot Motion
cs.RO 2026-06 unverdicted novelty 6.0

HJ-SafeDMP learns a control barrier value function offline from demonstrations via finite-difference HJ recursion and uses it as a closed-form safety filter on DMP outputs, with conformal prediction for coverage guarantees.
Hallucination in World Models is Predictable and Preventable
cs.LG 2026-06 unverdicted novelty 6.0

Hallucination in world models is a data coverage issue predictable by three signals and preventable through targeted training sampling and online data collection.
Horizon Adaptive Offline Policy Learning via Value Stitching
cs.LG 2026-06 unverdicted novelty 6.0

VAST learns a horizon-adaptive auxiliary value function and stitching policy to compose variable-length returns for improved offline policy optimization on long-horizon tasks.
Sim2O: Efficient Offline-to-Online MARL via Joint Action Composition
cs.LG 2026-06 unverdicted novelty 6.0

Sim2O enables efficient offline-to-online MARL by dynamically blending offline and online action proposals across agents and selecting high-value combinations via a centralized value function without auxiliary objectives.
When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
stat.ML 2026-06 unverdicted novelty 6.0

Proposes OPAC for trajectory-level offline RL achieving 𝓣O(H^{2}√(C_sa(π*)/n)) bounds with matching lower bound, plus conditions for tractability in generalized nonlinear outcome settings.
Reversal Q-Learning
cs.LG 2026-06 unverdicted novelty 6.0

Reversal Q-Learning (RQL) proposes reversing flows for virtual trajectories and bias-variance reduction in an expanded MDP to train flow policies, reporting best average performance on 50 simulated robotic tasks versu...
Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning
cs.LG 2026-06 unverdicted novelty 6.0

BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.
Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning
cs.LG 2026-06 unverdicted novelty 6.0

Deep RL with action decomposition and reward shifting learns a symbolic multi-parameter policy for (1+(λ,λ))-GA on OneMax that outperforms baselines across problem sizes.
Counterfactual Transport Flows for Offline Conservative Trajectory Refinement
cs.LG 2026-06 unverdicted novelty 6.0

Counterfactual transport flows enable conservative, instance-specific trajectory refinement in offline RL by constructing local preference pairs in latent space from offline data and learning refinement directions con...

Reference graph

Works this paper leans on

284 extracted references · 284 canonical work pages · cited by 157 Pith papers · 28 internal anchors

[1]

and Friedman, N

Koller, D. and Friedman, N. , title =. 2009 , isbn =

work page 2009
[2]

2019 International Conference on Robotics and Automation (ICRA) , pages=

Closing the sim-to-real loop: Adapting simulation randomization with real world experience , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

work page 2019
[3]

International journal of computer vision , volume=

Imagenet large scale visual recognition challenge , author=. International journal of computer vision , volume=. 2015 , publisher=

work page 2015
[4]

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

Sim-to-real: Learning agile locomotion for quadruped robots , author=. arXiv preprint arXiv:1804.10332 , year=

work page Pith review arXiv
[5]

Sadeghi, Fereshteh and Levine, Sergey , booktitle=

work page
[6]

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author =

work page
[7]

2019 , howpublished =

Data-Driven Deep Reinforcement Learning , author =. 2019 , howpublished =

work page 2019
[8]

2020 , howpublished =

Does On-Policy Data Collection Fix Errors in Reinforcement Learning? , author =. 2020 , howpublished =

work page 2020
[9]

The Journal of Machine Learning Research , volume=

End-to-end training of deep visuomotor policies , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

work page 2016
[10]

International Conference on Machine Learning , pages=

Guided policy search , author=. International Conference on Machine Learning , pages=

work page
[11]

Nature , volume=

Mastering the game of go without human knowledge , author=. Nature , volume=. 2017 , publisher=

work page 2017
[12]

Advances in neural information processing systems , pages=

A natural policy gradient , author=. Advances in neural information processing systems , pages=

work page
[13]

Playing Atari with Deep Reinforcement Learning

Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Advances in neural information processing systems , pages=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , pages=

work page
[15]

Journal of Machine Learning Research , volume=

Tree-based batch mode reinforcement learning , author=. Journal of Machine Learning Research , volume=

work page
[16]

Machine learning , volume=

Reinforcement learning in feedback control , author=. Machine learning , volume=. 2011 , publisher=

work page 2011
[17]

Neural fitted

Riedmiller, Martin , booktitle=. Neural fitted. 2005 , organization=

work page 2005
[18]

Reinforcement learning , pages=

Batch reinforcement learning , author=. Reinforcement learning , pages=. 2012 , publisher=

work page 2012
[19]

International Conference on Machine Learning (ICML) , year =

Bias in Natural Actor-Critic Algorithms , author =. International Conference on Machine Learning (ICML) , year =

work page
[20]

and Storkey, A

Toussaint, M. and Storkey, A. , title =. International Conference on Machine Learning (ICML) , year =

work page
[21]

Attias , title =

H. Attias , title =. Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics , year =

work page
[22]

1994 , publisher=

Tesauro, Gerald , journal=. 1994 , publisher=

work page 1994
[23]

European Workshop on Reinforcement Learning (EWRL) , year =

Actor-Critic Reinforcement Learning with Energy-Based Policies , author =. European Workshop on Reinforcement Learning (EWRL) , year =

work page
[24]

International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page
[25]

International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

Efficient reductions for imitation learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page
[26]

International Conference on Machine Learning (ICML) , volume=

Approximately optimal approximate reinforcement learning , author=. International Conference on Machine Learning (ICML) , volume=

work page
[27]

Minka, T. P. , title =. Uncertainty in Artificial Intelligence (UAI) , year =

work page
[28]

Maximum a Posteriori Policy Optimisation , author =

work page
[29]

Williams, R. J. , title =. Machine Learning , issue_date =. 1992 , pages =

work page 1992
[30]

Williams, R. J. and Peng, J. , journal =

work page
[31]

Sutton, R. S. and Barto, A. G. , title =. 1998 , isbn =

work page 1998
[32]

R. S. Sutton , title =. International Conference on Machine Learning (ICML) , year =

work page
[33]

and Munos, R

O'Donoghue, B. and Munos, R. and Kavukcuoglu, K. and Mnih, V. , year =. PGQ: Combining policy gradient and Q-learning , booktitle =

work page
[34]

and Hinton, G

Sallans, B. and Hinton, G. E. , title =. Journal of Machine Learning Research , volume =

work page
[35]

L. P. Kaelbling and M. L. Littman and A. P. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research. 1996

work page 1996
[36]

Todorov , booktitle=

E. Todorov , booktitle=. General duality between optimal control and estimation , year=

work page
[37]

J. A. Bagnell and J. Schneider , title =. International Joint Conference on Artifical Intelligence (IJCAI) , year =

work page
[38]

u lling, K. and Alt \

Peters, J. and M \"u lling, K. and Alt \"u n, Y. Relative Entropy Policy Search. AAAI Conference on Artificial Intelligence (AAAI). 2010

work page 2010
[39]

Neural Information Processing Systems (NIPS) , year =

Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , author =. Neural Information Processing Systems (NIPS) , year =

work page
[40]

Todorov , title =

E. Todorov , title =. Neural Information Processing Systems (NIPS) , year =

work page
[41]

and Todorov, E

Dvijotham, K. and Todorov, E. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

work page
[42]

Neural Information Processing Systems (NeurIPS) , year =

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , author =. Neural Information Processing Systems (NeurIPS) , year =

work page
[43]

arXiv preprint arXiv:1603.01312 , year=

Learning physical intuition of block towers by example , author=. arXiv preprint arXiv:1603.01312 , year=

work page internal anchor Pith review arXiv
[45]

2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Deep visual foresight for planning robot motion , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=

work page 2017
[46]

Advances in neural information processing systems , pages=

Interaction networks for learning about objects, relations and physics , author=. Advances in neural information processing systems , pages=

work page
[47]

International Conference on Learning Representations , year=

Distributionally Robust Neural Networks , author=. International Conference on Learning Representations , year=

work page
[49]

Duchi , title =

Certifying some distributional robustness with principled adversarial training , author=. arXiv preprint arXiv:1710.10571 , year=

work page arXiv
[50]

Advances in neural information processing systems , pages=

Semi-supervised learning with deep generative models , author=. Advances in neural information processing systems , pages=

work page
[51]

Advances in neural information processing systems , pages=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , pages=

work page
[52]

international conference on machine learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=

work page
[53]

Causality for machine learning

Causality for Machine Learning , author=. arXiv preprint arXiv:1911.10500 , year=

work page arXiv 1911
[54]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[55]

Deep Reinforcement Learning and the Deadly Triad

Deep reinforcement learning and the deadly triad , author=. arXiv preprint arXiv:1812.02648 , year=

work page Pith review arXiv
[56]

International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

Learning Policy Improvements with Path Integrals , author =. International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

work page 2010
[57]

International Conference on Machine Learning (ICML) , year =

Trust Region Policy Optimization , author =. International Conference on Machine Learning (ICML) , year =

work page
[58]

, title =

Levine, S. , title =

work page
[59]

, title =

Ziebart, B. , title =

work page
[60]

H. J. Kappen , title =. Inference and Learning in Dynamic Models , year =

work page
[61]

Advances in Neural Information Processing Systems , pages=

Learning continuous control policies by stochastic value gradients , author=. Advances in Neural Information Processing Systems , pages=

work page
[62]

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=

Reinforcement learning with deep energy-based policies , author=. Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=. 2017 , organization=

work page 2017
[63]

H. J. Kappen and V. G. Optimal control as a graphical model inference problem , journal =. 2012 , pages =

work page 2012
[64]

Rawlik and M

K. Rawlik and M. Toussaint and S. Vijayakumar , title =. 2013 , booktitle =

work page 2013
[65]

Toussaint , title =

M. Toussaint , title =. International Conference on Machine Learning (ICML) , year =

work page
[66]

Uncertainty in Artificial Intelligence (UAI) , volume=

Hierarchical POMDP Controller Optimization by Likelihood Maximization , author=. Uncertainty in Artificial Intelligence (UAI) , volume=

work page
[67]

Kalman , title=

R. Kalman , title=. ASME Transactions journal of basic engineering , volume=

work page
[68]

Advances in neural information processing systems , pages=

Actor-critic algorithms , author=. Advances in neural information processing systems , pages=

work page
[69]

Machine learning , volume=

Self-improving reactive agents based on reinforcement learning, planning and teaching , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[70]

Machine learning , volume=

Q-learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[71]

Todorov , title =

E. Todorov , title =. Advances in Neural Information Processing Systems (NIPS) , year =

work page
[72]

and Schaal, S

Peters, J. and Schaal, S. , title =. International Conference on Machine Learning (ICML) , year =

work page
[73]

Neumann , title =

G. Neumann , title =. International Conference on Machine Learning (ICML) , year =

work page
[74]

Levine and V

S. Levine and V. Koltun , title =. Advances in Neural Information Processing Systems (NIPS) , year =

work page
[75]

European Conference on Machine Learning (ECML) , year =

Efficient Sample Reuse in EM-Based Policy Search , author =. European Conference on Machine Learning (ECML) , year =

work page
[76]

Journal of Machine Learning Research , volume=

Variational message passing , author=. Journal of Machine Learning Research , volume=. 2005 , pages=

work page 2005
[77]

International Conference on Machine Learning (ICML) , year=

Reinforcement Learning with Deep Energy-Based Policies , author=. International Conference on Machine Learning (ICML) , year=

work page
[78]

2018 , booktitle =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. 2018 , booktitle =

work page 2018
[79]

2020 , booktitle =

D4RL: Datasets for Deep Data-Driven Reinforcement Learning , author =. 2020 , booktitle =

work page 2020
[80]

2017 , booktitle =

Bridging the Gap Between Value and Policy Based Reinforcement Learning , author =. 2017 , booktitle =

work page 2017
[81]

and Koltun, V

Levine, S. and Koltun, V. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

work page
[82]

Ziebart, B. D. and Maas, A. and Bagnell, J. A. and Dey, A. K. , title =. International Conference on Artificial Intelligence (AAAI) , year =

work page

Showing first 80 references.

[1] [1]

and Friedman, N

Koller, D. and Friedman, N. , title =. 2009 , isbn =

work page 2009

[2] [2]

2019 International Conference on Robotics and Automation (ICRA) , pages=

Closing the sim-to-real loop: Adapting simulation randomization with real world experience , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

work page 2019

[3] [3]

International journal of computer vision , volume=

Imagenet large scale visual recognition challenge , author=. International journal of computer vision , volume=. 2015 , publisher=

work page 2015

[4] [4]

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

Sim-to-real: Learning agile locomotion for quadruped robots , author=. arXiv preprint arXiv:1804.10332 , year=

work page Pith review arXiv

[5] [5]

Sadeghi, Fereshteh and Levine, Sergey , booktitle=

work page

[6] [6]

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author =

work page

[7] [7]

2019 , howpublished =

Data-Driven Deep Reinforcement Learning , author =. 2019 , howpublished =

work page 2019

[8] [8]

2020 , howpublished =

Does On-Policy Data Collection Fix Errors in Reinforcement Learning? , author =. 2020 , howpublished =

work page 2020

[9] [9]

The Journal of Machine Learning Research , volume=

End-to-end training of deep visuomotor policies , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

work page 2016

[10] [10]

International Conference on Machine Learning , pages=

Guided policy search , author=. International Conference on Machine Learning , pages=

work page

[11] [11]

Nature , volume=

Mastering the game of go without human knowledge , author=. Nature , volume=. 2017 , publisher=

work page 2017

[12] [12]

Advances in neural information processing systems , pages=

A natural policy gradient , author=. Advances in neural information processing systems , pages=

work page

[13] [13]

Playing Atari with Deep Reinforcement Learning

Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Advances in neural information processing systems , pages=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , pages=

work page

[15] [15]

Journal of Machine Learning Research , volume=

Tree-based batch mode reinforcement learning , author=. Journal of Machine Learning Research , volume=

work page

[16] [16]

Machine learning , volume=

Reinforcement learning in feedback control , author=. Machine learning , volume=. 2011 , publisher=

work page 2011

[17] [17]

Neural fitted

Riedmiller, Martin , booktitle=. Neural fitted. 2005 , organization=

work page 2005

[18] [18]

Reinforcement learning , pages=

Batch reinforcement learning , author=. Reinforcement learning , pages=. 2012 , publisher=

work page 2012

[19] [19]

International Conference on Machine Learning (ICML) , year =

Bias in Natural Actor-Critic Algorithms , author =. International Conference on Machine Learning (ICML) , year =

work page

[20] [20]

and Storkey, A

Toussaint, M. and Storkey, A. , title =. International Conference on Machine Learning (ICML) , year =

work page

[21] [21]

Attias , title =

H. Attias , title =. Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics , year =

work page

[22] [22]

1994 , publisher=

Tesauro, Gerald , journal=. 1994 , publisher=

work page 1994

[23] [23]

European Workshop on Reinforcement Learning (EWRL) , year =

Actor-Critic Reinforcement Learning with Energy-Based Policies , author =. European Workshop on Reinforcement Learning (EWRL) , year =

work page

[24] [24]

International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page

[25] [25]

International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

Efficient reductions for imitation learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page

[26] [26]

International Conference on Machine Learning (ICML) , volume=

Approximately optimal approximate reinforcement learning , author=. International Conference on Machine Learning (ICML) , volume=

work page

[27] [27]

Minka, T. P. , title =. Uncertainty in Artificial Intelligence (UAI) , year =

work page

[28] [28]

Maximum a Posteriori Policy Optimisation , author =

work page

[29] [29]

Williams, R. J. , title =. Machine Learning , issue_date =. 1992 , pages =

work page 1992

[30] [30]

Williams, R. J. and Peng, J. , journal =

work page

[31] [31]

Sutton, R. S. and Barto, A. G. , title =. 1998 , isbn =

work page 1998

[32] [32]

R. S. Sutton , title =. International Conference on Machine Learning (ICML) , year =

work page

[33] [33]

and Munos, R

O'Donoghue, B. and Munos, R. and Kavukcuoglu, K. and Mnih, V. , year =. PGQ: Combining policy gradient and Q-learning , booktitle =

work page

[34] [34]

and Hinton, G

Sallans, B. and Hinton, G. E. , title =. Journal of Machine Learning Research , volume =

work page

[35] [35]

L. P. Kaelbling and M. L. Littman and A. P. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research. 1996

work page 1996

[36] [36]

Todorov , booktitle=

E. Todorov , booktitle=. General duality between optimal control and estimation , year=

work page

[37] [37]

J. A. Bagnell and J. Schneider , title =. International Joint Conference on Artifical Intelligence (IJCAI) , year =

work page

[38] [38]

u lling, K. and Alt \

Peters, J. and M \"u lling, K. and Alt \"u n, Y. Relative Entropy Policy Search. AAAI Conference on Artificial Intelligence (AAAI). 2010

work page 2010

[39] [39]

Neural Information Processing Systems (NIPS) , year =

Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , author =. Neural Information Processing Systems (NIPS) , year =

work page

[40] [40]

Todorov , title =

E. Todorov , title =. Neural Information Processing Systems (NIPS) , year =

work page

[41] [41]

and Todorov, E

Dvijotham, K. and Todorov, E. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

work page

[42] [42]

Neural Information Processing Systems (NeurIPS) , year =

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , author =. Neural Information Processing Systems (NeurIPS) , year =

work page

[43] [43]

arXiv preprint arXiv:1603.01312 , year=

Learning physical intuition of block towers by example , author=. arXiv preprint arXiv:1603.01312 , year=

work page internal anchor Pith review arXiv

[44] [45]

2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Deep visual foresight for planning robot motion , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=

work page 2017

[45] [46]

Advances in neural information processing systems , pages=

Interaction networks for learning about objects, relations and physics , author=. Advances in neural information processing systems , pages=

work page

[46] [47]

International Conference on Learning Representations , year=

Distributionally Robust Neural Networks , author=. International Conference on Learning Representations , year=

work page

[47] [49]

Duchi , title =

Certifying some distributional robustness with principled adversarial training , author=. arXiv preprint arXiv:1710.10571 , year=

work page arXiv

[48] [50]

Advances in neural information processing systems , pages=

Semi-supervised learning with deep generative models , author=. Advances in neural information processing systems , pages=

work page

[49] [51]

Advances in neural information processing systems , pages=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , pages=

work page

[50] [52]

international conference on machine learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=

work page

[51] [53]

Causality for machine learning

Causality for Machine Learning , author=. arXiv preprint arXiv:1911.10500 , year=

work page arXiv 1911

[52] [54]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015

[53] [55]

Deep Reinforcement Learning and the Deadly Triad

Deep reinforcement learning and the deadly triad , author=. arXiv preprint arXiv:1812.02648 , year=

work page Pith review arXiv

[54] [56]

International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

Learning Policy Improvements with Path Integrals , author =. International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

work page 2010

[55] [57]

International Conference on Machine Learning (ICML) , year =

Trust Region Policy Optimization , author =. International Conference on Machine Learning (ICML) , year =

work page

[56] [58]

, title =

Levine, S. , title =

work page

[57] [59]

, title =

Ziebart, B. , title =

work page

[58] [60]

H. J. Kappen , title =. Inference and Learning in Dynamic Models , year =

work page

[59] [61]

Advances in Neural Information Processing Systems , pages=

Learning continuous control policies by stochastic value gradients , author=. Advances in Neural Information Processing Systems , pages=

work page

[60] [62]

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=

Reinforcement learning with deep energy-based policies , author=. Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=. 2017 , organization=

work page 2017

[61] [63]

H. J. Kappen and V. G. Optimal control as a graphical model inference problem , journal =. 2012 , pages =

work page 2012

[62] [64]

Rawlik and M

K. Rawlik and M. Toussaint and S. Vijayakumar , title =. 2013 , booktitle =

work page 2013

[63] [65]

Toussaint , title =

M. Toussaint , title =. International Conference on Machine Learning (ICML) , year =

work page

[64] [66]

Uncertainty in Artificial Intelligence (UAI) , volume=

Hierarchical POMDP Controller Optimization by Likelihood Maximization , author=. Uncertainty in Artificial Intelligence (UAI) , volume=

work page

[65] [67]

Kalman , title=

R. Kalman , title=. ASME Transactions journal of basic engineering , volume=

work page

[66] [68]

Advances in neural information processing systems , pages=

Actor-critic algorithms , author=. Advances in neural information processing systems , pages=

work page

[67] [69]

Machine learning , volume=

Self-improving reactive agents based on reinforcement learning, planning and teaching , author=. Machine learning , volume=. 1992 , publisher=

work page 1992

[68] [70]

Machine learning , volume=

Q-learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992

[69] [71]

Todorov , title =

E. Todorov , title =. Advances in Neural Information Processing Systems (NIPS) , year =

work page

[70] [72]

and Schaal, S

Peters, J. and Schaal, S. , title =. International Conference on Machine Learning (ICML) , year =

work page

[71] [73]

Neumann , title =

G. Neumann , title =. International Conference on Machine Learning (ICML) , year =

work page

[72] [74]

Levine and V

S. Levine and V. Koltun , title =. Advances in Neural Information Processing Systems (NIPS) , year =

work page

[73] [75]

European Conference on Machine Learning (ECML) , year =

Efficient Sample Reuse in EM-Based Policy Search , author =. European Conference on Machine Learning (ECML) , year =

work page

[74] [76]

Journal of Machine Learning Research , volume=

Variational message passing , author=. Journal of Machine Learning Research , volume=. 2005 , pages=

work page 2005

[75] [77]

International Conference on Machine Learning (ICML) , year=

Reinforcement Learning with Deep Energy-Based Policies , author=. International Conference on Machine Learning (ICML) , year=

work page

[76] [78]

2018 , booktitle =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. 2018 , booktitle =

work page 2018

[77] [79]

2020 , booktitle =

D4RL: Datasets for Deep Data-Driven Reinforcement Learning , author =. 2020 , booktitle =

work page 2020

[78] [80]

2017 , booktitle =

Bridging the Gap Between Value and Policy Based Reinforcement Learning , author =. 2017 , booktitle =

work page 2017

[79] [81]

and Koltun, V

Levine, S. and Koltun, V. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

work page

[80] [82]

Ziebart, B. D. and Maas, A. and Bagnell, J. A. and Dey, A. K. , title =. International Conference on Artificial Intelligence (AAAI) , year =

work page