Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning

Alessandro Lazaric; Denis Yarats; Lerrel Pinto; Rob Fergus

arxiv: 2107.09645 · v1 · pith:FQLV5H64new · submitted 2021-07-20 · 💻 cs.AI · cs.LG

Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning

Denis Yarats , Rob Fergus , Alessandro Lazaric , Lerrel Pinto This is my paper

classification 💻 cs.AI cs.LG

keywords drq-v2controlcontinuousdirectlylearningmodel-freereinforcementtasks

0 comments

read the original abstract

We present DrQ-v2, a model-free reinforcement learning (RL) algorithm for visual continuous control. DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels. We introduce several improvements that yield state-of-the-art results on the DeepMind Control Suite. Notably, DrQ-v2 is able to solve complex humanoid locomotion tasks directly from pixel observations, previously unattained by model-free RL. DrQ-v2 is conceptually simple, easy to implement, and provides significantly better computational footprint compared to prior work, with the majority of tasks taking just 8 hours to train on a single GPU. Finally, we publicly release DrQ-v2's implementation to provide RL practitioners with a strong and computationally efficient baseline.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Operator-Guided Invariance Learning for Continuous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Rank-Then-Act: Reward-Free Control from Frame-Order Progress
cs.LG 2026-07 unverdicted novelty 6.0

RTA trains a VLM as a progress ordinal scorer via GRPO on shuffled expert frames and uses Spearman rank correlation with temporal indices as a bounded RL reward, matching or exceeding prior video reward methods on dis...
RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

RARM is a lightweight visual comparator trained once on general videos that supplies dense progress rewards to RL by matching rollout clips to a reference demonstration and gating rewards on match confidence.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
cs.AI 2025-09 unverdicted novelty 6.0

TimeRewarder derives step-wise progress rewards from frame-wise temporal distances in passive videos and uses them to guide RL, achieving high success rates on Meta-World tasks with fewer interactions than prior metho...
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
cs.CV 2025-03 unverdicted novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
cs.AI 2025-09 unverdicted novelty 5.0

TimeRewarder derives progress-based dense rewards from passive videos via frame-wise temporal distance modeling and uses them as proxy rewards to boost RL success on Meta-World tasks.
LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning
cs.RO 2025-09 unverdicted novelty 5.0

LLM-TALE steers RL exploration using LLM-generated plans at task and affordance levels with online suboptimality correction, improving sample efficiency and success rates on pick-and-place tasks without human supervision.
Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own
cs.RO 2023-10 unverdicted novelty 5.0

RLFP and the FAC algorithm combine foundation-model priors for policy, value, and rewards to produce sample-efficient robotic RL that reaches 86% real-robot success after one hour and 100% success on 7/8 Meta-world ta...
Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
cs.RO 2026-05 unverdicted novelty 4.0

SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robo...
Balancing Plasticity and Stability with Fast and Slow Successor Features
cs.LG 2026-05 unverdicted novelty 4.0

Synaptic consolidation applied to multi-timescale successor features yields better performance than plasticity-focused methods in RL under gradual environmental drift.
Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms
cs.RO 2026-05 unverdicted novelty 4.0

A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.