Decoupled Weight Decay Regularization
read the original abstract
L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
-
DataComp-VLM: Improved Open Datasets for Vision-Language Models
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
-
Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization
Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.
-
Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
-
Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset ...
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Dissecting Jet-Tagger Through Mechanistic Interpretability
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
-
LLM Translation of Compiler Intermediate Representation
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
-
Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum
Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.
-
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.
-
When Do Diffusion Models learn to Generate Multiple Objects?
Diffusion models' multi-object generation is limited primarily by scene complexity and held-out combinations rather than imbalance, with counting difficult in low data and compositional generalization collapsing as mo...
-
Stability and Generalization in Looped Transformers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
-
CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations
CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving ...
-
Rotation Equivariant Mamba for Vision Tasks
EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-e...
-
A document is worth a structured record: Principled inductive bias design for document recognition
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
-
HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals
Authors release HSG-12M, a dataset of 16.7 million spatial multigraphs generated from non-Hermitian crystal energy spectra via the Poly2Graph pipeline, along with initial GNN benchmarks.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
VMamba: Visual State Space Model
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
-
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Seek to Segment: Active Perception for Panoramic Referring Segmentation
Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.
-
GaussianFusion: Unified 3D Gaussian Representation for Multi-Modal Fusion Perception
GaussianFusion presents a 3D Gaussian-based framework that unifies multi-modal features in continuous space for 3D object detection and semantic occupancy, reporting gains over BEVFusion and GaussFormer on nuScenes.
-
Joint inference of weak lensing convergence map and cosmology with diffusion models
A transformer-based diffusion model learns the joint distribution of convergence maps and cosmology from log-normal weak lensing simulations and generates calibrated posterior samples matching MCMC results.
-
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates l...
-
Deep Reinforcement Learning for Individual Atomic Control and Cooling
Deep reinforcement learning achieves real-time cooling of single-atom motion with a 388 microsecond time constant using cavity feedback, outperforming a linear differentiator controller.
-
Morphing into Hybrid Attention Models
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient...
-
Factorizable Normalizing Flows for parameter-dependent density morphing
Factorizable Normalizing Flows represent parameter-dependent densities via a reference flow composed with a factorized polynomial transformation, enabling isolated per-parameter learning and linear scaling.
-
Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation wit...
-
Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation
PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA ...
-
DeVAR: Low-Dose CT Denoising via Visual Autoregressive Modeling
DeVAR is the first application of visual autoregressive modeling to low-dose CT denoising, using next-scale token prediction, a residual refiner, and hybrid discrete-continuous decoding to outperform prior methods on ...
-
Masked Language Flow Models
MLFMs combine masking with continuous flows to scale flow-based language models to reasoning and instruction-following tasks on GSM8K and MT-Bench.
-
Understanding Cross-Rig Generalization in Automotive Perception: a Multi-Rig Benchmark and Rig Variation Metrics
Introduces Plentiful CARLA Camera Rigs benchmark and two calibration-derived metrics (Rig Variance, Rig Contrastive Distance) showing geometric rig differences correlate with cross-rig performance drops in multi-view ...
-
Tessellating The Earth
TTE replaces fixed spherical bases with differentiable Voronoi partitions plus shared semantic tokens to create adaptive geolocation encoders that reach new SOTA on geospatial tasks and iNaturalist species classification.
-
Neural Texture Compression using Hypernetworks
A single hypernetwork generates per-material latent features and MLP decoder parameters for neural texture compression, matching the quality of per-material optimization methods while enabling extensions like super-re...
-
Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models
PACE is an AdamW wrapper derived from optimal control that improves the limiting error of the returned exponential-moving-average model in both theory and LM experiments.
-
Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models
PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL introduces a learnable specification language for sewing patterns that lets vision-language models reconstruct explicit, simulation-ready 3D garments from single images, backed by a new 300K paired dataset.
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL is a new template-free specification language for complete sewing patterns that enables direct single-image prediction of simulation-ready garments via a vision-language model, supported by a new 300K paire...
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL defines a compact, learnable specification language for sewing patterns that enables direct image-to-structured-garment prediction via VLM without templates or post-optimization, supported by a 300K dataset.
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL is a learnable template-free language for garment sewing patterns enabling direct VLM prediction of simulation-ready 3D garments from images, backed by a 300K image-to-specification dataset.
-
MeshFlow: Mesh Generation with Equivariant Flow Matching
MeshFlow applies equivariant optimal-transport flow matching to generate triangle meshes as soups, matching autoregressive quality with an 18x inference speedup.
-
PanoVine: Whole-Body Visuomotor Control for Soft Growing Vine Robot
Introduces the first autonomous whole-body vision control system for soft vine robots via an end-to-end visuomotor policy trained on demonstrations.
-
Full-Body Golf Swing Kinematic Reconstruction From a Smartwatch IMU
WIT-KinNet estimates full-body golf swing kinematics from one wrist IMU with 8.11° mean absolute error on 36 golfers across swing types and clubs, validated against optical motion capture.
-
The Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders
Transformer recommenders amplify popularity bias via spectral collapse when scaled; SPRINT constrains attention column-sums and feed-forward spectral norms to improve fairness and scaling behavior.
-
Towards Understanding the Power and Limits of the Muon Optimizer: A River-Valley Perspective
Muon moves faster along signal river directions early but converges slower or oscillates near optima than GD due to orthogonal updates removing scale information, supporting two-stage optimization.
-
Atomistic Language Models Understand and Generate Materials
ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.
-
MADField: Multi-fidelity Amortized Density Field for Adsorption in Nanoporous Materials
MADField is a multi-fidelity amortized model for predicting density fields to improve accuracy and speed of adsorption calculations in nanoporous materials for high-throughput screening.
-
Context-Aware Autoregressive Diffusion for Gloss-Wise Sign Language Production
GARD is a context-aware autoregressive diffusion model for gloss-wise sign language production using inter-gloss transition guidance and global motion harmonizer, claiming superior linguistic accuracy and motion simil...
-
Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models
PRISM shows video diffusion models inherently encode preference information in noisy latents, achieving SOTA accuracy and enabling noise-robust early-stage sampling with a correlation to generative performance.
-
EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors
EFIQA uses unsupervised masked anatomical inpainting to learn normal fundus structures and produces spatial quality maps via a shallow adapter on a frozen foundation model without any quality supervision.
-
Deep-Unfolded Coordination
Deep Coordinator uses deep unfolding to adapt ADMM-DDP penalty parameters at runtime, delivering 6.18-9.44x faster comparable-quality trajectories in car and quadrotor fleet simulations while scaling to 8x larger systems.
-
ParticleTransformer is all you need for reconstructing hadronic tau leptons
ParticleTransformer models achieve per-mille level misidentification rates, F1 scores up to 0.95 for decay modes, and superior kinematic resolution for hadronic tau leptons compared to conventional jet observables on ...
-
ParticleTransformer is all you need for reconstructing hadronic tau leptons
First fully machine-learned hadronic tau reconstruction at FCC-ee using ParticleTransformer achieves high performance on simulated data for identification, decay mode, charge, and kinematics.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.