hub Canonical reference

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson · 2018 · cs.LG · arXiv 1803.05407

Canonical reference. 71% of citing Pith papers cite this work as background.

45 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1

citation-polarity summary

background 5 unclear 1 use method 1

representative citing papers

Neural Network Quantization by Learning Low-Loss Subspaces

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

Learning quantization-aware linear paths in weight space yields a midpoint whose direct quantization matches quantization-aware training performance without using straight-through estimators.

Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models

cs.LG · 2026-06-23 · unverdicted · novelty 7.0 · 2 refs

PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

cs.LG · 2026-05-14 · conditional · novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

A foundation model of vision, audition, and language for in-silico neuroscience

q-bio.NC · 2026-05-05 · unverdicted · novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.

Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation

quant-ph · 2026-04-09 · unverdicted · novelty 7.0

Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.

Optimizing Visual Generative Models via Distribution-wise Rewards

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

Distribution-wise rewards with subset-replace strategy and post-hoc merging improve FID-50K on SiT (8.30 to 5.77) and EDM2 (3.74 to 3.52) while preserving diversity.

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

eess.SP · 2026-05-16 · unverdicted · novelty 6.0

Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.

AIRA_2: Overcoming Bottlenecks in AI Research Agents

cs.AI · 2026-03-27 · conditional · novelty 6.0

AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.

Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

cs.CV · 2025-12-11 · conditional · novelty 6.0

SAAD adaptively weights adversarial training samples by their transferability to the teacher, yielding higher AutoAttack robustness than prior distillation methods on CIFAR and Tiny-ImageNet without extra compute.

EmbeddingGemma: Powerful and Lightweight Text Representations

cs.CL · 2025-09-24 · unverdicted · novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

cs.LG · 2025-02-08 · unverdicted · novelty 6.0

TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.

Sharpness-Aware Minimization for Efficiently Improving Generalization

cs.LG · 2020-10-03 · conditional · novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.

Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors

stat.ML · 2026-05-12 · unverdicted · novelty 6.0

The Spatial Adapter equips frozen predictors with a spatially regularized orthonormal basis for residuals and derives a closed-form low-rank-plus-noise covariance for spatial prediction and kriging.

MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.

Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy

cs.LG · 2026-05-02 · unverdicted · novelty 6.0

Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.

Using Graph Neural Networks for hadronic clustering and to reduce beam background in the Belle~II electromagnetic calorimeter

hep-ex · 2026-04-22 · unverdicted · novelty 6.0

Graph neural networks can identify and remove unwanted beam background depositions in the Belle II calorimeter to improve hadronic clustering and reduce fake photon clusters.

FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods

cs.CV · 2026-04-22 · conditional · novelty 6.0

The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower training cost on CIFAR-10, CIFAR-100, and Tiny-ImageNet.

Generalization at the Edge of Stability

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.

Benchmarking Optimizers for MLPs in Tabular Deep Learning

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.

Vision Transformers Need Registers

cs.CV · 2023-09-28 · unverdicted · novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

citing papers explorer

Showing 45 of 45 citing papers.

Neural Network Quantization by Learning Low-Loss Subspaces cs.CV · 2026-06-23 · unverdicted · none · ref 20 · internal anchor
Learning quantization-aware linear paths in weight space yields a midpoint whose direct quantization matches quantization-aware training performance without using straight-through estimators.
Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models cs.LG · 2026-06-23 · unverdicted · none · ref 15 · 2 links · internal anchor
PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.
Why Muon Outperforms Adam: A Curvature Perspective cs.LG · 2026-06-03 · conditional · none · ref 149 · internal anchor
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm cs.LG · 2026-05-14 · conditional · none · ref 39 · internal anchor
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
A foundation model of vision, audition, and language for in-silico neuroscience q-bio.NC · 2026-05-05 · unverdicted · none · ref 65
TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading cs.CR · 2026-04-19 · unverdicted · none · ref 188
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation quant-ph · 2026-04-09 · unverdicted · none · ref 76
Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.
Optimizing Visual Generative Models via Distribution-wise Rewards cs.LG · 2026-07-02 · unverdicted · none · ref 15 · internal anchor
Distribution-wise rewards with subset-replace strategy and post-hoc merging improve FID-50K on SiT (8.30 to 5.77) and EDM2 (3.74 to 3.52) while preserving diversity.
On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity cs.LG · 2026-06-24 · unverdicted · none · ref 113 · internal anchor
On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games cs.LG · 2026-06-22 · unverdicted · none · ref 9 · internal anchor
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis eess.SP · 2026-05-16 · unverdicted · none · ref 144 · internal anchor
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
AIRA_2: Overcoming Bottlenecks in AI Research Agents cs.AI · 2026-03-27 · conditional · none · ref 9 · internal anchor
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation cs.CV · 2025-12-11 · conditional · none · ref 35 · internal anchor
SAAD adaptively weights adversarial training samples by their transferability to the teacher, yielding higher AutoAttack robustness than prior distillation methods on CIFAR and Tiny-ImageNet without extra compute.
EmbeddingGemma: Powerful and Lightweight Text Representations cs.CL · 2025-09-24 · unverdicted · none · ref 7 · internal anchor
A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data cs.LG · 2025-02-08 · unverdicted · none · ref 245 · internal anchor
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
Sharpness-Aware Minimization for Efficiently Improving Generalization cs.LG · 2020-10-03 · conditional · none · ref 21 · internal anchor
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors stat.ML · 2026-05-12 · unverdicted · none · ref 21
The Spatial Adapter equips frozen predictors with a spatially regularized orthonormal basis for residuals and derives a closed-form low-rank-plus-noise covariance for spatial prediction and kriging.
MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text cs.CL · 2026-05-07 · unverdicted · none · ref 18
MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.
Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy cs.LG · 2026-05-02 · unverdicted · none · ref 24
Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.
Using Graph Neural Networks for hadronic clustering and to reduce beam background in the Belle~II electromagnetic calorimeter hep-ex · 2026-04-22 · unverdicted · none · ref 9
Graph neural networks can identify and remove unwanted beam background depositions in the Belle II calorimeter to improve hadronic clustering and reduce fake photon clusters.
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods cs.CV · 2026-04-22 · conditional · none · ref 27
The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower training cost on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
Generalization at the Edge of Stability cs.LG · 2026-04-21 · unverdicted · none · ref 36
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
Benchmarking Optimizers for MLPs in Tabular Deep Learning cs.LG · 2026-04-16 · unverdicted · none · ref 4
Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 166
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Flatness Preserves Instruction Following in Vision-Language-Action Models cs.RO · 2026-06-22 · unverdicted · none · ref 41 · internal anchor
Sharpness-aware minimization during VLA finetuning preserves instruction following and yields over 60% gains across simulation and real-world tasks.
Data Selection Through Iterative Self-Filtering for Vision-Language Settings cs.CV · 2026-06-22 · unverdicted · none · ref 92 · internal anchor
An iterative bootstrapped self-filtering approach selects balanced clean and diverse subsets from noisy vision-language datasets to train improved CLIP models.
GRAIN: Group Aggregation via Min-Norm Objective cs.LG · 2026-06-22 · unverdicted · none · ref 69 · internal anchor
GRAIN is a gradient aggregation method using min-norm objectives to ensure non-negative inner products with group gradients, yielding tighter uniform stability bounds than SGD under smoothness assumptions.
LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning cs.LG · 2026-06-12 · unverdicted · none · ref 50 · internal anchor
The paper reformulates industrial continual learning for LLMs as a closed-loop ecosystem problem, identifies three core challenges, and organizes solutions around five lifecycle design principles.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts cs.LG · 2026-05-17 · unverdicted · none · ref 46 · internal anchor
UniAlign improves robustness of deep learning NTC models under distribution shifts via domain alignment fine-tuning and stable ensembling, yielding 2.51% accuracy and 2.71% F1 gains over standard training on three public datasets.
Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics cs.LG · 2026-05-14 · unverdicted · none · ref 46 · internal anchor
DiMS is a physics-inspired dynamical sampler guaranteed to exactly sample reparameterization-invariant minimum level sets in neural network loss landscapes.
CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization cs.CV · 2026-05-06 · unverdicted · none · ref 44 · 2 links · internal anchor
CPCANet unrolls the Flury-Gautschi algorithm for Common Principal Component Analysis into differentiable layers to learn a shared invariant subspace across domains, reporting SOTA zero-shot transfer on four DG benchmarks.
Defending against Backdoor Attacks via Module Switching cs.CR · 2025-04-08 · unverdicted · none · ref 19 · internal anchor
Module-switching defense disrupts backdoors more effectively than weight averaging with fewer models and remains robust even when some models share the same backdoors.
Causal Fine-Tuning under Latent Confounded Shift cs.LG · 2024-10-18 · unverdicted · none · ref 25 · internal anchor
Causal Fine-Tuning decomposes BERT representations into causal and spurious parts via SCM inductive bias to improve robustness under latent confounded shifts in text classification.
The Platonic Representation Hypothesis cs.LG · 2024-05-13 · unverdicted · none · ref 101 · internal anchor
Representations learned by large AI models are converging toward a shared statistical model of reality.
Differentially Private Model Merging cs.LG · 2026-04-22 · unverdicted · none · ref 33
Post-processing via random selection or linear combination of differentially private models allows meeting arbitrary target privacy parameters without additional training.
Biologically-Grounded Multi-Encoder Architectures as Developability Oracles for Antibody Design q-bio.BM · 2026-04-10 · unverdicted · none · ref 12
CrossAbSense oracles using frozen PLM encoders plus self- or cross-attention decoders improve prediction accuracy by 12-20% on three of five developability assays for therapeutic IgGs, with architecture choices revealing that aggregation depends on single-chain signals while stability requires heavy
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications cs.CV · 2026-04-03 · unverdicted · none · ref 32
MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselines on nine Mars-Bench tasks.
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini cs.CV · 2026-05-26 · unverdicted · none · ref 26 · internal anchor
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities cs.LG · 2024-08-14 · accept · none · ref 89 · internal anchor
The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection cs.LG · 2026-05-09 · unverdicted · none · ref 12
TopoGeoScore learns a non-negative linear combination of geometric and topological features from source embeddings via self-supervised invariance to select robust checkpoints for OOD scenarios.
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning cs.LG · 2026-05-08 · unverdicted · none · ref 30
The paper proposes Trajectory Regularized Merging (TRM) to enable storage-free model merging in continual learning by optimizing in an augmented trajectory subspace with task alignment, prediction consistency, and gradient responsiveness objectives, claiming SOTA results.
Momentum-Anchored Multi-Scale Fusion Model for Long-Tailed Chest X-Ray Classification cs.CV · 2026-05-04 · unverdicted · none · ref 14
A new neural network stabilizes features for rare chest X-ray diseases via momentum anchoring and multi-scale fusion on EfficientNet, achieving 0.8682 AUC on ChestX-ray14.
Phoenix-VL 1.5 Medium Technical Report cs.CL · 2026-05-11 · unverdicted · none · ref 11
Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.
LLMs Struggle with Abstract Meaning Comprehension More Than Expected cs.CL · 2026-04-13 · unverdicted · none · ref 12
LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.

Averaging Weights Leads to Wider Optima and Better Generalization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer