Learning quantization-aware linear paths in weight space yields a midpoint whose direct quantization matches quantization-aware training performance without using straight-through estimators.
hub Canonical reference
Averaging Weights Leads to Wider Optima and Better Generalization
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.
Distribution-wise rewards with subset-replace strategy and post-hoc merging improve FID-50K on SiT (8.30 to 5.77) and EDM2 (3.74 to 3.52) while preserving diversity.
On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
SAAD adaptively weights adversarial training samples by their transferability to the teacher, yielding higher AutoAttack robustness than prior distillation methods on CIFAR and Tiny-ImageNet without extra compute.
A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
The Spatial Adapter equips frozen predictors with a spatially regularized orthonormal basis for residuals and derives a closed-form low-rank-plus-noise covariance for spatial prediction and kriging.
MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.
Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.
Graph neural networks can identify and remove unwanted beam background depositions in the Belle II calorimeter to improve hadronic clustering and reduce fake photon clusters.
The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower training cost on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
citing papers explorer
-
Neural Network Quantization by Learning Low-Loss Subspaces
Learning quantization-aware linear paths in weight space yields a midpoint whose direct quantization matches quantization-aware training performance without using straight-through estimators.
-
Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models
PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.
-
Why Muon Outperforms Adam: A Curvature Perspective
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
-
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
-
A foundation model of vision, audition, and language for in-silico neuroscience
TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
-
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
-
Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation
Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.
-
Optimizing Visual Generative Models via Distribution-wise Rewards
Distribution-wise rewards with subset-replace strategy and post-hoc merging improve FID-50K on SiT (8.30 to 5.77) and EDM2 (3.74 to 3.52) while preserving diversity.
-
On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.
-
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
-
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
-
AIRA_2: Overcoming Bottlenecks in AI Research Agents
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
-
Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation
SAAD adaptively weights adversarial training samples by their transferability to the teacher, yielding higher AutoAttack robustness than prior distillation methods on CIFAR and Tiny-ImageNet without extra compute.
-
EmbeddingGemma: Powerful and Lightweight Text Representations
A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
-
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
-
Sharpness-Aware Minimization for Efficiently Improving Generalization
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
-
Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors
The Spatial Adapter equips frozen predictors with a spatially regularized orthonormal basis for residuals and derives a closed-form low-rank-plus-noise covariance for spatial prediction and kriging.
-
MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text
MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.
-
Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy
Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.
-
Using Graph Neural Networks for hadronic clustering and to reduce beam background in the Belle~II electromagnetic calorimeter
Graph neural networks can identify and remove unwanted beam background depositions in the Belle II calorimeter to improve hadronic clustering and reduce fake photon clusters.
-
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower training cost on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
-
Generalization at the Edge of Stability
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
-
Benchmarking Optimizers for MLPs in Tabular Deep Learning
Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
Flatness Preserves Instruction Following in Vision-Language-Action Models
Sharpness-aware minimization during VLA finetuning preserves instruction following and yields over 60% gains across simulation and real-world tasks.
-
Data Selection Through Iterative Self-Filtering for Vision-Language Settings
An iterative bootstrapped self-filtering approach selects balanced clean and diverse subsets from noisy vision-language datasets to train improved CLIP models.
-
GRAIN: Group Aggregation via Min-Norm Objective
GRAIN is a gradient aggregation method using min-norm objectives to ensure non-negative inner products with group gradients, yielding tighter uniform stability bounds than SGD under smoothness assumptions.
-
LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning
The paper reformulates industrial continual learning for LLMs as a closed-loop ecosystem problem, identifies three core challenges, and organizes solutions around five lifecycle design principles.
-
Anytime Training with Schedule-Free Spectral Optimization
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
-
UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts
UniAlign improves robustness of deep learning NTC models under distribution shifts via domain alignment fine-tuning and stable ensembling, yielding 2.51% accuracy and 2.71% F1 gains over standard training on three public datasets.
-
Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics
DiMS is a physics-inspired dynamical sampler guaranteed to exactly sample reparameterization-invariant minimum level sets in neural network loss landscapes.
-
CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization
CPCANet unrolls the Flury-Gautschi algorithm for Common Principal Component Analysis into differentiable layers to learn a shared invariant subspace across domains, reporting SOTA zero-shot transfer on four DG benchmarks.
-
Defending against Backdoor Attacks via Module Switching
Module-switching defense disrupts backdoors more effectively than weight averaging with fewer models and remains robust even when some models share the same backdoors.
-
Causal Fine-Tuning under Latent Confounded Shift
Causal Fine-Tuning decomposes BERT representations into causal and spurious parts via SCM inductive bias to improve robustness under latent confounded shifts in text classification.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
Differentially Private Model Merging
Post-processing via random selection or linear combination of differentially private models allows meeting arbitrary target privacy parameters without additional training.
-
Biologically-Grounded Multi-Encoder Architectures as Developability Oracles for Antibody Design
CrossAbSense oracles using frozen PLM encoders plus self- or cross-attention decoders improve prediction accuracy by 12-20% on three of five developability assays for therapeutic IgGs, with architecture choices revealing that aggregation depends on single-chain signals while stability requires heavy
-
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselines on nine Mars-Bench tasks.
-
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.
-
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.
-
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection
TopoGeoScore learns a non-negative linear combination of geometric and topological features from source embeddings via self-supervised invariance to select robust checkpoints for OOD scenarios.
-
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning
The paper proposes Trajectory Regularized Merging (TRM) to enable storage-free model merging in continual learning by optimizing in an augmented trajectory subspace with task alignment, prediction consistency, and gradient responsiveness objectives, claiming SOTA results.
-
Momentum-Anchored Multi-Scale Fusion Model for Long-Tailed Chest X-Ray Classification
A new neural network stabilizes features for rare chest X-ray diseases via momentum anchoring and multi-scale fusion on EfficientNet, achieving 0.8682 AUC on ChestX-ray14.
-
Phoenix-VL 1.5 Medium Technical Report
Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.
-
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.