Git Re-Basin: Merging Models modulo Permutation Symmetries
read the original abstract
The success of deep learning is due in large part to our ability to solve certain massive non-convex optimization problems with relative ease. Though non-convex optimization is NP-hard, simple algorithms -- often variants of stochastic gradient descent -- exhibit surprising effectiveness in fitting large neural networks in practice. We argue that neural network loss landscapes often contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units a la Entezari et al. 2021. We introduce three algorithms to permute the units of one model to bring them into alignment with a reference model in order to merge the two models in weight space. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10. Additionally, we identify intriguing phenomena relating model width and training time to mode connectivity. Finally, we discuss shortcomings of the linear mode connectivity hypothesis, including a counterexample to the single basin theory.
This paper has not been read by Pith yet.
Forward citations
Cited by 22 Pith papers
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
Editing Models with Task Arithmetic
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
-
WARP: Weight-Space Analysis for Recovering Training Data Portfolios
WARP recovers training domain mixtures from fine-tuned model weights using weight-space interpolation via model merging to generate pseudo-checkpoints and geometric features mapped to proportions.
-
Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs
Tiered Language Models use a secret key to induce an alternative computation graph over shared weights, enabling private capabilities in the keyed mode while the public mode shows none.
-
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
-
Discovering Physical Directions in Weight Space: Composing Neural PDE Experts
Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.
-
Flat Channels to Infinity in Neural Loss Landscapes
Neural loss landscapes contain flat channels to infinity along which gradient flow leads pairs of neurons to implement gated linear units.
-
Child-directed speech facilitates production, not comprehension, in BabyLMs
CDS-trained BabyLMs show earlier and more appropriate production in a new frame-completion task while FineWeb-edu models lead on comprehension benchmarks, indicating current tests underestimate CDS benefits.
-
Motion-Compensated Weight Compression
MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.
-
Unlocking the Potential of Continual Model Merging: An ODE Perspective
Introduces ODE-M, an ODE-based merging method for continual model merging that follows low-loss connecting paths to mitigate catastrophic forgetting.
-
Unlocking the Potential of Continual Model Merging: An ODE Perspective
ODE-M traces low-loss connecting paths via time-dependent velocity fields and barrier constraints to improve controllability and reduce forgetting in continual model merging.
-
Unlocking the Potential of Continual Model Merging: An ODE Perspective
ODE-M formulates continual model merging as a barrier-aware ODE trajectory in parameter space, using first-order feedback and a utility-aware schedule to balance retained knowledge and new task performance.
-
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
-
Functional Similarity Metric for Neural Networks: Overcoming Parametric Ambiguity via Activation Region Analysis
A functional similarity metric for ReLU networks uses normalized activation region signatures and MinHash to overcome parametric symmetries like neuron permutation and scaling.
-
Evidence of an Emergent "Self" in Continual Robot Learning
Continual learning robots form a significantly more stable invariant subnetwork than constant-task controls, and preserving it improves adaptation while damaging it hurts performance.
-
Steerable Adversarial Scenario Generation through Test-Time Preference Alignment
SAGE reframes adversarial scenario generation as multi-objective preference alignment, using hierarchical group-based optimization and test-time linear interpolation of two expert policies to enable steerable control ...
-
DanceOPD: On-Policy Generative Field Distillation
DanceOPD routes samples across capability velocity fields in flow-matching models and trains via on-policy student-induced states to compose T2I, local editing, and global editing without mutual interference.
-
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
A bidirectional optimization method using parameterized transformations enables near-zero loss barriers for linear mode connectivity in medium-scale language models and small barriers in billion-parameter transformers.
-
HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation
HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.
-
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselin...
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning
FedBB addresses inter-case, inter-class, and inter-client imbalances in federated learning via Positive Negative Balanced loss and Client Balanced Reweighting, outperforming baselines on X-ray and natural image datase...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.