pith. sign in

arxiv: 2209.04836 · v6 · pith:YYZVEFFLnew · submitted 2022-09-11 · 💻 cs.LG · cs.AI

Git Re-Basin: Merging Models modulo Permutation Symmetries

classification 💻 cs.LG cs.AI
keywords modelbasinconnectivitymodemodelssinglealgorithmsincluding
0
0 comments X
read the original abstract

The success of deep learning is due in large part to our ability to solve certain massive non-convex optimization problems with relative ease. Though non-convex optimization is NP-hard, simple algorithms -- often variants of stochastic gradient descent -- exhibit surprising effectiveness in fitting large neural networks in practice. We argue that neural network loss landscapes often contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units a la Entezari et al. 2021. We introduce three algorithms to permute the units of one model to bring them into alignment with a reference model in order to merge the two models in weight space. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10. Additionally, we identify intriguing phenomena relating model width and training time to mode connectivity. Finally, we discuss shortcomings of the linear mode connectivity hypothesis, including a counterexample to the single basin theory.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Statistical Cost of Adaptation in Multi-Source Transfer Learning

    math.ST 2026-05 unverdicted novelty 8.0

    Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

  2. Editing Models with Task Arithmetic

    cs.LG 2022-12 accept novelty 8.0

    Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

  3. WARP: Weight-Space Analysis for Recovering Training Data Portfolios

    cs.LG 2026-07 unverdicted novelty 7.0

    WARP recovers training domain mixtures from fine-tuned model weights using weight-space interpolation via model merging to generate pseudo-checkpoints and geometric features mapped to proportions.

  4. Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs

    cs.CR 2026-06 unverdicted novelty 7.0

    Tiered Language Models use a secret key to induce an alternative computation graph over shared weights, enabling private capabilities in the keyed mode while the public mode shows none.

  5. Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

    cs.LG 2026-06 unverdicted novelty 7.0

    MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.

  6. Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Fine-tuning neural PDE operators to regime endpoints reveals a physical direction in weight space that CCM uses to compose accurate merged models for new or extrapolated regimes from metadata or short prefixes.

  7. Flat Channels to Infinity in Neural Loss Landscapes

    cs.LG 2025-06 unverdicted novelty 7.0

    Neural loss landscapes contain flat channels to infinity along which gradient flow leads pairs of neurons to implement gated linear units.

  8. Child-directed speech facilitates production, not comprehension, in BabyLMs

    cs.CL 2026-05 unverdicted novelty 6.0

    CDS-trained BabyLMs show earlier and more appropriate production in a new frame-completion task while FineWeb-edu models lead on comprehension benchmarks, indicating current tests underestimate CDS benefits.

  9. Motion-Compensated Weight Compression

    cs.CV 2026-05 unverdicted novelty 6.0

    MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.

  10. Unlocking the Potential of Continual Model Merging: An ODE Perspective

    cs.LG 2026-05 unverdicted novelty 6.0

    Introduces ODE-M, an ODE-based merging method for continual model merging that follows low-loss connecting paths to mitigate catastrophic forgetting.

  11. Unlocking the Potential of Continual Model Merging: An ODE Perspective

    cs.LG 2026-05 unverdicted novelty 6.0

    ODE-M traces low-loss connecting paths via time-dependent velocity fields and barrier constraints to improve controllability and reduce forgetting in continual model merging.

  12. Unlocking the Potential of Continual Model Merging: An ODE Perspective

    cs.LG 2026-05 unverdicted novelty 6.0

    ODE-M formulates continual model merging as a barrier-aware ODE trajectory in parameter space, using first-order feedback and a utility-aware schedule to balance retained knowledge and new task performance.

  13. PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

    cs.CV 2026-04 unverdicted novelty 6.0

    PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.

  14. Functional Similarity Metric for Neural Networks: Overcoming Parametric Ambiguity via Activation Region Analysis

    cs.LG 2026-04 unverdicted novelty 6.0

    A functional similarity metric for ReLU networks uses normalized activation region signatures and MinHash to overcome parametric symmetries like neuron permutation and scaling.

  15. Evidence of an Emergent "Self" in Continual Robot Learning

    cs.RO 2026-03 unverdicted novelty 6.0

    Continual learning robots form a significantly more stable invariant subnetwork than constant-task controls, and preserving it improves adaptation while damaging it hurts performance.

  16. Steerable Adversarial Scenario Generation through Test-Time Preference Alignment

    cs.AI 2025-09 unverdicted novelty 6.0

    SAGE reframes adversarial scenario generation as multi-objective preference alignment, using hierarchical group-based optimization and test-time linear interpolation of two expert policies to enable steerable control ...

  17. DanceOPD: On-Policy Generative Field Distillation

    cs.CV 2026-06 unverdicted novelty 5.0

    DanceOPD routes samples across capability velocity fields in flow-matching models and trains via on-policy student-induced states to compose T2I, local editing, and global editing without mutual interference.

  18. Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

    cs.LG 2026-06 unverdicted novelty 5.0

    A bidirectional optimization method using parameterized transformations enables near-zero loss barriers for linear mode connectivity in medium-scale language models and small barriers in billion-parameter transformers.

  19. HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation

    cs.LG 2026-04 unverdicted novelty 5.0

    HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.

  20. MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications

    cs.CV 2026-04 unverdicted novelty 5.0

    MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselin...

  21. The Platonic Representation Hypothesis

    cs.LG 2024-05 unverdicted novelty 5.0

    Representations learned by large AI models are converging toward a shared statistical model of reality.

  22. Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning

    cs.LG 2026-06 unverdicted novelty 4.0

    FedBB addresses inter-case, inter-class, and inter-client imbalances in federated learning via Positive Negative Balanced loss and Client Balanced Reweighting, outperforming baselines on X-ray and natural image datase...