pith. sign in

arxiv: 2607.01311 · v1 · pith:36L3KIOKnew · submitted 2026-07-01 · 💻 cs.LG · stat.ML

From Approximation to Emergence: A Theory of Deep Learning

Pith reviewed 2026-07-03 21:37 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords deep learning theoryapproximation theoryoptimizationgeneralizationoverparameterizationemergencescaling lawstransformers
0
0 comments X

The pith

Deep learning theory forms a single narrative from classical approximation through optimization to modern emergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The monograph builds a unified account that links early results on function approximation, optimization, and generalization bounds to later work on overparameterization, transformers, scaling laws, in-context learning, interpretability, alignment, and emergence. Each component is presented by the object it governs, the assumptions required for its validity, and the phenomena it cannot yet explain. A reader would therefore obtain a map of the field that shows both its accumulated power and its remaining gaps around how capabilities arise from scale, data, architecture, and training. If the account holds, researchers gain a consistent way to place new results inside an existing progression rather than treating them as disconnected.

Core claim

The paper establishes that deep learning theory is best organized as a continuous progression: classical approximation theory supplies representational power, optimization and generalization analyses explain trainability and performance, and contemporary topics such as overparameterization, robustness, generative modeling, transformers, in-context learning, scaling laws, interpretability, alignment, and emergence address how new mechanisms appear once scale, data volume, and architectural choices cross certain thresholds.

What carries the argument

The coherent research narrative that examines each theory through the object it controls, the assumptions that validate it, and the phenomena it leaves unexplained.

If this is right

  • Approximation results set the representational limits that later scaling phenomena must respect.
  • Overparameterization resolves optimization success in regimes where classical generalization bounds predict failure.
  • Scaling laws and emergence mark the points at which new capabilities appear as functions of model size and data volume.
  • Interpretability and alignment questions become questions about how mechanisms that arise at scale can be inspected and steered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The narrative could be used to locate which current phenomena still lack any controlling theory.
  • Future work might test whether the same progression applies when new architectures or training methods are introduced.
  • The structure suggests that theoretical effort should concentrate on the transition regimes where emergence is observed.

Load-bearing premise

The broad and diverse literature on deep learning can be organized into one coherent narrative that accurately reflects the assumptions, scope, and limitations of each theory without significant selection bias or omissions.

What would settle it

Identification of a substantial body of results on deep learning that cannot be placed inside the proposed progression without violating the assumptions used in the earlier sections on approximation or optimization.

Figures

Figures reproduced from arXiv: 2607.01311 by Zhilin Zhao.

Figure 2.1
Figure 2.1. Figure 2.1: A useful theory must connect the data-generating process, the model class, the training algorithm, and performance on future data. The monograph will repeatedly return to four lenses: approximation, optimization, gener￾alization, and representation. Approximation asks which functions can be represented, and at what size. Optimization asks whether an algorithm can locate a good representative of the class… view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Mathematical theory tries to move from fitting labels to predictions with explicit guarantees. Approximation Which functions can networks represent? Optimization Can algorithms find them? Generalization Why predict on new data? Representation What structure is learnable? deep learning theory [PITH_FULL_IMAGE:figures/full_fig_p015_2_2.png] view at source ↗
Figure 2.5
Figure 2.5. Figure 2.5: Universal approximation is only a starting point. It does not by itself give size, optimization, or prediction guarantees. 2.2 The Learning Setup We now introduce the statistical notation used throughout the chapter. Let µ be an unknown distribution on R d × Y. A training sample is S = {(xi , yi)} n i=1, (xi , yi) ∼ µ, typically assumed independent and identically distributed. The vector xi ∈ R d contain… view at source ↗
Figure 2.6
Figure 2.6. Figure 2.6: The training data are samples from an unknown distribution. Definition 2.1 (Population and Empirical Risk). Given a loss ℓ(y, y b ), the population risk and empirical risk are R(f) = E(x,y)∼µ [PITH_FULL_IMAGE:figures/full_fig_p016_2_6.png] view at source ↗
Figure 2.8
Figure 2.8. Figure 2.8: A nonlinear rule in the original coor￾dinates can become a linear rule after a feature map, for example ϕ(x) = (x1, x2, x2 1 , x2 2 , x1x2) >. The approximation problem may now be stated abstractly. Given a target function g and a network class FL,m of depth L and width m, bound inf f∈FL,m kf − gk in a norm appropriate to the learning task, such as L∞ on a compact set or L 2 (µ) under a data distribution… view at source ↗
Figure 2.9
Figure 2.9. Figure 2.9: A feedforward network composes layers of affine maps and coordinatewise nonlin￾earities. The choice of activation matters. Standard examples include ReLU(t) = max{t, 0}, s(t) = 1 1 + e−t , H(t) = 1{t≥0} , c(t) = cos(t). ReLU activations give continuous piecewise-linear functions and are dominant in modern prac￾tice. Sigmoid and threshold activations appear naturally in classical approximation proofs. Cos… view at source ↗
Figure 2.10
Figure 2.10. Figure 2.10: Three basic activations: ReLU, sigmoid, and threshold. these partitions. This remark is often the most useful way to remember ReLU expressivity. A single ReLU unit ReLU(hw, xi + b) is affine on each side of the hyperplane hw, xi + b = 0. A sum of such units creates a piecewise-affine surface with many linear pieces. A deeper network feeds one piecewise-affine map into another, so the number and arrangem… view at source ↗
Figure 2.11
Figure 2.11. Figure 2.11: A one-dimensional ReLU network is piecewise linear; the dashed lines mark hinge locations. expressivity control structured classes uncontrolled fitting [PITH_FULL_IMAGE:figures/full_fig_p020_2_11.png] view at source ↗
Figure 2.12
Figure 2.12. Figure 2.12: More expressivity is valuable only when accompanied by mathematical control. 2.4 Universal Approximation Definition 2.4 (Universal Approximator). Let K ⊂ R d be compact. A function class F ⊂ C(K) is a universal approximator if, for every g ∈ C(K) and every ε > 0, there exists f ∈ F such that sup x∈K |f(x) − g(x)| ≤ ε. This is a density statement in the uniform norm. It says that the closure of F is all … view at source ↗
Figure 2.13
Figure 2.13. Figure 2.13: A continuous target can first be approximated by simple local pieces. Proposition 2.7 (Lipschitz Step Approximation). Let g : [0, 1] → R be ρ-Lipschitz and let H(t) = 1{t≥0} . For every ε > 0, there is a depth-two threshold network gb(x) = g(0) + mX−1 i=1 [PITH_FULL_IMAGE:figures/full_fig_p022_2_13.png] view at source ↗
Figure 2.14
Figure 2.14. Figure 2.14: The one-dimensional construction uses threshold units to build a step approxima￾tion. Proof. For each cell Ri , choose a point xi ∈ Ri and set the value on the cell to g(xi). If x ∈ Ri , then kx − xik∞ ≤ δ, so |g(x) − g(xi)| ≤ ε. The number of cells is approximately δ −d . Thus the construction is essentially a lookup table. It proves universality, but it also reveals the curse of dimensionality: arbitr… view at source ↗
Figure 2.15
Figure 2.15. Figure 2.15: In several dimensions, local pieces are rectangle-like bumps. d cells δ −d dimension increases [PITH_FULL_IMAGE:figures/full_fig_p023_2_15.png] view at source ↗
Figure 2.17
Figure 2.17. Figure 2.17: The Barron seminorm penalizes Fourier mass far from the origin. measure on K. If B(f) < ∞, then for every k ≥ 1 there exists a depth-two network fk(x) = X k j=1 aj σ(hwj , xi + bj ) such that Z K |f(x) − fk(x)| 2 dµ(x) ≤ C B(f) 2 k , where C depends on normalization choices and the bounded domain, but not on the ambient dimension through a grid factor. In the reference notes this is framed as follows: f… view at source ↗
Figure 2.18
Figure 2.18. Figure 2.18: An infinite-width representation averages ridge features over a distribution. integral rep￾resentation sample k units finite O network (k −1/2) [PITH_FULL_IMAGE:figures/full_fig_p026_2_18.png] view at source ↗
Figure 2.20
Figure 2.20. Figure 2.20: Barron’s proof rewrites the target as an infinite-width network and sparsifies the integral. Example 2.13 (Low and High Frequencies). For a unit direction u, f(x) = cos(2π u>x) has Fourier mass concentrated at ±u, so its Barron quantity scales with kuk. By contrast, g(x) = cos(2πq u>x) has mass at ±qu, and its Barron quantity scales with q kuk. High-frequency oscillation is therefore harder for shallow … view at source ↗
Figure 2.21
Figure 2.21. Figure 2.21: Fourier spikes farther from the origin increase the Barron quantity. 2.6 Connections and Limits Approximation theorems are necessary but not sufficient explanations of deep learning. They tell us that a good network exists; optimization asks whether training finds one; generalization asks whether the trained network predicts beyond the sample. A classical strategy is to control all three using a shared … view at source ↗
Figure 2.22
Figure 2.22. Figure 2.22: Useful theory connects ap￾proximation, optimization, and general￾ization. model size error train test interpolation threshold [PITH_FULL_IMAGE:figures/full_fig_p028_2_22.png] view at source ↗
Figure 2.24
Figure 2.24. Figure 2.24: Depth can encode hierarchical or compositional structure succinctly. There is also a complementary width-versus-depth result. For ReLU networks with un￾bounded depth, narrow architectures can still be universal in L p senses. A representative [PITH_FULL_IMAGE:figures/full_fig_p028_2_24.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: The conceptual flow of Chapter 2. Barron theory gives a representation and a finite-width approximation mechanism. Depth separation gives a different kind of expressivity result. Optimization asks whether the useful representation can be found from data. A useful mental model is a shallow network whose hidden units are ridge features: fm(x) = Xm j=1 aj σ(hwj , xi + bj ). The Barron theorem does not say t… view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: The shallow-network mental model. Each hidden unit is a ridge feature, and the theorem explains when a finite number of such features is enough. How to Read This Chapter The first two sections study shallow approximation: an analytic condition on f gives an infinite mixture of simple atoms, and a sparsification lemma turns that mixture into a finite network. The third section studies depth: repeated comp… view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: A Fourier mode is a sinusoid along the direction ω. By subtracting f(0) and integrating along the scalar variable hω, xi, one obtains ridge-type threshold atoms. 3.2.1 From Fourier Phase to Halfspace Atoms The starting point is to subtract the constant value f(0). From equation (3.1), f(x) − f(0) = Z Rd h cos(2π hω, xi + ϕω) − cos(ϕω) i [PITH_FULL_IMAGE:figures/full_fig_p034_3_3.png] view at source ↗
Figure 3.4
Figure 3.4. Figure 3.4: The representation-to-approximation path. The Fourier condition gives an infinite signed mixture, and sparsification converts that mixture into a finite shallow network. The passage from indicator atoms to ReLU atoms is conceptually simple. In one dimension, (t)+ = Z ∞ 0 1{t ≥ b} db. On a bounded domain this integral only needs thresholds in a bounded interval. Thus a ReLU ridge function is an accumulate… view at source ↗
Figure 3.5
Figure 3.5. Figure 3.5: Geometry of a Barron atom. A threshold ridge atom divides space by a hyperplane; ReLU ridge functions can be viewed as integrated threshold atoms. Representation Versus Approximation The signed integral is an exact infinite-width representation. The finite network obtained later is only an approximation, because it replaces the full measure by finitely many atoms. The size of the measure controls how man… view at source ↗
Figure 3.6
Figure 3.6. Figure 3.6: Gaussian spectrum intuition. Most mass remains near the origin, and the first spectral moment grows like the typical radius of a Gaussian vector. Rates for nonparametric estimation often degrade with dimension through volume effects. Bar￾ron approximation is a different statement: given a target function with a controlled represen￾tation, a shallow network can approximate it at a rate governed by Cf . Th… view at source ↗
Figure 3.7
Figure 3.7. Figure 3.7: Frank-Wolfe sparsification. The target X lies in the convex hull of atoms; each greedy step adds one atom while staying inside the hull. Proof. As in Maurey’s lemma, write X = E[V ] for a random atom V ∈ S. Because the greedy step is at least as good as choosing a random atom, for any λ ∈ [0, 1], kgi − Xk 2 ≤ E kλ(gi−1 − X) + (1 − λ)(V − X)k 2 = λ 2 kgi−1 − Xk 2 + (1 − λ) 2E kV − Xk 2 . The cross term va… view at source ↗
Figure 3.8
Figure 3.8. Figure 3.8: The tent map m. Its graph is a single triangle on [0, 1] and zero outside. The tent map has a constant-size ReLU implementation. Let z(x) = 2 ReLU(x) − 4 ReLU x − 1 2  . Then m(x) = ReLU(z(x)). Indeed, on [0, 1/2] this is 2x; on (1/2, 1] it is 2 − 2x; and outside the interval the final ReLU clips the negative part to zero. The point of writing m as a small ReLU module is that the same module can be reu… view at source ↗
Figure 3.9
Figure 3.9. Figure 3.9: A constant-size ReLU module for the tent map. Stacking this module implements iterates of m. The key phenomenon appears when we compose m with itself. Write m(k) = m| ◦ m {z ◦ · · · ◦ m} k times . Each iteration doubles the number of teeth on [0, 1]. Thus m(k) has 2 k teeth, while a network implementing it is obtained by stacking k copies of the same constant-size module. This is the simplest possible ex… view at source ↗
Figure 3.10
Figure 3.10. Figure 3.10: Composition creates teeth. The first, second, and third iterates have one, two, and four teeth on [0, 1]. 3.4.2 Sawtooth Complexity Definition 3.6 (t-sawtooth function). A univariate function f : R → R is t-sawtooth if its domain can be partitioned into at most t intervals on each of which f is affine. The tent map m is 4-sawtooth when the two outside flat pieces are included. On the interval [0, 1], m(… view at source ↗
Figure 3.11
Figure 3.11. Figure 3.11: A schematic for the sawtooth counting bound. ReLU units create piecewise-affine maps; depth composes these maps. 3.4.3 Telgarsky’s Separation Theorem 3.9 (Telgarsky depth separation). For every sufficiently large L, there exists a function f : [0, 1] → [0, 1] computed by a ReLU network with depth O(L 2 ) and O(L 2 ) units such that every ReLU network g with depth at most L and at most 2 L units satisfie… view at source ↗
Figure 3.12
Figure 3.12. Figure 3.12: Deep implementation by module reuse. The tent-map module is small; iterating it produces an exponentially oscillatory function. The proof picture is the following. The target f = m(L2+2) has roughly 2 L2+2 small triangles. A shallow network g, because it has only limited sawtooth complexity, has long affine pieces at the relevant scale. On such an affine piece, g cannot follow both the rising and fallin… view at source ↗
Figure 3.13
Figure 3.13. Figure 3.13: Missing triangles. A shallow piecewise-affine approximator cannot track every oscillation of an iterated tent map. triangles are missed. This gives Z 1 0 |f(x) − g(x)| dx ≥ 1 2 [PITH_FULL_IMAGE:figures/full_fig_p043_3_13.png] view at source ↗
Figure 3.14
Figure 3.14. Figure 3.14: A schematic nonconvex loss landscape. The slide used a surface plot; the essential point is the same: neural-network objectives are not simple convex bowls. 3.5.1 Convex Functions and Gradient Descent Definition 3.11 (Convex function). A differentiable function F : R p → R is convex if for all x, y ∈ R p and all λ ∈ [0, 1], F(λx + (1 − λ)y) ≤ λF(x) + (1 − λ)F(y). For a convex function, every sublevel se… view at source ↗
Figure 3.15
Figure 3.15. Figure 3.15: Gradient descent geometry in the convex baseline theory. The iterates move through nested sublevel sets toward the minimizer. Definition 3.12 (B-smoothness). A differentiable function F is B-smooth if k∇F(x) − ∇F(y)k ≤ B kx − yk for all x, y. If F is twice differentiable, this is equivalent to ∇2F(x)  BI for all x. Smoothness says that the gradient cannot change too abruptly. Therefore a local linear a… view at source ↗
Figure 3.16
Figure 3.16. Figure 3.16: A concept map for later chapters. Approximation, generalization, and optimization will remain separate but interacting themes. Remark 3.15 (Limitations and open directions). Barron bounds are strongest for functions with small spectral first moment. Telgarsky’s construction is explicit and instructive, but it is a stylized one-dimensional target rather than a model of all practical data. The main open q… view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: The route from convex optimization to NTK theory. Least squares explains residual flow with a fixed feature matrix; NTK analysis repeats the same algebra with the neural-network Jacobian. the parameters move only a little, so the network remains close to its tangent approximation at initialization. In that lazy-training regime, training can be described through a kernel matrix on the training examples. H… view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: The smallest positive eigenvalue of the relevant kernel controls the slowest residual direction. A larger spectral gap gives faster decay. 4.1.1 Finite-Sample Setup Fix a training set {(xi , yi)} N i=1 and parameters w ∈ R p . It is convenient to suppress the input points and collect the network predictions into a vector f(w) =    f(x1; w) . . . f(xN ; w)    , y =    y1 . . . yN    . The chap… view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Parameter space and prediction space. NTK analysis studies a nonconvex parameter trajectory through the induced ODE on the finite vector of training predictions. 4.2 Least Squares and Gradient Flow Before neural networks, we analyze the model problem min x F(x) = 1 2 kAx − bk 2 2 , A ∈ R N×p , b ∈ R N [PITH_FULL_IMAGE:figures/full_fig_p052_4_3.png] view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: Least-squares geometry. Under strong convexity the contours are elliptic bowls, and gradient descent moves toward the unique minimizer. 4.2.1 Residual Dynamics Let rt = Axt − b. Gradient descent with stepsize η gives xt+1 = xt − ηA>rt . Multiplying by A converts the parameter update into a residual update: rt+1 = Axt+1 − b = rt − ηAA>rt . Thus the matrix that controls training in residual space is AA>, t… view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: The least-squares Gram matrix. The smallest eigenvalue determines the slowest residual direction. This warm-up matters because the same algebra reappears with A ←→ J(w(0)). The linearized neural network is a least-squares problem whose feature matrix is the Jacobian at initialization. The Warm-Up in One Line Least squares has residual flow r˙ = −AA>r. Linearized neural networks have the same form with A … view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Different optimization trajectories can reach different endpoints, especially in non￾convex overparameterized models. 4.3 Nonconvexity and Overparameterization Beyond convexity, several reassuring implications break. Global minimization is computation￾ally hard in general; the reached solution may depend strongly on initialization, stepsize, and noise; and many different parameters may fit the training s… view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: A schematic nonconvex objective. Gradient methods may be easy to run but hard to analyze globally. In overparameterized neural networks, p N is common. The interpolation equations f(xi ; w) = yi , i = 1, . . . , N, [PITH_FULL_IMAGE:figures/full_fig_p055_4_7.png] view at source ↗
Figure 4.8
Figure 4.8. Figure 4.8: Interpolation in an overparameterized model. Many parameters can fit the training labels exactly. parameter direction loss sharp flat [PITH_FULL_IMAGE:figures/full_fig_p056_4_8.png] view at source ↗
Figure 4.9
Figure 4.9. Figure 4.9: Sharp and flat minima. The endpoint matters, but so does the path that selected it. This is the setting in which one would like a theorem of the following form: for a fixed training set of size N, a sufficiently wide randomly initialized network reaches zero training error if the initial tangent kernel is well conditioned and remains stable during training. The important caveats are that width bounds may… view at source ↗
Figure 4.10
Figure 4.10. Figure 4.10: The tension between kernel learning and feature learning. NTK theory analyzes a regime where tangent features remain nearly fixed. fine Jt =    ∇wf(x1; w(t))> . . . ∇wf(xN ; w(t))>    ∈ R N×p . The empirical neural tangent kernel is Kt = JtJ > t , Kt(i, j) = h∇wf(xi ; w(t)), ∇wf(xj ; w(t))i. The NTK measures how similarly two training examples respond to an infinitesimal param￾eter update. If Kt i… view at source ↗
Figure 4.11
Figure 4.11. Figure 4.11: Linearization at initialization. Lazy training analyzes the network near w0, replac￾ing it by its tangent model. 4.4.2 Linearization at Initialization Near w0 = w(0), define the affine approximation f0(u) = f(w0) + J0(u − w0), J0 = J(w0). The corresponding linearized flow freezes the Jacobian: u˙(t) = −αJ> 0 [PITH_FULL_IMAGE:figures/full_fig_p058_4_11.png] view at source ↗
Figure 4.12
Figure 4.12. Figure 4.12: The empirical NTK as a Gram matrix. Its entries are inner products of training￾example parameter gradients. At suitable random initialization, very wide networks have two related limits: the random function f(·; w0) converges in distribution to a Gaussian process, and the empirical tangent kernel concentrates around a deterministic kernel: K0(x, x0 ) ≈ K∞(x, x0 ). In the infinite-width lazy limit, train… view at source ↗
Figure 4.13
Figure 4.13. Figure 4.13: Random networks give both Gaussian-process output limits and deterministic tangent-kernel limits under appropriate width and scaling. 4.5 Convergence Mechanisms 4.5.1 Fixed-Kernel Decay For the linearized flow, the kernel K0 is constant: r˙(t) = −α 2K0r(t), r(t) = e −α 2K0t r(0) [PITH_FULL_IMAGE:figures/full_fig_p059_4_13.png] view at source ↗
Figure 4.14
Figure 4.14. Figure 4.14: The bootstrap structure of lazy-training proofs. Assume kernel stability, prove small movement, then use small movement and width to justify kernel stability. 4.6 Random Features, Adaptive Features, and Width The simplest model behind the NTK proof is a random-features model: f(x; a) = 1 √ m Xm r=1 arσ(b > r x), where only the output weights ar are trained and the random hidden weights br are fixed. The… view at source ↗
Figure 4.15
Figure 4.15. Figure 4.15: Random features. The hidden features are sampled once and held fixed, so the optimization problem is linear in the output weights. For a two-layer neural network with trainable hidden features, f(x; w) = 1 √ m Xm r=1 arσ(b > r x), the tangent kernel has two types of contributions: K0(x, x0 ) = Ka(x, x0 ) + Kb(x, x0 ). The Ka part comes from changing output weights, while Kb comes from moving hidden feat… view at source ↗
Figure 4.16
Figure 4.16. Figure 4.16: Width as duplication and averaging. Wider networks provide many nearly inde￾pendent tangent features. where the entries of Wi are independent standard Gaussian variables at initialization. Under such scaling, as hidden widths go to infinity, both outputs and tangent kernels concentrate. The reference notes also point toward the mean-field view, where a different scaling describes feature dynamics throug… view at source ↗
Figure 4.17
Figure 4.17. Figure 4.17: Kernel-flow residual decay. A well-conditioned kernel removes residual components quickly; an ill-conditioned kernel has slow directions. 4.7 What NTK Theory Explains NTK theory explains why sufficiently wide random networks can fit finite training sets, why least-squares residual dynamics reappear inside neural network training, and why the smallest empirical NTK eigenvalue controls the training speed.… view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Adaptive optimization connects algorithmic geometry, noisy gradient statistics, and the practical cost of training foundation models. The guiding question is: How do adaptive optimizers change geometry, implicit bias, and the cost of training foundation models? The answer is not a single theorem. Rather, the chapter develops a sequence of models. AdaGrad gives a clean convex and online-learning picture o… view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: The inserted chapter bridges optimization dynamics and implicit regularization by asking how the optimizer changes the path to interpolation. Definition 5.1 (Preconditioned stochastic descent). Let f(θ) = Eξ∼Dℓ(θ; ξ), gt ≈ ∇f(θt) be a stochastic gradient. A preconditioned stochastic descent method has the form θt+1 = θt − ηtPtgt , Pt  0. Here Pt may be fixed, data-dependent, diagonal, low-rank, or a ric… view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Multiplying the gradient by a preconditioner changes the steepest descent direction. The chapter is organized around three axes. metric noise memory state cost modern optimizers live at the intersection [PITH_FULL_IMAGE:figures/full_fig_p068_5_3.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: Optimizer theory combines geometry, statistics, and systems constraints. 5.2 AdaGrad and Data-Dependent Geometry AdaGrad is the cleanest starting point because it has a simple online-learning interpretation. At round t, an algorithm chooses θt , observes a loss ℓt , and receives a gradient gt = ∇ℓt(θt). The algorithm is compared with a fixed decision u using regret RT (u) = X T t=1 ℓt(θt) − ℓt(u). Small … view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: Adaptive methods use gradient history to choose later coordinate scales. Definition 5.2 (Diagonal AdaGrad). Initialize s0 = 0. Given a gradient gt ∈ R d , AdaGrad updates st = st−1 + gt gt , θt+1 = θt − η gt √ st + ε , where the square root and division are coordinate-wise. Coordinate j has effective learning rate η eff t,j = η √st,j + ε . Thus frequent or high-magnitude gradients reduce future steps in … view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: AdaGrad assigns different effective step sizes to different coordinates. Theorem 5.3 (AdaGrad coordinate-form regret bound). For convex G∞-Lipschitz losses on a bounded domain of diameter D, diagonal AdaGrad satisfies a regret bound of the form RT (u) ≲ [PITH_FULL_IMAGE:figures/full_fig_p069_5_6.png] view at source ↗
Figure 5.7
Figure 5.7. Figure 5.7: The AdaGrad norm changes as gradients accumulate. Example 5.5 (Rare but predictive features). Suppose x1 is a frequent noisy feature and x2 is a rare but predictive feature. A global learning rate treats their coordinates symmetrically. AdaGrad can reduce steps along the frequently excited noisy coordinate while preserving larger steps for the rare coordinate [PITH_FULL_IMAGE:figures/full_fig_p070_5_7.png] view at source ↗
Figure 5.8
Figure 5.8. Figure 5.8: Coordinate adaptivity is useful when the data distribution is anisotropic across features. AdaGrad also has limitations. Its classical regret theory is convex or online; deep networks are nonconvex. Diagonal scaling ignores correlations between parameters. The accumulated sum st = P τ≤t g 2 τ never forgets, so late training can become too conservative. Adam replaces this infinite memory by exponential me… view at source ↗
Figure 5.9
Figure 5.9. Figure 5.9: Adam forgets old gradients exponentially, while AdaGrad keeps accumulating squared gradients. Definition 5.6 (Adam). Given parameters β1, β2 ∈ [0, 1), initialize m0 = v0 = 0. Adam [PITH_FULL_IMAGE:figures/full_fig_p071_5_9.png] view at source ↗
Figure 5.10
Figure 5.10. Figure 5.10: Adam’s coordinate-wise effective learning rate is inversely related to the estimated RMS scale. For coordinate j, η eff t,j = η p vbt,j + ε . Flat or low-gradient coordinates can receive larger steps; high-variance coordinates receive smaller steps. This normalization is the source of both Adam’s robustness and its subtlety. It makes the update relatively insensitive to the scale of a coordinate’s gradi… view at source ↗
Figure 5.11
Figure 5.11. Figure 5.11: AMSGrad repairs Adam’s convergence issue by using a monotone second-moment denominator. Definition 5.8 (AMSGrad). AMSGrad modifies Adam by setting vet = max(vet−1, vbt), θt+1 = θt − η mb t √ vet + ε , where the maximum is coordinate-wise. AMSGrad is important theoretically, but AdamW is more common in large-scale language￾model practice. The central difference is the treatment of weight decay. AMSGrad i… view at source ↗
Figure 5.12
Figure 5.12. Figure 5.12: AdamW treats the adaptive loss-gradient path and the weight-decay path as distinct operations. 5.4 Implicit Bias and Generalization In overparameterized learning, many parameter vectors can interpolate the training data. The optimizer selects a path to one of them. Adaptive methods can therefore change not only the training speed but also the solution reached at zero training error. This point connects … view at source ↗
Figure 5.13
Figure 5.13. Figure 5.13: Different optimizers can reach different interpolating solutions from the same initialization. Proposition 5.10 (Marginal value phenomenon, informal). There are learning problems where adaptive gradient methods reach zero training error but converge to a classifier with worse test error than the one found by SGD. The proposition, inspired by Wilson et al., is a warning rather than a universal indictment… view at source ↗
Figure 5.14
Figure 5.14. Figure 5.14: Coordinate adaptivity can help or hurt depending on which coordinates carry stable signal. Scale invariance adds another complication. For a ReLU unit, a σ(w >x) = (ca) σ((w/c) >x), c > 0. Many parameterizations represent the same function. An optimizer that is not invariant to this rescaling can prefer one representation over another even when the represented function is unchanged. This matters because… view at source ↗
Figure 5.15
Figure 5.15. Figure 5.15: Scale-equivalent parameterizations can represent the same ReLU function but induce different optimizer behavior. Definition 5.11 (Gradient flow with a metric). For a time-dependent positive semidefinite matrix P(t), the continuous time analogue of preconditioned descent is ˙θ(t) = −P(t)∇f(θ(t)). The choice of P(t) changes the trajectory even when the objective is the same. In continuous time, the metric… view at source ↗
Figure 5.16
Figure 5.16. Figure 5.16: Changing the metric changes the continuous-time optimization flow. Once the training loss is nearly zero, the optimizer still matters because it determines the route to interpolation. SGD has noise-induced and geometry-induced biases; AdaGrad and Adam introduce coordinate-wise rescaling; AdamW separates shrinkage from the adaptive gradient path. These ideas feed directly into the next chapter on implici… view at source ↗
Figure 5.17
Figure 5.17. Figure 5.17: Transformer parameter groups naturally produce heterogeneous gradient scales. The same adaptivity that helps tuning also increases memory. For N parameters, Adam￾style training stores the parameters, gradients, first moments, and second moments, plus mixed￾precision copies and distributed communication buffers. This memory cost is often the real bottleneck. If each parameter has a first moment and a sec… view at source ↗
Figure 5.18
Figure 5.18. Figure 5.18: Adam-mini-style methods reduce the number of distinct effective learning-rate statistics by sharing rates within structured blocks. Remark 5.12 (Adam-mini). Adam-mini is motivated by the observation that many coor￾dinates in a large language model may not need separate second-moment learning-rate resources. Structured parameter blocks can share some effective rates, reducing memory while preserving Adam… view at source ↗
Figure 5.19
Figure 5.19. Figure 5.19: Adafactor stores row and column statistics instead of a full second-moment tensor for matrix parameters. Another route is to reduce the optimizer state by projecting gradients onto low-dimensional subspaces. Low-rank optimizer-state methods make a different assumption. Instead of sharing statistics by rows and columns, they assume that the most important gradient variation lives in a low￾dimensional sub… view at source ↗
Figure 5.20
Figure 5.20. Figure 5.20: GaLore-style methods store optimizer state in low-rank gradient subspaces. 5.6 Modern Optimizer Frontier Modern optimizer work revisits the same questions under large-model constraints. How much curvature can be exploited cheaply? How much optimizer state is necessary? Can a schedule be absorbed into the optimizer? Can matrix structure beat diagonal adaptivity at scale? These questions are empirical and… view at source ↗
Figure 5.21
Figure 5.21. Figure 5.21: Sophia uses a cheap curvature estimate to rescale and clip updates. Remark 5.14 (Sophia). Sophia, or Second-order Clipped Stochastic Optimization, uses a lightweight diagonal Hessian or curvature estimate. Its goal is not full Newton training; it is to capture enough curvature information to prevent overly aggressive directions in language-model pretraining. Sophia illustrates a recurring compromise in … view at source ↗
Figure 5.22
Figure 5.22. Figure 5.22: Sign-momentum methods preserve coarse direction while discarding some magni￾tude information. Learning-rate schedules are another major part of modern training. A schedule-free opti￾mizer attempts to make schedule behavior internal to the algorithmic state. This does not remove the need for hyperparameter choices; it changes where the choices live. Instead of prescribing a long hand-designed decay curve… view at source ↗
Figure 5.23
Figure 5.23. Figure 5.23: Schedule-free methods aim to reduce dependence on hand-designed decay curves. diagonal Adam matrix preconditioner better layer geometry cost versus curvature [PITH_FULL_IMAGE:figures/full_fig_p080_5_23.png] view at source ↗
Figure 5.24
Figure 5.24. Figure 5.24: Matrix preconditioners exploit richer layer geometry but cost more per step. Long-memory momentum methods combine a fast memory with a slow memory: mfast t = βfmfast t−1 + (1 − βf )gt , mslow t = βsmslow t−1 + (1 − βs)gt . The fast component reacts to local changes, while the slow component stabilizes across long horizons. This design reflects the fact that stochastic gradients contain information at mu… view at source ↗
Figure 5.25
Figure 5.25. Figure 5.25: AdEMAMix-style methods combine short and long momentum memories. Finally, benchmarking optimizers is hard. A new optimizer may win because it received more tuning, because a small-scale proxy did not reflect large-scale behavior, or because wall-clock time, tokens, memory, and final loss point in different directions. For a fair comparison, the unit of cost must be specified. Some methods reduce steps t… view at source ↗
Figure 5.26
Figure 5.26. Figure 5.26: Optimizer comparisons require controlled tuning and scaling protocols. 5.7 Synthesis The conceptual map is now clear. SGD supplies the baseline Euclidean geometry. AdaGrad makes coordinate geometry data-dependent. AdamW adds exponential memory and separates weight decay from adaptive scaling. At foundation-model scale, memory and communication become first-class theoretical constraints, leading to facto… view at source ↗
Figure 5.27
Figure 5.27. Figure 5.27: The optimizer frontier extends the SGD-to-AdaGrad-to-AdamW story under implicit-bias and scale constraints. Remark 5.15 (Key takeaways). First, adaptivity chooses a geometry: AdaGrad and Adam are data-dependent preconditioners, not merely faster SGD variants. Second, AdamW sep￾arates two effects: the adaptive gradient path and the weight-decay path. Third, scale changes the question: for foundation mode… view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: Interpolation leaves a set of zero-error solutions. The optimizer and its parameteri￾zation select a particular point on this set. The chapter develops a three-step story. A parameterization w = g(θ) changes the coordi￾nates in which gradient descent is run. Those coordinates induce a geometry in the original w-space. Once the loss is driven to zero, that geometry often identifies the interpolant that is… view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: The chapter’s organizing principle: coordinates induce geometry, and geometry determines the implicit bias of the path. 6.2 The Running Least-Squares Model The cleanest model in which to see implicit regularization is an underdetermined least-squares problem. Let f(x) = 1 2 kAx − bk 2 2 , A ∈ R n×m, n < m. We assume throughout this section that the equations are consistent. Thus the interpolation set S =… view at source ↗
Figure 6.3
Figure 6.3. Figure 6.3: The minimum-norm interpolant is the point where the affine solution set meets the row space. Other feasible points differ by null-space directions. Lemma 6.2 (Minimum-norm bias of gradient flow). Assume Ax = b is consistent. Let x(0) = 0 and x˙(t) = −A >(Ax(t) − b). If x(t) converges to a point in S, then x(t) −→ x † = arg min Ax=b kxk2 [PITH_FULL_IMAGE:figures/full_fig_p085_6_3.png] view at source ↗
Figure 6.4
Figure 6.4. Figure 6.4: Vanishing ridge regularization is an explicit proxy for the implicit ℓ2 bias of direct gradient flow from the origin. 6.3 Reparameterization Changes the Bias Now reparameterize the same least-squares problem by the entrywise square map x = sq(y) = y 2 , bf(y) = 1 2 [PITH_FULL_IMAGE:figures/full_fig_p087_6_4.png] view at source ↗
Figure 6.5
Figure 6.5. Figure 6.5: Different norms select different feasible points. The corners of the ℓ1 ball explain why ℓ1 minimization promotes sparsity [PITH_FULL_IMAGE:figures/full_fig_p088_6_5.png] view at source ↗
Figure 6.6
Figure 6.6. Figure 6.6: A sparse signal has a small number of large coordinates. Basis pursuit uses ℓ1 geometry to find such feasible points. Example 6.7 (A two-dimensional toy problem). Let A = h 1 2i , b = 1, x ≥ 0. The feasible set is the line segment x1 + 2x2 = 1. Direct computation gives arg min Ax=b kxk2 =  1 5 , 2 5  , arg min Ax=b x≥0 kxk1 =  0, 1 2  . The ℓ1 solution is sparser because it lies on a coordinate axis.… view at source ↗
Figure 6.7
Figure 6.7. Figure 6.7: In the toy problem, the ℓ2 and nonnegative ℓ1 selected solutions are different points on the same feasible line segment [PITH_FULL_IMAGE:figures/full_fig_p089_6_7.png] view at source ↗
Figure 6.8
Figure 6.8. Figure 6.8: Gradient descent minimizes a linearized objective plus a quadratic penalty on the step length. Definition 6.8 (Bregman divergence). Let ϕ : R m → R be differentiable and strictly convex on a convex domain. The Bregman divergence generated by ϕ is Dϕ(x, z) = ϕ(x) − ϕ(z) − h∇ϕ(z), x − zi. The quantity Dϕ(x, z) is the error made by the first-order Taylor approximation of ϕ at z when evaluated at x. It is no… view at source ↗
Figure 6.9
Figure 6.9. Figure 6.9: Bregman divergence measures the vertical gap between a convex function and its tangent approximation [PITH_FULL_IMAGE:figures/full_fig_p090_6_9.png] view at source ↗
Figure 6.10
Figure 6.10. Figure 6.10: At the Bregman projection, the gradient of the divergence is normal to the feasible affine set. 6.5.1 Square Flow as Entropy Mirror Flow For the entropy mirror map, ∇2ϕ(x) = D−1 x . The continuous-time mirror flow associated with f is x˙(t) = −(∇2ϕ(x(t)))−1∇f(x(t)). For least squares this becomes x˙(t) = −DxA >(Ax(t) − b). Comparing with the square-parameterized flow x˙(t) = −4DxA >(Ax(t) − b), [PITH_F… view at source ↗
Figure 6.11
Figure 6.11. Figure 6.11: Putting the pieces together: square parameterization induces entropy mirror ge￾ometry, whose small-initialization KL projection converges to ℓ1 selection. 6.6 Connections to Deep Learning Neural networks are highly reparameterized models. Layers can be scaled against each other, features are shared across examples, and many architectures are homogeneous. The lesson of the square map is therefore broader… view at source ↗
Figure 6.12
Figure 6.12. Figure 6.12: Neural networks can interpolate in many parameter configurations. Implicit bias asks which configuration and which predictor the training path selects. 6.6.1 Linear Classification and Maximum Margin A second canonical example concerns separable linear classification. Consider data {(xi , yi)} N i=1 with yi ∈ {±1}. The data are linearly separable if there exists a vector a (and, if desired, an intercept … view at source ↗
Figure 6.13
Figure 6.13. Figure 6.13: The maximum-margin separator classifies the training data correctly and maxi￾mizes the normalized distance to the closest examples. The reference notes also mention the generalization-bound motivation: classical margin bounds for support vector machines suggest that large margin can be statistically meaningful. The subtle question is whether maximizing such a bound is the same as maximizing real test pe… view at source ↗
Figure 7.1
Figure 7.1. Figure 7.1: Generalization compares the error measured on the finite training sample with the error under the distribution that produced the sample. Definition 7.1 (Empirical and population error). Let S = {(xi , yi)} N i=1 be drawn inde￾pendently from a distribution D on R d × {0, 1}. For a binary classifier h : R d → {0, 1}, define errS(h) = 1 N X N i=1 1h(xi) 6= yi , errD(h) = P(x,y)∼D[h(x) 6= y] . The generaliza… view at source ↗
Figure 7.2
Figure 7.2. Figure 7.2: Three views of capacity. Finite-class bounds count hypotheses, VC theory counts labelings on finite samples, and data-dependent bounds measure the hypotheses that matter on the observed distribution. 7.2 Finite Hypothesis Classes The cleanest result is obtained when the hypothesis class is finite. This is not a realistic description of a neural network with real-valued weights, but it captures the main s… view at source ↗
Figure 7.3
Figure 7.3. Figure 7.3: The finite-class proof is Hoeffding for one hypothesis followed by a union bound over the bad events Bh. The theorem is agnostic: it makes no assumption that the labels are generated by a member of H. Under a realizability assumption one can obtain a faster 1/N dependence in the error level [PITH_FULL_IMAGE:figures/full_fig_p100_7_3.png] view at source ↗
Figure 7.4
Figure 7.4. Figure 7.4: Finite-class bounds can be read as an Occam principle: short descriptions that fit many examples are statistically meaningful. Remark 7.5. The factor log |H| is the description cost of selecting one member of the class. This is too crude for real-valued neural networks, but the principle survives throughout the chapter: better descriptions of the learned predictor lead to sharper generalization statement… view at source ↗
Figure 7.5
Figure 7.5. Figure 7.5: To shatter three points, a class must realize all 2 3 possible labelings. The figure shows four representative dichotomies realized by line separators. Example 7.7 (Halfspaces). Let H = n x 7→ sgn(a >x + b) : a ∈ R d , b ∈ R o . The VC dimension of this class is d + 1. In R 2 , three points in general position can be shattered by lines, whereas no four-point set is shattered by all labelings. The proof o… view at source ↗
Figure 7.6
Figure 7.6. Figure 7.6: A single successful separation is not shattering. Shattering is a uniform statement over every assignment of labels to the same points. One must also be careful not to confuse VC dimension with parameter count. The class H = {x 7→ sgn(sin(ax)) : a ∈ R} has only one real parameter, yet it has infinite VC dimension on the real line: by taking a large enough, the sine wave can oscillate rapidly enough to ma… view at source ↗
Figure 7.7
Figure 7.7. Figure 7.7: Parameter count alone does not determine capacity. The geometry of the function family matters. Lemma 7.9 (Sauer–Shelah lemma). If VCdim(H) = k, then for every N, GH(N) ≤ X k i=0 N i ! . In particular, when N > k, GH(N) ≤  eN k k . The significance of the Sauer–Shelah lemma is that finite VC dimension turns an infinite class into a polynomial-size collection of label patterns on any finite sample. This… view at source ↗
Figure 7.8
Figure 7.8. Figure 7.8: The ghost-sample argument replaces population error by an independent sample, and Rademacher signs randomize which point in each pair belongs to which sample. 7.4 VC Bounds for Neural Networks We now return to neural networks. To keep the combinatorics transparent, the chapter considers sign-activation networks. Let the architecture have layer widths d0, d1, . . . , dL, and define f = fL ◦ fL−1 ◦ · · · ◦… view at source ↗
Figure 7.9
Figure 7.9. Figure 7.9: A sign network is a composition of vector-valued layers. The proof counts label patterns neuron by neuron and layer by layer. Lemma 7.12 (Product rule for growth functions). Let H1 ⊆ YX 1 and H2 ⊆ YX 2 . For the product class H = H1 × H2, meaning x 7→ (h1(x), h2(x)), h1 ∈ H1, h2 ∈ H2, we have GH(N) ≤ GH1 (N)GH2 (N). Proof. Fix any N-point set T ⊂ X . The first coordinate of a pattern realized by H on T m… view at source ↗
Figure 7.10
Figure 7.10. Figure 7.10: The classical parameter-counting bound becomes vacuous when the number of parameters substantially exceeds the number of examples. 7.5 PAC Learnability VC dimension is not merely a proof technique for a convenient bound. In the distribution-free realizable setting, it exactly characterizes learnability. Definition 7.14 (PAC learnability). A binary concept class H is PAC learnable if for every distributi… view at source ↗
Figure 7.11
Figure 7.11. Figure 7.11: If a large support is shattered, arbitrary labels on unseen points remain possible after observing only part of the support. Remark 7.16 (Agnostic PAC learning). The same characterization extends to the noisy, agnostic setting: a concept class is agnostically PAC learnable if and only if it has finite VC dimension. In that setting the goal is to return h with errD(h) ≤ infg∈H errD(g) + ϵ, rather than to… view at source ↗
Figure 7.12
Figure 7.12. Figure 7.12: Rademacher complexity labels the sample with random signs and asks for the best correlation achievable by hypotheses in the class. 7.6.1 Random labels and deep networks A tempting hope is that although a deep network has large VC dimension, the combination of natural data and SGD might prevent it from fitting random labels. The random-label experi￾ments of Zhang et al. showed that this hope is false in … view at source ↗
Figure 7.13
Figure 7.13. Figure 7.13: Deep networks can eventually fit random labels, while test accuracy under random labels remains at chance. Expressivity alone is not a generalization explanation. 7.7 Margins, Fat Shattering, and PAC-Bayes The last part of the chapter surveys two ways to describe a learned hypothesis more finely than by raw parameter count. Margin bounds use real-valued function values and norms. PAC￾Bayes bounds descri… view at source ↗
Figure 7.14
Figure 7.14. Figure 7.14: Occam’s principle reappears in several forms: finite-class indices, growth-function label patterns, and PAC-Bayes KL divergence. 7.8 What the Bounds Explain, and What They Do Not Classical generalization theory explains several fundamental facts. It shows why finite or low￾VC classes are learnable, why sample size must grow with capacity, and why distribution-free learning is impossible for classes with… view at source ↗
Figure 7.15
Figure 7.15. Figure 7.15: Small population error is not explained by training fit alone. The useful theory must also describe capacity, data structure, and algorithmic selection. The conceptual endpoint of the chapter is therefore not that VC theory is wrong. On the contrary, finite VC dimension is a tight characterization of distribution-free PAC learnability. The point is that the guarantees one gets from worst-case counting a… view at source ↗
Figure 8.1
Figure 8.1. Figure 8.1: Stability is an algorithmic notion: neighboring samples should lead to nearby out￾puts, or at least to similar losses on test points. Definition 8.1 (Risk, empirical risk, and generalization error). Let z = (x, y) ∼ D, and let S = (z1, . . . , zN ) ∼ DN be an iid training sample. For a parameter w ∈ Ω, define R(w) = Ez∼Dℓ(w, z), RS(w) = 1 N X N i=1 ℓ(w, zi). The generalization error of w on S is ϵgen(w) … view at source ↗
Figure 8.2
Figure 8.2. Figure 8.2: The generalization gap compares the population risk with the empirical risk at the learned output [PITH_FULL_IMAGE:figures/full_fig_p113_8_2.png] view at source ↗
Figure 8.3
Figure 8.3. Figure 8.3: Adding an ℓ2 penalty changes a flat empirical objective into a more curved one. w objective flat empirical loss after adding weight decay curvature improves stability [PITH_FULL_IMAGE:figures/full_fig_p114_8_3.png] view at source ↗
Figure 8.4
Figure 8.4. Figure 8.4: Stability proofs often need curvature. Weight decay is the simplest way to add it. 8.2 Average Stability The stability viewpoint treats a learning algorithm as a map A : Z N → Ω, S 7→ A(S). For randomized algorithms, the map also depends on internal randomness, and expectations below include that randomness when appropriate. Let S = (z1, . . . , zN ) and S 0 = (z 0 1 , . . . , z0 N ) be independent iid s… view at source ↗
Figure 8.5
Figure 8.5. Figure 8.5: The chapter map. Stability is the organizing principle; regularized ERM, SGD, privacy, and double descent provide different views of it. S A wb = A(S) [PITH_FULL_IMAGE:figures/full_fig_p115_8_5.png] view at source ↗
Figure 8.6
Figure 8.6. Figure 8.6: A learning algorithm is a sample-dependent map from datasets to parameters. S : S ′ : S (i) : z1 z2 · · · zi · · · zN z ′ 1 z ′ 2 · · · z ′ i · · · z ′ N z1 z2 · · · z ′ i · · · zN replace only i [PITH_FULL_IMAGE:figures/full_fig_p115_8_6.png] view at source ↗
Figure 8.7
Figure 8.7. Figure 8.7: The hybrid sample S (i) differs from S in exactly one coordinate [PITH_FULL_IMAGE:figures/full_fig_p115_8_7.png] view at source ↗
Figure 8.8
Figure 8.8. Figure 8.8: Uniform stability is stronger than average stability and leads to more direct high￾probability control. We write ∆unif(A) ≤ γ. This is a worst-case condition over the sample, the replaced coordinate, and the test point. The handwritten notes use the equivalent chapter version that tests at the replacement point z 0 i . The displayed definition is the standard stronger form and is the one most often used … view at source ↗
Figure 8.9
Figure 8.9. Figure 8.9: Uniform convergence controls a whole class, while stability controls the algorithm’s selected output under a neighboring-sample perturbation [PITH_FULL_IMAGE:figures/full_fig_p117_8_9.png] view at source ↗
Figure 8.10
Figure 8.10. Figure 8.10: The geometric mechanism of stable ERM. Replacing one summand changes the objective only slightly; curvature prevents the minimizer from moving far. Proof sketch of the ERM stability theorem. By the lemma, the empirical risk RS is α-strongly convex. Since wbS minimizes RS, RS(wbS(i) ) − RS(wbS) ≥ α 2 kwbS(i) − wbSk 2 2 . On the other hand, comparing the empirical objectives on S and S (i) and using optim… view at source ↗
Figure 8.11
Figure 8.11. Figure 8.11: The stable-ERM proof has two steps: curvature gives nearby minimizers, and Lipschitzness turns nearby minimizers into nearby losses. 8.5 Regularization, SGD, and Privacy 8.5.1 Explicit regularization When the original loss is convex but not strongly convex, one can add a quadratic penalty: FS(w) = RS(w) + α 2 kwk 2 2 , wbS(α) ∈ arg min w∈Ω FS(w) [PITH_FULL_IMAGE:figures/full_fig_p119_8_11.png] view at source ↗
Figure 8.12
Figure 8.12. Figure 8.12: Choosing the regularization strength balances stability and bias. 8.5.2 Implicit regularization by SGD Stochastic gradient descent can also be stable without an explicit quadratic penalty. The reason is dynamic: two coupled runs on neighboring samples see the same example most of the time, and can drift only when the replaced example is selected. This is an algorithmic version of regularization. Rather … view at source ↗
Figure 8.13
Figure 8.13. Figure 8.13: The SGD coupling proof runs the two algorithms with the same sampled indices. Most steps are shared; only hits to the replaced example create drift. Definition 8.13 (Expected uniform stability). For a randomized algorithm, one common stability notion is sup S,S(i) ,z [PITH_FULL_IMAGE:figures/full_fig_p121_8_13.png] view at source ↗
Figure 8.14
Figure 8.14. Figure 8.14: Stability and privacy are both single-sample sensitivity notions, but they control different objects. 8.6 Double Descent The classical bias–variance story says that as model capacity increases, bias decreases while variance increases, leading to a U-shaped test-error curve. Deep learning often operates beyond the interpolation threshold, where training error reaches zero. In this regime, test error can … view at source ↗
Figure 8.15
Figure 8.15. Figure 8.15: The modern double-descent curve augments the classical U-shaped picture with an overparameterized branch after interpolation. Proposition 8.16 (Squared-error bias–variance decomposition). Let y = f(x) + ε, E[ε | x] = 0, Var(ε | x) = σ 2 , [PITH_FULL_IMAGE:figures/full_fig_p122_8_15.png] view at source ↗
Figure 8.16
Figure 8.16. Figure 8.16: In overparameterized linear regression, many interpolators may fit the data. Gra￾dient descent from near zero selects the minimum-norm solution. Definition 8.17 (Minimum-norm least-squares estimator). Consider yi = x > i β + εi , xi ∈ R p , i = 1, . . . , n, and let X ∈ R n×p , y ∈ R n . The minimum-norm least-squares estimator is bβ = arg min b  kbk2 : b ∈ arg min u ky − Xuk 2 2  . For p < n, this is… view at source ↗
Figure 8.17
Figure 8.17. Figure 8.17: A schematic reading of the ridgeless-risk formula. The risk spikes near p/n = 1; beyond interpolation it may decrease depending on the signal-to-noise ratio. Let SNR = r 2/σ2 . The null predictor bβ = 0 has risk r 2 , and as γ → ∞ the ridgeless risk approaches this null risk. The reference notes emphasize two observations: regardless of SNR, the highly overparameterized limit converges to the null risk;… view at source ↗
Figure 8.18
Figure 8.18. Figure 8.18: Linear ridgeless regression is a solvable model of a phenomenon that also appears in overparameterized neural networks: interpolation need not imply poor test performance [PITH_FULL_IMAGE:figures/full_fig_p124_8_18.png] view at source ↗
Figure 9.1
Figure 9.1. Figure 9.1: The chapter moves from exact ridgeless-risk formulas to computational lower bounds for nonlinear classes related to neural networks. approximation representation optimization training generalization test error computation efficiency deep learning theory [PITH_FULL_IMAGE:figures/full_fig_p127_9_1.png] view at source ↗
Figure 9.2
Figure 9.2. Figure 9.2: Four lenses on a learning problem. This chapter completes a statistical-risk story and then focuses on efficient learnability [PITH_FULL_IMAGE:figures/full_fig_p127_9_2.png] view at source ↗
Figure 9.3
Figure 9.3. Figure 9.3: A schematic double-descent curve. Interpolation changes the classical bias–variance picture by creating a high-risk threshold and a possible second descent. The first part of the chapter asks where this curve comes from in a solvable model. The second part asks a different question: if a nonlinear class is powerful enough to interpolate, is there an efficient learner that can exploit it? This shift is im… view at source ↗
Figure 9.4
Figure 9.4. Figure 9.4: When p > n, many directions are invisible to the rows of X. The conditional bias is the prediction norm of the nullspace component Πβ. 9.2.1 Isotropic Gaussian Designs The formulas become explicit for isotropic Gaussian features. Suppose xi ∼ N(0, Ip), Σ = Ip, γ = p n , kβk2 = r. Rotational invariance says that X and XU have the same distribution for every orthogonal U. Therefore the row span of X, when … view at source ↗
Figure 9.5
Figure 9.5. Figure 9.5: A schematic Marchenko–Pastur density in an underparameterized regime. The inverse moment controls the variance. eigenvalue density zero mass γ > 1 [PITH_FULL_IMAGE:figures/full_fig_p131_9_5.png] view at source ↗
Figure 9.6
Figure 9.6. Figure 9.6: In the overparameterized regime, Σb has a nullspace. The pseudoinverse ignores the zero eigenvalues but the nonzero spectrum still determines the variance. packages the inverse moments of the sample spectrum. Closed forms for this transform under Marchenko–Pastur asymptotics can be evaluated as z → 0, which yields V ( bβ, β) ≈    σ 2 γ 1 − γ , 0 < γ < 1, σ 2 1 γ − 1 , γ > 1. Theorem 9.3 (High-dime… view at source ↗
Figure 9.7
Figure 9.7. Figure 9.7: Ridgeless double descent arises from two terms: variance from the inverse sample spectrum and bias from the signal in the nullspace. This section uses the language of PAC learning. The emphasis is not on sharp sample￾complexity constants, but on the distinction between statistical possibility and computational feasibility. A class may have a small description and may even be learnable with polynomially m… view at source ↗
Figure 9.8
Figure 9.8. Figure 9.8: A lower bound can depend on statistical resources, computational resources, and restrictions on the learner’s output class. Theorem 9.4 (Blum–Rivest, informal). For a depth-two threshold network with three com￾putational nodes, deciding whether some choice of weights fits all given training examples is NP-hard. This theorem is striking because the architecture is tiny and fixed; only the weights vary. Ho… view at source ↗
Figure 9.9
Figure 9.9. Figure 9.9: A schematic depth-two threshold network of the kind used to motivate proper￾training hardness. For a target class H, the benchmark error is OPTH = inf h∈H errD(h). A learner competes with H if, with high probability over the sample and its internal ran￾domness, it returns bh satisfying errD( bh) ≤ OPTH + ε. The usual PAC requirement also asks that the sample size and runtime be polynomial in the natural … view at source ↗
Figure 9.10
Figure 9.10. Figure 9.10: Improper learning is an escape hatch: output a different kind of predictor while still competing with the original class. representation if that representation supports a simple algorithm. Definition 9.7 (CNF and DNF classes). A 3-CNF formula is a conjunction of clauses with three literals: ϕ(x) = ^m i=1 (Li1 ∨ Li2 ∨ Li3). A k-term DNF formula is a disjunction of k conjunctions of literals: f(x) = _ k i… view at source ↗
Figure 9.11
Figure 9.11. Figure 9.11: The CNF obtained from a three-term DNF is highly structured: every clause chooses one literal from each original term. Theorem 9.10 (Valiant-style improper learning, informal). For constant k, a k-term DNF can be learned in polynomial time by outputting a hypothesis in a larger CNF/DNF-derived class. One convenient way to view the algorithm is through complements. The complement of a k-CNF is a DNF whos… view at source ↗
Figure 9.12
Figure 9.12. Figure 9.12: The deletion algorithm starts from a very large class and removes only candidates contradicted by negative examples. Lemma 9.12 (No false deletions). Suppose a conjunction T is one of the terms in the target DNF being learned in the complement view. The deletion algorithm never removes T. Proof. A target term cannot be true on a negative example; if it were true, the target DNF would label the example p… view at source ↗
Figure 9.13
Figure 9.13. Figure 9.13: Improper learning enlarges the output class. This may make the algorithm easy while increasing the number of samples needed for uniform generalization. 9.5 Neural Networks, Halfspaces, and Cryptographic Hardness The DNF example prevents a common mistake: proper hardness alone is not a representation￾independent lower bound. For neural networks, the analogous question is whether a hard [PITH_FULL_IMAGE:… view at source ↗
Figure 9.14
Figure 9.14. Figure 9.14: Intersections of halfspaces carve out regions by requiring several linear threshold tests to be simultaneously satisfied. Threshold activations are halfspace tests. If hidden units output gj (x) ∈ {−1, +1}, then the AND of m such tests can be written as a single threshold gate: F(x) = sign   Xm j=1 gj (x) − (m − 1)   . Indeed, if all gj (x) = 1, the argument is 1; if at least one hidden unit outputs… view at source ↗
Figure 9.15
Figure 9.15. Figure 9.15: An intersection of halfspaces can be implemented by a depth-two threshold network: hidden nodes compute halfspaces and the output node computes their AND. the output class does not remove the computational barrier under the assumed hardness of the underlying lattice problem. Definition 9.17 (Lattice and shortest vector problem). A lattice generated by basis vectors b1, . . . , bd is Λ = (X d i=1 zibi : … view at source ↗
Figure 9.16
Figure 9.16. Figure 9.16: A lattice is a discrete set closed under integer linear combinations. The crypto￾graphic hardness result uses assumptions related to approximating short lattice vectors. The reduction idea follows a standard cryptographic template. In a public-key encryption system, an attacker can generate labeled examples by choosing a message m, computing a ci￾phertext x = Enc(m), and labeling it by the decryption va… view at source ↗
Figure 9.17
Figure 9.17. Figure 9.17: If the decryption function were efficiently learnable, chosen-message examples would train an attacker to predict decryptions of fresh ciphertexts. Remark 9.18. The hard distribution in this theorem is not a natural image or language distribution. It is produced by encrypted random messages. Thus the theorem does not say that practical neural networks cannot be learned; it says that without assumptions … view at source ↗
Figure 9.18
Figure 9.18. Figure 9.18: A computational–statistical tradeoff. Enlarging the output class can reduce search difficulty while increasing the sample complexity required to control generalization. Proper hardness can disappear under improper learning, as the DNF example shows. Cryp￾tographic hardness can survive improper learning, but it may rely on artificial distributions. Average-case assumptions, product or Gaussian input dist… view at source ↗
Figure 10.1
Figure 10.1. Figure 10.1: The chapter’s roadmap: distributional assumptions expose Fourier structure; noise sensitivity certifies concentration; and parity functions show where statistical-query methods break down. arbitrary distribution uniform hypercube Gaussian space assumptions invariance [PITH_FULL_IMAGE:figures/full_fig_p142_10_1.png] view at source ↗
Figure 10.2
Figure 10.2. Figure 10.2: A structured distribution is not a cosmetic assumption. It changes which statistics are informative and which algorithms become plausible. parity functions hide all of their mass in a single high-degree coefficient, making them almost invisible to algorithms that only ask for coarse averages. This is also why the chapter belongs in a theory of deep learning. Many neural-network procedures can be interpr… view at source ↗
Figure 10.3
Figure 10.3. Figure 10.3: The chapter’s dictionary: boundary geometry, noise stability, Fourier concentration, algorithms, and lower bounds are different views of the same structure. 10.2 Fourier Analysis on the Hypercube We work first on the Boolean hypercube {±1} d = {−1, +1} d with the uniform distribution. A typical example is x = (x1, . . . , xd) ∼ Unif({±1} d ), and the label is y = f(x), where often f : {±1} d → {±1}. For… view at source ↗
Figure 10.4
Figure 10.4. Figure 10.4: The Boolean hypercube. The uniform distribution gives every vertex the same probability, making orthogonality calculations exact. The hypercube has a natural orthonormal basis indexed by subsets of coordinates. Definition 10.1 (Parity characters). For S ⊆ [d], the parity character χS is χS(x) = Y i∈S xi , χ∅ (x) = 1. Its degree is |S|. The characters multiply according to χS(x)χT (x) = χS4T (x), where S… view at source ↗
Figure 10.5
Figure 10.5. Figure 10.5: The Fourier transform on the hypercube rewrites the truth table in an orthonormal coordinate system. j ∈ S4T. Independence and E[xj ] = 0 give Ex[χS(x)χT (x)] = Ex[χS4T (x)] = E[xj ] Y i∈S4T, i6=j E[xi ] = 0. When S = T, the product is identically 1. The coefficient bf(S) should therefore be read as a correlation: it is large precisely when f(x) tends to align with the parity pattern χS(x). This interpr… view at source ↗
Figure 10.6
Figure 10.6. Figure 10.6: A nonempty symmetric difference leaves at least one independent mean-zero sign, which kills the inner product. Theorem 10.4 (Parseval identity). For every f : {±1} d → R, Ex[f(x) 2 ] = X S⊆[d] bf(S) 2 . In particular, if f(x) ∈ {±1}, then P S bf(S) 2 = 1. Proof. Expand both copies of f in the Fourier basis: E[f(x) 2 ] = E " X S bf(S)χS(x) ! X T bf(T)χT (x) !# = X S,T bf(S) bf(T)E[χS(x)χT (x)]. Orthonorm… view at source ↗
Figure 10.7
Figure 10.7. Figure 10.7: Low-degree truncation keeps the Fourier mass below a degree cutoff and discards the high-degree tail. 10.3 Low-Degree Learning Definition 10.6 (Fourier concentration). A function f : {±1} d → R is α(ϵ, d)-concentrated if X |S|≥α(ϵ,d) bf(S) 2 ≤ ϵ. A concept class H is α(ϵ, d)-concentrated if every f ∈ H has this property. Fourier concentration is exactly an L2 approximation statement. If g = f<α, then f … view at source ↗
Figure 10.8
Figure 10.8. Figure 10.8: The low-degree algorithm estimates the relevant correlations and assembles them into a polynomial classifier. More concretely, for every S ⊆ [d] with |S| < α, the learner estimates gb(S) = 1 m Xm i=1 yiχS(xi). It then forms g(x) = X |S|<α gb(S)χS(x), h(x) = sgn(g(x)). There are αX−1 j=0 d j ! ≤ d O(α) coefficients to estimate. Lemma 10.8 (Coefficient accuracy implies prediction accuracy). Suppose the es… view at source ↗
Figure 10.9
Figure 10.9. Figure 10.9: For a new class, the algorithmic part is generic. The mathematical work is to prove a structural concentration theorem. The remainder of the positive part of the chapter explains one powerful way to prove con￾centration: show that the function is stable under random noise. 10.4 Noise Sensitivity and Fourier Tails Definition 10.9 (Random bit noise). For x ∈ {±1} d and 0 ≤ η ≤ 1/2, let Tη(x) be obtained b… view at source ↗
Figure 10.10
Figure 10.10. Figure 10.10: The noise operator flips coordinates independently. A function is noise stable if these local perturbations rarely change the label. Noise sensitivity has an exact Fourier formula. Proposition 10.11 (Spectral formula for noise sensitivity). For Boolean f : {±1} d → {±1} and 0 ≤ η ≤ 1/2, NSη(f) = 1 2 − 1 2 X S⊆[d] (1 − 2η) |S| bf(S) 2 . Proof. Since f(x)f(Tηx) = 1 when the labels agree and −1 when they … view at source ↗
Figure 10.11
Figure 10.11. Figure 10.11: The noise multiplier (1 − 2η) k damps high-degree Fourier components first. Lemma 10.13 (Stability implies low-degree mass). For 0 < η < 1/2 and Boolean f, X |S|≥1/η bf(S) 2 ≤ 2 1 − e−2 NSη(f). Proof. By Parseval and the spectral formula, 2 NSη(f) = X S bf(S) 2  1 − (1 − 2η) |S|  ≥  1 − (1 − 2η) 1/η X |S|≥1/η bf(S) 2 . Finally, (1 − 2η) 1/η ≤ e −2 , so 1 − (1 − 2η) 1/η ≥ 1 − e −2 . This lemma turns… view at source ↗
Figure 10.12
Figure 10.12. Figure 10.12: Noise sensitivity is a bridge from geometry to Fourier concentration and then to an explicit learning algorithm [PITH_FULL_IMAGE:figures/full_fig_p149_10_12.png] view at source ↗
Figure 10.13
Figure 10.13. Figure 10.13: An intersection of halfspaces is a depth-two threshold architecture: hidden units compute linear threshold functions and the output gate computes their conjunction. Theorem 10.16 (Peres noise bound). For every halfspace h and every 0 < η < 1/2, NSη(h) ≤ C √ η for a universal constant C. The geometric intuition is simple. A noisy point changes the label of a halfspace only if the perturbation carries it… view at source ↗
Figure 10.14
Figure 10.14. Figure 10.14: A halfspace changes label under small random noise only when the original point is close enough to the separating boundary. X lies in a band of width O( √η) around zero. The Gaussian mass of that band is also O( √η), matching Peres’s bound. margin X density −2 0 2 unstable band width O( √η) [PITH_FULL_IMAGE:figures/full_fig_p151_10_14.png] view at source ↗
Figure 10.15
Figure 10.15. Figure 10.15: For majority, the normalized margin is approximately Gaussian. A sign change is likely only near the threshold. Noise stability is also compositional. Proposition 10.17 (Union bound for noise sensitivity). If h(x) = g(f1(x), . . . , fk(x)), then NSη(h) ≤ X k i=1 NSη(fi). Proof. If h(x) 6= h(Tηx), then at least one input bit to g must change: there exists i such that fi(x) 6= fi(Tηx). Taking probabiliti… view at source ↗
Figure 10.16
Figure 10.16. Figure 10.16: For Gaussian inputs, the span of the halfspace normals can be identified from distributional changes in labeled examples [PITH_FULL_IMAGE:figures/full_fig_p152_10_16.png] view at source ↗
Figure 10.17
Figure 10.17. Figure 10.17: Samples simulate statistical queries by empirical averaging. The model is powerful enough to include many moment, correlation, and gradient-based procedures. Population gradients in neural-network training, for instance, are expectations: ∇θL(θ) = E(x,y) [∇θℓ(fθ(x), y)] . Mini-batch SGD estimates these expectations. If a useful signal is exponentially small in all such averages, SQ lower bounds suggest… view at source ↗
Figure 10.18
Figure 10.18. Figure 10.18: SQ algorithms see stable aggregate information. They may miss exact combina￾torial structure present in the raw sample. Definition 10.22 (Unknown parity). Choose an unknown set T ⊆ [d]. Draw x ∼ Unif({±1} d ) and label y = χT (x) = Y i∈T xi . The Fourier spectrum of this target is concentrated on exactly one coefficient, namely T. If T is large, the target is high degree and very noise sensitive. This … view at source ↗
Figure 10.19
Figure 10.19. Figure 10.19: Raw parity samples give a linear system over F2. This exact combinatorial information is not available to an SQ learner. Noisy parity is different again. If each label is flipped independently with a constant probabil￾ity below 1/2, the exact linear-system trick becomes non-robust. The Blum–Kalai–Wasserman algorithm learns noisy parity in subexponential time 2 O(d/ log d) , which is far faster than bru… view at source ↗
Figure 11.1
Figure 11.1. Figure 11.1: The chapter moves from SQ and CSQ lower bounds to noisy gradient dynamics, then to one-hidden-layer networks under Gaussian inputs, and finally to positive algorithms based on designed landscapes and filtered PCA. We will use three algorithmic lenses. An SQ algorithm asks for approximate expectations E[q(x, y)]±τ . A CSQ algorithm is more restrictive: it only asks for label correlations E[yϕ(x)]±τ . Noi… view at source ↗
Figure 11.2
Figure 11.2. Figure 11.2: Three views of a learning algorithm: raw sample access, aggregate statistical access, and gradient-based dynamics. The transfer from parity to Gaussian neural networks goes through orthogonality. Parity functions are hard for CSQ algorithms because a query can correlate noticeably with only a small number of possible hidden parities. For Gaussian shallow networks, the construction embeds a parity-like s… view at source ↗
Figure 11.3
Figure 11.3. Figure 11.3: The lower-bound strategy: build many nearly invisible targets by turning parity￾like sign patterns into Gaussian shallow-network functions [PITH_FULL_IMAGE:figures/full_fig_p158_11_3.png] view at source ↗
Figure 11.4
Figure 11.4. Figure 11.4: A CSQ sees the label only through a correlation with a chosen feature of the input. The connection to gradients is direct. Proposition 11.2 (Gradient coordinates are CSQs). Assume the marginal distribution of x is known and that the relevant gradient-coordinate functions are bounded after normalization. Each coordinate of ∇R(θ) can be approximated using a known marginal expectation and one CSQ. Proof. D… view at source ↗
Figure 11.5
Figure 11.5. Figure 11.5: A CSQ algorithm can simulate the distribution of a noisy population GD step if the gradient estimate is accurate enough. The simulation is distributional rather than pointwise: we want the noisy update produced by the CSQ simulation to agree with the noisy GD update with high probability under a suitable coupling. The reason for using a coupling is that the two algorithms may not compute the same determ… view at source ↗
Figure 11.6
Figure 11.6. Figure 11.6: A maximal coupling agrees with probability equal to the overlap of the two densi￾ties. For Gaussians with the same covariance, a small mean shift implies small total variation distance. Proof idea. If every gradient coordinate is estimated to accuracy τ , then the Euclidean error in the mean of the Gaussian update is at most [PITH_FULL_IMAGE:figures/full_fig_p161_11_6.png] view at source ↗
Figure 11.7
Figure 11.7. Figure 11.7: The coupling argument matches an entire noisy GD trajectory with a CSQ￾generated trajectory, step by step. Corollary 11.6 (Parity lower bound transfers). For parity functions under the uniform distri￾bution, any noisy GD method captured by the coupling theorem must either take T = exp(Ω(d)) or s = exp(−Ω(d)), up to polynomial factors in the model size and parameter dimension. The message is not that eve… view at source ↗
Figure 11.8
Figure 11.8. Figure 11.8: Ordinary SGD with sample access and sufficient precision can be much more powerful than an SQ or CSQ abstraction. CSQ noisy GD mini-batch SGD PAC more access to individual-sample information [PITH_FULL_IMAGE:figures/full_fig_p162_11_8.png] view at source ↗
Figure 11.9
Figure 11.9. Figure 11.9: Modeling choices matter. Moving rightward gives the algorithm more direct access to sample-level information. 11.3 CSQ Lower Bounds for Shallow Networks We now move from parity functions to shallow neural networks under Gaussian inputs. Consider one-hidden-layer functions of the form f(x) = Xm i=1 ai σ(w > i x), x ∼ N (0, Id), where wi ∈ R d and ai ∈ R. Gaussian inputs make rotations and sign flips expl… view at source ↗
Figure 11.10
Figure 11.10. Figure 11.10: A one-hidden-layer network under Gaussian inputs. The lower bounds construct many such functions with hidden parity-like sign patterns. Theorem 11.7 (Informal CSQ lower bound under Gaussians). For common activations such as sigmoid or ReLU, any CSQ algorithm learning one-hidden-layer networks under x ∼ N (0, Id) requires either d Ω(log m) queries or τ = d −Ω(log m) . The lower bound follows from a CSQ … view at source ↗
Figure 11.11
Figure 11.11. Figure 11.11: The hidden subset S has size r = dlog2 me. The number of possible subsets is superpolynomial in m when d is large. There are 2 r ≤ 2m hidden units in gS. The output coefficients χ(ω) ∈ {±1} create cancel￾lation, while the Gaussian input distribution supplies sign symmetry. Each summand is a ridge function whose weight vector is supported on the hidden coor￾dinate set S. The learner sees a shallow netwo… view at source ↗
Figure 11.12
Figure 11.12. Figure 11.12: A sign flip on selected coordinates multiplies gS by the parity of the flip pattern. Lemma 11.11 (Pairwise orthogonality). If S 6= T, then Ex∼N (0,Id) [gS(x)gT (x)] = 0. Proof. Let z be uniformly distributed on {±1} d , independent of x. Since a standard Gaussian is invariant under coordinatewise sign flips, x ◦ z has the same distribution as x. Therefore Ex[gS(x)gT (x)] = EzEx[gS(x ◦ z)gT (x ◦ z)] = E… view at source ↗
Figure 11.13
Figure 11.13. Figure 11.13: The parity-weighted construction acts as a filter, canceling lower interactions and preserving a selected high-degree component. Putting the pieces together, the family {gS : |S| = r} has ℓ = [PITH_FULL_IMAGE:figures/full_fig_p165_11_13.png] view at source ↗
Figure 11.14
Figure 11.14. Figure 11.14: A schematic standard landscape: matching some moments can create a local minimum that does not recover the teacher parameters. Landscape design changes the objective so that the relevant low-order moment structure has benign critical points. The design principle is not to make the network more expressive, but to make the objective ask a cleaner question. By filtering the activation to selected Hermite … view at source ↗
Figure 11.15
Figure 11.15. Figure 11.15: A schematic designed landscape. Selecting specific Hermite degrees and using nonnegative weights removes the cancellation responsible for the lower-bound construction. This contrast is a major lesson of the chapter. The CSQ lower-bound family uses aω = χ(ω) ∈ {±1}, creating parity-like cancellation. When ai ≥ 0, large responses point toward the relevant subspace, enabling PCA and designed landscapes. O… view at source ↗
Figure 11.16
Figure 11.16. Figure 11.16: Response-weighted covariance vanishes on directions outside the teacher span V . Proposition 11.17 (Positive trace creates a signal direction). Suppose the filter ψ(y) = 1{|y|≥τ} selects examples for which kΠV xk 2 ≥ 2m. Then Mψ has a positive eigenvalue, and every eigenvector for a positive eigenvalue lies in V . Proof. Since Mψ vanishes on V ⊥, it suffices to show a positive trace on V . Let V also d… view at source ↗
Figure 11.17
Figure 11.17. Figure 11.17: Conditioning or filtering on large response breaks spherical symmetry along the teacher subspace, so PCA can find a direction in V . Stein’s lemma explains why second-order response-weighted covariance detects active direc￾tions. Lemma 11.18 (Stein identity). If X ∼ N (0, 1) and g is sufficiently smooth with suitable decay, then E[g(X)X] = E[g 0 (X)]. Consequently, E[g(X)(X2 − 1)] = E[g 00(X)] [PITH_F… view at source ↗
Figure 12.1
Figure 12.1. Figure 12.1: Chapter 10 is a turning point: the monograph moves from supervised optimization and generalization toward robustness, generation, and control. statistical learning game against nature interactive learning new actors: adversaries, discriminators, and environments [PITH_FULL_IMAGE:figures/full_fig_p173_12_1.png] view at source ↗
Figure 12.2
Figure 12.2. Figure 12.2: The later topics embed networks inside games and feedback systems, not only inside passive supervised learning problems. In all three cases the objective is no longer just a single empirical average over fixed examples. Geometry, dynamics, and model misspecification become part of the mathematical problem. The word “interlude” should not suggest that these topics are peripheral. They mark a change in wh… view at source ↗
Figure 12.3
Figure 12.3. Figure 12.3: An adversarial perturbation stays inside a small norm ball around the original input but crosses a decision boundary. Definition 12.2 (Robust risk). Let ∆ϵ = {δ : kδk ≤ ϵ}. The standard risk and robust risk of a classifier f are R(f) = P[f(X) 6= Y ], Rrob(f) = P[∃δ ∈ ∆ϵ : f(X + δ) 6= Y ] . Always R(f) ≤ Rrob(f). Robust risk depends on the geometry of the data distribution, the norm, and the perturba￾tio… view at source ↗
Figure 12.4
Figure 12.4. Figure 12.4: A point is robust when its perturbation ball does not hit the decision boundary. Geometrically, robust training tries to move boundaries away from the data manifold. Proposition 12.4 (Dual-norm first-order attack). For differentiable loss ℓ(x, y), the linear [PITH_FULL_IMAGE:figures/full_fig_p174_12_4.png] view at source ↗
Figure 12.5
Figure 12.5. Figure 12.5: The ℓ∞ perturbation budget can change many coordinates at once, producing a large aggregate score shift [PITH_FULL_IMAGE:figures/full_fig_p175_12_5.png] view at source ↗
Figure 12.6
Figure 12.6. Figure 12.6: A projected-gradient attack alternates ascent steps on the loss with projection back into the perturbation ball. Definition 12.6 (Robustness certificate). A certificate proves that, for a given point x, f(x + δ) = f(x) for all kδk ≤ ϵ. It is a rigorous lower bound on the local robust radius. Empirical defenses try many attacks and report accuracy against them. Certified defenses use convex relaxations, … view at source ↗
Figure 12.7
Figure 12.7. Figure 12.7: A schematic robustness–accuracy tradeoff. Robust learning may discard predictive but nonrobust features. Example 12.8 (One robust feature and many fragile features). Let Y ∈ {±1}. Suppose X1 = Y with high probability, Xj ∼ N (ηY, 1), j = 2, . . . , d + 1. The weak Gaussian features have small correlation η with Y , but averaging many of them gives high standard accuracy by concentration. An ℓ∞ perturbat… view at source ↗
Figure 12.8
Figure 12.8. Figure 12.8: Robustness includes both test-time evasion attacks and training-time data attacks such as poisoning or backdoors. 12.4 Generative Modeling Generative modeling replaces prediction by distribution learning. Given samples x1, . . . , xn ∼ Pdata, a generator Gθ : R k → R d , z ∼ Pz, x = Gθ(z), induces the pushforward distribution Pθ = (Gθ)#Pz. The goal is to make Pθ close to Pdata. This changes the object o… view at source ↗
Figure 12.9
Figure 12.9. Figure 12.9: An implicit generator maps simple latent noise into synthetic data. The phrase “realistic sample” hides a mathematical choice: which distance or divergence between distributions should be optimized? Common choices include DKL(PdatakPθ), DJS(Pdata, Pθ), W1(Pdata, Pθ), and integral probability metrics sup f∈F |EP f − EQf| . Likelihood rewards density estimation, adversarial distances reward indistinguisha… view at source ↗
Figure 12.10
Figure 12.10. Figure 12.10: Manifold intuition for generative modeling: a low-dimensional latent distribution is mapped into data space and should align with the true data manifold. Definition 12.9 (Maximum likelihood). For a tractable density pθ(x), maximum likelihood solves max θ 1 n Xn i=1 log pθ(xi). At the population level this minimizes DKL(PdatakPθ). The direction of the KL divergence matters. Maximum likelihood averages −… view at source ↗
Figure 12.11
Figure 12.11. Figure 12.11: Implicit models make sampling easy while leaving likelihood evaluation unavail￾able or expensive. likelihood models latent-variable models adversarial models score and diffusion [PITH_FULL_IMAGE:figures/full_fig_p179_12_11.png] view at source ↗
Figure 12.12
Figure 12.12. Figure 12.12: A taxonomy of generative models. Different approaches make density, sampling, inference, or comparison tractable. 12.5 GAN Theory Definition 12.11 (GAN minimax objective). Generative adversarial networks solve min G max D [Ex∼Pdata log D(x) + Ez∼Pz log(1 − D(G(z)))] . The discriminator tries to distinguish real from generated samples, while the generator tries to fool it. This is the original GAN formu… view at source ↗
Figure 12.13
Figure 12.13. Figure 12.13: The GAN game pits a generator against a discriminator [PITH_FULL_IMAGE:figures/full_fig_p179_12_13.png] view at source ↗
Figure 12.14
Figure 12.14. Figure 12.14: When real and generated supports are separated, a discriminator can separate them too well and the generator may receive weak or unstable gradients. Definition 12.14 (Earth-mover distance). The 1-Wasserstein distance is W1(P, Q) = inf π∈Π(P,Q) E(X,Y )∼π [PITH_FULL_IMAGE:figures/full_fig_p180_12_14.png] view at source ↗
Figure 12.15
Figure 12.15. Figure 12.15: Mode collapse is a coverage failure: the generated distribution has high fidelity in a few regions but misses other modes. Generalization in GANs is naturally phrased through a discriminator class F, which defines the integral probability metric dF (P, Q) = sup f∈F |EP f − EQf| . Rich discriminators detect more differences but require more samples. Weak discriminators may miss mode collapse or memoriza… view at source ↗
Figure 12.16
Figure 12.16. Figure 12.16: Reinforcement learning is interactive: the agent’s action affects future states and rewards. Definition 12.17 (Return, value, and Q-value). For a policy π, J(π) = Eπ "X∞ t=0 γ t r(st , at) # . The value and action-value functions are V π (s) = Eπ "X∞ t=0 γ t rt | s0 = s # , Qπ (s, a) = Eπ "X∞ t=0 γ t rt | s0 = s, a0 = a # . The value function is a compressed description of long-term consequence. It rep… view at source ↗
Figure 12.17
Figure 12.17. Figure 12.17: A deep RL pipeline. The network is trained by gradient methods, but the data distribution is generated by the agent’s own policy. Deep RL is difficult both statistically and computationally. Data are dependent, exploration affects what is observed, rewards may be sparse or delayed, bootstrapping creates moving targets, and function approximation can amplify errors. In a single feedback loop it combines… view at source ↗
Figure 12.18
Figure 12.18. Figure 12.18: Robustness, GANs, and RL all replace passive supervised learning by a coupled system with feedback. Several open questions remain central. Can robust training avoid losing useful nonrobust features? Which distributional metrics best predict human-perceived sample quality? When do adversarial games converge rather than cycle? Can deep RL be made sample efficient without brittle exploration heuristics? W… view at source ↗
Figure 13.1
Figure 13.1. Figure 13.1: A small perturbation can move an input across a learned decision boundary. The mathematical issue is not only the size of δ, but the interaction between the metric, the data distribution, and the classifier. training data learner model prediction poisoning evasion [PITH_FULL_IMAGE:figures/full_fig_p186_13_1.png] view at source ↗
Figure 13.2
Figure 13.2. Figure 13.2: Two adversarial threat models. Poisoning attacks corrupt the training sample; evasion attacks perturb an input after the model has been trained. Definition 13.1 (Adversarial example). For a classifier f : R d → [K], an input x with true label y, a norm k·k, and a perturbation radius ϵ > 0, an adversarial example is a point x 0 = x + δ, kδk ≤ ϵ, f(x 0 ) 6= y. In a targeted attack the adversary requires f… view at source ↗
Figure 13.3
Figure 13.3. Figure 13.3: A PGD attack alternates ascent steps on the loss with projection back to the allowed perturbation ball. + + + + − − − − linear robust [PITH_FULL_IMAGE:figures/full_fig_p188_13_3.png] view at source ↗
Figure 13.4
Figure 13.4. Figure 13.4: Robust training can require moving the decision boundary away from every per￾turbation ball, which may create a more complex separator. A certification method proves a lower bound rf (x) ≥ r, without relying on the success or failure of a particular attack algorithm. This distinction matters. An attack is a search procedure; if it fails, the model may be robust, or the search may simply have missed a be… view at source ↗
Figure 13.5
Figure 13.5. Figure 13.5: A common empirical pattern is that adversarial training improves robust accuracy but can reduce clean standard accuracy. u σ(u) ℓ u ReLU convex relaxation [PITH_FULL_IMAGE:figures/full_fig_p189_13_5.png] view at source ↗
Figure 13.6
Figure 13.6. Figure 13.6: For an ambiguous ReLU with preactivation interval [ℓ, u], the convex hull gives a sound outer relaxation. bounds, linear relaxations, and convex outer polytopes all follow this pattern: {f(x + δ) : kδk ≤ ϵ} ⊆ Prelax. If the relaxed set still has positive margins, the original network is certified. The word “outer” is essential. A relaxation used for certification may add fake activations that no true ne… view at source ↗
Figure 13.7
Figure 13.7. Figure 13.7: A typical certificate pipeline propagates input uncertainty through activation bounds, forms a tractable relaxation, and tests output margins. input coordinate x noise scale [PITH_FULL_IMAGE:figures/full_fig_p190_13_7.png] view at source ↗
Figure 13.8
Figure 13.8. Figure 13.8: Randomization averages predictions around x. The scale of the noise controls both stability and information loss. Proof. For kx 0 − xk ≤ ϵ, mj (x 0 ) ≥ mj (x) − Lj [PITH_FULL_IMAGE:figures/full_fig_p190_13_8.png] view at source ↗
Figure 13.9
Figure 13.9. Figure 13.9: Adding noise to an intermediate layer can make the randomized classifier stable. The required noise scale is controlled by the sensitivity of the preceding map. noise scale certified accuracy best balance too much noise too little radius [PITH_FULL_IMAGE:figures/full_fig_p192_13_9.png] view at source ↗
Figure 13.10
Figure 13.10. Figure 13.10: Randomized smoothing has a tradeoff: too little noise gives a weak certificate, while too much noise destroys class information. Definition 13.11 (Smoothed classifier). Given a base classifier h and Gaussian noise η ∼ N (0, σ2 I), the smoothed classifier is g(x) = arg max c∈[K] P [PITH_FULL_IMAGE:figures/full_fig_p192_13_10.png] view at source ↗
Figure 13.11
Figure 13.11. Figure 13.11: Different smoothing distributions certify different perturbation geometries. Gaus￾sian balls are only one case. + + + + + + − − − − − − 1 R label is encoded by radius [PITH_FULL_IMAGE:figures/full_fig_p193_13_11.png] view at source ↗
Figure 13.12
Figure 13.12. Figure 13.12: The adversarial-spheres toy model separates two classes by radius. It isolates geometric effects from image semantics. 13.5 High-Dimensional Limits The chapter next asks whether adversarial examples are an accident of current training methods or a geometric feature of high-dimensional spaces. Let E = {x : the classifier makes a mistake at x} be the error set, and define d(E) = Ex[dist(x, E)]. This quan… view at source ↗
Figure 13.13
Figure 13.13. Figure 13.13: Reference-note intuition for spherical expansion: spherical caps are the hardest sets to expand, so concentration of measure makes most of the sphere close to a large enough error set. data manifold nearby adversarial set [PITH_FULL_IMAGE:figures/full_fig_p194_13_13.png] view at source ↗
Figure 13.14
Figure 13.14. Figure 13.14: The geometry of real data may be closer to a structured manifold than to a uniform high-dimensional sphere. The theorem should be interpreted as a measure-concentration statement, not as a claim about a particular classifier architecture. It says that once the error set has nontrivial measure, its small Euclidean neighborhood can cover most of the sphere. This mechanism is one reason high-dimensional g… view at source ↗
Figure 13.15
Figure 13.15. Figure 13.15: Robust learning mixes geometric, optimization, and computational difficulties. 13.6 Poisoning and Robust Statistics The second half of the chapter changes the location of the attack. Instead of perturbing a test input, the adversary corrupts the training sample. In the simplest abstraction, Sobs = (1 − ϵ)Sclean + ϵSbad, where Sbad is arbitrary. The goal may be to degrade global performance, or to impla… view at source ↗
Figure 13.16
Figure 13.16. Figure 13.16: A small corrupted bump far from the clean distribution can destroy the mean and variance, while median-based estimates remain stable. Theorem 13.18 (Robust Gaussian estimation). Given ϵ-corrupted samples from a high￾dimensional Gaussian N (µ, Σ), there are polynomial-time algorithms that estimate µ and Σ with total variation error dTV N (µ, Σ), N (µ, b Σ) b  ≤ O  ϵ q log(1/ϵ)  under standard sample-… view at source ↗
Figure 13.17
Figure 13.17. Figure 13.17: Spectral signatures: poisoned examples can create an anomalous low-rank direc￾tion in hidden feature space. detectable low-rank deviation. Remark 13.20 (From robust statistics to deep nets). The statistical filtering step is only as good as the representation on which it operates. A backdoor may become visible as a feature-space cluster, but a sufficiently adaptive adversary can try to hide the cluster… view at source ↗
Figure 13.18
Figure 13.18. Figure 13.18: Influence functions identify training points whose upweighting has a large first￾order effect on a test prediction. Threat Mathematical object Typical defense Evasion local input ball adversarial training, certificates Random noise smoothed decision rule probability-gap certification Geometry limit expansion of the error set representation and metric design Poisoning contaminated samples robust statist… view at source ↗
Figure 14.1
Figure 14.1. Figure 14.1: A generator pushes a simple latent distribution PZ through a deep map Gθ, induc￾ing the model law Pθ. This viewpoint is useful because it separates two tasks that are often blurred in practice. First, one must choose a discrepancy that says when two probability laws look close. Second, one must optimize the generator parameters so that the induced law is close to the data law under that discrepancy. Dif… view at source ↗
Figure 14.2
Figure 14.2. Figure 14.2: The indistinguishability viewpoint: a test class F defines what differences between real and generated data are visible. Generative adversarial networks (GANs) instantiate this idea by making the tests neural networks. The discriminator or critic is not merely an auxiliary optimization device; it also defines the statistical resolution at which real and generated distributions are compared. A small disc… view at source ↗
Figure 14.3
Figure 14.3. Figure 14.3: A distribution-matching lens. The same pair of laws may look close or far depending on the chosen divergence or test class. Deep learning enters at two places. The generator represents a complicated low-dimensional image of latent space in the ambient data space. The discriminator represents a learned sta￾tistical test. Alternating training creates a game between these two networks, and the same generat… view at source ↗
Figure 14.4
Figure 14.4. Figure 14.4: A deep generator is a parameterized nonlinear map from latent variables to data. Its range can model a complicated low-dimensional set in R d [PITH_FULL_IMAGE:figures/full_fig_p202_14_4.png] view at source ↗
Figure 14.5
Figure 14.5. Figure 14.5: The adversarial setup in a GAN. The discriminator is trained to distinguish P from Qθ; the generator is trained to make that distinction hard. To understand the population objective, first freeze the generator and optimize over all possible discriminators. This idealization ignores the finite neural architecture but reveals the divergence hidden inside the loss. Lemma 14.3 (Pointwise optimal discriminat… view at source ↗
Figure 14.6
Figure 14.6. Figure 14.6: Two low-dimensional manifolds can be geometrically close and still have disjoint supports. Jensen-Shannon divergence then saturates even though moving one manifold toward the other should be meaningful. The geometric issue is especially visible in high dimension. Data may concentrate near a thin manifold, while the generator range is another thin manifold. If the two supports are disjoint, the optimal d… view at source ↗
Figure 14.7
Figure 14.7. Figure 14.7: Transport intuition for W1. A coupling specifies how to move probability mass from P to Q, and the distance is the minimum expected transport cost. For generative modeling, the most important form of W1 is its dual representation. Theorem 14.8 (Kantorovich-Rubinstein duality). On a Polish metric space with finite first moments, W1(P, Q) = sup kfkLip≤1 [EX∼P f(X) − EY ∼Qf(Y )] . The supremum is over all … view at source ↗
Figure 14.8
Figure 14.8. Figure 14.8: For P = δ0 and Qt = δt , Wasserstein distance gives a linear signal in t, while Jensen-Shannon divergence is saturated for every nonzero t. In practice, enforcing the exact Lipschitz constraint is difficult. The original WGAN used weight clipping. Later variants introduced gradient penalties, encouraging k∇xfψ(x)k2 ≈ 1 on interpolated samples, and spectral normalization, which constrains layer operator … view at source ↗
Figure 14.9
Figure 14.9. Figure 14.9: Lipschitz control limits the critic’s slope. Practical WGANs approximate this constraint by clipping, gradient penalties, or spectral normalization. Remark 14.10 (What the Wasserstein objective does and does not solve). Replacing Jensen￾Shannon divergence by W1 addresses one important geometric pathology: disjoint supports no longer force a saturated objective. It does not make the GAN problem convex, r… view at source ↗
Figure 14.10
Figure 14.10. Figure 14.10: A continuous population and a finite empirical distribution have different sup￾ports. Strong distributional distances may see this support mismatch before they see statistical usefulness. Proposition 14.11 (Strong distances can mislead). Let P = N (0, Id/d), and let Pbm be the empirical distribution from m = poly(d) independent samples. With high probability, DJS(PkPbm) = log 2, W1(P, Pbm) 6≈ 0. The fi… view at source ↗
Figure 14.11
Figure 14.11. Figure 14.11: The covering argument behind neural discriminator generalization. Concentration is proved on a finite net and then extended to the full class by Lipschitzness. The reference notes emphasize the same idea in game-theoretic language. If a generator can approximate point masses, then a mixed generator strategy can represent the empirical data distribution. After discretizing the discriminator class and us… view at source ↗
Figure 14.12
Figure 14.12. Figure 14.12: A generator mixture can be folded into one larger generator by using part of the latent input as a selector. Equilibrium existence is not the same as training convergence. The toy game min x max y xy [PITH_FULL_IMAGE:figures/full_fig_p211_14_12.png] view at source ↗
Figure 14.13
Figure 14.13. Figure 14.13: In the bilinear game minx maxy xy, gradient descent-ascent rotates around the equilibrium. Game-theoretic existence results therefore do not by themselves guarantee stable GAN training. This distinction matters. The finite-sample and equilibrium results give structural reasons why restricted GAN objectives are not hopeless. They do not replace the analysis of the actual training dynamics used in neural… view at source ↗
Figure 14.14
Figure 14.14. Figure 14.14: Generative compressed sensing searches over latent variables, not over all of R d . The recovered signal is constrained to lie in the generator range [PITH_FULL_IMAGE:figures/full_fig_p213_14_14.png] view at source ↗
Figure 14.15
Figure 14.15. Figure 14.15: Inpainting as a measurement problem. The observed pixels define b = Ax; the generator prior supplies plausible missing pixels through G(zb). Generative priors are powerful but not magical. The true signal may not lie near the generator range. The latent optimization problem is nonconvex. Theory often assumes random measurements, while applications may use structured or adversarially missing observation… view at source ↗
Figure 15.1
Figure 15.1. Figure 15.1: Transformer theory connects approximation, optimization, and generalization to a new question: when does a fixed network implement an adaptive learning algorithm in its activations? Definition 15.1 (In-context learning). Let Cn = ((xi , yi))n i=1 be a context of examples and let x⋆ be a query. A fixed predictor Tθ performs in-context learning when it outputs yb⋆ = Tθ(Cn, x⋆) without changing the paramet… view at source ↗
Figure 15.2
Figure 15.2. Figure 15.2: In-context learning turns the prompt into temporary data. The model predicts at x⋆ using information routed from the preceding examples. Transformers are particularly well suited to this view because self-attention compares every token with previous tokens, assigns data-dependent weights, and then aggregates information. In a causal language model, this aggregation is online: token t can use tokens 1, .… view at source ↗
Figure 15.3
Figure 15.3. Figure 15.3: Self-attention supplies content-dependent memory access: a query can read from previous tokens according to learned similarity scores. The guiding question for this chapter is: When does a pretrained transformer implement a learning algorithm inside its for￾ward pass? The answer has three parts. Architecture tells us how attention and residual streams process the prompt. Expressivity tells us which sequ… view at source ↗
Figure 15.4
Figure 15.4. Figure 15.4: Scaled dot-product attention computes query-key scores, normalizes them row￾wise, and uses the resulting attention matrix to average value vectors. Definition 15.2 (Causal mask). For an autoregressive transformer, token t is only allowed to attend to positions 1, . . . , t. Equivalently, the attention weights satisfy Atj = 0 for j > t. This makes the transformer a learned online algorithm over the promp… view at source ↗
Figure 15.5
Figure 15.5. Figure 15.5: The causal mask makes the attention matrix lower triangular: a token may read from the past and present, but not from the future. 15.2.2 Blocks and Heads A transformer block alternates communication across positions with local nonlinear transforma￾tions. In the simplified residual form used in the slides, U ℓ = Hℓ + MHA(Hℓ ), Hℓ+1 = U ℓ + MLP(U ℓ ). The residual stream preserves old information, attenti… view at source ↗
Figure 15.6
Figure 15.6. Figure 15.6: A transformer block combines communication through attention with token-wise computation through an MLP, both wrapped by residual connections. Multi-head attention repeats the attention operation several times with different learned projections. Heads are not independent learners, but they give the model several simultaneous memory reads. One head may retrieve a label, another may compare features, and … view at source ↗
Figure 15.7
Figure 15.7. Figure 15.7: Multi-head attention performs several memory reads in parallel and then recom￾bines the resulting features. 15.3 Expressivity and Attention For a query token i, one attention head computes h + i = X j≤i aijvj , aij = exp(q > i kj/ √ dk) P r≤i exp(q > i kr/ √ dk) . This formula is a learned, data-dependent smoother. The value vectors {vj} are averaged using weights determined by the current representatio… view at source ↗
Figure 15.8
Figure 15.8. Figure 15.8: An induction-head style circuit first locates a previous matching pattern and then routes the token that followed it. Mechanistic studies of in-context learning have found induction heads in trained transformer language models. The diagram is schematic, but it captures the algorithmic flavor: a head can use content matching to retrieve a past continuation and copy it into the prediction path. Example 15… view at source ↗
Figure 15.9
Figure 15.9. Figure 15.9: Universal ICL results view a transformer as an approximator of learning rules: the prompt is mapped directly to a prediction. The expressivity landscape has three levels. Attention can route information by content￾based lookup. Transformers can represent many formal algorithms, though robustness under finite precision is subtle. In-context learning can solve new tasks under a task-distribution assumptio… view at source ↗
Figure 15.10
Figure 15.10. Figure 15.10: A linear-attention view of the first gradient step: the context is aggregated into a sufficient statistic and then combined with the query. Proof idea. Store xi in key and value features and store yi in value features. The query token collects a context statistic such as P i yixi . The output projection combines this statistic with x⋆ to produce yb⋆. Repeating the same pattern with a representation of … view at source ↗
Figure 15.11
Figure 15.11. Figure 15.11: In the algorithmic view, depth corresponds to computation time: successive layers can implement successive optimization steps. Theorem 15.8 (Preconditioned gradient descent, informal). For linear regression prompts, trained transformer architectures can implement updates resembling wt+1 = wt − ηPtX>(Xwt − y)/n, where Pt acts like a learned preconditioner adapted to the task distribution. A good precond… view at source ↗
Figure 15.12
Figure 15.12. Figure 15.12: The Bayesian view: the context updates beliefs about the latent task, and the prediction integrates over that posterior. Example 15.9 (Linear-Gaussian tasks). Assume w ∼ N(0, τ 2 I), yi = x > i w + ξi , ξi ∼ N(0, σ2 ). Then the posterior mean prediction is E[y⋆ | Cn, x⋆] = x > ⋆ (X>X + λI) −1X>y, λ = σ 2 /τ 2 . Thus Bayes prediction coincides with ridge regression. This example also clarifies the role … view at source ↗
Figure 15.13
Figure 15.13. Figure 15.13: Algorithm selection in context. The examples reveal a latent task family, and the model routes the prompt through the corresponding rule. Mixture models clarify both the promise and the limitation. Wider pretraining mixtures can improve task identification; rare task families need enough mass to be learned; and interpolation between families is not automatic. The last point is often missed. Training on… view at source ↗
Figure 15.14
Figure 15.14. Figure 15.14: Pretraining mixtures matter. A model can learn narrow selection among familiar families while remaining weak in low-support or out-of-mixture regions. 15.6 Generalization and Limits There are at least three axes of generalization. A transformer can be fluent at token-level language modeling, yet fail to infer the correct rule from a short prompt. Prompt-level sample efficiency asks how many examples ar… view at source ↗
Figure 15.15
Figure 15.15. Figure 15.15: In-context learning generalizes along several axes. The ability to continue text is not identical to the ability to infer a task rule [PITH_FULL_IMAGE:figures/full_fig_p229_15_15.png] view at source ↗
Figure 15.16
Figure 15.16. Figure 15.16: Distribution shift for ICL. A prompt from an unseen task family may not be handled by the algorithm selected during pretraining. Remark 15.10 (Limits of the algorithmic view). From a short prompt alone, many task rules can agree on all observed examples but disagree on the query. The prior over tasks is therefore doing real statistical work. A learned in-context algorithm can be brittle under out￾of-di… view at source ↗
Figure 16.1
Figure 16.1. Figure 16.1: Diffusion models and flow matching replace direct density estimation by learned transport along a path of intermediate distributions. Definition 16.1 (Denoising generator). A diffusion model chooses a forward corruption process that maps x0 ∼ pdata to noisy variables xt ≈ αtx0 + σtε, ε ∼ N (0, I), and learns a reverse-time rule that removes noise step by step. Training is supervised by corrupted data; s… view at source ↗
Figure 16.2
Figure 16.2. Figure 16.2: The diffusion idea: a known noising process moves data toward a simple prior, and a learned reverse process generates data from noise. Definition 16.2 (Continuous transport and flow matching). Let X0 ∼ pbase be a base sample and X1 ∼ pdata a data sample. Flow matching specifies a continuous path Xt from base to data and learns a time-dependent velocity field dXt dt = vt(Xt). The generator is obtained by… view at source ↗
Figure 16.3
Figure 16.3. Figure 16.3: Flow matching learns the velocity field of a chosen path from the base law to the data law. The chapter has three layers. First, we review the discrete DDPM construction. Second, we pass to continuous-time score SDEs and probability-flow ODEs. Third, we formulate flow matching as direct velocity regression and compare the resulting path design and solver ques￾tions. 16.2 Discrete Diffusion 16.2.1 Forwar… view at source ↗
Figure 16.4
Figure 16.4. Figure 16.4: The DDPM forward process is a prescribed noising chain. At large t, the marginal becomes close to a standard Gaussian [PITH_FULL_IMAGE:figures/full_fig_p235_16_4.png] view at source ↗
Figure 16.5
Figure 16.5. Figure 16.5: The simplified DDPM objective is supervised regression: construct a noisy input and ask the network to predict the noise that was added. 16.2.3 Reverse Sampling Sampling uses learned reverse conditionals pθ(xt−1 | xt) = N (µθ(xt , t), Σt) to transform xT ∼ N (0, I) into a sample near the data distribution. The learned mean points toward higher data likelihood, while the variance term keeps the reverse c… view at source ↗
Figure 16.6
Figure 16.6. Figure 16.6: Reverse sampling follows a learned Markov chain from Gaussian noise back toward the data distribution. This explains why diffusion models avoid explicit density estimation. They do not need pdata(x) itself. They need local denoising information at many noise levels. Large noise levels are close to Gaussian and easier to model; small noise levels are geometry-sensitive but local. Multi-scale training dec… view at source ↗
Figure 16.7
Figure 16.7. Figure 16.7: A score field is local: it points toward regions of larger log-density at each noise level [PITH_FULL_IMAGE:figures/full_fig_p237_16_7.png] view at source ↗
Figure 16.8
Figure 16.8. Figure 16.8: The reverse SDE and the probability-flow ODE differ pathwise but agree at the level of one-time marginal distributions [PITH_FULL_IMAGE:figures/full_fig_p238_16_8.png] view at source ↗
Figure 16.9
Figure 16.9. Figure 16.9: Sampler design balances quality, number of solver steps, stochasticity, and accu￾mulated error. 16.4 Learning Scores 16.4.1 Denoising Score Matching The basic identity behind diffusion training says that Gaussian denoising recovers the score of the corrupted distribution. Lemma 16.6 (Gaussian corruption score). Let Xt = αtX0 + σtε, ε ∼ N (0, I), and let pt be the density of Xt. Then ∇x log pt(x) = E  −… view at source ↗
Figure 16.10
Figure 16.10. Figure 16.10: Score learning is vector-field regression: a corrupted sample is fed to the network and a denoising target supplies the local field. Several sources of error must be controlled to turn score learning into a guarantee for gener￾ated samples. The learned score may differ from the true score, the numerical solver introduces time-discretization error, and the initialization or terminal truncation may leave… view at source ↗
Figure 16.11
Figure 16.11. Figure 16.11: The generated distribution is affected by score estimation error, solver error, and tail or truncation error. Theorem 16.7 (Informal error decomposition). Assume the target density is regular, the score estimator satisfies Z T 0 Ept ksθ(Xt , t) − st(Xt)k 2 dt ≤ ε 2 score, and the sampler has step size h. Many analyses prove bounds of the form dist(pθ,0, pdata) ≤ C [PITH_FULL_IMAGE:figures/full_fig_p24… view at source ↗
Figure 16.12
Figure 16.12. Figure 16.12: Conditional flow matching constructs paths between base and data samples; the conditional velocity is known by design. The flow matching objective is min θ Et,z,x kvθ(Xt , t) − ut(Xt | z, x)k 2 . The notation ut(Xt | z, x) emphasizes that the target is the velocity of the conditional path associated with the sampled pair [PITH_FULL_IMAGE:figures/full_fig_p242_16_12.png] view at source ↗
Figure 16.13
Figure 16.13. Figure 16.13: Conditional flow matching is again supervised vector-field regression, now with velocity labels rather than denoising labels. 16.5.3 Rectification and Couplings The simplest independent coupling between z and x may create curved or crossing marginal trajectories. Rectified flow trains on interpolated pairs, then re-samples pairs from the learned flow and trains again. The goal is to straighten paths so… view at source ↗
Figure 16.14
Figure 16.14. Figure 16.14: Rectified flow aims to replace curved trajectories by straighter paths, reducing the numerical burden during sampling [PITH_FULL_IMAGE:figures/full_fig_p243_16_14.png] view at source ↗
Figure 16.15
Figure 16.15. Figure 16.15: Minibatch optimal-transport couplings try to pair base and data samples so that the conditional paths are easier to learn. 16.6 Unification and Solvers 16.6.1 Two Views of the Same Transport Problem Diffusion and flow matching can be placed side by side: diffusion/score: learn st(x) = ∇ log pt(x), vt(x) = f(x, t) − 1 2 g(t) 2 st(x), where the path is induced by a noising SDE. Flow matching instead choo… view at source ↗
Figure 16.16
Figure 16.16. Figure 16.16: Stochastic interpolants unify deterministic flows and diffusion-like paths by com￾bining interpolation with optional injected noise. 16.6.2 Solver and Neural Error Given a learned field vθ, ODE sampling solves X˙ t = vθ(Xt , t), X0 ∼ pbase. If the true field is Lipschitz with constant L, a typical trajectory error bound has the schematic form sup t [PITH_FULL_IMAGE:figures/full_fig_p245_16_16.png] view at source ↗
Figure 16.17
Figure 16.17. Figure 16.17: Classifier-free guidance modifies the local score or velocity direction to trade diversity for condition fidelity. 16.6.4 When Is Flow Matching Easier? A path is easy when its marginal velocity field is smooth, low-curvature, and aligned with the data geometry. Independent linear interpolation is simple but can create crossing trajectories. Optimal-transport couplings can reduce kinetic energy. Diffusi… view at source ↗
Figure 16.18
Figure 16.18. Figure 16.18: A concept map for diffusion and flow matching: choose a path, learn a field, solve dynamics, and generate samples. 16.7 Limitations, Connections, and Outlook 16.7.1 What Current Theory Explains Current theory explains several important pieces. Denoising losses identify scores or velocities under ideal sampling. Continuous-time samplers have clean PDE interpretations through the Fokker–Planck and contin… view at source ↗
Figure 17.1
Figure 17.1. Figure 17.1: A deployment budget can be spent before deployment, through model size and data, or during deployment, through test-time computation. Definition 17.1 (Training-time and test-time scaling). Training-time scaling chooses N and D under a training compute budget Ctrain ≈ κND, where κ depends on the architecture and implementation. Test-time scaling uses extra com￾putation after training to answer a particul… view at source ↗
Figure 17.2
Figure 17.2. Figure 17.2: The workflow of scaling experiments: measure smaller runs, fit laws, allocate resources, and update the frontier [PITH_FULL_IMAGE:figures/full_fig_p250_17_2.png] view at source ↗
Figure 17.3
Figure 17.3. Figure 17.3: On log-log axes, excess loss under a power law is approximately linear. A change of regime changes the slope. For language models, a common separable approximation is L(N, D) = L∞ + AN −α + BD−β . The first term after the floor represents capacity or approximation error; the second represents data-limited estimation error. In practice this formula is an approximation to a smooth region of an empirical f… view at source ↗
Figure 17.4
Figure 17.4. Figure 17.4: Scaling-law fitting is a closed loop: small runs estimate the law, larger predictions test it, and the residuals guide new experiments. Remark 17.3 (What power laws do not say). A fitted power law summarizes a regime; it does not identify the mechanism by itself. Dataset mixture, tokenizer, optimizer, architecture, and evaluation all matter. Scaling can break near contamination, saturation, distribution… view at source ↗
Figure 17.5
Figure 17.5. Figure 17.5: Under a fixed training compute budget, increasing model size and increasing data are competing uses of the same resource. Proposition 17.4 (Balanced allocation). Assume L(N, D) − L∞ = AN −α + BD−β , ND = C. Then the compute-optimal allocation obeys N⋆ ∝ C β/(α+β) , D⋆ ∝ C α/(α+β) . The constants A and B determine the exact ratio at a given scale. Proof. Substitute D = C/N. The excess loss becomes AN −α … view at source ↗
Figure 17.6
Figure 17.6. Figure 17.6: Compute-optimal training balances marginal returns from model size and data [PITH_FULL_IMAGE:figures/full_fig_p253_17_6.png] view at source ↗
Figure 17.7
Figure 17.7. Figure 17.7: A historical shift: more data per parameter moved the compute-optimal frontier away from undertrained large models. High-quality data may be finite. If D must exceed unique data, training repeats examples. Repetition can still help, but marginal returns often degrade. Synthetic data, filtering, curricu￾lum, and mixture design change the effective token count, so data-constrained scaling is about quality… view at source ↗
Figure 17.8
Figure 17.8. Figure 17.8: Data-constrained scaling distinguishes unique data, repeated tokens, synthetic tokens, and effective data quality [PITH_FULL_IMAGE:figures/full_fig_p254_17_8.png] view at source ↗
Figure 17.9
Figure 17.9. Figure 17.9: A spectral view: learning more modes leaves a smaller tail of unlearned energy, and algebraic tails can produce power-law curves. This picture connects scaling laws to the themes of previous chapters. A useful mental decomposition is L − L∞ ≈ approximation | {z } N + estimation | {z } D + optimization | {z } training recipe . Approximation improves when the hypothesis class expands. Estimation improves … view at source ↗
Figure 17.10
Figure 17.10. Figure 17.10: Effective dimension counts how many eigen-directions are visible at a given scale. Proposition 17.6 (Spectral tail to learning curve). Suppose the remaining target energy after learning m modes satisfies X j>m a 2 j ≤ C0m−r . If the number of reliably learned modes grows as m(n)  n γ , then L(n) − L∞ ≲ n −rγ . Proof. After n samples or units of resource, assume the algorithm reliably learns the first … view at source ↗
Figure 17.11
Figure 17.11. Figure 17.11: Scaling extrapolation is fragile when the experimental regime changes [PITH_FULL_IMAGE:figures/full_fig_p256_17_11.png] view at source ↗
Figure 17.12
Figure 17.12. Figure 17.12: Test-time compute treats inference as a small decision process: generate candi￾dates, check them, and possibly continue. Lemma 17.8 (Oracle pass probability). If each independent sample solves a problem with probability p, then the probability that at least one of K samples is correct is Ppass(K) = 1 − (1 − p) K. Proof. The probability that no sample is correct is (1 − p) K. Taking the complement gives… view at source ↗
Figure 17.13
Figure 17.13. Figure 17.13: Repeated sampling has diminishing returns: gains are largest when the base success probability is neither tiny nor already close to one. 17.5.2 Self-Consistency and Verifiers Self-consistency samples multiple reasoning traces and aggregates their final answers by majority vote or confidence. The method is useful when the model can find diverse paths to the same correct answer and when errors are not pe… view at source ↗
Figure 17.14
Figure 17.14. Figure 17.14: Self-consistency spends compute on diverse traces and aggregates the final an￾swers. Proposition 17.9 (Selection matters). Let Y1, . . . , YK be candidate answers with verifier scores R(Yi). A verifier-guided best-of-K policy outputs bY = arg max 1≤i≤K R(Yi). Improvement from extra samples requires the verifier ranking to correlate with correctness. Outcome reward models score final answers, while proc… view at source ↗
Figure 17.15
Figure 17.15. Figure 17.15: Search over thoughts allocates inference compute to branching, evaluating, and pruning partial solutions. 17.6 Joint Scaling of Training and Inference 17.6.1 An Inference-Aware Objective For a trained model and a test-time policy π, write the deployment loss as Ldeploy = L(N, D, π; Ctrain, Cinfer), Ctotal = Ctrain + Cinfer. Training compute buys a stronger base policy for all future queries. Inference … view at source ↗
Figure 17.16
Figure 17.16. Figure 17.16: Deployment quality depends on several axes: model size, data, number of at￾tempts, and depth of deliberation or search. 17.6.2 Thinking Versus Size If extra inference compute raises solution probability more cheaply than increasing model size, a smaller model with more thinking can dominate. This is most plausible for hard problems with verifiable answers. Ambiguous or open-ended tasks may favor better… view at source ↗
Figure 17.17
Figure 17.17. Figure 17.17: A smaller but search-friendly system can beat a larger base model when inference compute has higher marginal return. The policy must also know when to stop thinking. An adaptive test-time policy spends a low budget on easy queries and a high budget on hard ones. Calibration therefore becomes part of the scaling problem: the model must estimate whether extra computation is worth the latency and cost. Ad… view at source ↗
Figure 17.18
Figure 17.18. Figure 17.18: Adaptive inference routes easy queries to cheap answers and hard queries to more expensive deliberation. 17.6.3 Qualitative Regimes The return to test-time compute depends strongly on base skill. If the base model is too weak, sampling mostly repeats failures. At medium base skill, extra compute can have high leverage because some correct reasoning paths are within reach. At high base skill, returns sa… view at source ↗
Figure 17.19
Figure 17.19. Figure 17.19: A phase diagram for inference scaling: test-time compute is most valuable when the base model is competent but not saturated. fit empirical laws choose N, D test-time policy deployed frontier measure, allocate, deploy, and update [PITH_FULL_IMAGE:figures/full_fig_p261_17_19.png] view at source ↗
Figure 17.20
Figure 17.20. Figure 17.20: Joint scaling closes the loop between empirical measurement, training allocation, inference policy, and deployed performance [PITH_FULL_IMAGE:figures/full_fig_p261_17_20.png] view at source ↗
Figure 18.1
Figure 18.1. Figure 18.1: Mechanistic interpretability refines behavioral descriptions into features and causal circuits. The lower level is not automatically better: it must still explain the observed behavior. The chapter follows the route in figure 18.2. We first review circuits and interventions. We then study superposition, the hypothesis that models pack more features than available coordinates by representing features as … view at source ↗
Figure 18.2
Figure 18.2. Figure 18.2: The chapter moves from observable behavior to hidden states, sparse features, and causal feature graphs. 18.2 Circuits and Interventions 18.2.1 Residual Streams and Features Let x1:T be a token sequence and let rℓ(t) ∈ R d denote the residual stream at layer ℓ and position t. In a simplified pre-normalization transformer block, the next residual stream is written as rℓ+1(t) = rℓ(t) + Attnℓ(rℓ)t + MLPℓ(r… view at source ↗
Figure 18.3
Figure 18.3. Figure 18.3: A circuit claim is causal: the name feature, the attention head, and the copy feature are proposed to mediate a target logit. The figure is schematic; real transformer circuits often use many positions and layers. 18.2.2 Activation Patching Activation patching is the basic experimental tool for turning a circuit hypothesis into a causal test. Run the model on a clean input x where the behavior occurs an… view at source ↗
Figure 18.4
Figure 18.4. Figure 18.4: Activation patching copies an internal state from a clean run into a corrupted run and measures whether the behavior is restored. Patching is most informative when the clean and corrupted inputs differ in a controlled way. For an induction-head experiment, a clean prompt may contain the pattern A B, . . . , A so that the model should predict B. A corrupted prompt changes the earlier occurrence and remov… view at source ↗
Figure 18.5
Figure 18.5. Figure 18.5: An induction-head motif: one part of the circuit matches the repeated token A, and another part copies the following token B into the current position. Remark 18.5 (What interventions prove). Strong evidence for a circuit comes from a pattern of interventions: patching restores the behavior, ablation destroys it, a simpler replacement circuit preserves it, and the effect transfers to held-out examples. … view at source ↗
Figure 18.6
Figure 18.6. Figure 18.6: Polysemanticity occurs when one neuron is associated with several apparently different features. Superposition explains how this can happen when a model stores many sparse features in fewer dimensions. The central question is whether n > d useful features can be represented reliably in a d￾dimensional hidden space. Superposition answers yes, under sparsity and tolerance for some interference. The two qu… view at source ↗
Figure 18.7
Figure 18.7. Figure 18.7: Five feature directions can be packed into a two-dimensional hidden space. The price is nonzero inner products, hence interference when multiple features are active together [PITH_FULL_IMAGE:figures/full_fig_p270_18_7.png] view at source ↗
Figure 18.8
Figure 18.8. Figure 18.8: A schematic phase diagram for feature density and dimension pressure. As features become dense or dimensions become scarce, the model is pushed toward superposition. Superposition can be understood as compression. It lets a model store many rare features, use limited residual dimensions efficiently, and build distributed codes. Its cost is that neurons and directions become entangled: individual neurons… view at source ↗
Figure 18
Figure 18. Figure 18: figure 18.8 [PITH_FULL_IMAGE:figures/full_fig_p271_18.png] view at source ↗
Figure 18.9
Figure 18.9. Figure 18.9: An SAE maps a dense activation to an overcomplete sparse code and then decodes the code back into the original activation space. 18.4.2 Why Sparsity Helps Without sparsity, the decomposition a = Pm j=1 djzj has many equivalent solutions. If a dense code reconstructs well, a rotated dense code may reconstruct equally well, so the coordinates need not correspond to stable concepts. Sparsity breaks much of… view at source ↗
Figure 18.10
Figure 18.10. Figure 18.10: Choosing λ balances reconstruction against sparsity. The best interpretability usually lies in an intermediate regime, not at either extreme. Several training choices matter in practice: which layer and token positions provide acti￾vations, whether the residual stream or an MLP activation is used, the expansion factor m/d, whether encoder and decoder weights are tied, the sparse penalty schedule, activ… view at source ↗
Figure 18.11
Figure 18.11. Figure 18.11: Feature circuits replace dense activation vectors by sparse variables and then study the causal interactions among those variables. For a scalar logit or score y, local linearization gives a first-order attribution formula: ∆y ≈ X j ∂y ∂zj ∆zj . (18.7) A feature j is a positive contributor on an example if zj is active and ∂y/∂zj is large and positive. It is inhibitory if the derivative is negative. At… view at source ↗
Figure 18.12
Figure 18.12. Figure 18.12: A feature attribution graph summarizes local positive and negative influences among active sparse features and a target logit [PITH_FULL_IMAGE:figures/full_fig_p274_18_12.png] view at source ↗
Figure 18.13
Figure 18.13. Figure 18.13: SAE feature-circuit work is iterative: discover features, interpret them, intervene on them, build graphs, and test the explanation on new data. Example 18.9 (A monosemantic feature claim). A feature is plausibly monosemantic when its top activating examples share a coherent concept, it is sparse across unrelated contexts, its activation predicts downstream behavior, and ablation or steering changes th… view at source ↗
Figure 18.14
Figure 18.14. Figure 18.14: Evaluation should not confuse interpretability with faithfulness. Human-legible features still need causal and transfer tests. Representation metrics include reconstruction error, fraction of variance explained, average active features, feature activation frequency, and the dead-feature rate. Mechanistic metrics in￾clude causal ablation effects, steering consistency, transfer across prompts, circuit co… view at source ↗
Figure 18.15
Figure 18.15. Figure 18.15: Common SAE failure modes form a loop: dataset choices bias the dictionary, human labels over-interpret features, weak validation permits strong claims, and those claims influence the next dataset. The main theoretical questions are still open. First, under what distributional and archi￾tectural assumptions is an SAE dictionary identifiable up to permutation and scaling? Second, when does a feature-leve… view at source ↗
Figure 19.1
Figure 19.1. Figure 19.1: Grokking separates fitting the training set from discovering a generalizing internal rule. Definition 19.1 (Grokking phenomenology). A training run exhibits grokking when it reaches near-perfect training accuracy at time ttrain, while near-perfect test accuracy appears only after a much larger time ttest: 1 ttrain ttest, acctrain(ttrain) ≈ 1, acctest(ttest) ≈ 1. The phenomenon is not just a high final t… view at source ↗
Figure 19.2
Figure 19.2. Figure 19.2: The visible metric may remain flat while the model builds reusable structure. The delayed jump marks when that structure becomes usable by the readout. Grokking is a meeting point of themes from earlier chapters. Optimization dynamics deter￾mine whether the model keeps moving after interpolation. Implicit regularization selects among [PITH_FULL_IMAGE:figures/full_fig_p280_19_2.png] view at source ↗
Figure 19.3
Figure 19.3. Figure 19.3: Grokking studies the move from interpolation to structured computation, so it naturally touches optimization, regularization, representation geometry, and emergence. algorithmic tasks grokking curves Fourier mechanisms emergent generalization [PITH_FULL_IMAGE:figures/full_fig_p281_19_3.png] view at source ↗
Figure 19.4
Figure 19.4. Figure 19.4: The chapter first defines the experimental setup, then analyzes the curves, the modular-arithmetic mechanism, and the connection to emergence. 19.2 Setup and Algorithmic Tasks 19.2.1 Supervised Learning View Let (x, y) ∼ D and train a model fθ : X → Y on S = {(xi , yi)} n i=1, Rb S(θ) = 1 n Xn i=1 ℓ(fθ(xi), yi). The population risk is R(θ) = E(x,y)∼Dℓ(fθ(x), y). Interpolation means Rb S(θ) ≈ 0. The grok… view at source ↗
Figure 19.5
Figure 19.5. Figure 19.5: Modular addition on Z7 can be viewed as moving around a clock. This reveals periodic structure that a lookup table hides. Two predictors can have the same training loss but very different test behavior. A mem￾orizing solution stores observed labels with little reusable structure. An algorithmic solution represents the periodic rule and applies one computation to unseen pairs. Example 19.2 (Lookup versus… view at source ↗
Figure 19.6
Figure 19.6. Figure 19.6: Memorization and rule learning are both compatible with interpolation, but only the rule generalizes across the whole modular table. Definition 19.3 (Grokking delay). For a threshold 0 < α < 1, define ttrain = inf{t : acctrain(t) ≥ α}, ttest = inf{t : acctest(t) ≥ α}. The grokking delay is the ratio ttest/ttrain. A large delay means that hidden learning lasts long after the model fits the training set. … view at source ↗
Figure 19.7
Figure 19.7. Figure 19.7: The delay ratio compares the time to interpolate with the time to generalize. 19.3 Curves and Phase Transitions 19.3.1 Canonical Curves The canonical grokking curve has a rapid training-accuracy rise, a long plateau, and a late test-accuracy jump. Time is often plotted on a logarithmic scale because the delay can span orders of magnitude [PITH_FULL_IMAGE:figures/full_fig_p283_19_7.png] view at source ↗
Figure 19.8
Figure 19.8. Figure 19.8: A typical grokking curve: the training metric saturates early, while the test metric improves only after a long delay. It is tempting to call this a phase transition. That language is useful only when one specifies a measurable statistic that behaves like an order parameter and shows sharper change as width, data, or compute grows. In statistical physics, a phase transition is not merely a steep curve; … view at source ↗
Figure 19.9
Figure 19.9. Figure 19.9: An order parameter translates a qualitative regime change into a measurable curve [PITH_FULL_IMAGE:figures/full_fig_p284_19_9.png] view at source ↗
Figure 19.10
Figure 19.10. Figure 19.10: Hidden progress measures can improve for a long time before test accuracy crosses the threshold. Proposition 19.5 (Thresholded metrics). Suppose a latent score s(C) improves smoothly with compute or scale C, but a benchmark reports A(C) = 1{s(C) ≥ τ}. Then A(C) can look discontinuous even when s(C) is smooth. Proof. If s is continuous and crosses τ at C0, then the indicator is zero for s(C) < τ and one… view at source ↗
Figure 19.11
Figure 19.11. Figure 19.11: A thresholded benchmark can turn gradual capability into an abrupt reported score. 19.4 Mechanisms in Modular Arithmetic 19.4.1 Fourier Coordinates on Zp The reason modular addition is analytically attractive is that the cyclic group has a simple Fourier basis. Let ω = exp(2πi/p). The Fourier characters on Zp are χk(x) = ω kx, k, x ∈ Zp. Every function g : Zp → C can be written as g(x) = X k∈Zp gb(k)χk… view at source ↗
Figure 19.12
Figure 19.12. Figure 19.12: A compact circuit computes phases for a and b, combines them, and scores the modular sum. 19.4.2 A Mechanistic Story Mechanistic analyses of modular addition find internal variables that become more sinusoidal and more aligned with the correct Fourier modes before test accuracy jumps. The story is not that the network stores every pair (a, b). Instead, embeddings gradually organize around circular feat… view at source ↗
Figure 19.13
Figure 19.13. Figure 19.13: A mechanistic progress story: embeddings become circular, phase features com￾bine trigonometrically, and logits become readable as modular answers [PITH_FULL_IMAGE:figures/full_fig_p287_19_13.png] view at source ↗
Figure 19.14
Figure 19.14. Figure 19.14: Progress measures can reveal a shift from memorizing features to algorithmic features before the accuracy jump. Proposition 19.8 (Shared harmonic rule). If a model constructs features approximating χk(a) and χk(b) for several k, then a shallow readout can represent modular addition on all pairs using χk(a + b) = χk(a)χk(b). Proof sketch. Characters form a complete basis for functions on Zp. Addition be… view at source ↗
Figure 19.15
Figure 19.15. Figure 19.15: Interpolation alone does not determine which solution is selected. Grokking depends on the late-time movement among low-training-loss solutions. Proposition 19.9 (Simplicity bias). Suppose two interpolating predictors fmem and frule satisfy Rb S(fmem) = Rb S(frule) = 0, Ω(frule) < Ω(fmem). If training continues to decrease Ω on the interpolation manifold, then the selected predictor can move from memor… view at source ↗
Figure 19.16
Figure 19.16. Figure 19.16: Grokking can be interpreted as a slow transition from a lazy fit to a feature￾learning solution [PITH_FULL_IMAGE:figures/full_fig_p289_19_16.png] view at source ↗
Figure 19.17
Figure 19.17. Figure 19.17: Grokking separates the presence of statistical signal from the optimization process that extracts it efficiently. 19.6.2 Emergence and Measurement Large-model benchmarks often report emergent abilities: performance appears near chance for small scales and then jumps at a threshold. Grokking suggests two explanations. The first is a genuine mechanistic transition: a new internal algorithm becomes availa… view at source ↗
Figure 19.18
Figure 19.18. Figure 19.18: Emergent abilities may reflect a new internal mechanism, a thresholded metric, or both. A sharp score alone is not enough to decide. A useful way to summarize experiments is a phase diagram whose axes are regularization strength and data fraction. Delay is most visible when the rule exists, but memorization is initially easier. Too little data and weak regularization favor memorization. Enough data may… view at source ↗
Figure 19.19
Figure 19.19. Figure 19.19: A schematic phase diagram: grokking is most visible near the boundary between memorization-dominated training and rule-learning training. 19.7 Measurement, Limits, and Open Questions The right measurements include both external and internal quantities. External metrics include training and test loss, accuracy at multiple thresholds, calibration, margin, and robustness across random splits. Internal met… view at source ↗
Figure 20.1
Figure 20.1. Figure 20.1: A common alignment pipeline first builds capability through pretraining, then teaches instruction following, and finally optimizes against preference information. The first mathematical question is therefore: How does a preference label become an objective for changing a language model policy? The rest of the chapter answers this question in increasing detail. We first model preference data statisticall… view at source ↗
Figure 20.2
Figure 20.2. Figure 20.2: Preference supervision compares candidate behaviors rather than declaring one response to be the unique target [PITH_FULL_IMAGE:figures/full_fig_p295_20_2.png] view at source ↗
Figure 20.3
Figure 20.3. Figure 20.3: The argument: model preference labels, connect reward optimization to KL control, derive DPO, and then examine the broader method family and its limitations. 20.2 Preference Data and Reward Models Definition 20.2 (Preference dataset). A pairwise preference dataset is a collection P = {(xi , y+ i , y− i )} n i=1, y+ i xi y − i . The prompt xi is sampled from a prompt distribution, y + i is the preferred … view at source ↗
Figure 20.4
Figure 20.4. Figure 20.4: Reward-model training fits a scalar proxy so that preferred responses receive larger scores than rejected responses. Proposition 20.5 (Preference likelihood is invariant to prompt shifts). For any function c : X → R, the rewards r(x, y) and r 0 (x, y) = r(x, y) + c(x) induce exactly the same Bradley–Terry preference probabilities [PITH_FULL_IMAGE:figures/full_fig_p297_20_4.png] view at source ↗
Figure 20.5
Figure 20.5. Figure 20.5: Preference modeling is statistical estimation under noisy and incomplete feedback. The optimized policy can also change the distribution of future comparisons. 20.3 RLHF as KL-Regularized Control In classical RLHF for language models, the reward model is used as an objective for policy optimization. The policy is not optimized freely: it is kept near a reference policy, usually the supervised fine-tuned… view at source ↗
Figure 20.6
Figure 20.6. Figure 20.6: RLHF learns a reward model from preferences and then performs KL-regularized policy optimization. For a prompt x, let πref(· | x) be the reference policy and let r(x, y) be a learned reward [PITH_FULL_IMAGE:figures/full_fig_p298_20_6.png] view at source ↗
Figure 20.7
Figure 20.7. Figure 20.7: A geometric view of RLHF: the policy moves in a reward-improving direction while being penalized for leaving the neighborhood of the reference policy. In large language models the exact maximizer is not computed by enumerating responses. Practical RLHF often uses a policy-gradient method such as PPO. The policy samples responses, receives a reward consisting of the learned reward model score plus a KL p… view at source ↗
Figure 20.8
Figure 20.8. Figure 20.8: RLHF explicitly learns a reward and optimizes it by RL; DPO folds the KL-control solution into the preference likelihood. Proposition 20.9 (Reference-corrected log odds). DPO increases the policy log-odds of the preferred response relative to the reference policy. Its comparison statistic is mθ = log πθ(yw | x) πθ(yl | x) − log πref(yw | x) πref(yl | x) . Thus the loss does not merely ask whether πθ ass… view at source ↗
Figure 20.9
Figure 20.9. Figure 20.9: DPO behaves like a contrastive supervised objective, with update size controlled by the current reference-corrected margin. To see this analytically, write the loss on one example as ℓθ = − log σ(βmθ). Then ∇θℓθ = −β [PITH_FULL_IMAGE:figures/full_fig_p302_20_9.png] view at source ↗
Figure 20.10
Figure 20.10. Figure 20.10: The inverse-temperature β converts log-ratio margins into preference logits. Larger values make the logistic transition sharper [PITH_FULL_IMAGE:figures/full_fig_p302_20_10.png] view at source ↗
Figure 20.11
Figure 20.11. Figure 20.11: Direct preference objectives differ in their data format and in how they score policy deviations from a reference. IPO and margins. The logistic DPO loss can continue increasing preference margins when labels are nearly separable. IPO-style views introduce a finite target scale. A simplified margin loss has the form  log πθ(yw | x) πref(yw | x) − log πθ(yl | x) πref(yl | x) − m 2 , where m > 0 is a d… view at source ↗
Figure 20.12
Figure 20.12. Figure 20.12: Preference objectives balance three pressures: fit the feedback, remain close to the reference behavior, and optimize strongly enough to matter. This family perspective is useful when comparing algorithms. The question is not whether one loss is universally correct. The question is which assumptions are appropriate: pairwise labels or binary labels, a trusted reference or reference-free scoring, a logi… view at source ↗
Figure 20.13
Figure 20.13. Figure 20.13: Reward hacking can be viewed as distribution shift: optimization queries the proxy reward in regions not well constrained by preference data. Reward hacking. Reward hacking occurs when a policy finds responses that score highly under the proxy reward while failing the intended objective. In language models, this may appear as verbosity that pleases a reward model, confident but false answers, superfici… view at source ↗
Figure 20.14
Figure 20.14. Figure 20.14: Goodhart’s law in schematic form: as the proxy becomes the target, true quality may eventually stop improving or decline. Goodhart’s law is not a theorem about every reward model, but it is a useful warning. If a proxy is selected because it correlates with quality on the training distribution, strong optimization can select examples where the correlation breaks. Example 20.12 (Verbosity as a proxy). S… view at source ↗
Figure 20.15
Figure 20.15. Figure 20.15: Preference optimization is safer when the comparison data cover the responses likely to be produced after optimization. Preference ambiguity. Human preferences are not a single deterministic function. They differ across annotators, cultures, expertise levels, and available context. Pairwise labels may be inconsistent, and a majority label may hide serious minority harms. A scalar reward is especially c… view at source ↗
Figure 20.16
Figure 20.16. Figure 20.16: Preference fit is one ingredient of deployment alignment, but it must be combined with robustness, truthfulness, calibration, and other safety requirements. Remark 20.13 (A practical checklist). During training, monitor KL to the reference, chosen and rejected log-likelihoods, held-out preference accuracy, reward-model calibration, and examples with extreme reward. During evaluation, test adversarial p… view at source ↗
Figure 21.1
Figure 21.1. Figure 21.1: OOD generalization asks whether a model trained on observed domains can be trusted on a new deployment domain. The distribution shift creates an extra term beyond ordinary finite-sample generalization. If P is the training distribution and Q is the deployment distribution, then RQ(f) = RP (f) + ∆P→Q(f), where ∆P→Q(f) denotes the price of changing distributions. Classical generalization theory controls t… view at source ↗
Figure 21.2
Figure 21.2. Figure 21.2: Perfect behavior on the training support need not constrain the model’s behavior on deployment support. There are several common names for recurring shift patterns. Under covariate shift, P(X) 6= Q(X) but P(Y | X) is stable. Under label shift, P(Y ) 6= Q(Y ) but P(X | Y ) is stable. Under concept shift, the conditional rule P(Y | X) itself changes. Subpopulation shift changes the mixture weights among g… view at source ↗
Figure 21.3
Figure 21.3. Figure 21.3: OOD deployment combines questions from generalization, robustness, representa￾tion learning, in-context adaptation, and alignment. OOD risk adaptation bounds robust learning detect and calibrate foundation models [PITH_FULL_IMAGE:figures/full_fig_p311_21_3.png] view at source ↗
Figure 21.4
Figure 21.4. Figure 21.4: The chapter moves from formal risk to adaptation bounds, robust training, uncer￾tainty, and foundation-model evaluation. 21.2 Environments, Risks, and Shift Models Definition 21.1 (Environment model and OOD risk). Let e ∈ E index a distribution Pe on X × Y. For a predictor f : X → Y, define Re(f) = E(X,Y )∼Pe [PITH_FULL_IMAGE:figures/full_fig_p311_21_4.png] view at source ↗
Figure 21.5
Figure 21.5. Figure 21.5: A shortcut feature can be predictive in training environments while being unstable across environments. The example exposes the necessary ingredients for OOD learning. The target should not require unconstrained prediction on a region with no source information; some conditional, representation, or mechanism must be shared; the training environments must vary enough to reveal unstable cues; and the lear… view at source ↗
Figure 21.6
Figure 21.6. Figure 21.6: The domain adaptation bound decomposes target error into source error, domain discrepancy, and shared-solution error. The operational interpretation of the discrepancy term leads to domain-adversarial learning. If a domain classifier can easily distinguish source from target in a learned representation Φ(X), then the discrepancy is large and the target bound is loose. A domain-adversarial network theref… view at source ↗
Figure 21.7
Figure 21.7. Figure 21.7: Domain-adversarial representation learning uses a task head and a domain head to encourage discriminative but domain-invariant features. The adaptation bound also points to an impossibility: making marginals look similar cannot create labels where there is no source information [PITH_FULL_IMAGE:figures/full_fig_p314_21_7.png] view at source ↗
Figure 21.8
Figure 21.8. Figure 21.8: Group DRO focuses optimization pressure on the high-loss group rather than allowing average performance to hide a subgroup failure. Definition 21.10 (Invariant Risk Minimization). The ideal IRM objective seeks a repre￾sentation Φ such that the same classifier w is optimal in every training environment: min Φ,w X e∈Etrain Re(w ◦ Φ) s.t. w ∈ arg min w¯ Re( ¯w ◦ Φ) ∀e. The intuition is that stable predicto… view at source ↗
Figure 21.9
Figure 21.9. Figure 21.9: A feature whose relation to the label survives environment changes is a better OOD candidate than a shortcut feature. In practice, the exact IRM constraint is difficult to enforce. A common surrogate fixes a scalar classifier and penalizes environment-wise optimality gradients: min Φ X e∈Etrain Re(Φ) + λ X e∈Etrain k∇wRe(w ◦ Φ)|w=1k 2 . The penalty says that a classifier trained on one environment shoul… view at source ↗
Figure 21.10
Figure 21.10. Figure 21.10: Energy-based detection separates ID and OOD examples by thresholding a scalar score. Predictive uncertainty also changes under dataset shift. A Bayesian or ensemble predictor estimates p(y | x, D) = Z p(y | x, θ) p(θ | D) dθ. Moving away from the training support can increase epistemic uncertainty. Empirically, deep ensembles often improve calibration under moderate shift, but severe semantic shift can… view at source ↗
Figure 21.11
Figure 21.11. Figure 21.11: In-context learning can be viewed as target adaptation using prompt examples and the pretrained prior. For language models, OOD shift is not only a change in input distribution. It can involve new dialects, APIs, laws, tools, adversarial prompt styles, long-context extrapolation, synthetic￾to-real task mismatch, reward-model extrapolation, judge distribution shift, and changes in [PITH_FULL_IMAGE:figu… view at source ↗
Figure 21.12
Figure 21.12. Figure 21.12: OOD robustness for foundation models is an ongoing evaluation and monitoring loop rather than a one-time benchmark score. Recent theory and empirical work asks several related questions. Can a learner test whether a target distribution is safe before trusting a hypothesis? Which representations preserve in￾variant mechanisms rather than surface cues? When do pretrained features help or hurt OOD selecti… view at source ↗
Figure 22.1
Figure 22.1. Figure 22.1: Observed emergence can arise from metric thresholding, finite-size critical behavior, or genuine mechanism formation. This chapter takes a deliberately decompositional view. We do not treat “emergence” as an explanation. We treat it as a hypothesis to be broken into observables, diagnostics, solvable models, training dynamics, and mechanisms. The research problem is to turn a sudden ability into a theor… view at source ↗
Figure 22.2
Figure 22.2. Figure 22.2: The chapter’s structure: define the observable, rule out artifacts, study solvable models, analyze training transitions, and formulate research programs. 22.2 Metrics, Slices, and Apparent Jumps The first diagnostic for any emergence claim is the metric. Many benchmarks convert a continu￾ous latent score into a binary success indicator. This can turn smooth progress into an apparent cliff. Definition 22… view at source ↗
Figure 22.3
Figure 22.3. Figure 22.3: A discontinuous or thresholded metric can create an apparent jump from a smooth latent score. The diagnostic lesson is simple: before claiming emergence, replace exact-match accuracy by a continuous score when possible, condition on task difficulty, vary the prompt distribution, remove in-context demonstrations, and search for internal progress measures that move before the visible jump. An emergence cl… view at source ↗
Figure 22.4
Figure 22.4. Figure 22.4: Aggregating easy and hard slices can hide smoother slice behavior and create an apparent emergence threshold. In-context learning is another confounder. Some abilities attributed to large-model emer￾gence can be decomposed into memory, linguistic knowledge, and task inference from examples in the prompt. A claim about the pretrained model should therefore compare zero-shot, few￾shot, and instruction-onl… view at source ↗
Figure 22.5
Figure 22.5. Figure 22.5: Discrete skill quanta can make one ability appear suddenly while the aggregate loss remains smooth. The multitask sparse parity model makes this idea more concrete. Tasks are basis functions ϕj , and the target is sampled from a heavy-tailed task distribution: f(x) = X K j=1 ajϕj (x), P(j) ∝ j −α . Learning each basis direction has a sigmoidal curve. A power-law task distribution creates smooth aggregat… view at source ↗
Figure 22.6
Figure 22.6. Figure 22.6: A percolation model interprets capability emergence as the appearance of a task￾relevant connected component of learned regularities. For in-context learning, high-dimensional linear attention gives a cleaner limit theory. In linear-regression prompts, take a joint limit d, n, ℓ, k → ∞, ℓ d → α, k d → κ, n d 2 → τ. Low task diversity leads to memorization of pretraining tasks; high diversity leads to ge… view at source ↗
Figure 22.7
Figure 22.7. Figure 22.7: Grokking: training accuracy rises early, while test accuracy rises only after a delayed hidden mechanism becomes usable. Definition 22.9 (Lazy-to-rich transition). During early training, a network may behave [PITH_FULL_IMAGE:figures/full_fig_p327_22_7.png] view at source ↗
Figure 22.8
Figure 22.8. Figure 22.8: A two-minima schematic: as training changes the effective landscape, the dominant state can switch discontinuously. Mechanistic interpretability provides a more constructive route. Rather than asking whether a curve jumps, search for continuous progress variables P1(t), . . . , Pr(t) such that A(t) ≈ Ψ(P1(t), . . . , Pr(t)). For modular addition, circuit components can be tracked in Fourier space before… view at source ↗
Figure 22.9
Figure 22.9. Figure 22.9: An induction head matches a previous token and copies the following token, giving an elementary mechanism for in-context pattern continuation. For Markov-chain data, transformers can learn to estimate transition probabilities from context: pb(xt+1 = j | xt = i) ≈ #(i → j in context) #(i in context) . Observed phases can include uniform prediction, a unigram shortcut, and then a bigram in￾context solutio… view at source ↗
Figure 22.10
Figure 22.10. Figure 22.10: A schematic phase diagram for in-context learning: task diversity, context length, and architecture determine whether memorization or in-context generalization dominates. Several interpretations of in-context learning coexist. In one view, attention implements gradient descent in the forward pass. In another, the prompt defines a posterior over tasks. In a third, the model retrieves or reuses stored ex… view at source ↗
Figure 22.11
Figure 22.11. Figure 22.11: A reference map for emergence theory: empirical claims should be filtered through diagnostics, solvable models, training dynamics, and mechanistic evidence. The open problems are clear. Can we prove finite-size scaling exponents for attention-based models beyond linear attention? Can progress measures be found automatically and validated causally? Can an emergence threshold be forecast from small-scale… view at source ↗
read the original abstract

Deep learning has outgrown any single mathematical explanation. From Approximation to Emergence develops a unified, proof-oriented account of modern deep learning theory, tracing a path from the classical foundations of approximation, optimization, and generalization to the contemporary mechanisms of overparameterization, robustness, generative modeling, transformers, in-context learning, scaling laws, interpretability, alignment, and emergence. Rather than presenting isolated results, the book organizes a broad literature into a coherent research narrative: each theory is examined through the object it controls, the assumptions that make it valid, and the phenomena it leaves unexplained. Written for researchers, graduate students, and mathematically trained practitioners, this monograph offers a rigorous map of deep learning theory as it stands today: powerful, incomplete, and increasingly centered on the question of how learned mechanisms arise from scale, data, architecture, and training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript is a monograph synthesizing deep learning theory. It traces a path from classical foundations in approximation, optimization, and generalization to modern topics including overparameterization, robustness, generative modeling, transformers, in-context learning, scaling laws, interpretability, alignment, and emergence. The central contribution is an organizational narrative that examines each theory by the object it controls, its assumptions, and the phenomena it leaves unexplained, rather than presenting isolated results or new derivations.

Significance. If the synthesis is comprehensive and balanced, the monograph would provide a valuable reference map of the field for researchers and graduate students, highlighting the increasing focus on how mechanisms emerge from scale, data, architecture, and training. The work's strength is its explicit framing of assumptions and limitations across the literature; no new machine-checked proofs or parameter-free derivations are claimed.

minor comments (1)
  1. The abstract states the work is 'proof-oriented,' yet the overall description positions it as a survey of existing results; clarifying in the introduction how proofs from the cited literature are presented or re-derived would help readers assess the rigor.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the monograph's organizational narrative and for recommending minor revision. The report correctly identifies the work as a synthesis that examines theories by their controlled objects, assumptions, and unexplained phenomena, without claiming new proofs or derivations.

Circularity Check

0 steps flagged

No significant circularity: survey of existing results

full rationale

This monograph is explicitly a survey that organizes existing literature into a narrative without introducing new derivations, predictions, fitted parameters, or original theorems. The abstract and structure describe tracing paths from classical results to contemporary mechanisms by examining published theories, their assumptions, and limitations. No load-bearing steps exist that could reduce by construction to inputs, self-citations, or ansatzes, as the work contains no deductive chain or empirical claims of its own. The reader's assessment of score 0.0 is confirmed by the absence of any evaluable original content that might exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a review monograph based on the abstract and introduces no new free parameters, axioms, or invented entities beyond those already present in the cited literature.

pith-pipeline@v0.9.1-grok · 5660 in / 937 out tokens · 27980 ms · 2026-07-03T21:37:43.241254+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 5 internal anchors

  1. [1]

    2016 , publisher =

    Deep Learning , author =. 2016 , publisher =

  2. [2]

    2006 , publisher =

    Pattern Recognition and Machine Learning , author =. 2006 , publisher =

  3. [3]

    2022 , publisher =

    Probabilistic Machine Learning: An Introduction , author =. 2022 , publisher =

  4. [4]

    Mathematics of Control, Signals and Systems , volume =

    Approximation by superpositions of a sigmoidal function , author =. Mathematics of Control, Signals and Systems , volume =. 1989 , doi =

  5. [5]

    Neural Networks , volume =

    Approximation capabilities of multilayer feedforward networks , author =. Neural Networks , volume =. 1991 , doi =

  6. [6]

    IEEE Transactions on Information Theory , volume =

    Universal approximation bounds for superpositions of a sigmoidal function , author =. IEEE Transactions on Information Theory , volume =. 1993 , doi =

  7. [7]

    Proceedings of the 29th Conference on Learning Theory , series =

    The Power of Depth for Feedforward Neural Networks , author =. Proceedings of the 29th Conference on Learning Theory , series =. 2016 , publisher =

  8. [8]

    Proceedings of the 29th Conference on Learning Theory , series =

    Benefits of depth in neural networks , author =. Proceedings of the 29th Conference on Learning Theory , series =. 2016 , publisher =

  9. [9]

    Error bounds for approximations with deep

    Yarotsky, Dmitry , journal =. Error bounds for approximations with deep. 2017 , doi =

  10. [10]

    Advances in Neural Information Processing Systems , volume =

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

  11. [11]

    Proceedings of the 36th International Conference on Machine Learning , series =

    Gradient Descent Finds Global Minima of Deep Neural Networks , author =. Proceedings of the 36th International Conference on Machine Learning , series =. 2019 , publisher =

  12. [12]

    Proceedings of the 36th International Conference on Machine Learning , series =

    A Convergence Theory for Deep Learning via Over-Parameterization , author =. Proceedings of the 36th International Conference on Machine Learning , series =. 2019 , publisher =

  13. [13]

    Advances in Neural Information Processing Systems , volume =

    Wide Neural Networks of Any Depth Evolve as Linear Models under Gradient Descent , author =. Advances in Neural Information Processing Systems , volume =. 2019 , url =

  14. [14]

    Advances in Neural Information Processing Systems , volume =

    On Lazy Training in Differentiable Programming , author =. Advances in Neural Information Processing Systems , volume =. 2019 , url =

  15. [15]

    Journal of Machine Learning Research , volume =

    Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , author =. Journal of Machine Learning Research , volume =. 2011 , url =

  16. [16]

    International Conference on Learning Representations , year =

    Adam: A Method for Stochastic Optimization , author =. International Conference on Learning Representations , year =

  17. [17]

    International Conference on Learning Representations , year =

    On the Convergence of Adam and Beyond , author =. International Conference on Learning Representations , year =

  18. [18]

    International Conference on Learning Representations , year =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

  19. [19]

    Advances in Neural Information Processing Systems , volume =

    The Marginal Value of Adaptive Gradient Methods in Machine Learning , author =. Advances in Neural Information Processing Systems , volume =. 2017 , url =

  20. [20]

    Proceedings of the 35th International Conference on Machine Learning , series =

    Characterizing Implicit Bias in Terms of Optimization Geometry , author =. Proceedings of the 35th International Conference on Machine Learning , series =. 2018 , publisher =

  21. [21]

    Journal of Machine Learning Research , volume =

    The Implicit Bias of Gradient Descent on Separable Data , author =. Journal of Machine Learning Research , volume =. 2018 , url =

  22. [22]

    1998 , publisher =

    Statistical Learning Theory , author =. 1998 , publisher =

  23. [23]

    2014 , publisher =

    Understanding Machine Learning: From Theory to Algorithms , author =. 2014 , publisher =

  24. [24]

    IEEE Transactions on Information Theory , volume =

    Sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network , author =. IEEE Transactions on Information Theory , volume =. 1998 , doi =

  25. [25]

    International Conference on Learning Representations , year =

    Understanding deep learning requires rethinking generalization , author =. International Conference on Learning Representations , year =

  26. [26]

    Journal of Machine Learning Research , volume =

    Stability and Generalization , author =. Journal of Machine Learning Research , volume =. 2002 , url =

  27. [27]

    Proceedings of the 33rd International Conference on Machine Learning , series =

    Train faster, generalize better: Stability of stochastic gradient descent , author =. Proceedings of the 33rd International Conference on Machine Learning , series =. 2016 , publisher =

  28. [28]

    Proceedings of the National Academy of Sciences , volume =

    Reconciling modern machine-learning practice and the classical bias--variance trade-off , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =

  29. [29]

    The Annals of Statistics , volume =

    Surprises in High-Dimensional Ridgeless Least Squares Interpolation , author =. The Annals of Statistics , volume =. 2022 , doi =

  30. [30]

    2014 , publisher =

    Analysis of Boolean Functions , author =. 2014 , publisher =

  31. [31]

    Journal of the ACM , volume =

    Efficient noise-tolerant learning from statistical queries , author =. Journal of the ACM , volume =. 1998 , doi =

  32. [32]

    International Conference on Learning Representations , year =

    Intriguing properties of neural networks , author =. International Conference on Learning Representations , year =

  33. [33]

    International Conference on Learning Representations , year =

    Explaining and Harnessing Adversarial Examples , author =. International Conference on Learning Representations , year =

  34. [34]

    International Conference on Learning Representations , year =

    Towards Deep Learning Models Resistant to Adversarial Attacks , author =. International Conference on Learning Representations , year =

  35. [35]

    Proceedings of the 36th International Conference on Machine Learning , series =

    Certified Adversarial Robustness via Randomized Smoothing , author =. Proceedings of the 36th International Conference on Machine Learning , series =. 2019 , publisher =

  36. [36]

    Proceedings of the 34th International Conference on Machine Learning , series =

    Understanding Black-box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning , series =. 2017 , publisher =

  37. [37]

    Advances in Neural Information Processing Systems , volume =

    Spectral Signatures in Backdoor Attacks , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

  38. [38]

    Advances in Neural Information Processing Systems , volume =

    Generative Adversarial Nets , author =. Advances in Neural Information Processing Systems , volume =. 2014 , url =

  39. [39]

    Proceedings of the 34th International Conference on Machine Learning , series =

    Wasserstein Generative Adversarial Networks , author =. Proceedings of the 34th International Conference on Machine Learning , series =. 2017 , publisher =

  40. [40]

    Improved Training of Wasserstein

    Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Martin and Dumoulin, Vincent and Courville, Aaron , booktitle =. Improved Training of Wasserstein. 2017 , url =

  41. [41]

    Proceedings of the 34th International Conference on Machine Learning , series =

    Compressed Sensing using Generative Models , author =. Proceedings of the 34th International Conference on Machine Learning , series =. 2017 , publisher =

  42. [42]

    Nature , volume =

    Human-level control through deep reinforcement learning , author =. Nature , volume =. 2015 , doi =

  43. [43]

    2018 , publisher =

    Reinforcement Learning: An Introduction , author =. 2018 , publisher =

  44. [44]

    Advances in Neural Information Processing Systems , volume =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =. 2017 , url =

  45. [45]

    Advances in Neural Information Processing Systems , volume =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

  46. [46]

    Advances in Neural Information Processing Systems , volume =

    What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  47. [47]

    Proceedings of the 40th International Conference on Machine Learning , series =

    Transformers Learn In-Context by Gradient Descent , author =. Proceedings of the 40th International Conference on Machine Learning , series =. 2023 , publisher =

  48. [48]

    Advances in Neural Information Processing Systems , volume =

    Denoising Diffusion Probabilistic Models , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

  49. [49]

    International Conference on Learning Representations , year =

    Score-Based Generative Modeling through Stochastic Differential Equations , author =. International Conference on Learning Representations , year =

  50. [50]

    International Conference on Learning Representations , year =

    Flow Matching for Generative Modeling , author =. International Conference on Learning Representations , year =

  51. [51]

    Scaling Laws for Neural Language Models

    Scaling Laws for Neural Language Models , author =. arXiv preprint arXiv:2001.08361 , year =. 2001.08361 , eprinttype =

  52. [52]

    Advances in Neural Information Processing Systems , volume =

    Training Compute-Optimal Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  53. [53]

    2020 , url =

    Zoom In: An Introduction to Circuits , author =. 2020 , url =

  54. [54]

    2022 , url =

    Toy Models of Superposition , author =. 2022 , url =

  55. [55]

    2023 , url =

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. 2023 , url =

  56. [56]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , author =. arXiv preprint arXiv:2201.02177 , year =. 2201.02177 , eprinttype =

  57. [57]

    Progress measures for grokking via mechanistic interpretability

    Progress Measures for Grokking via Mechanistic Interpretability , author =. arXiv preprint arXiv:2301.05217 , year =. 2301.05217 , eprinttype =

  58. [58]

    Emergent Abilities of Large Language Models

    Emergent Abilities of Large Language Models , author =. arXiv preprint arXiv:2206.07682 , year =. 2206.07682 , eprinttype =

  59. [59]

    Advances in Neural Information Processing Systems , volume =

    Are Emergent Abilities of Large Language Models a Mirage? , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

  60. [60]

    Advances in Neural Information Processing Systems , volume =

    Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems , volume =. 2017 , url =

  61. [61]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  62. [62]

    Advances in Neural Information Processing Systems , volume =

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

  63. [63]

    Machine Learning , volume =

    A theory of learning from different domains , author =. Machine Learning , volume =. 2010 , doi =

  64. [64]

    On the Opportunities and Risks of Foundation Models

    On the Opportunities and Risks of Foundation Models , author =. arXiv preprint arXiv:2108.07258 , year =. 2108.07258 , eprinttype =