A Genealogy of Optimizers
A guided tour through neural-network optimizers, with interactive widgets. Each one solves a specific problem in the previous, and the trajectory tells you something about where deep learning is going.
Sec. 01 SGD & the stochastic gradient
A neural network is just a giant pile of numbers — the parameters. Training it means finding the values of those numbers that make the model's predictions good. We measure "good" with a single scalar called the loss: lower is better.
So training is a search problem. Imagine the loss as the height of a landscape, and every possible setting of the parameters as a location on that landscape. Training a model is finding the lowest valley while wandering around blindfolded.
Gradient descent: the simplest possible idea
You can't see the landscape, but at any point you can feel which way the ground tilts. That tilt is the gradient — it points uphill. So you do the obvious thing: step in the opposite direction. That's the entire algorithm:
The step_size (also called the learning rate) is the only knob you really tune. Too small and you crawl. Too big and you overshoot the valley and fly out the other side.
The "stochastic" part
Here's the catch. To compute the true gradient, you'd need to evaluate the loss on every training example — millions or billions of them — before you could take a single step. That is wildly impractical.
The trick: don't compute the true gradient. Compute a noisy estimate of it using a small random batch of examples. That estimate is wrong in the details but right on average — and it costs almost nothing.
So instead of one careful step using all the data, you take thousands of slightly drunk steps using slivers of data. You wobble more, but you cover ground much faster. That's stochastic gradient descent.
Set the learning rate, set how noisy the gradient is (low = like a big batch, high = like a single example), and watch the trajectory unfold. The orange dot is the minimum we're trying to reach.
A concrete example: fitting a line
Forget abstract bowls for a moment. Here's the simplest learning problem that exists. You have 24 data points — each is a pair (x, y). Some unknown rule generated them. You believe the rule is roughly a straight line through the origin, so your model is ŷ = w · x. There is exactly one parameter to learn: the slope w.
For a single point (x, y), your prediction will usually be a little off. We measure how off with the squared error (w·x − y)². Zero if you nailed it, bigger if you didn't. The total loss is the average of those squared errors over all points.
For a single point the gradient with respect to w is 2x · (w·x − y). You don't need to memorize that — just read off the sign. If your prediction is too low the gradient is negative, so SGD raises w. If it's too high the gradient is positive, so SGD lowers w. Each data point is yelling "you got me wrong by this much, in this direction", and SGD nudges w accordingly. The whole algorithm in three lines:
1. pick a random data point (or a small batch)
2. compute the gradient using just that point
3. w ← w − lr · gradient
In the demo below, the top plot shows the data and your current line (y = w·x). The highlighted coral points are the ones being used in this step — the dashed lines show how wrong the line is for each. The bottom plot is the loss curve: a parabola in w, with the orange dot marking the true minimum and the green dot showing where you currently are.
What this earned us, and what it didn't
Try batch size 1 with lr = 0.02 and hit auto-step. The line snaps toward the data, then starts wobbling once it's roughly fit. The green dot bounces around the orange minimum but never quite settles. That's stochastic noise — even when w is right, a single random point still pulls it.
Now switch to all 24 points. The path is dead smooth. Every step uses the true gradient, so there's no noise to cancel. This is vanilla gradient descent. It looks great, but in real training, doing this every step would mean a forward+backward pass through your entire dataset to take one update. With a trillion-token corpus that's a non-starter.
SGD's deal: cheap, scalable, good enough to actually train large models. The noise that looks like a bug here is part of why it works at scale — those random kicks help models escape bad local regions of the loss landscape.
But three problems are now obvious enough that you can name them just from playing with the demo. First, even at the optimum, the noise never goes away — w keeps wobbling. Second, every step is memoryless. If the last twenty gradients all said "go right", the twenty-first step ignores all of that history and only listens to the current point. Third, in real models different parameters have wildly different gradient scales. One global learning rate that works for the whole network is asking a lot.
The next section — momentum — fixes the second of these and partially the first. After that, RMSprop and Adam will tackle the third.
An aside: how the speedrun-ladder charts work
From here on, every section ends with a speedrun ladder — an interactive chart that races each optimizer introduced so far to the same target loss. They are all the same experiment: a small but genuinely deep MLP fitting a random teacher network (so the per-matrix optimizers later in the story have real matrix structure to exploit), trained on minibatches. Each optimizer runs at its own best learning rate from a small sweep, under one shared schedule — 10% linear warmup, a constant middle, then a cosine cooldown over the final 20%.
Press Play to watch them race, or drag the slider to scrub through training; the leaderboard fills in as each curve crosses the target. It's one small toy network, not a benchmark — the ordering tracks the textbook ladder, but the exact margins are illustrative (a browser-scale toy can't reproduce the real thing). The full comparison code is on GitHub: optimizer-ladder-sim.py.
Sec. 02 Momentum
Look back at the very first demo from section 1, with the bowl and the ball. Now imagine stretching that bowl out so it's a long narrow valley instead of a circular dish — steep walls running left-and-right, very gentle slope running along the length.
That's what real loss landscapes look like in deep learning. A few directions are extremely curved (a small change in the parameter changes the loss a lot) and most directions are almost flat. Watch SGD on a landscape like that and you'll see the same disaster every time. The gradient is dominated by the steep directions, so each step is a big sideways lurch across the valley. The slow component along the valley — the direction we actually need to make progress in — barely moves.
The deeper diagnosis: SGD's steps are memoryless. The seventeenth step has no idea that the previous sixteen gradients had a nice consistent component pointing down the valley.
The fix: give the optimizer a velocity
Stop treating the gradient as the direction to move. Treat it as a force pushing on a velocity vector. The optimizer carries the velocity from step to step. Useful directions (where the force keeps pushing the same way) cause the velocity to grow. Useless directions (where the force flips back and forth) cause the velocity contributions to cancel.
v ← β · v + g
w ← w − lr · v
That's it. Two lines. β is a number close to 1 (a typical default is 0.9), controlling how much of the old velocity persists. With β = 0 you get plain SGD back. With β = 0.9 the velocity is a kind of running tally of roughly the last ten gradients' worth of motion.
In the steep-wall directions, the gradient alternates sign each step, so when you accumulate it into v, the contributions cancel and the velocity stays small. In the consistent valley direction, every gradient pushes the same way, so the velocity grows — at steady state with β=0.9, it's about ten times a single gradient, which means the effective step is ten times bigger than SGD's would have been. And under noise, the noise terms in successive gradients are independent and average toward zero in v, while the real signal accumulates.
Below, both optimizers start at the same point, see the same noisy gradients, and use the same learning rate. The only difference is momentum. The contour ellipses are very flat horizontal ovals — that's the elongated valley.
Hit auto-step with the defaults and watch what happens. SGD (coral) does the classic disaster: it slams across the valley walls, bounces back, slams across again, and meanwhile barely creeps to the right. Momentum (teal) has a different shape entirely — it overshoots once or twice early on as the velocity builds up, then settles into a smooth glide along the valley floor.
Drag β to 0 and reset. Both paths overlap exactly. Momentum with β=0 is SGD. Push β to 0.95 and you'll see the cost of memory — too much inertia carries momentum past the minimum, and the teal path orbits around before settling. β=0.9 is the canonical default because it's a sweet spot between "useful inertia" and "loops around forever."
What momentum did and didn't fix
Momentum fixed the second of SGD's three complaints — steps now use information from past gradients. It partially fixed the first — under noise, the velocity buffer cancels independent noise terms. And it gave a real speedup along the slow valley directions that show up everywhere in real loss landscapes.
What it did not touch is the third complaint: there is still one global learning rate, applied uniformly to every parameter. In real models, some parameters have gradients on the order of 100 and others on the order of 0.001 at the same time. No single global lr is right for both. The fix everyone arrived at, more or less independently, in the early 2010s: keep a per-parameter estimate of recent gradient magnitudes, and divide each parameter's step by that estimate.
Sec. 03 Adam: AdaGrad to RMSProp to Adam
Momentum gave the optimizer a memory and a velocity. It made progress along slow valleys faster, oscillations smaller, and noise damped. But the third complaint from the SGD section is still untouched: there is exactly one learning rate, applied to every parameter in the network.
In the toy 2D demos this looks harmless. In a real network it's catastrophic. Different parameters live in wildly different gradient regimes. The bias of a softmax output and a weight matrix two layers back can have gradient magnitudes that differ by 1000×. Pick lr for the loud parameters and the quiet ones never move. Pick lr for the quiet ones and the loud ones blow up. The fix the field converged on is per-parameter learning rates, computed automatically from gradient statistics. The story arrives in three steps.
AdaGrad: the first try (and why it broke)
Keep, for each parameter, a running sum of how big its gradients have been:
G ← G + g²
w ← w − lr · g / sqrt(G + ε)
A parameter that's been seeing huge gradients gets divided by a huge sqrt(G), so its effective step shrinks. A parameter with tiny gradients keeps a near-full lr. This is exactly what we wanted — automatic per-parameter scaling.
The catch: G only ever grows. After a few thousand steps, every parameter's effective learning rate is approaching zero, and the optimizer just stops. Fine for sparse problems; fatal for deep learning where we'll train for millions of steps.[1]
RMSprop: just take an average
Replace the running sum with an exponential moving average:
v ← β₂ · v + (1 − β₂) · g²
w ← w − lr · g / sqrt(v + ε)
Now v is bounded — it tracks the recent squared-gradient magnitude rather than the lifetime sum. Per-parameter scaling without monotonic decay. If gradients are roughly stationary at some long-run value, the steady-state v satisfies "EMA equals itself" and plateaus at the average squared gradient. The effective learning rate is preserved forever.[2]
Compare the two accumulators on the same constant gradient stream:
Drag β₂ down to 0.9. RMSProp's plateau is unchanged (still 1) but it gets there in about ten steps — short memory, fast adaptation, noisier. Push β₂ up to 0.9995 and RMSProp warms up so slowly it barely reaches its plateau by the end of the run; long memory, smooth, but more AdaGrad-ish. At β₂=1 exactly, the EMA stops accumulating entirely.
Adam: RMSProp meets momentum
Adam's contribution is conceptually small but practically enormous.[3] It just stacks the two ideas: keep a momentum term (EMA of gradients) and a per-parameter scale term (EMA of squared gradients), and use both:
m ← β₁ · m + (1 − β₁) · g ← this is momentum
v ← β₂ · v + (1 − β₂) · g² ← this is RMSProp
m̂ = m / (1 − β₁ᵗ) ← bias correction
v̂ = v / (1 − β₂ᵗ)
w ← w − lr · m̂ / (sqrt(v̂) + ε)
Defaults that have proven nearly universal: β₁=0.9, β₂=0.999, ε=10⁻⁸, lr in the 10⁻³ to 10⁻⁴ range.
The bias correction handles a subtle bootstrapping problem: both m and v start at zero, so the first few EMAs are biased toward zero. Dividing by (1 − β^t) compensates — it's small for small t (boosting the early estimate) and approaches 1 as t grows. After a few thousand steps the correction is invisible.
Here's the punchline that took the field a few years to fully internalize. Look at the term m / sqrt(v). If gradients are roughly stationary, then v ≈ g² and m ≈ g, so m/sqrt(v) ≈ g/|g| = sign(g). In other words, Adam's update is approximately a momentum-smoothed sign of the gradient, scaled by lr. Every parameter, regardless of its gradient magnitude, moves by approximately lr in some direction each step.
That's why Adam works so well on heterogeneous networks. Loud parameters and quiet parameters both step by ~lr, and you only need to tune one number for the whole network.
At lr=0.05, both work, but Adam glides toward the minimum in a clean, almost-straight line while momentum does its familiar oscillation. Crank lr to 0.08: momentum diverges off the canvas. Adam doesn't care — its update in the steep direction is still ~lr regardless of how big the gradient is, because the sqrt(v) divisor scales with gradient magnitude.
Adam works at default settings on most problems. That single fact — "default Adam works on a new problem" — is why it became the field's reflexive choice for nearly a decade.
Sec. 04 Adam as a signal-to-noise ratio
"Smoothed gradient over square root of squared gradients" is a mouthful. Here's the cleaner reading: Adam's update is computing a per-parameter signal-to-noise ratio — in steady state, on a scale of roughly −1 to +1.
Walk through a single parameter, observing a noisy gradient written as signal + noise at each step. Imagine the signal as the true mean (what the gradient should be) and the noise as a zero-mean random kick with standard deviation σ. After many steps:
maverages the gradient, som → signal— the noise cancels in the running mean.vaverages the squared gradient, sov → signal² + σ²— the noise variance contributes here even though it cancelled inm.- The update
m / sqrt(v) → signal / sqrt(signal² + σ²)— exactly the signal-to-noise ratio.
This works out across many scenarios:
| scenario | signal | σ | m → | sqrt(v) → | update | interpretation |
|---|---|---|---|---|---|---|
| A — clean strong | 0.3 | 0 | 0.3 | 0.3 | 1.00 | full step, clear direction |
| B — moderate noise | 0.3 | 1.0 | 0.3 | 1.04 | 0.29 | direction known, magnitude uncertain |
| C — weak signal | 0.1 | 1.0 | 0.1 | 1.005 | 0.10 | barely above noise, small step |
| D — pure noise | 0 | 1.0 | 0 | 1.0 | 0.00 | no signal, no step |
| E — oscillating | ±1 | 0 | 0 | 1.0 | 0.00 | at minimum in this direction |
| F — loud parameter | 100 | 0 | 100 | 100 | 1.00 | same as A — magnitude normalized |
Look down the "update" column. In this stationary regime it lands in [−1, +1] (only approximately, and only once the EMAs have settled — the caveat below) and it's measuring how confidently the optimizer should commit to this direction. Strong clean signal → 1. Strong noisy signal → ~0.3. Pure noise or oscillation → 0. And critically, magnitude alone (compare A and F) doesn't change the answer — only the signal-to-magnitude ratio matters.
The widget below feeds three different gradient streams to the same Adam update. The line plotted is m̂ / sqrt(v̂) — the update direction, not the gradient itself.
The teal line shoots to 1 and stays there. The coral oscillating signal damps fast to a tight band near zero — the alternating gradient cancels in m but not in sqrt(v), so the update is suppressed. The purple noisy signal settles around 0.29, which is exactly 0.3 / sqrt(0.3² + 1²) = 0.287.
One caveat: in strict steady state the update is bounded by 1, but during transients (especially early in training, or right after a learning rate change) the EMA window mismatch — m with β₁=0.9 has a short window while v with β₂=0.999 has a much longer one — can produce updates above 1. This is part of why gradient clipping is standard practice for transformer training: it prevents the numerator from temporarily outpacing the denominator and producing oversized parameter updates.
Sec. 05 AdamW: the one-line fix
Adam was published in 2014.[3] AdamW, the version everyone actually uses, came out in 2017.[4] The change is one line of code. The reason it took three years is that nobody — including the people who originally wrote Adam — noticed the bug.
Weight decay's job: at every step, gently pull parameters toward zero. Small parameters → simpler model → better generalization. The "gentle pull" is a multiplier slightly less than 1:
w ← w − lr · g ← regular update
w ← (1 − wd) · w ← weight decay, applied separately
There's a math identity that's true for SGD: shrinking the parameters is equivalent to adding wd · w to the gradient. So for years it was standard practice to "implement weight decay" by just adding wd · w to the gradient before passing it to the optimizer. Cleaner code, same result — for SGD, momentum, RMSProp, anything that just multiplies the gradient by lr and subtracts.
The original Adam paper followed this convention. So did every Adam implementation for the next three years.
Why it silently breaks for Adam
Adam doesn't just multiply the gradient by lr and subtract. It divides by sqrt(v) first. So when you fold wd · w into the gradient, the decay term goes through that divisor too. The decay each parameter actually receives is lr · wd · w / sqrt(v), not lr · wd · w. Parameters with large v (loud gradients) get their weight decay divided by a large number — they're decayed less. Parameters with small v get decayed more.
This is exactly backwards from what regularization should do. The parameters fitting the loudest signals — the ones most at risk of overfitting — get the least regularization. The quiet parameters that barely move get aggressively shrunk to zero.
The fix
Loshchilov and Hutter's observation: the identity that lets you fold weight decay into the gradient is only valid when the optimizer is w ← w − lr · g. The moment you preprocess the gradient — with sqrt(v), with sign, with anything — the equivalence breaks. So just don't fold it in. Apply weight decay directly to the weights:
m ← β₁·m + (1−β₁)·g
v ← β₂·v + (1−β₂)·g²
m̂ = m / (1 − β₁ᵗ), v̂ = v / (1 − β₂ᵗ)
w ← w − lr · m̂ / (sqrt(v̂) + ε) ← Adam update, no wd here
w ← w − lr · wd · w ← decoupled weight decay
That's the entire change. Now every parameter gets the same proportional decay regardless of v, and wd is an independent knob from lr.
This visualization isolates the effect: two parameters A (loud, large v) and B (quiet, small v) both starting at 1.0 with zero gradient. The only thing pulling them down is decay. So everything you see is the decay term in action, no other dynamics.
The teal line is AdamW: both parameters decay at the same rate, because the decay is decoupled from v. They overlap perfectly. The two coral lines show Adam-with-L2: the dark coral (large v) is barely moving — its decay got divided by a big number, so it's effectively unregularized. The light coral (small v) drops fast — its decay got divided by a small number, amplifying it. Drag the v sliders apart and the gap widens dramatically.
That's the bug. The optimizer was supposed to apply the same regularization to every parameter, and Adam-with-L2 silently made the regularization strength inversely proportional to gradient magnitude.
AdamW is the optimizer that trained essentially every transformer between 2017 and ~2023. It's the silent default behind GPT-2, GPT-3, BERT, T5, ViT, CLIP, and most of what came in their wake.
Sec. 06 Interlude — SignSGD
Up to this point the story has been "track more statistics, use them more cleverly." SignSGD is the rude question: what if we tracked less and it worked anyway?
We just spent two sections building Adam, which carries two EMAs per parameter and computes m / sqrt(v). Then we noticed that in steady state, that update is approximately sign(g). So Adam is paying for two buffers of optimizer state to approximately compute the sign. A natural reaction: skip the machinery. Compute the sign directly.
w ← w − lr · sign(g)
That's it.[5] No m, no v, no momentum, no per-parameter scale. Every parameter steps by exactly lr in some direction each step. The optimizer holds zero state — your only memory cost is the weights themselves.
Two big practical wins. Optimizer state: AdamW keeps two extra values per parameter — m and v — so 2× the parameter count in optimizer state. For a 70B-parameter model with the usual fp32 moments, that's ~560 GB for m and v alone, on top of the weights. SignSGD costs zero. Distributed communication: in distributed training, you can compress each gradient to one bit per parameter — a 32× reduction in bandwidth.
But SignSGD has two real issues.
No magnitude information. A parameter with gradient 0.0001 and one with gradient 1000 take the same step. If a parameter is already close to its optimum, SignSGD will overshoot by a fixed amount every step and oscillate around the minimum forever.
Noise is catastrophic. Adam's sqrt(v) term automatically scales steps down for noisy parameters. SignSGD doesn't: the sign of a nearly-zero gradient buried in noise is essentially random, and SignSGD will take a full lr-sized step in that random direction every time.
Same anisotropic valley as before. SignSGD with no momentum, AdamW for comparison.
Run with default noise (0.3). SignSGD marches diagonally toward the minimum: each step is exactly lr in both axes regardless of the wildly different gradient magnitudes (~3 in x, ~150 in y). This is the heterogeneity-handling that made Adam great. But once close to the minimum, it can't slow down — watch it oscillate around the target. Crank noise to 1.5 and SignSGD goes haywire while AdamW stays composed.
SignSGD is rarely used as-is in practice — too brittle. But it cleanly demonstrates two things that turn out to matter for what comes next. The directionally normalized step (everything moves by ~lr) is doing most of the work that makes Adam great. The magnitude correction in m / sqrt(v) is a refinement, not the core mechanism. And you can pay almost nothing in optimizer state and still get serviceable training, especially if you add back just a momentum buffer to handle noise.
Sec. 07 Lion: one buffer instead of two
Lion is the optimizer that emerged when Google Brain ran a massive program search over the space of optimizer update rules — they let an evolutionary algorithm hunt for new optimizers and Lion is what won.[6] The result is striking: substantially simpler than AdamW, half the optimizer state, and competitive with AdamW across many large-scale benchmarks — though its wins are task- and tuning-dependent, and several follow-ups found it needs careful sweeps to match AdamW on language models. The name stands for "EvoLved sIgn mOmeNtum."
We have all the parts. SignSGD without momentum was brittle. The fix is: smooth the gradient first, then take the sign. The actual rule has one extra wrinkle — Lion uses two different mixing rates, one for taking the step and one for updating the momentum buffer:
update = sign(β₁ · m + (1 − β₁) · g) ← what you step in this update
w ← w − lr · update − lr · wd · w ← step + decoupled weight decay
m ← β₂ · m + (1 − β₂) · g ← what you remember for next time
Defaults: β₁ = 0.9, β₂ = 0.99. Only one buffer of state. No v. No sqrt. No ε.
Lion takes the step using a fast EMA of the gradient (window ≈ 10 steps with β₁=0.9), but stores a slow EMA for next time (window ≈ 100 steps with β₂=0.99). The faster average gives responsiveness; the slower average gives stability. The Lion paper found this asymmetric design materially outperforms using the same β for both — the kind of detail you only find by searching the space exhaustively.
What Lion gives up and what it gives back
Lion's update is in {−1, 0, +1} per parameter — strict sign, no fractional confidence weighting. A parameter Adam would have stepped by 0.1 Lion will step by 0 or full lr. And the per-parameter magnitude normalization from sqrt(v) is replaced by global normalization via the sign function. In practice, neither is fatal. Lion still works.
What's earned: memory halved (only m, no v) and 2-15% throughput improvement per step depending on hardware. The catch: Lion needs a smaller lr (3–10× smaller is typical) and stronger weight decay than AdamW. The hyperparameter sweet spot shifts.
What makes Lion intellectually interesting isn't that it's better than AdamW — sometimes it is, sometimes it isn't. What's interesting is that it's competitive while throwing away half the machinery. That's strong evidence for a claim worth examining carefully in the next section: the most important thing Adam was doing all along was producing per-parameter normalized steps.
Sec. 08 The hidden invariant: per-parameter normalization
Time to pause and name the through-line. Looking back at SGD → momentum → AdamW → SignSGD → Lion, there's a single dimension that explains most of why each one outperforms the last. Call it per-parameter normalization: how does each parameter's gradient magnitude affect how far it moves?
| optimizer | effective step for parameter i | scales with |gi|? |
|---|---|---|
| SGD | lr · gi | yes — linearly |
| SGD + momentum | lr · vi where v is EMA of g | yes — still linearly |
| AdamW | lr · mi / sqrt(vi) ≈ lr · sign(gi) | no — flat |
| Lion | lr · sign(β mi + (1−β) gi) | no — flat |
| SignSGD | lr · sign(gi) | no — flat |
This is a useful chance to correct something easy to get wrong. Momentum does not solve the heterogeneous-gradient problem. It accumulates gradients along consistent directions, which speeds you up in slow valleys, but it preserves the per-parameter magnitude. A parameter with 100× larger gradient still gets 100× larger steps under momentum. The thing that did solve heterogeneity was the sqrt(v) divisor in RMSProp / Adam, or the sign() function in Lion / SignSGD. Both produce roughly equal step sizes per parameter regardless of magnitude. Both are doing "spatial normalization across parameters" at each step. Momentum is "temporal smoothing within each parameter."
Below: five parameters with gradient magnitudes spanning 10⁻³ to 10², and the displacement each optimizer takes for each. The y-axis is log scale because otherwise SGD's tallest bar would be 100,000× taller than its smallest.
The SGD bars span five orders of magnitude. Momentum's bars are uniformly ten times taller — same shape, just scaled. AdamW's bars are all the same height: a parameter with gradient 0.001 and one with gradient 100 take the same step. Lion's bars are identical to AdamW's.
This lens reframes Lion's success. Lion isn't beating AdamW because sign() is smarter than m/sqrt(v). It's beating it because both reach the same destination — flat, per-parameter normalized steps — and Lion gets there with one buffer and one operation while AdamW needs two buffers, a square, a square root, a division, and a bias correction. The work AdamW was doing turned out to be approximating something simpler.
It also points at what comes next. Once you see "per-parameter normalization" as the underlying principle, the obvious question is whether per-parameter is even the right granularity. Real network parameters aren't a flat list — they're organized into matrices, which have richer structure that per-parameter operations throw away. The next section, Muon, will exploit exactly this.
Sec. 09 Muon: per-matrix orthogonalization
Here's the assumption that every optimizer until now smuggled in: a neural network's parameters are a flat list. Adam looks at each scalar entry of each weight matrix independently and decides how much to step it. The fact that those scalars are organized into a matrix — with rows that mean something, columns that mean something, and a structure that survives matrix multiplication — is information the optimizer throws away.
Muon's bet is that this structure is exactly the information you need.
The picture: singular values
Any matrix M has a singular value decomposition: M = U Σ VT. U and V are rotation matrices. Σ is a diagonal matrix of singular values — these say how much the matrix stretches space along each principal direction. A matrix with one huge singular value and many tiny ones acts almost like a rank-1 projection: it pours all input variation onto one output direction. A matrix with uniform singular values acts like a rotation that preserves the geometry of the input space.
The gradient of a weight matrix has its own singular value structure. And empirically, gradient matrices in training have very unequal singular values. A few directions dominate; most are nearly zero. This means a vanilla gradient step is mostly making progress in one or two directions and barely moving in others.
Muon's update
Compute the momentum-smoothed gradient as usual. Then, before stepping, flatten all its singular values to 1. Equivalently: replace Σ with the identity. What you get is a "whitened" version of the gradient — same directions as the original (same U and V) but every direction now contributes equally.
u ← β · u + g ← momentum buffer
û = orthogonalize(u) ← replace u's singular values with 1s
W ← W − lr · û
For hidden weight matrices only — embeddings, biases, and norm parameters use AdamW. That's why Muon comes packaged with a side-by-side optimizer in practice: it only knows what to do with matrices.
Replacing the singular values directly via SVD is too slow for every step of training. The Muon paper uses Newton-Schulz iteration: a degree-5 polynomial in the matrix that, applied iteratively (typically 5 times), pushes all singular values toward 1 without ever computing the SVD.[10] Each iteration is just matrix multiplications, which GPUs are exceptionally good at. One honest caveat: Newton-Schulz only approximates the ideal orthogonalization. The standard degree-5 coefficients are tuned for speed, not accuracy — they leave the singular values in a band near 1 rather than exactly at 1, and the left/right singular vectors are only approximately preserved. In practice that approximation is more than good enough.
Drag skew: the raw gradient's spectrum (gray) goes from uniform to extremely skewed. Drag iters: at 0, the green bars match gray (no orthogonalization). By 3-4 iterations the green spectrum is essentially flat at 1. With heavy skew you can see why standard Newton-Schulz uses 5 iterations as a practical default — it's the smallest count that handles realistic spectra.
Why this works
The vanilla momentum step is dominated by a few large singular directions. Muon's orthogonalized step has equal magnitude in every direction U Σ VT defines. Every direction the gradient knows about gets attention. Parameters that would have been ignored under vanilla momentum (because they sat in low-singular-value directions of the gradient) get the same step size as the dominant directions.
Practically, Muon set the current speed records for nanoGPT training[10] and — after work showing it scales[11] — was used at the trillion-parameter scale for Kimi K2.[12] Muon also has favorable learning-rate-transfer behavior: paired with μP-style width scaling,[13] you largely avoid re-tuning the learning rate as the model grows — one of the most labor-intensive parts of training large models. (μP is a separate technique, not something baked into Muon, but the two compose unusually well.)
Muon takes only the gradient matrix structure and gets a substantial speedup from it. But we slipped past something worth pausing on: why is making the step uniform across singular directions better than the across-the-board uniform step that Adam and SignSGD already give us? The next section makes the difference visible.
Sec. 10 The right kind of uniform: directions, not entries
Back in the per-parameter normalization section we found the through-line: the strongest optimizers take a uniform step — every parameter moves by roughly the same amount, regardless of its gradient's magnitude. Adam and SignSGD do this per entry of the weight matrix. Muon does it too, but in a different sense, and the gap between the two senses is the whole point.
"Uniform per entry" is uniform in the basis of matrix coordinates — this row, that column. But that basis is arbitrary. What a weight matrix actually does — the function the layer computes — is set by its singular directions: the orthogonal directions it stretches input space along, and by how much (the singular values). Rotate your coordinate frame and every entry changes while the function doesn't move at all.
So the real question isn't "did every entry move by the same amount?" It's "did the update push equally hard in every direction it acts on?" A step can nudge every entry identically and still be wildly lopsided in direction space — pouring almost all of its effect into one or two singular directions while the rest barely budge. When the gradient is concentrated, it gets worse: sign of a near-rank-one matrix is still near-rank-one, so the per-entry step funnels the entire update into a single direction.
Muon's orthogonalization is exactly the fix: it keeps the gradient's singular directions but flattens its singular values to 1. The update becomes isotropic — it advances every direction the gradient knows about by the same amount. That is the right kind of uniform.
The clearest way to see it is the update's singular-value spectrum — the size of the step it takes in each of the matrix's directions. A balanced update is flat. Below, the same gradient is turned into a step three ways; the bars are each scaled so the tallest is 1, so only the shape matters.
Drag gradient skew up. The raw gradient's spectrum (gray) tilts steeply — it wants to move almost entirely in its top direction. The per-entry step (purple) refuses to flatten with it: at low skew it is merely uneven, and as the gradient concentrates it collapses, dumping the whole step into one direction. Muon (coral) is flat at every setting — balance pinned to 1.00.
That is the deeper reason per-matrix beats per-parameter. Per-entry normalization asks "is each number stepping equally?" — a question whose answer depends on how you happened to lay the matrix out. Muon asks "is each direction stepping equally?" — a question about the layer's real geometry, and the only one that survives a change of basis.
And yet, balanced as it is across directions, even Muon is still working from a single piece of information — the gradient. It never sees how sharply the loss curves. The next section is about that blind spot.
Sec. 11 What first-order optimizers can't see
Every optimizer so far — SGD, momentum, Adam, Lion, even Muon — uses one piece of information at each parameter: the gradient. That tells you the slope. It doesn't tell you the curvature.
Here's the difference. Imagine two parameters, both with the same gradient right now. The first is in a gently curving region — small changes in the parameter barely change the slope. The second is in a sharply curving region — small changes in the parameter rapidly change the slope. The optimal step for these two cases is very different. The gently curving parameter can take a big step safely. The sharply curving one is about to roll past its minimum if you step too far.
Both green and coral curves have the same slope at the green dot (the starting point). The slider is the step size — i.e., how far you move in the direction the gradient suggests. On the gentle curve, a big step lands you closer to the minimum. On the sharp curve, the same big step overshoots and lands you higher up the other side.
This is the third recurring complaint about first-order methods. The optimizer can't see how much each parameter "wants" you to step — only the direction. Adam's sqrt(v) normalizes by recent gradient magnitudes, which is a noisy proxy for curvature. SignSGD throws away even that. None of them are using the actual curvature of the loss landscape.
What "curvature" means precisely
For a single parameter, curvature is just the second derivative — the rate of change of the slope. Call it H (for "Hessian," which is the multi-parameter generalization). Then near a current point w, the loss looks like:
This is just the Taylor expansion. The optimal Δ — the one that minimizes the quadratic — is the value that makes the derivative zero: g + H · Δ = 0, so Δ = −g / H. That's Newton's method. The −g / H step takes you directly to the bottom of the local parabola in one move.
The geometric picture is satisfying. The gradient g is the current slope. The Hessian H is how fast that slope is changing. The ratio g / H is "how far do I need to walk before the slope reaches zero" — i.e., the distance to the bottom of the local parabolic bowl.
The top plot is the loss L(w) = ½ H w². The bottom plot is its slope, g = H w. The Newton step takes you from where you are to where the slope crosses zero — and on a quadratic that's exactly the minimum, in one step. Try different curvature values: high curvature means the slope drops fast and the Newton step is short. Low curvature means the slope drops slowly and the step is long. The step is automatically calibrated to the curvature.
The Hessian in 2D
For multiple parameters, the Hessian generalizes to a matrix:
For two parameters that's a 2×2 matrix. The diagonal entries are the curvatures along each axis. The off-diagonal entries — these are the interesting ones — say whether the two parameters' curvatures are "tilted" or "rotated" relative to the axes. They're zero only when the level curves are perfectly aligned with the axes.
At Hxx = Hyy = 2, Hxy = 0 you get circular level sets — same curvature in every direction. Crank Hxx to 5 keeping Hyy = 2: the bowl flattens to a horizontal ellipse — that's the anisotropic valley from section 2. Push Hxy to 1: the ellipse rotates. The off-diagonals are the source of all the geometric weirdness real neural networks have. Real Hessians have non-zero off-diagonals because parameters interact through the network — the curvature of one parameter depends on the others' values.
How you'd actually compute one
The gradient g is what backpropagation gives you for free — it's the same machinery that trains the network. The Hessian needs second derivatives. The classic trick: differentiate the gradient. If you ran backprop to get g, then run backprop again on each entry of g, you get the Hessian row by row. This is called "double backprop."
For a tiny example: L = ½ (w₁² + 4 w₂² + 2 w₁ w₂). First backward pass gives g = [w₁ + w₂, 4 w₂ + w₁]. Now backprop g[0] = w₁ + w₂ with respect to w: that gives the first row of H. Repeat for g[1]: second row.
Two backward passes give you the full 2×2 Hessian for a 2-parameter model. For an N-parameter model, you'd need N backward passes — one per row. Even N = 10⁶ is intractable. For modern LLMs with N ≈ 10⁹–10¹², materializing the full Hessian isn't merely expensive, it's physically impossible — it wouldn't fit in any storage. So full Newton's method is dead on arrival.
But there's a useful escape hatch. We rarely want the whole Hessian — what we usually want is H · v for some vector v. And that's surprisingly cheap: differentiate g · v (a scalar!) once more, and you get H · v directly. One backward pass instead of N. These "Hessian-vector products" are the basis of every practical second-order optimizer ever proposed.
So the second-order story is: don't try to materialize H. Don't try to invert it. Get cheap approximations, ideally factored into pieces that match the matrix structure of your network. Shampoo, the next section, is the field's main matrix-aware answer to it.
Sec. 12 Shampoo: curvature-aware preconditioning
Shampoo predates Muon by several years (the original paper is from 2018)[7] and was actually the first matrix-aware optimizer to gain real attention. The plot twist is that Muon got most of the practical traction, partly because Shampoo is more expensive to run and partly because its theoretical motivation — full Newton-style preconditioning — is a stronger claim than people initially needed.
Shampoo's pitch: approximate the full Hessian using just two matrices per layer, one capturing how rows interact and one capturing how columns interact. For a weight matrix W ∈ ℝm×n with gradient G, maintain:
L ← β·L + (1−β)·G GT ← m × m left preconditioner (row curvature)
R ← β·R + (1−β)·GT G ← n × n right preconditioner (column curvature)
W ← W − lr · L−1/4 · G · R−1/4
The −1/4 exponents come from a Kronecker-factorization argument — the full Hessian is approximately L ⊗ R, so its inverse-square-root applied to G works out to multiplying by L−1/4 on the left and R−1/4 on the right.
One bookkeeping note that echoes the AdaGrad → RMSProp story from section 3: the original 2018 Shampoo accumulated L and R as un-decayed running sums (L ← L + GGᵀ), exactly the AdaGrad form, with the same eventual-stall problem. Every practical implementation since — Distributed Shampoo, SOAP — uses the exponential moving average shown above, so the preconditioner tracks recent curvature instead of the lifetime sum.[8][9]
Three optimizers, same gradient, same matrix, three different updates:
Same starting matrix. AdamW (purple) operates entry-wise: scale each cell by its own statistic, but the matrix structure is invisible to it. Muon (teal) flattens the singular value spectrum — same row/column structure, but reshaped to equal magnitude in every direction. Shampoo (gold) does something subtly different: it uses the empirical row- and column-curvature statistics to rebalance the matrix non-uniformly, dampening rows/columns that have been receiving large updates and boosting ones that haven't.
A caveat on these last two figures: unlike the 2D demos in earlier sections, Figs. 9.1 and 12.1 visualize the transform each optimizer applies to a single gradient matrix — they are not optimization runs. There's no loss being descended and no trajectory; the point is purely to see how each method reshapes a matrix, which is where the matrix-aware methods differ from the per-entry ones.
Practical considerations
Shampoo's chief practical issues are cost (computing matrix square roots is expensive — typically every few hundred steps, not every step), memory (the two preconditioners take additional space proportional to m² + n²), and tuning (more hyperparameters than Muon). Variants like Distributed Shampoo and SOAP (Shampoo + Adam fused) have made it more practical. Recent benchmarks have it roughly tied with Muon at small scale and behind Muon at large scale once optimizer-aware learning rate transfer (μP) is factored in.
The interesting question this raises is whether μP plus Muon plus Shampoo is redundant. If μP already gives you scale-invariant learning rates, and Muon already orthogonalizes per-matrix, what additional curvature signal is Shampoo capturing that the others miss? The answer seems to be: not as much as you'd hope at large scale, but enough at moderate scale to keep Shampoo in the conversation. It's the optimizer that proved matrix-aware preconditioning was a real direction — even if Muon turned out to be the simpler practical winner.
Muon vs Shampoo, head to head
Strip away the machinery and the two are answering different questions about the same gradient matrix. Muon asks "which directions does this gradient span?" and steps equally along all of them — it whitens the update's singular values to 1 and stops there. Shampoo asks "how curved is the loss along the rows and along the columns?" and rescales the update by that estimated curvature (L−1/4 on the left, R−1/4 on the right). Muon is purely first-order and cheap every step; Shampoo is second-order-flavored and pays for it with eigendecompositions.
At small and moderate scale that extra curvature signal earns its keep — the two are about tied, and Shampoo's preconditioner sometimes edges ahead (it does, slightly, in the ladder below). The gap opens at large scale, for reasons that have little to do with the update rule itself: Muon transfers its learning rate cleanly across model sizes under μP, while Shampoo's matrix roots are expensive enough that you recompute the preconditioner only every few hundred steps — so it runs on stale curvature — and it carries more knobs to tune. SOAP — Shampoo's preconditioner with Adam running inside its eigenbasis — folds the two ideas together and often beats both.[9]
One fairness point, because it bites in the speedrun ladder below: Shampoo is only competitive when it preconditions a momentum-smoothed gradient, the way it is actually run. Feed its curvature factors a raw, noisy minibatch gradient and the L−1/4, R−1/4 terms amplify the noise — a near-zero curvature estimate becomes a large multiplier — and it looks far worse than it is. The chart uses the momentum variant; Muon's noise-robustness comes from the same place, its own momentum buffer.
Sec. 13 MuonH: pinning the weight norm
Every optimizer so far has been about the update direction. SGD: raw gradient. Momentum: smoothed. Adam: per-element normalized. Muon: per-matrix orthogonalized. Shampoo: curvature-aware. But there's a parallel question we've only touched lightly in the AdamW section: how should we control the magnitude of the weights themselves?
Weight decay was the default answer. MuonH — from recent work on hypersphere optimization and transferable learning-rate scaling — is a more radical one.[15] It is much newer and less battle-tested than the optimizers above, so treat this section as a report on an active research direction rather than settled practice.
Scale invariance reframes the question
Modern transformers wrap every weight matrix in RMSNorm layers — the "RMSNorm sandwich." RMSNorm is scale-invariant: scaling its input by any positive α produces identical output. As a consequence, the weight matrix's magnitude doesn't actually affect what function the layer computes. Only its direction matters.
But weights still grow during training. With the roughly-orthogonal, fixed-magnitude updates Muon produces, successive steps are close to uncorrelated, so the Frobenius norm grows diffusively — roughly as √t (the exact rate is regime-dependent, but the point is that it grows without bound). That inflates the parameter scale relative to step size and shrinks the effective learning signal. AdamW's decoupled weight decay handles this by gently pulling weights toward zero each step, creating an equilibrium norm that weights "hover" near.
The Hyperball question is: if magnitude doesn't matter functionally, why not just fix the magnitude exactly?
The Hyperball update
Pick a target radius R — typically R = ||W₀||F, the matrix's initial Frobenius norm. After every optimizer step, project the weight matrix back to that radius:
U_t = −η · R · Normalize(proposed_update) ← normalize update to scale ηR
W_temp = W_t + U_t ← take the step
W_{t+1} = R · Normalize(W_temp) ← retract back to radius R
The weight matrix lives on a hypersphere of radius R for the entire training run. (Despite the name "hyperball," you're constrained to the boundary — a hypersphere.) The update magnitude ||U_t||F = η · R is fixed, so η takes on a clean geometric meaning: it's the relative step size in units of weight norm. MuonH means "Muon with Hyperball," parallel to AdamW meaning "Adam with decoupled Weight decay."[15]
The geometric picture
In 2D this is a circle. W_t is a point on the circle. The proposed update U_t is a vector starting at W_t. The sum W_t + U_t generally lands off the circle. The retraction projects it back radially — that's W_{t+1}. The actual step ΔW = W_{t+1} − W_t is a chord of the circle.
At default position (γ=0, η=0.3), the proposed update is tangent to the circle. The retraction does almost nothing — actual ||ΔW||/R ≈ proposed. This is the best case.
Drag γ toward +1: the update points radially outward. The retraction now has to slice off the radial component, and the actual step shrinks dramatically. Drag γ toward −1: same in reverse. The geometric punchline: retraction implicitly cancels any radial component of the update. Only tangential motion survives.
The spectral squeeze
Muon was already controlling the spectrum of the update by flattening its singular values — orthogonalization can be read as steepest descent under a spectral-norm constraint.[14] With Hyperball, you also pin the weight matrix's Frobenius norm. So MuonH gives you complete control over both update structure and weight magnitude — the cleanest version of the matrix-aware story.
But empirically, MuonH doesn't uniformly beat MuonW (Muon with ordinary weight decay). At lower learning rates MuonH wins; at higher rates MuonW pulls ahead. The mechanism: MuonW lets the weight norm grow over training, which provides an implicit annealing schedule — as ||W||F grows, the relative step ||ΔW||/||W|| shrinks even at fixed lr. MuonH's fixed norm eliminates this annealing.
There's a deeper failure mode at scale, too. Hyperball's retraction multiplies the matrix by a single scalar c = R / ||W_temp||F. This shrinks all singular values by the same factor. The dominant singular values were big to start, so proportional shrinkage barely affects them. The trailing singular values — which Muon was patiently growing one step at a time — get repeatedly pulled back down by the rescaling. Over many steps, the spectrum collapses into a few dominant modes. This is the "spectral squeeze." Recent benchmarks report MuonH's spectral entropy and participation ratio collapsing over training while MuonW maintains both.[15]
Where this leaves us
The frontier as of mid-2026: MuonW is the safer default for large-scale runs, MuonH is competitive at moderate scale but has the spectral squeeze problem to contend with. Active research questions include Spectral Sphere variants that constrain σ1 rather than ||W||F, hybrid schemes that handle the output projection separately, and the deeper question of whether the implicit annealing from norm growth in MuonW is a feature or just a happy accident.
The bigger lesson: weight magnitude has emerged as a third axis of optimizer design, orthogonal to "what direction to step" and "how aware of curvature." As networks become more carefully internally normalized, the optimizer has more room to manage scale directly, and MuonH is one promising point in that design space.
Coda The through-line
Thirteen sections, eleven optimizers. The arc, compressed:
SGD took small noisy steps in the gradient's direction. Momentum gave the optimizer a memory and a velocity. Adam added per-parameter learning rates via the EMA of squared gradients, which we then saw was really computing a signal-to-noise ratio bounded in [−1, +1]. AdamW noticed that the obvious way to add weight decay to Adam silently broke it, and fixed it in one line. SignSGD and Lion showed that the per-parameter normalized step was the active ingredient all along — Adam's elaborate machinery was approximating something simpler. Muon jumped from per-parameter to per-matrix, orthogonalizing each weight matrix's update so every singular direction gets attention. Shampoo went one further and tracked actual curvature per-row and per-column. MuonH opened a third axis: control the weight magnitude directly, not just the update.
The arc, abstracted: each optimizer either looked at the previous one and asked what is it failing to use?, or asked what is it doing redundantly that we can throw away? Both directions produced wins.[3][10]
Where this is going: the matrix-aware optimizers (Muon, Shampoo, MuonH) are still relatively new and rapidly evolving. The next axes that look promising — based on what's getting attention now — are second-order without the cost (cheap Hessian-vector products integrated into the update), even better factorizations of the implicit Hessian, and scale-control as a first-class concept. The story isn't finished, and a few of the sections above might look quaint in three years. But the through-line will hold: every optimizer fixes a specific failure of the one before it, and noticing the failure clearly is the hard part.
Refs References & further reading
This essay is a synthesis rather than original research; the lineage it traces is drawn from the papers below. The classical adaptive methods established per-parameter scaling; the sign-based family showed how little of that machinery was load-bearing; and the matrix-aware methods — Shampoo, Muon, and the hypersphere variants — are the active frontier as of mid-2026. Where the text states empirical claims about large-scale behavior (Muon's records, the spectral-squeeze of MuonH), those trace to the matrix-aware and hypersphere references in the matrix-aware groups. The groups after that collect the foundational and adjacent work the lineage builds on — stochastic approximation and momentum, the analysis of adaptive methods, normalization and weight geometry, natural-gradient and second-order methods, the spectral-norm and duality view of Muon, and learning-rate schedules and large-batch scaling — for readers who want to follow any thread back to its root. Dates and arXiv identifiers are given so each claim can be checked at the source.
- J. Duchi, E. Hazan, Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (AdaGrad). JMLR 12, 2011.
- T. Tieleman, G. Hinton. Lecture 6.5 — RMSProp: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning, 2012.
- D. P. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015. arXiv:1412.6980.
- I. Loshchilov, F. Hutter. Decoupled Weight Decay Regularization (AdamW). ICLR, 2019. arXiv:1711.05101.
- J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, A. Anandkumar. signSGD: Compressed Optimisation for Non-Convex Problems. ICML, 2018. arXiv:1802.04434.
- X. Chen et al. Symbolic Discovery of Optimization Algorithms (Lion). NeurIPS, 2023. arXiv:2302.06675.
- V. Gupta, T. Koren, Y. Singer. Shampoo: Preconditioned Stochastic Tensor Optimization. ICML, 2018. arXiv:1802.09568.
- H.-J. M. Shi, T.-H. Lee, S. Iwasaki, J. Gallego-Posada et al. A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer. 2023. arXiv:2309.06497.
- N. Vyas, D. Morwani, R. Zhao et al. SOAP: Improving and Stabilizing Shampoo using Adam. 2024. arXiv:2409.11321.
- K. Jordan. Muon: An Optimizer for Hidden Layers in Neural Networks. Blog, 2024.
- J. Liu et al. (Moonshot AI). Muon is Scalable for LLM Training. 2025. arXiv:2502.16982.
- Kimi Team (Moonshot AI). Kimi K2: Open Agentic Intelligence. 2025. arXiv:2507.20534.
- G. Yang et al. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (μP). NeurIPS, 2021. arXiv:2203.03466.
- L. Chen, J. Li, Q. Liu. Muon Optimizes Under Spectral Norm Constraints. 2025. arXiv:2506.15054.
- L. Ren et al. Rethinking Language Model Scaling under Transferable Hypersphere Optimization (HyperP / MuonH). 2026. arXiv:2603.28743.
- H. Robbins, S. Monro. A Stochastic Approximation Method. Annals of Mathematical Statistics 22(3), 1951.
- B. T. Polyak. Some Methods of Speeding Up the Convergence of Iteration Methods (heavy-ball momentum). USSR Computational Mathematics and Mathematical Physics 4(5), 1964.
- Y. Nesterov. A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k²). Soviet Mathematics Doklady 27, 1983.
- I. Sutskever, J. Martens, G. Dahl, G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. ICML, 2013.
- S. J. Reddi, S. Kale, S. Kumar. On the Convergence of Adam and Beyond (AMSGrad). ICLR, 2018. arXiv:1904.09237.
- L. Balles, P. Hennig. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients. ICML, 2018. arXiv:1705.07774.
- S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML, 2015. arXiv:1502.03167.
- J. L. Ba, J. R. Kiros, G. E. Hinton. Layer Normalization. 2016. arXiv:1607.06450.
- T. Salimans, D. P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. NeurIPS, 2016. arXiv:1602.07868.
- B. Zhang, R. Sennrich. Root Mean Square Layer Normalization (RMSNorm). NeurIPS, 2019. arXiv:1910.07467.
- T. van Laarhoven. L2 Regularization versus Batch and Weight Normalization. 2017. arXiv:1706.05350.
- S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation 10(2), 1998.
- J. Martens, R. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature (K-FAC). ICML, 2015. arXiv:1503.05671.
- R. Anil, V. Gupta, T. Koren, K. Regan, Y. Singer. Scalable Second Order Optimization for Deep Learning. 2020. arXiv:2002.09018.
- J. Bernstein, L. Newhouse. Old Optimizer, New Norm: An Anthology. 2024. arXiv:2409.20325.
- J. Bernstein, L. Newhouse. Modular Duality in Deep Learning. 2024. arXiv:2410.21265.
- N. J. Higham. Functions of Matrices: Theory and Computation (Newton–Schulz iteration). SIAM, 2008.
- I. Loshchilov, F. Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR, 2017. arXiv:1608.03983.
- Y. You, I. Gitman, B. Ginsburg. Large Batch Training of Convolutional Networks (LARS). 2017. arXiv:1708.03888.
- Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar et al. Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes (LAMB). ICLR, 2020. arXiv:1904.00962.
Cite this essay
@misc{choudhary2026genealogy,
author = {Siddharth Choudhary and Claude (Anthropic)},
title = {A Genealogy of Optimizers},
year = {2026},
month = may,
howpublished = {\url{https://itzsid.github.io/publications/optimizer-ladder.html}},
note = {Essay}
}