The Optimizer Ladder
from SGD to MuonH
A guided tour through neural-network optimizers, with twelve interactive widgets. Each one solves a specific problem in the previous, and the trajectory tells you something about where deep learning is going.
Ch. 01 SGD & the stochastic gradient
A neural network is just a giant pile of numbers — the parameters. Training it means finding the values of those numbers that make the model's predictions good. We measure "good" with a single scalar called the loss: lower is better.
So training is a search problem. Imagine the loss as the height of a landscape, and every possible setting of the parameters as a location on that landscape. Training a model is finding the lowest valley while wandering around blindfolded.
Gradient descent: the simplest possible idea
You can't see the landscape, but at any point you can feel which way the ground tilts. That tilt is the gradient — it points uphill. So you do the obvious thing: step in the opposite direction. That's the entire algorithm:
The step_size (also called the learning rate) is the only knob you really tune. Too small and you crawl. Too big and you overshoot the valley and fly out the other side.
The "stochastic" part
Here's the catch. To compute the true gradient, you'd need to evaluate the loss on every training example — millions or billions of them — before you could take a single step. That is wildly impractical.
The trick: don't compute the true gradient. Compute a noisy estimate of it using a small random batch of examples. That estimate is wrong in the details but right on average — and it costs almost nothing.
So instead of one careful step using all the data, you take thousands of slightly drunk steps using slivers of data. You wobble more, but you cover ground much faster. That's stochastic gradient descent.
Set the learning rate, set how noisy the gradient is (low = like a big batch, high = like a single example), and watch the trajectory unfold. The orange dot is the minimum we're trying to reach.
A concrete example: fitting a line
Forget abstract bowls for a moment. Here's the simplest learning problem that exists. You have 24 data points — each is a pair (x, y). Some unknown rule generated them. You believe the rule is roughly a straight line through the origin, so your model is ŷ = w · x. There is exactly one parameter to learn: the slope w.
For a single point (x, y), your prediction will usually be a little off. We measure how off with the squared error (w·x − y)². Zero if you nailed it, bigger if you didn't. The total loss is the average of those squared errors over all points.
For a single point the gradient with respect to w is 2x · (w·x − y). You don't need to memorize that — just read off the sign. If your prediction is too low the gradient is negative, so SGD raises w. If it's too high the gradient is positive, so SGD lowers w. Each data point is yelling "you got me wrong by this much, in this direction", and SGD nudges w accordingly. The whole algorithm in three lines:
1. pick a random data point (or a small batch)
2. compute the gradient using just that point
3. w ← w − lr · gradient
In the demo below, the top plot shows the data and your current line (y = w·x). The highlighted coral points are the ones being used in this step — the dashed lines show how wrong the line is for each. The bottom plot is the loss curve: a parabola in w, with the orange dot marking the true minimum and the green dot showing where you currently are.
What this earned us, and what it didn't
Try batch size 1 with lr = 0.02 and hit auto-step. The line snaps toward the data, then starts wobbling once it's roughly fit. The green dot bounces around the orange minimum but never quite settles. That's stochastic noise — even when w is right, a single random point still pulls it.
Now switch to all 24 points. The path is dead smooth. Every step uses the true gradient, so there's no noise to cancel. This is vanilla gradient descent. It looks great, but in real training, doing this every step would mean a forward+backward pass through your entire dataset to take one update. With a trillion-token corpus that's a non-starter.
SGD's deal: cheap, scalable, good enough to actually train large models. The noise that looks like a bug here is part of why it works at scale — those random kicks help models escape bad local regions of the loss landscape.
But three problems are now obvious enough that you can name them just from playing with the demo. First, even at the optimum, the noise never goes away — w keeps wobbling. Second, every step is memoryless. If the last twenty gradients all said "go right", the twenty-first step ignores all of that history and only listens to the current point. Third, in real models different parameters have wildly different gradient scales. One global learning rate that works for the whole network is asking a lot.
The next chapter — momentum — fixes the second of these and partially the first. After that, RMSprop and Adam will tackle the third.
Ch. 02 Momentum
Look back at the very first demo from chapter 1, with the bowl and the ball. Now imagine stretching that bowl out so it's a long narrow valley instead of a circular dish — steep walls running left-and-right, very gentle slope running along the length.
That's what real loss landscapes look like in deep learning. A few directions are extremely curved (a small change in the parameter changes the loss a lot) and most directions are almost flat. Watch SGD on a landscape like that and you'll see the same disaster every time. The gradient is dominated by the steep directions, so each step is a big sideways lurch across the valley. The slow component along the valley — the direction we actually need to make progress in — barely moves.
The deeper diagnosis: SGD's steps are memoryless. The seventeenth step has no idea that the previous sixteen gradients had a nice consistent component pointing down the valley.
The fix: give the optimizer a velocity
Stop treating the gradient as the direction to move. Treat it as a force pushing on a velocity vector. The optimizer carries the velocity from step to step. Useful directions (where the force keeps pushing the same way) cause the velocity to grow. Useless directions (where the force flips back and forth) cause the velocity contributions to cancel.
v ← β · v + g
w ← w − lr · v
That's it. Two lines. β is a number close to 1 (a typical default is 0.9), controlling how much of the old velocity persists. With β = 0 you get plain SGD back. With β = 0.9 the velocity is a kind of running tally of roughly the last ten gradients' worth of motion.
In the steep-wall directions, the gradient alternates sign each step, so when you accumulate it into v, the contributions cancel and the velocity stays small. In the consistent valley direction, every gradient pushes the same way, so the velocity grows — at steady state with β=0.9, it's about ten times a single gradient, which means the effective step is ten times bigger than SGD's would have been. And under noise, the noise terms in successive gradients are independent and average toward zero in v, while the real signal accumulates.
Below, both optimizers start at the same point, see the same noisy gradients, and use the same learning rate. The only difference is momentum. The contour ellipses are very flat horizontal ovals — that's the elongated valley.
Hit auto-step with the defaults and watch what happens. SGD (coral) does the classic disaster: it slams across the valley walls, bounces back, slams across again, and meanwhile barely creeps to the right. Momentum (teal) has a different shape entirely — it overshoots once or twice early on as the velocity builds up, then settles into a smooth glide along the valley floor.
Drag β to 0 and reset. Both paths overlap exactly. Momentum with β=0 is SGD. Push β to 0.95 and you'll see the cost of memory — too much inertia carries momentum past the minimum, and the teal path orbits around before settling. β=0.9 is the canonical default because it's a sweet spot between "useful inertia" and "loops around forever."
What momentum did and didn't fix
Momentum fixed the second of SGD's three complaints — steps now use information from past gradients. It partially fixed the first — under noise, the velocity buffer cancels independent noise terms. And it gave a real speedup along the slow valley directions that show up everywhere in real loss landscapes.
What it did not touch is the third complaint: there is still one global learning rate, applied uniformly to every parameter. In real models, some parameters have gradients on the order of 100 and others on the order of 0.001 at the same time. No single global lr is right for both. The fix everyone arrived at, more or less independently, in the early 2010s: keep a per-parameter estimate of recent gradient magnitudes, and divide each parameter's step by that estimate.
Ch. 03 Adam: AdaGrad to RMSProp to Adam
Momentum gave the optimizer a memory and a velocity. It made progress along slow valleys faster, oscillations smaller, and noise damped. But the third complaint from the SGD chapter is still untouched: there is exactly one learning rate, applied to every parameter in the network.
In the toy 2D demos this looks harmless. In a real network it's catastrophic. Different parameters live in wildly different gradient regimes. The bias of a softmax output and a weight matrix two layers back can have gradient magnitudes that differ by 1000×. Pick lr for the loud parameters and the quiet ones never move. Pick lr for the quiet ones and the loud ones blow up. The fix the field converged on is per-parameter learning rates, computed automatically from gradient statistics. The story arrives in three steps.
AdaGrad: the first try (and why it broke)
Keep, for each parameter, a running sum of how big its gradients have been:
G ← G + g²
w ← w − lr · g / sqrt(G + ε)
A parameter that's been seeing huge gradients gets divided by a huge sqrt(G), so its effective step shrinks. A parameter with tiny gradients keeps a near-full lr. This is exactly what we wanted — automatic per-parameter scaling.
The catch: G only ever grows. After a few thousand steps, every parameter's effective learning rate is approaching zero, and the optimizer just stops. Fine for sparse problems; fatal for deep learning where we'll train for millions of steps.
RMSprop: just take an average
Replace the running sum with an exponential moving average:
v ← β₂ · v + (1 − β₂) · g²
w ← w − lr · g / sqrt(v + ε)
Now v is bounded — it tracks the recent squared-gradient magnitude rather than the lifetime sum. Per-parameter scaling without monotonic decay. If gradients are roughly stationary at some long-run value, the steady-state v satisfies "EMA equals itself" and plateaus at the average squared gradient. The effective learning rate is preserved forever.
Compare the two accumulators on the same constant gradient stream:
Drag β₂ down to 0.9. RMSProp's plateau is unchanged (still 1) but it gets there in about ten steps — short memory, fast adaptation, noisier. Push β₂ up to 0.9995 and RMSProp warms up so slowly it barely reaches its plateau by the end of the run; long memory, smooth, but more AdaGrad-ish. At β₂=1 exactly, the EMA stops accumulating entirely.
Adam: RMSProp meets momentum
Adam's contribution is conceptually small but practically enormous. It just stacks the two ideas: keep a momentum term (EMA of gradients) and a per-parameter scale term (EMA of squared gradients), and use both:
m ← β₁ · m + (1 − β₁) · g ← this is momentum
v ← β₂ · v + (1 − β₂) · g² ← this is RMSProp
m̂ = m / (1 − β₁ᵗ) ← bias correction
v̂ = v / (1 − β₂ᵗ)
w ← w − lr · m̂ / (sqrt(v̂) + ε)
Defaults that have proven nearly universal: β₁=0.9, β₂=0.999, ε=10⁻⁸, lr in the 10⁻³ to 10⁻⁴ range.
The bias correction handles a subtle bootstrapping problem: both m and v start at zero, so the first few EMAs are biased toward zero. Dividing by (1 − β^t) compensates — it's small for small t (boosting the early estimate) and approaches 1 as t grows. After a few thousand steps the correction is invisible.
Here's the punchline that took the field a few years to fully internalize. Look at the term m / sqrt(v). If gradients are roughly stationary, then v ≈ g² and m ≈ g, so m/sqrt(v) ≈ g/|g| = sign(g). In other words, Adam's update is approximately a momentum-smoothed sign of the gradient, scaled by lr. Every parameter, regardless of its gradient magnitude, moves by approximately lr in some direction each step.
That's why Adam works so well on heterogeneous networks. Loud parameters and quiet parameters both step by ~lr, and you only need to tune one number for the whole network.
At lr=0.05, both work, but Adam glides toward the minimum in a clean, almost-straight line while momentum does its familiar oscillation. Crank lr to 0.08: momentum diverges off the canvas. Adam doesn't care — its update in the steep direction is still ~lr regardless of how big the gradient is, because the sqrt(v) divisor scales with gradient magnitude.
Adam works at default settings on most problems. That single fact — "default Adam works on a new problem" — is why it became the field's reflexive choice for nearly a decade.
Ch. 04 Adam as a signal-to-noise ratio
"Smoothed gradient over square root of squared gradients" is a mouthful. Here's the cleaner reading: Adam's update is computing a per-parameter signal-to-noise ratio, on a scale of −1 to +1.
Walk through a single parameter, observing a noisy gradient written as signal + noise at each step. Imagine the signal as the true mean (what the gradient should be) and the noise as a zero-mean random kick with standard deviation σ. After many steps:
maverages the gradient, som → signal— the noise cancels in the running mean.vaverages the squared gradient, sov → signal² + σ²— the noise variance contributes here even though it cancelled inm.- The update
m / sqrt(v) → signal / sqrt(signal² + σ²)— exactly the signal-to-noise ratio.
This works out across many scenarios:
| scenario | signal | σ | m → | sqrt(v) → | update | interpretation |
|---|---|---|---|---|---|---|
| A — clean strong | 0.3 | 0 | 0.3 | 0.3 | 1.00 | full step, clear direction |
| B — moderate noise | 0.3 | 1.0 | 0.3 | 1.04 | 0.29 | direction known, magnitude uncertain |
| C — weak signal | 0.1 | 1.0 | 0.1 | 1.005 | 0.10 | barely above noise, small step |
| D — pure noise | 0 | 1.0 | 0 | 1.0 | 0.00 | no signal, no step |
| E — oscillating | ±1 | 0 | 0 | 1.0 | 0.00 | at minimum in this direction |
| F — loud parameter | 100 | 0 | 100 | 100 | 1.00 | same as A — magnitude normalized |
Look down the "update" column. It's in [−1, +1] and it's measuring how confidently the optimizer should commit to this direction. Strong clean signal → 1. Strong noisy signal → ~0.3. Pure noise or oscillation → 0. And critically, magnitude alone (compare A and F) doesn't change the answer — only the signal-to-magnitude ratio matters.
The widget below feeds three different gradient streams to the same Adam update. The line plotted is m̂ / sqrt(v̂) — the update direction, not the gradient itself.
The teal line shoots to 1 and stays there. The coral oscillating signal damps fast to a tight band near zero — the alternating gradient cancels in m but not in sqrt(v), so the update is suppressed. The purple noisy signal settles around 0.29, which is exactly 0.3 / sqrt(0.3² + 1²) = 0.287.
One caveat: in strict steady state the update is bounded by 1, but during transients (especially early in training, or right after a learning rate change) the EMA window mismatch — m with β₁=0.9 has a short window while v with β₂=0.999 has a much longer one — can produce updates above 1. This is part of why gradient clipping is standard practice for transformer training: it prevents the numerator from temporarily outpacing the denominator and producing oversized parameter updates.
Ch. 05 AdamW: the one-line fix
Adam was published in 2014. AdamW, the version everyone actually uses, came out in 2017. The change is one line of code. The reason it took three years is that nobody — including the people who originally wrote Adam — noticed the bug.
Weight decay's job: at every step, gently pull parameters toward zero. Small parameters → simpler model → better generalization. The "gentle pull" is a multiplier slightly less than 1:
w ← w − lr · g ← regular update
w ← (1 − wd) · w ← weight decay, applied separately
There's a math identity that's true for SGD: shrinking the parameters is equivalent to adding wd · w to the gradient. So for years it was standard practice to "implement weight decay" by just adding wd · w to the gradient before passing it to the optimizer. Cleaner code, same result — for SGD, momentum, RMSProp, anything that just multiplies the gradient by lr and subtracts.
The original Adam paper followed this convention. So did every Adam implementation for the next three years.
Why it silently breaks for Adam
Adam doesn't just multiply the gradient by lr and subtract. It divides by sqrt(v) first. So when you fold wd · w into the gradient, the decay term goes through that divisor too. The decay each parameter actually receives is lr · wd · w / sqrt(v), not lr · wd · w. Parameters with large v (loud gradients) get their weight decay divided by a large number — they're decayed less. Parameters with small v get decayed more.
This is exactly backwards from what regularization should do. The parameters fitting the loudest signals — the ones most at risk of overfitting — get the least regularization. The quiet parameters that barely move get aggressively shrunk to zero.
The fix
Loshchilov and Hutter's observation: the identity that lets you fold weight decay into the gradient is only valid when the optimizer is w ← w − lr · g. The moment you preprocess the gradient — with sqrt(v), with sign, with anything — the equivalence breaks. So just don't fold it in. Apply weight decay directly to the weights:
m ← β₁·m + (1−β₁)·g
v ← β₂·v + (1−β₂)·g²
m̂ = m / (1 − β₁ᵗ), v̂ = v / (1 − β₂ᵗ)
w ← w − lr · m̂ / (sqrt(v̂) + ε) ← Adam update, no wd here
w ← w − lr · wd · w ← decoupled weight decay
That's the entire change. Now every parameter gets the same proportional decay regardless of v, and wd is an independent knob from lr.
This visualization isolates the effect: two parameters A (loud, large v) and B (quiet, small v) both starting at 1.0 with zero gradient. The only thing pulling them down is decay. So everything you see is the decay term in action, no other dynamics.
The teal line is AdamW: both parameters decay at the same rate, because the decay is decoupled from v. They overlap perfectly. The two coral lines show Adam-with-L2: the dark coral (large v) is barely moving — its decay got divided by a big number, so it's effectively unregularized. The light coral (small v) drops fast — its decay got divided by a small number, amplifying it. Drag the v sliders apart and the gap widens dramatically.
That's the bug. The optimizer was supposed to apply the same regularization to every parameter, and Adam-with-L2 silently made the regularization strength inversely proportional to gradient magnitude.
AdamW is the optimizer that trained essentially every transformer between 2017 and ~2023. It's the silent default behind GPT-2, GPT-3, BERT, T5, ViT, CLIP, and most of what came in their wake.
Ch. 06 Interlude — SignSGD
Up to this point the story has been "track more statistics, use them more cleverly." SignSGD is the rude question: what if we tracked less and it worked anyway?
We just spent two chapters building Adam, which carries two EMAs per parameter and computes m / sqrt(v). Then we noticed that in steady state, that update is approximately sign(g). So Adam is paying for two buffers of optimizer state to approximately compute the sign. A natural reaction: skip the machinery. Compute the sign directly.
w ← w − lr · sign(g)
That's it. No m, no v, no momentum, no per-parameter scale. Every parameter steps by exactly lr in some direction each step. The optimizer holds zero state — your only memory cost is the weights themselves.
Two big practical wins. Optimizer state: AdamW costs 2× the model weights in optimizer state. For a 70B parameter model in mixed precision, that's ~280GB just for the optimizer. SignSGD costs zero. Distributed communication: in distributed training, you can compress each gradient to one bit per parameter — a 32× reduction in bandwidth.
But SignSGD has two real issues.
No magnitude information. A parameter with gradient 0.0001 and one with gradient 1000 take the same step. If a parameter is already close to its optimum, SignSGD will overshoot by a fixed amount every step and oscillate around the minimum forever.
Noise is catastrophic. Adam's sqrt(v) term automatically scales steps down for noisy parameters. SignSGD doesn't: the sign of a nearly-zero gradient buried in noise is essentially random, and SignSGD will take a full lr-sized step in that random direction every time.
Same anisotropic valley as before. SignSGD with no momentum, AdamW for comparison.
Run with default noise (0.3). SignSGD marches diagonally toward the minimum: each step is exactly lr in both axes regardless of the wildly different gradient magnitudes (~3 in x, ~150 in y). This is the heterogeneity-handling that made Adam great. But once close to the minimum, it can't slow down — watch it oscillate around the target. Crank noise to 1.5 and SignSGD goes haywire while AdamW stays composed.
SignSGD is rarely used as-is in practice — too brittle. But it cleanly demonstrates two things that turn out to matter for what comes next. The directionally normalized step (everything moves by ~lr) is doing most of the work that makes Adam great. The magnitude correction in m / sqrt(v) is a refinement, not the core mechanism. And you can pay almost nothing in optimizer state and still get serviceable training, especially if you add back just a momentum buffer to handle noise.
Ch. 07 Lion: one buffer instead of two
Lion is the optimizer that emerged when Google Brain ran a massive program search over the space of optimizer update rules — they let an evolutionary algorithm hunt for new optimizers and Lion is what won. The result is striking: substantially simpler than AdamW, half the optimizer state, matches or beats AdamW on most large-scale benchmarks. The name stands for "EvoLved sIgn mOmeNtum."
We have all the parts. SignSGD without momentum was brittle. The fix is: smooth the gradient first, then take the sign. The actual rule has one extra wrinkle — Lion uses two different mixing rates, one for taking the step and one for updating the momentum buffer:
update = sign(β₁ · m + (1 − β₁) · g) ← what you step in this update
w ← w − lr · update − lr · wd · w ← step + decoupled weight decay
m ← β₂ · m + (1 − β₂) · g ← what you remember for next time
Defaults: β₁ = 0.9, β₂ = 0.99. Only one buffer of state. No v. No sqrt. No ε.
Lion takes the step using a fast EMA of the gradient (window ≈ 10 steps with β₁=0.9), but stores a slow EMA for next time (window ≈ 100 steps with β₂=0.99). The faster average gives responsiveness; the slower average gives stability. The Lion paper found this asymmetric design materially outperforms using the same β for both — the kind of detail you only find by searching the space exhaustively.
What Lion gives up and what it gives back
Lion's update is in {−1, 0, +1} per parameter — strict sign, no fractional confidence weighting. A parameter Adam would have stepped by 0.1 Lion will step by 0 or full lr. And the per-parameter magnitude normalization from sqrt(v) is replaced by global normalization via the sign function. In practice, neither is fatal. Lion still works.
What's earned: memory halved (only m, no v) and 2-15% throughput improvement per step depending on hardware. The catch: Lion needs a smaller lr (3–10× smaller is typical) and stronger weight decay than AdamW. The hyperparameter sweet spot shifts.
What makes Lion intellectually interesting isn't that it's better than AdamW — sometimes it is, sometimes it isn't. What's interesting is that it's competitive while throwing away half the machinery. That's strong evidence for a claim worth examining carefully in the next chapter: the most important thing Adam was doing all along was producing per-parameter normalized steps.
Ch. 08 The hidden invariant: per-parameter normalization
Time to pause and name the through-line. Looking back at SGD → momentum → AdamW → SignSGD → Lion, there's a single dimension that explains most of why each one outperforms the last. Call it per-parameter normalization: how does each parameter's gradient magnitude affect how far it moves?
| optimizer | effective step for parameter i | scales with |gi|? |
|---|---|---|
| SGD | lr · gi | yes — linearly |
| SGD + momentum | lr · vi where v is EMA of g | yes — still linearly |
| AdamW | lr · mi / sqrt(vi) ≈ lr · sign(gi) | no — flat |
| Lion | lr · sign(β mi + (1−β) gi) | no — flat |
| SignSGD | lr · sign(gi) | no — flat |
This is a useful chance to correct something easy to get wrong. Momentum does not solve the heterogeneous-gradient problem. It accumulates gradients along consistent directions, which speeds you up in slow valleys, but it preserves the per-parameter magnitude. A parameter with 100× larger gradient still gets 100× larger steps under momentum. The thing that did solve heterogeneity was the sqrt(v) divisor in RMSProp / Adam, or the sign() function in Lion / SignSGD. Both produce roughly equal step sizes per parameter regardless of magnitude. Both are doing "spatial normalization across parameters" at each step. Momentum is "temporal smoothing within each parameter."
Below: five parameters with gradient magnitudes spanning 10⁻³ to 10², and the displacement each optimizer takes for each. The y-axis is log scale because otherwise SGD's tallest bar would be 100,000× taller than its smallest.
The SGD bars span five orders of magnitude. Momentum's bars are uniformly ten times taller — same shape, just scaled. AdamW's bars are all the same height: a parameter with gradient 0.001 and one with gradient 100 take the same step. Lion's bars are identical to AdamW's.
This lens reframes Lion's success. Lion isn't beating AdamW because sign() is smarter than m/sqrt(v). It's beating it because both reach the same destination — flat, per-parameter normalized steps — and Lion gets there with one buffer and one operation while AdamW needs two buffers, a square, a square root, a division, and a bias correction. The work AdamW was doing turned out to be approximating something simpler.
It also points at what comes next. Once you see "per-parameter normalization" as the underlying principle, the obvious question is whether per-parameter is even the right granularity. Real network parameters aren't a flat list — they're organized into matrices, which have richer structure that per-parameter operations throw away. Two chapters from now, Muon will exploit exactly this. But first, a detour into what optimizers still aren't seeing.
Ch. 09 What first-order optimizers can't see
Every optimizer so far — SGD, momentum, Adam, Lion, all of them — uses one piece of information at each parameter: the gradient. That tells you the slope. It doesn't tell you the curvature.
Here's the difference. Imagine two parameters, both with the same gradient right now. The first is in a gently curving region — small changes in the parameter barely change the slope. The second is in a sharply curving region — small changes in the parameter rapidly change the slope. The optimal step for these two cases is very different. The gently curving parameter can take a big step safely. The sharply curving one is about to roll past its minimum if you step too far.
Both green and coral curves have the same slope at the green dot (the starting point). The slider is the step size — i.e., how far you move in the direction the gradient suggests. On the gentle curve, a big step lands you closer to the minimum. On the sharp curve, the same big step overshoots and lands you higher up the other side.
This is the third recurring complaint about first-order methods. The optimizer can't see how much each parameter "wants" you to step — only the direction. Adam's sqrt(v) normalizes by recent gradient magnitudes, which is a noisy proxy for curvature. SignSGD throws away even that. None of them are using the actual curvature of the loss landscape.
What "curvature" means precisely
For a single parameter, curvature is just the second derivative — the rate of change of the slope. Call it H (for "Hessian," which is the multi-parameter generalization). Then near a current point w, the loss looks like:
This is just the Taylor expansion. The optimal Δ — the one that minimizes the quadratic — is the value that makes the derivative zero: g + H · Δ = 0, so Δ = −g / H. That's Newton's method. The −g / H step takes you directly to the bottom of the local parabola in one move.
The geometric picture is satisfying. The gradient g is the current slope. The Hessian H is how fast that slope is changing. The ratio g / H is "how far do I need to walk before the slope reaches zero" — i.e., the distance to the bottom of the local parabolic bowl.
The top plot is the loss L(w) = ½ H w². The bottom plot is its slope, g = H w. The Newton step takes you from where you are to where the slope crosses zero — and on a quadratic that's exactly the minimum, in one step. Try different curvature values: high curvature means the slope drops fast and the Newton step is short. Low curvature means the slope drops slowly and the step is long. The step is automatically calibrated to the curvature.
The Hessian in 2D
For multiple parameters, the Hessian generalizes to a matrix:
For two parameters that's a 2×2 matrix. The diagonal entries are the curvatures along each axis. The off-diagonal entries — these are the interesting ones — say whether the two parameters' curvatures are "tilted" or "rotated" relative to the axes. They're zero only when the level curves are perfectly aligned with the axes.
At Hxx = Hyy = 2, Hxy = 0 you get circular level sets — same curvature in every direction. Crank Hxx to 5 keeping Hyy = 2: the bowl flattens to a horizontal ellipse — that's the anisotropic valley from chapter 2. Push Hxy to 1: the ellipse rotates. The off-diagonals are the source of all the geometric weirdness real neural networks have. Real Hessians have non-zero off-diagonals because parameters interact through the network — the curvature of one parameter depends on the others' values.
How you'd actually compute one
The gradient g is what backpropagation gives you for free — it's the same machinery that trains the network. The Hessian needs second derivatives. The classic trick: differentiate the gradient. If you ran backprop to get g, then run backprop again on each entry of g, you get the Hessian row by row. This is called "double backprop."
For a tiny example: L = ½ (w₁² + 4 w₂² + 2 w₁ w₂). First backward pass gives g = [w₁ + w₂, 4 w₂ + w₁]. Now backprop g[0] = w₁ + w₂ with respect to w: that gives the first row of H. Repeat for g[1]: second row.
Two backward passes give you the full 2×2 Hessian for a 2-parameter model. For an N-parameter model, you'd need N backward passes — one per row. Even N = 10⁶ is intractable. For modern LLMs with N ≈ 10⁹–10¹², materializing the full Hessian isn't merely expensive, it's physically impossible — it wouldn't fit in any storage. So full Newton's method is dead on arrival.
But there's a useful escape hatch. We rarely want the whole Hessian — what we usually want is H · v for some vector v. And that's surprisingly cheap: differentiate g · v (a scalar!) once more, and you get H · v directly. One backward pass instead of N. These "Hessian-vector products" are the basis of every practical second-order optimizer ever proposed.
So the second-order story is: don't try to materialize H. Don't try to invert it. Get cheap approximations, ideally factored into pieces that match the matrix structure of your network. The next two chapters — Muon and Shampoo — are the two main answers the field has converged on for how to do that.
Ch. 10 Muon: per-matrix orthogonalization
Here's the assumption that every optimizer until now smuggled in: a neural network's parameters are a flat list. Adam looks at each scalar entry of each weight matrix independently and decides how much to step it. The fact that those scalars are organized into a matrix — with rows that mean something, columns that mean something, and a structure that survives matrix multiplication — is information the optimizer throws away.
Muon's bet is that this structure is exactly the information you need.
The picture: singular values
Any matrix M has a singular value decomposition: M = U Σ VT. U and V are rotation matrices. Σ is a diagonal matrix of singular values — these say how much the matrix stretches space along each principal direction. A matrix with one huge singular value and many tiny ones acts almost like a rank-1 projection: it pours all input variation onto one output direction. A matrix with uniform singular values acts like a rotation that preserves the geometry of the input space.
The gradient of a weight matrix has its own singular value structure. And empirically, gradient matrices in training have very unequal singular values. A few directions dominate; most are nearly zero. This means a vanilla gradient step is mostly making progress in one or two directions and barely moving in others.
Muon's update
Compute the momentum-smoothed gradient as usual. Then, before stepping, flatten all its singular values to 1. Equivalently: replace Σ with the identity. What you get is a "whitened" version of the gradient — same directions as the original (same U and V) but every direction now contributes equally.
u ← β · u + g ← momentum buffer
û = orthogonalize(u) ← replace u's singular values with 1s
W ← W − lr · û
For hidden weight matrices only — embeddings, biases, and norm parameters use AdamW. That's why Muon comes packaged with a side-by-side optimizer in practice: it only knows what to do with matrices.
Replacing the singular values directly via SVD is too slow for every step of training. The Muon paper uses Newton-Schulz iteration: a degree-5 polynomial in the matrix that, applied iteratively (typically 5 times), pushes all singular values toward 1 without ever computing the SVD. Each iteration is just matrix multiplications, which GPUs are exceptionally good at.
Drag skew: the raw gradient's spectrum (gray) goes from uniform to extremely skewed. Drag iters: at 0, the green bars match gray (no orthogonalization). By 3-4 iterations the green spectrum is essentially flat at 1. With heavy skew you can see why standard Newton-Schulz uses 5 iterations as a practical default — it's the smallest count that handles realistic spectra.
Why this works
The vanilla momentum step is dominated by a few large singular directions. Muon's orthogonalized step has equal magnitude in every direction U Σ VT defines. Every direction the gradient knows about gets attention. Parameters that would have been ignored under vanilla momentum (because they sat in low-singular-value directions of the gradient) get the same step size as the dominant directions.
Practically, Muon set the current speed records for nanoGPT training and was used at the trillion-parameter scale for Kimi K2. The Muon comes with built-in μP scaling — as you make the model bigger, you don't need to retune the learning rate, which is one of the most labor-intensive parts of training large models.
Muon takes only the gradient matrix structure and gets a substantial speedup from it. The next chapter — Shampoo — goes one level deeper and asks: what if the optimizer also tracks the curvature structure of each matrix?
Ch. 11 Shampoo: curvature-aware preconditioning
Shampoo predates Muon by several years (the original paper is from 2018) and was actually the first matrix-aware optimizer to gain real attention. The plot twist is that Muon got most of the practical traction, partly because Shampoo is more expensive to run and partly because its theoretical motivation — full Newton-style preconditioning — is a stronger claim than people initially needed.
Shampoo's pitch: approximate the full Hessian using just two matrices per layer, one capturing how rows interact and one capturing how columns interact. For a weight matrix W ∈ ℝm×n with gradient G, maintain:
L ← L + G GT ← m × m left preconditioner (row curvature)
R ← R + GT G ← n × n right preconditioner (column curvature)
W ← W − lr · L−1/4 · G · R−1/4
The −1/4 exponents come from a Kronecker-factorization argument — the full Hessian is approximately L ⊗ R, so its inverse-square-root applied to G works out to multiplying by L−1/4 on the left and R−1/4 on the right.
Three optimizers, same gradient, same matrix, three different updates:
Same starting matrix. AdamW (purple) operates entry-wise: scale each cell by its own statistic, but the matrix structure is invisible to it. Muon (teal) flattens the singular value spectrum — same row/column structure, but reshaped to equal magnitude in every direction. Shampoo (gold) does something subtly different: it uses the empirical row- and column-curvature statistics to rebalance the matrix non-uniformly, dampening rows/columns that have been receiving large updates and boosting ones that haven't.
Practical considerations
Shampoo's chief practical issues are cost (computing matrix square roots is expensive — typically every few hundred steps, not every step), memory (the two preconditioners take additional space proportional to m² + n²), and tuning (more hyperparameters than Muon). Variants like Distributed Shampoo and SOAP (Shampoo + Adam fused) have made it more practical. Recent benchmarks have it roughly tied with Muon at small scale and behind Muon at large scale once optimizer-aware learning rate transfer (μP) is factored in.
The interesting question this raises — and Sid you've thought about this — is whether μP plus Muon plus Shampoo is redundant. If μP already gives you scale-invariant learning rates, and Muon already orthogonalizes per-matrix, what additional curvature signal is Shampoo capturing that the others miss? The answer seems to be: not as much as you'd hope at large scale, but enough at moderate scale to keep Shampoo in the conversation. It's the optimizer that proved matrix-aware preconditioning was a real direction — even if Muon turned out to be the simpler practical winner.
Ch. 12 MuonH: pinning the weight norm
Every optimizer so far has been about the update direction. SGD: raw gradient. Momentum: smoothed. Adam: per-element normalized. Muon: per-matrix orthogonalized. Shampoo: curvature-aware. But there's a parallel question we've only touched lightly in the AdamW chapter: how should we control the magnitude of the weights themselves?
Weight decay was the default answer. MuonH is a more radical one.
Scale invariance reframes the question
Modern transformers wrap every weight matrix in RMSNorm layers — the "RMSNorm sandwich." RMSNorm is scale-invariant: scaling its input by any positive α produces identical output. As a consequence, the weight matrix's magnitude doesn't actually affect what function the layer computes. Only its direction matters.
But weights still grow during training. Without any weight decay, the Frobenius norm grows roughly as √t, which inflates the parameter scale relative to step size and shrinks the effective learning signal. AdamW's decoupled weight decay handles this by gently pulling weights toward zero each step, creating an equilibrium norm that weights "hover" near.
The Hyperball question is: if magnitude doesn't matter functionally, why not just fix the magnitude exactly?
The Hyperball update
Pick a target radius R — typically R = ||W₀||F, the matrix's initial Frobenius norm. After every optimizer step, project the weight matrix back to that radius:
U_t = −η · R · Normalize(proposed_update) ← normalize update to scale ηR
W_temp = W_t + U_t ← take the step
W_{t+1} = R · Normalize(W_temp) ← retract back to radius R
The weight matrix lives on a hypersphere of radius R for the entire training run. (Despite the name "hyperball," you're constrained to the boundary — a hypersphere.) The update magnitude ||U_t||F = η · R is fixed, so η takes on a clean geometric meaning: it's the relative step size in units of weight norm. MuonH means "Muon with Hyperball," parallel to AdamW meaning "Adam with decoupled Weight decay."
The geometric picture
In 2D this is a circle. W_t is a point on the circle. The proposed update U_t is a vector starting at W_t. The sum W_t + U_t generally lands off the circle. The retraction projects it back radially — that's W_{t+1}. The actual step ΔW = W_{t+1} − W_t is a chord of the circle.
At default position (γ=0, η=0.3), the proposed update is tangent to the circle. The retraction does almost nothing — actual ||ΔW||/R ≈ proposed. This is the best case.
Drag γ toward +1: the update points radially outward. The retraction now has to slice off the radial component, and the actual step shrinks dramatically. Drag γ toward −1: same in reverse. The geometric punchline: retraction implicitly cancels any radial component of the update. Only tangential motion survives.
The spectral squeeze
Muon was already controlling the spectrum of the update by flattening its singular values. With Hyperball, you also pin the weight matrix's Frobenius norm. So MuonH gives you complete control over both update structure and weight magnitude — the cleanest version of the matrix-aware story.
But empirically, MuonH doesn't uniformly beat MuonW (Muon with ordinary weight decay). At lower learning rates MuonH wins; at higher rates MuonW pulls ahead. The mechanism: MuonW lets the weight norm grow over training, which provides an implicit annealing schedule — as ||W||F grows, the relative step ||ΔW||/||W|| shrinks even at fixed lr. MuonH's fixed norm eliminates this annealing.
There's a deeper failure mode at scale, too. Hyperball's retraction multiplies the matrix by a single scalar c = R / ||W_temp||F. This shrinks all singular values by the same factor. The dominant singular values were big to start, so proportional shrinkage barely affects them. The trailing singular values — which Muon was patiently growing one step at a time — get repeatedly pulled back down by the rescaling. Over many steps, the spectrum collapses into a few dominant modes. This is the "spectral squeeze." Recent benchmarks show MuonH's spectral entropy and participation ratio collapse over training while MuonW maintains both.
Where this leaves us
The frontier as of mid-2026: MuonW is the safer default for large-scale runs, MuonH is competitive at moderate scale but has the spectral squeeze problem to contend with. Active research questions include Spectral Sphere variants that constrain σ1 rather than ||W||F, hybrid schemes that handle the output projection separately, and the deeper question of whether the implicit annealing from norm growth in MuonW is a feature or just a happy accident.
The bigger lesson: weight magnitude has emerged as a third axis of optimizer design, orthogonal to "what direction to step" and "how aware of curvature." As networks become more carefully internally normalized, the optimizer has more room to manage scale directly, and MuonH is one promising point in that design space.
Coda The through-line
Twelve chapters, twelve widgets, ten optimizers. The arc, compressed:
SGD took small noisy steps in the gradient's direction. Momentum gave the optimizer a memory and a velocity. Adam added per-parameter learning rates via the EMA of squared gradients, which we then saw was really computing a signal-to-noise ratio bounded in [−1, +1]. AdamW noticed that the obvious way to add weight decay to Adam silently broke it, and fixed it in one line. SignSGD and Lion showed that the per-parameter normalized step was the active ingredient all along — Adam's elaborate machinery was approximating something simpler. Muon jumped from per-parameter to per-matrix, orthogonalizing each weight matrix's update so every singular direction gets attention. Shampoo went one further and tracked actual curvature per-row and per-column. MuonH opened a third axis: control the weight magnitude directly, not just the update.
The arc, abstracted: each optimizer either looked at the previous one and asked what is it failing to use?, or asked what is it doing redundantly that we can throw away? Both directions produced wins.
Where this is going: the matrix-aware optimizers (Muon, Shampoo, MuonH) are still relatively new and rapidly evolving. The next axes that look promising — based on what's getting attention now — are second-order without the cost (cheap Hessian-vector products integrated into the update), even better factorizations of the implicit Hessian, and scale-control as a first-class concept. The story isn't finished, and a few of the chapters above might look quaint in three years. But the through-line will hold: every optimizer fixes a specific failure of the one before it, and noticing the failure clearly is the hard part.