why i keep thinking about grokking
2026.03.20a few years ago a paper dropped with a result so strange it looked like a bug. you train a small transformer on modular arithmetic, something like a + b mod 97. within a few hundred steps, the training loss crashes to zero. the model has memorized. test loss, meanwhile, is still sitting flat at chance. any reasonable person stops there, calls it overfit, and goes home.
except they didn’t. they kept training, and after something like a hundred thousand more steps of no visible progress, test loss suddenly collapsed to near zero. the model generalized, long after the moment everyone would have called the experiment over. the paper is power et al. 2022, and the authors named the phenomenon grokking.
the reason it sticks with me isn’t the specific result. it’s that every metric we had pointed in the wrong direction. if you’d used early stopping, which is the textbook move, you’d have stopped with a memorizing network and never discovered that the generalizing one was a hundred thousand steps away along the same loss curve, in the same training run, on the same weights.
later, nanda and collaborators cracked open one of these networks and reverse-engineered what was actually happening during the long flat region. for modular addition, the network was quietly building a representation that solves the whole task with discrete fourier transforms. it rotates vectors on a unit circle indexed by residues, and adds the angles. the memorizing solution and the fourier solution coexistin the weights for most of training. weight decay slowly sands down the memorizing circuit, and when it’s gone, the fourier circuit is already sitting there waiting. the phase transition you see on the test loss is just the moment the second circuit becomes load-bearing.
three things i keep coming back to.
the metric lied. test loss was flat. train loss was zero. from the outside, nothing was happening. inside, a whole new algorithm was being constructed. a lot of learning, in models and in people, looks like this: long stretches where the instruments read flat, then a phase transition. if your only instrument is a loss curve you will misread every one of those stretches.
memorization isn’t the opposite of understanding. it’s a substrate. the network had to get the training set right first by brute force. only then was there enough capacity and gradient signal to start building a compressed solution on top. the two modes aren’t rivals so much as one is scaffolding for the other.
this shape shows up in finance too.i worked on ml price prediction last summer. you can fit a model beautifully on historical data, ship it, watch it outperform for months, and then get run over by a regime that wasn’t in your training set. from the outside it looks like the model broke. from the inside, it never generalized. it had memorized a regime and you just hadn’t caught it yet. whatever the domain, the question is the same: has the model built something that compresses the world, or just photographed it?
i don’t think grokking is the final picture of how learning works. but the shape of the result, that the interesting part of training can be invisible to any metric you’d naturally log, has stayed with me longer than most papers i’ve read. it’s the cleanest empirical argument i know for why mechanistic interpretability is worth doing. you can’t align what you can’t see, and you can’t see anything if your only window is the loss.