We’ve long assumed that LLMs are “frozen” after training—if you want them to learn something new, you need to fine-tune. But Dario Amodei recently hinted that Anthropic is experimenting with models that can “learn” inside 100M-token contexts, without ever updating weights. Sounded mysterious—until Google’s new paper “Learning without training” seems pull back the curtain.
Here’s the core trick:
1. Self-attention extracts a “delta.”
Feed a query x alone and you get activation A(x). Feed x with context C and you get A(C,x). The difference ΔA = A(C,x) – A(x) is a compact summary of what the context “teaches” the model.
2. MLP turns this delta into a rank-1 sticky note.
When ΔA passes through the MLP, the math is exactly equivalent to adding a rank-1 update ΔW = (W·ΔA)·A(x)ᵀ / ||A(x)||² to the weight matrix W. For a 4,096×4,096 MLP (≈17M numbers), this update only needs two 4,096-dim vectors (≈8k numbers)—a 2,000× compression.
3. Temporary adaptation, no trace left.
During inference, using {W + ΔW, x} gives the same result as {W, [C,x]}. The “sticky note” ΔW is discarded after the forward pass. The base model never changes, but for one moment, it acts as if fine-tuned for your context.
This looks quite similar with LoRA with some key differences: LoRA injects low-rank adapters—learned and stored offline—while Google’s “sticky note” is computed at runtime, costs zero extra storage, and vanishes instantly. LoRA is like a notebook you carry; the sticky note is a Post-it you jot down and toss.
Why does this matter anyway? If context can imprint such efficient updates, scaling context windows may outpace continual fine-tuning. Combine this with retrieval, external memory, or LoRA-style caches, and you get a spectrum of “learning without training.” The challenges remain that how to chain these ephemeral updates for true long-term adaptation. As Dario and Altman push toward $100k/month models with 100M-token windows, the future may belong to those who can stack and orchestrate these tricks—unless open research keeps the playing field level.
Paper definitely worth reading: arxiv.org/abs/2507.16003