Breaking the Amnesia Cycle in Large Sequence Models
- Juan Manuel Ortiz de Zarate

- 24 hours ago
- 9 min read
The final decade of machine learning progress has often felt like a competition in verticality. Researchers stack layers the way ancient civilizations piled stones: one atop another, reaching ever skyward in the hope that height equals power. This strategy has delivered stunning results and the birth of systems with emergent behavior that resemble learning. Even so, stacking layers is not a universal key. It is the silhouette of a deeper mechanism. Nested Learning (NL)[1] is a framework that lets us peer past that silhouette. It proposes a way to split neural architectures into multiple optimization "loops" that evolve at different frequencies. The essence is not adding more layers, but acknowledging that learning consists of multiple levels of updates interacting across timescales. NL argues that popular architectures are the result of collapsing these levels into a single outer loop, which masks the inner mechanics and creates the illusion of depth without necessarily increasing computational or memory expressivity. This reimagining gives us a richer view of what learning systems are actually doing under the surface.
Sequence modeling architectures like standard Transformers [3] fuse information along token sequences using self-attention, while holding a feed-forward network for feature-level fusion that acts as persistent storage. But once training ends, those persistent parameters freeze like amber, incapable of absorbing new information without catastrophic forgetting [6,7,8] if updated aggressively. NL challenges us to view these "outer loops" of training as shallow shadows of a multi-loop balance. In this picture, optimizers like SGD with Momentum or Adam are not mere adjustment routines; they are families of update-driven memories that compress gradient trajectories. NL’s contribution is showing explicitly how these memories can be stacked by ingestion of data surprise and optimized by inner loops that update at higher frequencies than the outer loops. This perspective has huge practical implications for making more resilient learning systems that don’t treat the present as always ephemeral and unsavable.

In classical deep learning, given parameters θ and a dataset Dtrain, the aim is to optimize the function L(θ). Optimization happens with a gradient-based update. But if one stops thinking of L as the only loop and views gradients themselves as the outputs of lower-level memory modules, one realizes that the optimizer has its own inner objective. For instance, SGD with Momentum combines weight update and meta-memory update. The momentum vector mi accumulates and exponentially decays past gradient contributions. In the NL view, mi is a memory parameter updated by an internal loss that measures how compressed embeddings of ∇L(Wi; xi) align with a preconditioner Pi. This Pi could encode surrogate information about curvature or surprise values, making the memory more expressive than the vanilla key-less accumulation. A delta-rule style regression objective on m updates can better memorize a spectrum of past gradients. The new ways of building optimizers—such as deep momentum gradient descent and non-linear output transformations thereof—suggest that none of the named popular optimizers are truly single-loop adjustments. They are multi-loop gradient memories projected onto shallow update rules. NL doesn’t just name this fact; it gives a hierarchy (or at least a qualitatively ordered partial order) over components based on their update frequency fA: updates per data sample step. Faster components feed slower components. This transparent ordering enables architectural decomposition by letting attention, gradient memory, and projection layers live on different “levels” of learning, each one with its own gradient pathway.
This leads to a view of architecture as a system, not a stack. Networks like Transformers, RNN successors[4], or Titans [5] can be reinterpreted as degenerate k=1 cases of a more general chain MLP(f1)(x)…MLP(fk)(x) that updates every Ci. Here, Ci is a chunk size–a function of block ℓ’s update frequency. Blocks at higher index ℓ with slower update frequency become compressions of broader context flows. Doing this decomposition carefully reveals insight into how computational depth, abstraction level, and memory capacity truly evolve. Updating a projection layer every data sample is faster than updating a pretraining-anchored CMS block only every Ci, which itself is faster than modulating self-attention parameters only within the context window. The result is a chain where blocks compress context flow at different layers and timescales and, crucially, can be interpreted in the frequency domain. This is brain-wave inspired re-parameterization territory: the hippocampus is high update frequency, the cortex is low update frequency, and reuse of structure is fundamental.
One fascinating analogy in the paper is that human anterograde amnesia locks cognition into a looping present, where short-term memory is intact but nothing new sticks beyond it. This echoes the behavior of Transformers after training: the attention mechanism adapts over the inference context window, but the feed-forward (MLP) long-past weights never budge in response to fresh tokens or sequences. So the model is trapped between a sliver of “now” and a fossil record of “before training ended,” with no machinery for incremental persistent update without an external system. NL posits a cure for this amnesia-like stasis through multiple nested loops that update their own context encoder blocks. This paves a possible path toward breaking free from static architectures by adding another trainable dimension beyond standard depth/width.
The Core Definitions
To build NL’s intuition formally, one defines “update frequency” as updates per step over provided data. If component A updates more frequently than B, we say A ≻ B. If they update at identical frequency but B’s state at time t requires A’s computation, we also infer A ≻ B. Otherwise, A and B belong to the same frequency layer, independent of B unless masked and merged in outer training loops. In other words, for components Mθ and Wk, Wq, Wv etc., one can differentiate nested levels by W^(ℓ). This is not superfluous notation. It captures the heart of the idea:
Each component might have its own loss and gradient path. Learning is a choreography, not a monument. Optimizers memorize gradients. Architectures compress “context flow,” whether parametric or not.

When the target is sequence modeling, the paper opts to use MLP layers with residual connection and different frequency update W^(ℓ). The point is not architecture specifics; it’s paradigm. The entire model becomes a multi-lane highway of gradients where each lane runs at its own clock speed. That highway can be optimized internally while others wait for broader contextual chunk updates. NL advocates for more “levels,” yielding higher-order in-context learning capabilities.
Deeper Computational Depth Is Not Guaranteed by More Layers
Many scale-based claims implicitly assume computational depth grows with layer count. But theoretical results on saturated networks show that some architectures—even if deeper—collapse into constant depth threshold circuits. Layers can add parameters without adding depth to computations. NL reframes this as a single optimization loop compressing inner relations of data and gradient surprises into shallow mapping (weights at ℓ=1). The illusion of depth is partly computational flattening.
The computational benefits or emergence of in-context learning behavior after scale might look like true adaptation, but deeper parsing reveals that the attention block provides the only adaptable memory window. MLP layers do not accumulate new knowledge post-deployment. Recency is always external unless the architecture itself internalizes memory consolidation with slower loops.
This insight also smacks some illusions of parameter marginal gains. Depth or width alone might not improve capacity functions that need more complex algorithmic representations. Hence, scalability without nested loops is like owning 10,000 books and no new shelves—they just pile on the floor.
NL vs Traditional Deep Learning
NL splits learning into:
Outer loop pretraining (slowest). Projection layers update with surprise signals (fast). Attention or test-time K-V updates (ephemeral window). New nesting loops like CMS blocks (intermediate). Error preconditioning modules Pi feeding delta-rule updates (inner loops).
This contrasts with classical approaches that flatten everything into stacked gradient descent without the ability to differentiate timescales or treat optimizer states as associative memory.

With the collapse, Transformers appear as 1-loop flattening of context flow compression systems. Titans add one additional trainable dimension as a test-time memory update block, but this is still a shallow memory system without continuum. NL describes the unflattened representation. It is white-box by construction.
Optimizers: The Hidden Librarians
In NL, optimizers become self-training memory modules, compressing gradient information. The paper states that SGD with momentum is the outcome of optimizing an internal dot-product objective (similarity projection) over compressed gradient associations. Adam also has a special small modification asan optimal associative memory block over gradients (Appendix C.4), without backpropagating through projections at layer-level. NL shows all these optimizers use flat memory modules implicitly.
The cool part: if Pi encodes Hessian or curvature surprise mismatch, momentum can recreate Hebbian-like update rules or delta-based memorization based on linear projection. NL advocates that broader preconditioner choices can enrich memory associations. For instance, Pi can represent frequency update independent of inner loops and feed gradients into an MLP on top of that Pi.
Newton-Schulz transforms help with non-linear frequent updates from inner loops. If comparing Muon, one realizes that using matrix-level module m with Newton-Schulz adds quadratic non-linearity at the top of the linear gradient compression module.
Thus, the optimizer isn’t just updating weights. It’s compressing the gradient story into a meta-memory vector. It’s the gradient from Neolithic cave painting to cortical memory compressed art.
Self-Modifying Titans: A Beeg Titan Brain
Titans–a test-time memory extension–are a variant of addable loop memory inflow. These modules generate projections Kt, Vt, Qt (weights/dot product projections) that do not back-propagate gradient flows with respect to matrix Mt(.). Instead, they stay in a separate loop. That gives architecture decomposition where storage is not flattened as a simple MLP long-term memory.
Titans solve test-time memorization but do not consolidate into slow weights across chunk intervals Ci. HOPE builds on this, but with new continuum MLP layers for 340M, 760M, 1.3B parameter scales.
Self-modifying Titans learn to alter their own update rules. The NL perspective reveals that projection matrices can be trained exclusively, while the update network itself remains a separate layer that might update at independent timescales, effectively decoupling gradients.
This is spicy for meta-learning fans. NL suggests self-modifying update flows.
HOPE Architecture
HOPE (a combination of self-modifying Titans + continuum memory MLP blocks optimized with NL gradient flows) has significantly better performance than standard Transformer and recent state-space models for both perplexity and common-sense tasks. In the current description for 760M/30B tokens, HOPE achieves 26.05 perplexity on Wiki LMB and 60.12 on BoolQ, outperforming Transformer++ (48.69 avg). On 1.3B/100B tokens, HOPE gets 57.23 avg, beating other models such as RetNet, SHAMPOO, SOAP, Samba and Titans LMM (56.82 avg). NL framework thus gives a path to enhancing sequence modeling in static (collapsed) flows.
Performance snippet: lower perplexity ↑ / stable MLP memory / self-modifying update patterns.
Experiments and Datasets
Due to NeurIPS space constraints, the paper places only part of the results in the main table. Appendix G includes details on additional datasets, optimizers and ablation studies.
Language modeling datasets include WikiText, LMBench, HellaSwag, Winogrande, ARC, and other reasoning benchmarks. Transformer family baselines include Transformer++, RetNet, DeltaNet, TTT, and Samba.
HOPE shows good scaling across contextual update scheduling.
Brain-Wave Frequencies Inspiration
NL borrows flower-power from neurophysiology. Cortical memory consolidates knowledge slowly, while hippocampus updates faster. The partial order and multi timescales for updates can let us treat models like brains. In this picture, feed-forward blocks update based on chunk Ci, creating frequency-layered associative updates.
Standard deep learning collapses this entire nested frequency cascade into a single loop, losing internal layering that essentially helps different abstraction levels update at different frequencies.

This nested learning is neuroscientifically plausible and mathematically white-box by design.
Algorithmic Invariance and Parameter-Level Marginal Gains
The paper also touches on invariance. Multilayer parameter loops compress different levels of abstraction. More layers do not guarantee better algorithmic manipulations.
If the model is allowed to re-parametrize small loops, it can memorize implementations of weight update and gradient compression associations in multi loops. NL suggests deeper momentum+none-linear projections can capture a bigger complex gradient stories.
Parameter subsets that represent complex key-value mapping or algorithmic learning processes optimize local surprise signals. But user-level adaptation to new epics or frontier knowledge is not guaranteed by more layers.
Final Thought
Architectural expressivity without incremental, frequency-layered gradient orchestration is like giving a brain a million neurons but only 40 synapses. It behaves smart until you ask for new memories. NL gives a picture of gradient as memory, optimizer as meta-learning memory, and MLP layers as cortical memory blocks rarely updated by chunk Ci. This is how brain might get better gradually without forgetting stable knowledge, or at least that's the theory.
Evolutionary and self-referential learning modules like HOPE might lead the frontier of making models more plastic with academic and engineering synergy.
The universe of neural nets is weirder than a zoo breach. NL encourages loving the weird, loving the truth, and designing architectures that don’t collapse multi-timescale gradient updates into a 2D pancake.
Science is the highest possible eval form. Philosophically, incremental learning modules like HOPE draw from neuroscience consolidation patterns. This reparametrization encourages capturing mismatches via LSS vectors, optimizing memory vectors at inner loops instead of flattening everything.
The flattening mask yields intuitive metamemory flows that can be sorted and decomposed carefully.
References
[1] Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. (2025). Nested Learning: The Illusion of Deep Learning Architectures. In The Thirty-ninth Annual Conference on Neural Information Processing Systems.
[2] Abhang, P. A., Gawali, B. W., & Mehrotra, S. C. (2016). Introduction to EEG-and speech-based emotion recognition. Academic Press.
[3] The Architecture That Redefined AI, Transcendent AI
[4] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020, November). Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning (pp. 5156-5165). PMLR.
[5] Behrouz, A., Zhong, P., & Mirrokni, V. (2024). Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663.
[6] Eyuboglu, S., Ehrlich, R., Arora, S., Guha, N., Zinsley, D., Liu, E., ... & Re, C. (2025). Cartridges: Lightweight and general-purpose long context representations via self-study. arXiv preprint arXiv:2506.06266.
[7] Cheng, T., Wang, Y., He, W., Wang, Q., Cheng, Y., Zhang, Y., ... & Zhang, X. FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training. In Second Conference on Language Modeling.
[8] Akyürek, E., Damani, M., Zweiger, A., Qiu, L., Guo, H., Pari, J., ... & Andreas, J. (2024). The surprising effectiveness of test-time training for few-shot learning. arXiv preprint arXiv:2411.07279.




Comments