Large language models (LLMs) based on the Transformer architecture often struggle with computational inefficiency, particularly in their handling of repetitive knowledge lookup. While mechanisms like attention and Mixture-of-Experts (MoE) scale computation, they frequently re-process identical local patterns, consuming valuable depth and Floating Point Operations per Second (FLOPs). DeepSeek AI's latest innovation, Engram, aims to bridge this gap by integrating a conditional memory axis that complements existing MoE structures.
Introducing Engram: A Parametric Memory Solution
At its core, Engram reimagines traditional N-gram embeddings, transforming them into a highly scalable, constant-time (O(1)) lookup memory. This module seamlessly integrates into the Transformer backbone, serving as a dedicated repository for static information such such as common phrases and factual entities. By offloading these routine data retrievals, the Transformer's main processing units can then concentrate on more demanding tasks, including sophisticated reasoning and managing long-range dependencies within text.
The proposed system leverages the DeepSeek V3 tokenizer, featuring a robust 128,000-entry vocabulary, and undergoes pre-training on an extensive 262 billion token dataset. The underlying architecture is a 30-block Transformer with a hidden size of 2560, incorporating Multi-head Latent Attention (32 heads) and Manifold Constrained Hyper Connections for its feed-forward networks, optimized using the Muon optimizer.
Technical Integration and Architectural Design
Engram integrates as a sparse embedding module, constructed from hashed N-gram tables. Its design includes multi-head hashing into prime-sized buckets, a compact depth-wise convolution over the N-gram context, and a context-aware gating scalar (ranging 0-1) that precisely controls the injection of retrieved embeddings into each processing branch.
In larger models, such as Engram-27B and Engram-40B, the Engram module shares its Transformer backbone with the MoE-27B architecture. While MoE-27B utilizes 72 routed experts alongside two shared experts, Engram-27B reallocates some of these resources. It reduces routed experts to 55, dedicating the freed parameters to a substantial 5.7 billion-parameter Engram memory, maintaining a total parameter count of 26.7 billion. This Engram module uses N values of {2,3}, employs 8 Engram heads, and has a dimension of 1280, inserted strategically at layers 2 and 15. The Engram-40B variant further expands this memory to 18.5 billion parameters, without altering the number of activated parameters.
Optimizing Sparsity Allocation
A crucial design challenge involves strategically distributing the sparse parameter budget between routed experts and conditional memory. This is formally defined as the Sparsity Allocation problem, where the allocation ratio ρ denotes the fraction of inactive parameters assigned to MoE experts. Pure MoE models exhibit a ρ of 1. Decreasing ρ signifies reallocating parameters from experts to Engram slots.
Experiments on mid-scale 5.7 billion and 9.9 billion parameter models reveal an optimal U-shaped curve for validation loss when sweeping ρ. Engram models consistently match the performance of pure MoE baselines even when ρ is reduced to approximately 0.25, effectively halving the number of routed experts. The most favorable performance occurs when roughly 20 to 25 percent of the sparse budget is dedicated to Engram, demonstrating a robust balance between conditional computation and conditional memory under fixed sparsity constraints.
Impressive Performance Gains Across Benchmarks
Large-scale pre-training comparisons involved four distinct models, all trained on the same 262 billion token curriculum with 3.8 billion activated parameters. These included Dense 4B (4.1B total parameters), MoE 27B, Engram 27B (both 26.7B total parameters), and Engram 40B (39.5B total parameters).
Engram-27B and Engram-40B consistently surpassed the MoE-27B baseline across various metrics. On the Pile test set, Engram-27B achieved a language modeling loss of 1.960, a significant improvement over MoE-27B's 2.091. Engram-40B further reduced this to 1.942. Similar improvements were observed on internal validation sets.
Performance on knowledge and reasoning benchmarks also saw marked gains. MMLU scores for Engram-27B rose from 57.4 to 60.4 compared to MoE-27B. CMMLU increased from 57.9 to 61.9, and C-Eval from 58.0 to 62.7. Reasoning tasks like ARC Challenge improved from 70.1 to 73.8, BBH from 50.9 to 55.9, and DROP F1 from 55.7 to 59.0. Code and math capabilities also benefited, with HumanEval rising from 37.8 to 40.8 and GSM8K from 58.4 to 60.6.
Enhanced Long-Context Comprehension and Mechanistic Insights
Post-pre-training, the context window was extended to 32,768 tokens using YaRN for 5,000 steps on 30 billion high-quality long-context tokens. Engram-27B either matched or exceeded MoE-27B on LongPPL and RULER benchmarks at 32k context, even with fewer or equal pre-training FLOPs. Notably, Engram-27B showed significant improvements in Multi Query Needle in a Haystack (99.6 vs. 73.0) and QA (44.0 vs. 34.5).
Mechanistic analysis, employing tools like LogitLens and Centered Kernel Alignment (CKA), revealed that Engram variants achieved prediction readiness earlier in the model's layers. CKA similarity maps demonstrated that early Engram layers aligned effectively with much deeper layers in the MoE baseline, indicating that Engram efficiently increases the effective depth of the model by handling static reconstructions via memory lookup.
Key Findings
- Engram integrates a conditional memory axis into sparse LLMs, enabling O(1) hashed lookup for frequent N-gram patterns and entities.
- Reallocating 20-25 percent of the sparse capacity from MoE experts to Engram memory proves optimal, underscoring the complementary nature of conditional memory and computation.
- Engram-27B and Engram-40B, with identical activated parameters, significantly outperform MoE-27B across language modeling, knowledge, reasoning, coding, and mathematical benchmarks.
- The system demonstrates improved long-context performance, matching or surpassing baselines on perplexity and RULER scores, even with comparable or less computational effort.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost