Memorization, Generalization, and Reasoning
Not understanding how AI models actually memorize, generalize, and reason is costing us. We’re chasing the wrong problems instead of building real solutions.
How much do models actually memorize?
Meta’s research “How much do language models memorize?” [3] provides crucial insights. They distinguished between unintended memorization (sample-specific storage) and intended memorization (generalization).
Intuition
Neural networks possess a fundamental information storage capacity that governs the trade-off between memorization (storing specific training examples) and generalization (learning compressible patterns). They found each parameter in a neural network can store approximately 3.6 bits of information [3], creating a hard constraint on what the network can remember versus what it must compress. This capacity limitation forces networks to choose between perfect recall of training data and the discovery of generalizable patterns—a trade-off that explains numerous phenomena in deep learning, from double descent to emergent abilities.
Information Theory Foundations
Neural network memorization is fundamentally grounded in information theory. Understanding memorization requires precise definitions of information, compression, and capacity:
Shannon Entropy quantifies the average information content of a random variable. For a discrete random variable with possible values and probability mass function :
This measures the expected number of bits needed to encode messages from the distribution. Higher entropy indicates more randomness and less predictability.
Conditional Entropy measures the average information needed to describe given knowledge of :
Mutual Information quantifies the reduction in uncertainty about one variable when observing another:
In the context of memorization, measures how much information the model parameters contain about the training data .
Kolmogorov Complexity defines the shortest possible description length of string using a universal Turing machine. While uncomputable in general, it provides the theoretical foundation for understanding compression limits:
where is a universal Turing machine and is the length of program .
Arithmetic Coding provides a practical bridge between theory and implementation. Unlike Huffman coding which assigns fixed-length codes to symbols, arithmetic coding maps entire sequences to intervals in , achieving compression rates approaching theoretical entropy limits. For a sequence :
This connection allows us to estimate Kolmogorov complexity using language model probabilities, forming the basis for measuring memorization.
Formal Definition of Memorization
The memorization capacity of neural networks can be formalized through information-theoretic principles. Total memorization represents the mutual information between the training dataset and the trained model parameters :
This total memorization decomposes into two fundamental and distinct components that capture different aspects of learning:
Unintended Memorization (sample-specific storage): This represents information about individual training examples that cannot be inferred from the general data distribution. It measures how much the model has “overfit” to specific samples:
This captures the reduction in uncertainty about specific training examples when given both the true distribution parameters and the learned parameters , compared to knowing only .
Intended Memorization (generalization): This represents the portion of memorization that corresponds to learning the underlying data distribution - essentially “good” memorization that enables generalization:
The key insight is that total memorization splits into:
- Useful patterns that generalize to new data (intended memorization)
- Spurious details specific to training examples (unintended memorization)
where:
- : Training dataset drawn from the true data distribution
- : Parameters of the true underlying data distribution
- : Learned model parameters after training
- : Shannon entropy measuring uncertainty
- : Mutual information measuring shared information
This formulation provides theoretical clarity but requires practical estimation methods for real-world application.
Computing Memorization in Practice
The theoretical formulation requires practical estimation methods since we observe single instances of models and datasets. We transition from Shannon entropy to Kolmogorov complexity for computational tractability.
Kolmogorov Complexity Approximation: For unintended memorization of a specific text :
where represents the Kolmogorov complexity of given - the length of the shortest program that generates when given access to .
Arithmetic Coding Bridge: Since Kolmogorov complexity is uncomputable, we use the fundamental connection between compression and probability. For any computable probability distribution , arithmetic coding achieves compression length approximately equal to the negative log-probability:
Reference Model Method: To estimate the true data distribution , we use a larger reference model trained on a superset of the data. This gives us practical estimators:
- : Compression cost using reference model
- : Best compression using either model
Final Computable Measure: Combining these approximations yields the practical unintended memorization measure:
This formulation captures the key intuition: if the target model assigns much higher probability to a text than the reference model, it has memorized sample-specific information not present in the general distribution.
Interpretation:
- When : Target model shows no unintended memorization
- When : Target model has memorized bits of sample-specific information
- Larger values indicate stronger memorization of training-specific details
Validation Requirements: The reference model must be trained on a proper superset of the target model’s training data to ensure it captures the true underlying distribution without the specific memorization artifacts we wish to measure.
The 3.6 Bits-Per-Parameter Hypothesis
Fundamental Capacity Theorem: Extensive empirical analysis reveals that neural networks in the GPT family exhibit a remarkably consistent memorization capacity governed by a universal constant [3]:
This represents a fundamental constraint on the maximum mutual information between training data and model parameters:
Physical Interpretation: Each parameter can store approximately 3.6 bits of information about the training data. This is significantly less than the theoretical maximum for floating-point parameters (which could theoretically store unlimited information), indicating that practical training dynamics impose strict limits.
Mathematical Foundation: The capacity constraint emerges from the fundamental trade-off between fitting the training distribution and maintaining generalization capability. When a model approaches its capacity limit , it faces a critical phase transition between regimes.
Universality Across Scales: This 3.6 bits/parameter constant has been empirically validated across [3]:
- Model architectures: GPT-2 family with standard transformer architectures
- Parameter counts: From 80K to 1.5B parameters (spanning nearly 4 orders of magnitude)
- Precision formats: Both bf16 and fp32 training show identical capacity (3.64 bits/param)
- Training procedures: Standard gradient descent with various optimizers
- Dataset types: Multiple natural language corpora with different characteristics
Lower Bound Nature: This measurement represents a practical lower bound on theoretical information storage capacity, not an upper limit. Several factors contribute to this gap:
- Optimization Limitations: Gradient descent may converge to local minima rather than globally optimal parameter configurations that maximize information storage
- Continuous Parameter Spaces: Theoretically, real-valued parameters have infinite precision and could store unbounded information
- Training Dynamics: The constraint reflects practical SGD training behavior, not information-theoretic limits
- Regularization Effects: Implicit regularization from finite precision arithmetic and finite training time
Connection to Parameter Efficiency: The 3.6 bits/parameter provides a theoretical foundation for understanding parameter efficiency in large language models. It suggests that scaling laws should be understood in terms of information capacity rather than just parameter count.
Predicting Double Descent and Phase Transitions
The 3.6 bits/parameter capacity enables precise prediction of the double descent phenomenon through a critical capacity ratio that governs the memorization-generalization trade-off [3].
Critical Capacity Ratio: Double descent occurs precisely when the ratio of dataset information content to model memorization capacity approaches unity:
where:
- : Number of training sequences
- : Vocabulary size
- : Sequence length
- : Number of model parameters
Information Content Estimation: The dataset entropy assumes uniform distribution over token sequences. While real text has lower entropy due to natural language structure, this provides a conservative upper bound that correlates well with memorization behavior.
Three Distinct Learning Regimes:
1. Under-parameterized Regime ():
- Characteristics: Model capacity exceeds dataset information content
- Memorization Behavior: Can memorize entire dataset with capacity to spare
- Training Dynamics: Zero training loss achievable through perfect memorization
- Generalization: Poor due to lack of compression pressure
- Mathematical Description: Pure memorization is optimal
2. Critical Regime ():
- Characteristics: Capacity exactly matches dataset information content
- Memorization Behavior: Forced to choose between different training examples
- Training Dynamics: High variance, sensitivity to initialization
- Generalization: Worst performance due to instability at phase boundary
- Mathematical Description: Competition between memorization and compression
3. Over-parameterized Regime ():
- Characteristics: Dataset information exceeds model capacity
- Memorization Behavior: Must compress data to fit within capacity constraints
- Training Dynamics: Forced compression leads to pattern discovery
- Generalization: Improved through implicit regularization
- Mathematical Description: Compression necessary
Mathematical Characterization of Test Loss: The test loss exhibits the characteristic double descent curve:
where:
- : Decreasing loss in memorization regime
- : Peak loss at the critical transition point
- : Sharp divergence term near
- : Decreasing loss in compression regime
Practical Implications:
- Model Sizing: Target to avoid critical regime instabilities
- Training Stability: Expect higher variance when
- Scaling Laws: Information capacity, not just parameter count, determines performance
- Architecture Design: Optimize for bits-per-parameter efficiency rather than raw size
Memorization, Privacy, and Security Implications
Understanding memorization has profound implications for privacy and security in language model deployment. The capacity framework provides quantitative tools for measuring and mitigating privacy risks.
Scaling Law for Membership Inference Attacks
Membership inference attacks attempt to determine whether a specific text was included in the training dataset. The success rate of such attacks follows a predictable scaling law based on the capacity ratio [3]:
where:
- is the sigmoid function
- is the model’s information capacity
- represents the total information content of the dataset
Mathematical Interpretation: This function exhibits a sharp transition around the critical point where , indicating a phase transition in attack vulnerability.
Attack Success Regimes:
- High Capacity: When models have excess capacity relative to data, they memorize verbatim, making membership inference easy
- Critical Point: Sharp transition region with maximum attack vulnerability
- Low Capacity: Over-parameterized models compress data, reducing attack success
Privacy Threshold and Safe Operating Regions
Models achieve practical privacy protection when the capacity-to-data ratio falls below a critical threshold:
This constraint translates to requiring more than 1000 bits of training data per model parameter, or equivalently:
For typical language models with vocabulary size and sequence length :
Practical Guidelines:
- Minimum Data Requirement: >100 tokens per parameter for membership inference resistance
- Safe Operating Zone: >1000 tokens per parameter for strong privacy guarantees
- Critical Monitoring: Track capacity utilization during training to avoid privacy-vulnerable regimes
The Extraction vs Memorization Paradox
A counterintuitive finding emerges when examining extractable memorized content:
Key Observation: When the probability of extracting training examples equals the probability of extracting test examples:
then all successful extraction can be attributed to generalization capabilities rather than memorization.
Implications:
- True Memorization: Only extractable content that appears more frequently from training data indicates genuine memorization
- Generalization Masquerading: Much “extraction” actually demonstrates the model’s ability to generate realistic text from learned patterns
- Privacy Assessment: Simple extraction tests may overestimate privacy risks by conflating generation with memorization
Quantitative Measurement: The memorization-specific extraction rate:
This provides a more accurate measure of privacy-relevant memorization than raw extraction rates.
Memorization Patterns and Selective Learning
Neural networks don’t memorize randomly - they exhibit systematic preferences for certain types of content based on information-theoretic properties.
TF-IDF and Rarity-Based Selection
Documents with higher TF-IDF (Term Frequency-Inverse Document Frequency) scores are preferentially memorized due to their distinctiveness [3]:
where:
- : Term frequency of word in document
- : Document frequency of word across corpus
- : Length of document
- : Total number of documents
Empirical Correlation: Strong positive correlation () between TF-IDF scores and memorization likelihood indicates that models preferentially store distinctive, rare content [3].
Information-Theoretic Explanation: High TF-IDF content has lower compressibility because:
- Rare terms have high information content ()
- Unique combinations resist compression through pattern matching
- Low redundancy prevents amortization across multiple examples
Memorization Priority Function
The probability that a model memorizes document follows a competitive allocation process:
Parameter Interpretation:
- : Rarity weight - higher values prioritize unique content
- : Compressibility penalty - resist memorizing incompressible noise
- : Frequency boost - common patterns receive priority
Compressibility Estimation: Using lossless compression ratio as a proxy:
Frequency-Based Competition: When capacity is limited, documents compete for memorization slots:
Strategic Memorization Behavior
Models exhibit intelligent allocation strategies that maximize information efficiency:
1. Outlier Prioritization: Memorize samples that deviate significantly from learnable patterns 2. Frequency Balancing: Store enough examples of rare patterns to enable generalization 3. Compression Optimization: Prefer content that can be stored efficiently within parameter constraints
Mathematical Framework: The optimal memorization strategy solves:
where:
- : Information value of memorizing document
- : Parameter capacity required to store
This resembles a knapsack optimization problem, explaining the observed systematic memorization patterns.
Connections to Broader Deep Learning Phenomena
The memorization framework provides unified explanations for several mysterious phenomena in deep learning, revealing common underlying mechanisms.
Grokking: Sudden Generalization Through Capacity Reallocation
Grokking - the phenomenon where models suddenly transition from memorization to generalization after extended training - can be understood as a dynamic capacity reallocation process.
Phase 1: Memorization Dominance ():
- Capacity Allocation: ,
- Learning Strategy: Direct memorization of input-output pairs
- Loss Behavior: Low training loss, high test loss (overfitting)
- Parameter Usage: Most parameters encode specific training examples
Phase 2: Critical Transition ():
- Discovery Event: Model discovers compressible algorithmic structure
- Capacity Reallocation: Rapid shift from memorization to pattern encoding
- Mathematical Signature: ,
- Loss Dynamics: Sharp improvement in test performance
Phase 3: Generalization Dominance ():
- Stable Allocation: , efficient pattern representation
- Algorithmic Behavior: Model executes learned algorithms rather than table lookup
- Performance: Both training and test loss remain low
Mathematical Model of Grokking:
where controls the transition sharpness and is the critical time point.
Lottery Ticket Hypothesis: Optimal Capacity Utilization
The lottery ticket hypothesis states that dense networks contain sparse subnetworks that achieve comparable performance when trained in isolation. This connects directly to capacity efficiency.
Winning Ticket Characterization: A winning lottery ticket is a subnetwork that achieves optimal capacity allocation:
More precisely:
Capacity Efficiency Interpretation:
- Bad tickets: Waste capacity on irrelevant memorization
- Good tickets: Efficiently allocate capacity to generalizable patterns
- Winning tickets: Optimal balance of capacity utilization and performance
Connection to Memorization: Pruning removes parameters that store unimportant memorized content, leaving those that encode useful patterns.
Emergent Abilities: Capacity Threshold Effects
Emergent abilities appear suddenly when models exceed critical size thresholds. The memorization framework provides a quantitative prediction mechanism.
Emergence Threshold: New capabilities emerge when model capacity exceeds task complexity:
where:
- : Information content required for task competence
- : Additional capacity needed for pattern discovery
Task Complexity Estimation: For reasoning tasks requiring sequential steps with branching factor :
Phase Transition Dynamics: Emergence occurs when:
- Insufficient Capacity: Model can only memorize specific examples
- Critical Capacity: Model discovers algorithmic patterns that generalize
- Abundant Capacity: Reliable execution of learned algorithms
Examples:
- Arithmetic: Emerges when capacity exceeds the information needed to encode calculation procedures
- Chain-of-thought: Appears when models can allocate capacity to intermediate reasoning steps
- In-context learning: Develops when models can store pattern matching algorithms
Scaling Laws: Information-Theoretic Foundations
Traditional scaling laws focus on parameter count, but the memorization framework suggests information capacity as the fundamental quantity.
Revised Scaling Law:
where
This explains why architectural improvements can achieve better scaling than simple parameter increases.
Practical Implementation
Capacity-Aware Model Design
from typing import Dict
def compute_model_size(dataset_bits: int, safety_factor: float = 1.5, tokens_per_param: int = 1000) -> Dict[str, int]: """ Compute required model size given dataset and safety requirements
Args: dataset_bits: Total information in dataset safety_factor: Multiplicative safety margin tokens_per_param: Target ratio for good generalization
Returns: Dictionary with model specifications """ bits_per_param = 3.6
# Memorization-based lower bound min_params_memorization = dataset_bits / bits_per_param
# Generalization-based sizing min_params_generalization = dataset_bits / (tokens_per_param * np.log2(vocab_size))
# Take maximum and apply safety factor recommended_params = max(min_params_memorization, min_params_generalization) * safety_factor
return { 'parameters': int(recommended_params), 'capacity_bits': recommended_params * bits_per_param, 'memorization_ratio': dataset_bits / (recommended_params * bits_per_param) }
Measuring Unintended Memorization
import torchimport torch.nn as nnfrom torch.utils.data import DataLoaderfrom typing import List
def measure_unintended_memorization(model: nn.Module, reference_model: nn.Module, dataset: DataLoader) -> float: """ Quantify unintended memorization using reference model method
Args: model: Target model being evaluated reference_model: Larger model trained on superset dataset: Evaluation dataset
Returns: Unintended memorization in bits """ mem_U_total = 0
for x in dataset: # Compute negative log-likelihoods nll_target = -torch.log(model.forward(x)) nll_reference = -torch.log(reference_model.forward(x))
# Kolmogorov complexity approximation H_K_given_reference = nll_reference H_K_given_both = torch.min(nll_target, nll_reference)
# Unintended memorization for this sample mem_U = torch.clamp(H_K_given_reference - H_K_given_both, min=0) mem_U_total += mem_U.item()
return mem_U_total
Theoretical Foundations and Future Directions
Statistical Learning Theory Connections
The memorization framework bridges empirical observations with established theoretical foundations in statistical learning theory.
PAC-Bayes Bounds: The generalization gap can be bounded using the memorization capacity:
where is the training set size and is the confidence parameter.
Rademacher Complexity: Memorization capacity provides an upper bound on the Rademacher complexity of the hypothesis class:
Minimum Description Length (MDL): The optimal model balances fit quality with description length:
Multi-Modal and Cross-Domain Extensions
The capacity framework extends beyond language models to other domains and modalities.
Multi-Modal Capacity Sharing: For models processing multiple modalities (vision, language, audio):
where:
- : Modality-specific capacity coefficient (may differ from 3.6)
- : Parameters dedicated to modality
- : Capacity coefficient for shared representations
Cross-Domain Transfer: When fine-tuning across domains:
Continual Learning Dynamics: Capacity allocation evolves over sequential tasks:
where represents the forgetting rate and is task information content.
Quantum and Neuromorphic Extensions
Quantum Neural Networks: Theoretical capacity for quantum parameters exploiting superposition:
where is the Hilbert space dimension, potentially providing exponential capacity advantages.
Neuromorphic Computing: Spike-based neural networks may exhibit different capacity constraints:
Open Research Questions and Future Work
1. Architecture Universality:
- Does hold for Transformers beyond GPT-2?
- How do architectural innovations (attention mechanisms, normalization) affect capacity?
- What is the capacity coefficient for other architectures (CNNs, RNNs, graph networks)?
2. Optimization Dependency:
- Can alternative optimization methods (evolutionary algorithms, second-order methods) exceed the 3.6 bits/parameter bound?
- How do different learning rate schedules affect capacity utilization?
- What role does the optimization landscape geometry play?
3. Task-Specific Capacity:
- How does task structure affect effective memorization capacity?
- Can we predict task-specific capacity requirements?
- How do compositional tasks scale with capacity?
4. Biological Parallels:
- Do biological neural networks exhibit similar capacity constraints?
- How does synaptic plasticity relate to artificial parameter updates?
- Can neuroscience inform better capacity utilization strategies?
5. Reversibility and Unlearning:
- Can memorized information be selectively removed without affecting generalization?
- How can we design “forgettable” training procedures?
- What are the fundamental limits of machine unlearning?
6. Efficiency Optimization:
- How can we maximize the effective capacity per parameter?
- What architectural modifications improve capacity efficiency?
- Can we predict optimal model sizes for given datasets?
Practical Implementation Framework
Capacity-Aware Training Pipeline:
- Pre-training Analysis: Estimate dataset information content
- Model Sizing: Use capacity formula to determine optimal architecture
- Training Monitoring: Track capacity utilization throughout training
- Post-training Audit: Measure unintended memorization for privacy assessment
- Deployment Optimization: Prune parameters storing irrelevant memorized content
Capacity Optimization Strategies:
- Dynamic Capacity Allocation: Adjust parameter allocation during training based on task requirements
- Hierarchical Memorization: Structure models to separate general patterns from specific memorization
- Federated Capacity: Distribute memorization across multiple model instances to enhance privacy
How does reasoning come into play?
Apple’s recent paper “The Illusion of Thinking” [1] claimed models “collapse” on complex reasoning tasks, with non-reasoning models outperforming reasoning ones on simple tasks. However, “The Illusion of Illusion of Thinking” paper [2] revealed critical experimental design flaws:
Prompt Design
Apple’s prompts asked models to enumerate every Tower of Hanoi move, causing output token limits where models explicitly stated “The pattern continues, but to avoid making this too long, I’ll stop here.” [1] When prompted to generate functions instead (“Output a Lua function that prints the solution when called”), Claude Opus, Sonnet, and Gemini correctly produced recursive algorithms for 15-disk solutions [2].
Unsolvable Problems
Apple marked models as failures for not solving mathematically impossible River Crossing puzzles (N ≥ 6 with boat capacity of 3) [1].
Both sides acknowledge that current models have real output limitations, but experimental design can make those limits appear more fundamental than they are.
Here’s an example of successfully solving the Tower of Hanoi algorithmically using AI:
Tower of Hanoi Recursive Algorithm:
from typing import List, Tuple, Optional
def solve_tower_of_hanoi( n: int, source: str = "A", destination: str = "C", auxiliary: str = "B", moves: Optional[List[Tuple[int, str, str]]] = None,) -> Optional[List[Tuple[int, str, str]]]: """ Solve the Towers of Hanoi problem recursively.
Parameters ---------- n : int Number of disks to move source : str, optional Name of the source peg (default is 'A') destination : str, optional Name of the destination peg (default is 'C') auxiliary : str, optional Name of the auxiliary peg (default is 'B') moves : list, optional List to store the sequence of moves (default is None)
Returns ------- list A list of tuples representing the moves, where each tuple is (disk_number, source_peg, destination_peg)
Notes ----- The algorithm follows the recursive pattern: 1. Move n-1 disks from source to auxiliary (using destination as temporary) 2. Move the largest disk from source to destination 3. Move n-1 disks from auxiliary to destination (using source as temporary)
Examples -------- >>> solve_tower_of_hanoi(3) [(1, 'A', 'C'), (2, 'A', 'B'), (1, 'C', 'B'), (3, 'A', 'C'), (1, 'B', 'A'), (2, 'B', 'C'), (1, 'A', 'C')] """ # Initialize moves list on first call if moves is None: moves = []
# Base case: no disks to move if n == 0: return moves
# Recursive case: move n disks # Step 1: Move n-1 disks from source to auxiliary (using destination as temporary) solve_tower_of_hanoi(n - 1, source, auxiliary, destination, moves)
# Step 2: Move the largest disk from source to destination moves.append((n, source, destination))
# Step 3: Move n-1 disks from auxiliary to destination (using source as temporary) solve_tower_of_hanoi(n - 1, auxiliary, destination, source, moves)
return moves
Understanding the relationship
Understanding the relationship between memorization, generalization and reasoning is critical for:
1. Evaluation Design
Tests that expect memorized responses miss models that have progressed to generalization and reasoning. When evaluations reward exhaustive approaches (like move lists), they penalize models that have learned to think abstractly and generate practical engineering solutions.
2. Capability Unlocking
Reasoning “failures” often stem from poor evaluation design. Tweaking task setup or prompts can reveal genuine capabilities valuable for workflow automation, coding, and information extraction. Additionally, understanding when models move from memorization to generalization can help us in areas of efficiency and intelligence.
3. Tool Integration
LLMs should be paired with tools to augment reasoning. In our experiments, Claude Opus solved Tower of Hanoi through recursive algorithms, not exhaustive move enumeration.
4. Privacy and Security Implications
Understanding memorization carries significant implications for privacy and security in language model deployment. The capacity framework offers quantitative tools for assessing and reducing privacy risks.
Conclusion
Models are generalizing effectively and knowing at what points helps us iterate on intelligence. With proper prompts, tools, and evaluation frameworks, models demonstrate stronger reasoning than flawed tests suggest. While not yet human-level reasoners, the future isn’t as bleak as some evaluations indicate. The “collapse” often reflects our testing limitations, not fundamental model constraints.
Understanding these distinctions is crucial for determining where innovation is truly needed versus where better implementation can unlock existing capabilities.
References
[1] Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple. Retrieved from https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
[2] Opus, C., & Lawsen, A. (2025). The Illusion of the Illusion of Thinking: A Comment on Shojaee et al. (2025). arXiv preprint arXiv:2506.09250. Retrieved from https://arxiv.org/pdf/2506.09250
[3] Morris, J. X., Sitawarin, C., Guo, C., Kokhlikyan, N., Suh, G. E., Rush, A. M., Chaudhuri, K., & Mahloujifar, S. (2025). How much do language models memorize? arXiv preprint arXiv:2505.24832. Retrieved from https://arxiv.org/pdf/2505.24832