How Bigger Models Get Better

Juan Manuel Ortiz de Zarate
Apr 30
10 min read

Language modeling is a central task in AI, providing a proxy for many forms of reasoning and knowledge representation. In recent years, neural networks, particularly Transformer architectures, have achieved impressive results in generating human-like text. However, the questions of how large these models should be, how much data they require, and how computationally expensive their training needs to be remained only partially understood.

Kaplan et al. address these questions by identifying precise empirical "scaling laws" that govern how a model's predictive performance, measured by cross-entropy loss, depends on model size, dataset size, and computation. Remarkably, they find that performance improves as a power-law function of these variables, revealing a smooth, predictable progression.

Key Findings: Power Laws Everywhere

The core contribution of the paper is the identification of predictable power-law relationships governing language model performance. The authors find that test loss, a measure of how well a model predicts the next token in a sequence, decreases smoothly and consistently as a function of three principal factors:

Model Size (N): The loss scales with the number of non-embedding parameters as

where α_N is approximately 0.076. This means that doubling the model size leads to a measurable, though diminishing, reduction in loss.

Dataset Size (D): Loss also scales with the amount of training data as

where α_D is approximately 0.095. Importantly, this relationship shows that larger datasets continue to yield better results, but at a sublinear rate.

Training Compute (C): When training is conducted with an optimal balance of model size and dataset size, the compute-efficient loss follows

with α_C ≈ 0.050. This reveals that compute-efficient training strategies do not require full convergence to be effective.

These power laws are observed across a vast scale—seven orders of magnitude in compute, six in model size, and two in data—demonstrating that the relationships are not coincidental but deeply embedded in the behavior of neural networks.

Scaling Factor	Loss Formula	Exponent (α)	Interpretation
Model Size (N)		α_N ≈ 0.076	Loss improves with larger models, diminishing returns over scale
Dataset Size (D)		α_D ≈ 0.095	Larger datasets improve performance sublinearly
Compute Budget (C_min)		α_C ≈ 0.050	Optimal compute-efficient training favors large models, early stopping

Another remarkable finding is the universality of training and overfitting behavior:

Overfitting occurs predictably when model size grows too large for a fixed dataset, or vice versa. The penalty depends on the ratio N^{0.74}/D, suggesting that to avoid overfitting, dataset size only needs to increase sublinearly with model size.
Training curves follow power-law trajectories, allowing extrapolation from early results to predict eventual performance. This is true regardless of the model size.
Sample efficiency improves with scale: Larger models not only perform better but also learn faster and require fewer data points to reach the same level of performance.

A striking operational insight is that compute-efficient training favors large models trained for fewer steps, rather than small models trained to full convergence. This counters previous assumptions and underscores the importance of stopping early when compute is constrained.

Finally, the authors show that generalization performance (how well a model trained on one data distribution performs on another) correlates strongly with training performance, offset by a nearly constant penalty. This implies that improvements in training loss lead directly to better performance in downstream and transfer tasks.

Visually summarizes the core power-law relationships governing model loss as a function of model size, dataset size, and compute.

Together, these findings form the empirical backbone of modern large-scale AI model training. They provide not only a diagnostic lens for understanding existing models but also a prescriptive guide for building more capable ones in the future.

Architecture Details: Width vs. Depth vs. Size

A common belief in neural network design is that architecture—such as how deep (number of layers) or wide (dimension of each layer) a network is—plays a central role in determining performance. Kaplan et al. challenge this assumption with rigorous experiments showing that, when the total number of parameters is held constant, variations in depth, width, attention heads, and feedforward dimensions have only a minor impact on final loss.

In their empirical tests, they varied one architectural hyperparameter at a time—like increasing depth while proportionally decreasing width to maintain parameter count—and found that models with drastically different shapes performed within just a few percentage points of each other in terms of test loss.

Holding parameter count constant, varying depth and width has relatively minor effects on performance

This indicates that parameter count is the key driver of performance, not how those parameters are distributed. For example, a shallow model with very wide layers can perform nearly as well as a deep, narrow one with the same number of parameters. One practical implication is that architecture search should prioritize scalability and ease of parallelization over fine-tuning shape.

Moreover, they find that embedding parameters, which scale with vocabulary size, can dominate total parameter counts in small models. Excluding these from model size measurements yields much cleaner scaling trends and supports the idea that embeddings can be relatively compressed without sacrificing performance.

The authors hypothesize that this phenomenon may arise because deeper models behave like ensembles of shallower sub-networks, a notion supported by findings in residual networks (ResNets) in vision tasks. These results strongly support simplifying the design philosophy for language models: rather than focusing on complex architectures, engineers should aim to scale total size effectively and focus on efficient compute usage.

Training Efficiency: Sample Use, Overfitting, and Batch Size

Kaplan et al. also provide key insights into how models should be trained to achieve optimal performance under compute constraints. First, they show that larger models are significantly more sample efficient. They require fewer optimization steps and less data to reach a given performance level compared to smaller models.

This leads to a critical operational principle: when constrained by compute, it is better to train large models for fewer steps than to train small models to convergence. This strategy improves compute efficiency and accelerates progress.

Larger models achieve lower loss more quickly, reinforcing their superior sample efficiency

Second, they show how overfitting arises when either the model or dataset size grows disproportionately. To avoid this, the dataset size should scale sub-linearly with the model size, following the relation . This means that while bigger models need more data, the growth in data requirements is not linear. This provides substantial savings in dataset construction.

Third, the authors revisit the concept of critical batch size, originally introduced in earlier works. They find that training efficiency improves up to this critical batch size, beyond which returns diminish. Crucially, the critical batch size scales with training loss, not model size. Thus, the batch size should be dynamically adjusted during training based on the model’s current performance level.

Together, these findings offer a concrete recipe for efficient training: scale up model size, increase dataset size modestly to prevent overfitting, and adjust batch size according to training dynamics. This triad provides a foundation for compute-conscious, performance-maximizing model development.

Optimal Allocation of Compute Resources

One of the most consequential insights of the paper is how to optimally allocate a fixed compute budget between model size, training duration, and dataset size. The authors provide a roadmap that transforms training strategy from an art into a science.

Using their scaling laws, they demonstrate that the majority of the compute should go toward increasing model size, rather than extending the training duration or drastically enlarging the dataset. This is somewhat counterintuitive: instead of training small models to convergence, it is more efficient to train very large models for a relatively small number of steps.

They formalize this insight using power-law relationships among the variables:

Model size should grow as N α C^{0.73}
Batch size as B α C^{0.24}
Steps as S α C^{0.03}

These relationships indicate that even for large increases in compute, only minimal increases in training time (steps) are needed. Most of the gain comes from scaling up the model, with only moderate increases in batch size and data.

Optimal model size, training steps, and batch size scale with available compute

Furthermore, they identify the intersection point where compute-efficient training begins to saturate performance given the data. This suggests that training beyond this point offers diminishing returns, as either the data has been exhausted or the model’s capacity has outpaced the available information.

This leads to a compelling recommendation: for fixed compute budgets, AI practitioners should opt for large models trained briefly with just enough data to avoid overfitting. Training to full convergence or with unnecessarily large datasets can actually reduce efficiency.

This principle underlies the training of models like GPT-3 and Gopher, where the focus shifted toward large-scale models trained on moderately-sized datasets rather than maximizing epochs [3].

Ultimately, this section of the paper provides a practical toolkit for resource allocation, with direct implications for hardware scaling, parallelism strategies, and research investment.

Limits of Scaling Laws: Theoretical Boundaries

Although the observed power-laws hold across vast ranges of parameters, the authors caution that they must eventually break down. Natural language has a nonzero entropy, implying a limit to how much prediction accuracy can improve.

They hypothesize that at model sizes around one trillion parameters and compute budgets of thousands of petaflop-days, scaling laws would start to plateau due to fundamental limits of data and model capacity.

Kaplan et al. compared Transformers with LSTMs and Universal Transformers. Their experiments show that while LSTMs perform competitively on short contexts, Transformers dramatically outperform them as context length increases. This highlights the critical advantage of attention mechanisms for language tasks requiring long-range dependencies.

The implications of this work have already rippled throughout AI research and practice. Several major trends directly stem from the insights presented:

Massive models: GPT-3[2], Gopher[5], Chinchilla[3], and PaLM[6] all embrace the "bigger is better" ethos validated by this study.
Data curation and expansion: While scaling models, ensuring sufficient and high-quality datasets remains critical.
Efficient training frameworks: Hardware and software optimizations focus increasingly on supporting ultra-large models with massive batch sizes and dynamic resource allocation.

Moreover, the recognition that larger models are more sample efficient has shifted thinking about how to design future AI systems that balance environmental costs, compute budgets, and model performance.

Caveats and Open Questions

While the scaling laws uncovered by Kaplan et al. are remarkably consistent and predictive across a wide range of scenarios, they are not without limitations. Understanding where these laws may break down or require refinement is crucial for responsible and effective advancement in AI research.

1. Limits of Extrapolation One key caveat is that the scaling laws are based on empirical data observed within a finite range of model sizes, dataset sizes, and compute budgets. Although trends extend smoothly across many orders of magnitude, extrapolating indefinitely is risky. For example, the paper hypothesizes a point beyond which increasing model size or compute yields diminishing or no returns due to inherent entropy in natural language. At such scales, additional factors—such as training instability, data redundancy, or architectural bottlenecks—may dominate performance trends.

2. Data Quality vs. Quantity The scaling laws focus on dataset size (measured in tokens), but they do not fully account for the impact of data quality, diversity, or redundancy. Increasing the volume of low-quality or repetitive data may fail to yield the gains predicted by the model. Open questions remain about how to best characterize or quantify “effective data” and whether curation and filtering methods could shift or extend the scaling curves.

3. Generalization and Transfer Although the paper finds a consistent offset when evaluating performance on out-of-distribution text, it stops short of deeply analyzing transfer learning or cross-domain generalization. Will the same scaling laws apply to models fine-tuned for classification, reasoning, or multimodal tasks? It's unclear how these findings transfer beyond autoregressive language modeling.

4. Adversarial Robustness and Safety As model size grows, capabilities improve—but so do risks. Larger models may inadvertently learn biases, produce toxic content, or become vulnerable to adversarial prompting. The scaling laws do not address how robustness, fairness, or safety scale with size, nor whether these issues compound or dilute in larger models. Research into how to regularize or constrain these systems without breaking the scaling efficiency is urgently needed.

5. Interpretability and Controllability The increasing complexity of large models often reduces their interpretability. It's unclear whether scaling leads to emergent interpretability features (e.g., modularity or disentanglement) or whether it exacerbates black-box behavior. Likewise, as models grow, fine control over their behavior through prompting or training may become harder. Do current fine-tuning or RLHF methods scale as well as the core training loss does?

6. Cross-domain Generalization The scaling laws were derived for language modeling. Do they generalize to other modalities such as vision, audio, or robotics? Preliminary evidence in image generation (e.g., scaling in diffusion models) suggests similar trends, but these relationships may differ in slope, inflection point, or even form depending on the structure of the input domain.

7. Environmental and Economic Costs Finally, a growing ethical concern: even if larger models are more sample-efficient, they often require disproportionately more energy and infrastructure to train. The scaling laws say nothing about cost-efficiency or sustainability. A model trained at optimal compute may still be environmentally unsustainable or economically inaccessible, raising critical questions about democratization, regulation, and responsible deployment.

Conclusion

"Scaling Laws for Neural Language Models" provides a simple yet powerful framework: bigger models trained carefully with modest data increases deliver better performance — predictably.

This empirical blueprint has guided the most successful developments in AI over the past few years and will likely continue to do so. Understanding these scaling behaviors allows researchers and engineers to plan more effectively, avoid costly inefficiencies, and push the boundaries of what AI can achieve.

At its core, this work suggests a tantalizing possibility: that progress in AI, once thought to be governed by unpredictable breakthroughs, might instead be navigated smoothly and methodically — if only we scale wisely.

References

[1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[2] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

[3] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

[4] Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., ... & McCandlish, S. (2020). Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.

[5] Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., ... & Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.

[6] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Fiedel, N. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1-113.