Scaling Laws for Neural Language Models

Kaplan et al., 2020

Source: Kaplan et al., 2020

Summary

  • Study empirical scaling laws for language model performance
  • Loss scales as a power-law with size of model, dataset, and training compute
  • Architectural details (e.g. network width and depth) have minimal effects
  • Larger models are significantly more sample-efficient
  • “…optimal compute-efficient training involves training very large models on relatively modest amount of data and stopping significantly before convergence.”
  • Links: [ website ] [ pdf ]

Background

  • While emprically successful, the performance of deep learning models depends on many factors such as model architecture, model size, training compute power, and training dataset size
  • The high ceiling and low floor performance on language tasks allows for the invesigation of performance trends across several orders of magnitude
  • Power-law scalings with model and dataset size have been shown in density estimation and random forests

Methods

  • Train (mostly) decoder-only Transformer models on WebText2, web scrape of outbound links from Reddit (20.3M documents, $1.62 \times 10^{10}$ words)
    • Fixed $2.5 \times 10^5$ steps with batch size of 512 sequences of 1024 tokens
  • Parameterize Transformer architecture with hyperparameters:
    • $n_{layer}$: number of layers
    • $d_{model}$: dimension of residual stream
    • $d_{ff}$: dimension of intermediate feed-forward layer
    • $d_{attn}$: dimension of attention output
    • $n_{heads}$: number of attention heads per layer
  • Model size $N \approx 2d_{model}n_{layer}(2d_{attn}+d_{ff}) = 12n_{layer}d^2_{model}$
  • Dataset size $D$ in tokens
  • Compute $C \approx 6NBS$, where $N$ is model size, $B$ is batch size, and $S$ is number of training steps

Results

  • Power laws:
    • Performance depends strongly on scale (N, D, C), weakly on model shape
    • When not bottlenecked, power-law relationship spans over six orders of magnitude
  • Overfitting:
    • Performance enters regime of diminish returns when either $N$ or $D$ is fixed while other increases
    • Pentaly ratio of $\frac{N^{0.74}}{D}$, so scaling model size is slightly less efficient than data size
  • Efficiency:
    • Large models require fewer optimization steps and data points to reach the same performance compared to smaller models
    • When using a fixed compute budget $C$, obtain best performance training very large models without reaching convergence
      • Data requirements grow slowly with compute, $D \sim C^{0.27}$

Conclusion

  • Since the scalings are power-laws, this means there are diminishing returns as scale increases
  • The sample-efficiency of large models is surprising and may suggest that “big models” is more important than “big data” moving forward
  • Using networks that grow as they train might be useful for remaining compute-efficient in settings were data grows, e.g. lifelong/continual learning
  • Would architecture differences have a greater effect if they weren’t just limited to depth/width scaling of a specific class of architectures (e.g. Transformers)?
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate