pith. sign in

One Epoch Is All You Need

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it
abstract

In unsupervised learning, collecting more data is not always a costly process unlike the training. For example, it is not hard to enlarge the 40GB WebText used for training GPT-2 by modifying its sampling methodology considering how many webpages there are in the Internet. On the other hand, given that training on this dataset already costs tens of thousands of dollars, training on a larger dataset naively is not cost-wise feasible. In this paper, we suggest to train on a larger dataset for only one epoch unlike the current practice, in which the unsupervised models are trained for from tens to hundreds of epochs. Furthermore, we suggest to adjust the model size and the number of iterations to be performed appropriately. We show that the performance of Transformer language model becomes dramatically improved in this way, especially if the original number of epochs is greater. For example, by replacing the training for 10 epochs with the one epoch training, this translates to 1.9-3.3x speedup in wall-clock time in our settings and more if the original number of epochs is greater. Under one epoch training, no overfitting occurs, and regularization method does nothing but slows down the training. Also, the curve of test loss over iterations follows power-law extensively. We compare the wall-clock time of the training of models with different parameter budget under one epoch training, and we show that size/iteration adjustment based on our proposed heuristics leads to 1-2.7x speedup in our cases. With the two methods combined, we achieve 3.3-5.1x speedup. Finally, we speculate various implications of one epoch training and size/iteration adjustment. In particular, based on our analysis we believe that we can reduce the cost to train the state-of-the-art models as BERT and GPT-2 dramatically, maybe even by the factor of 10.

citation-role summary

background 2

citation-polarity summary

roles

background 2

polarities

background 2

representative citing papers

Scaling Laws for Autoregressive Generative Modeling

cs.LG · 2020-10-28 · accept · novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

Internal Data Repetition Destroys Language Models

cs.LG · 2026-06-23 · unverdicted · novelty 6.0

Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 8 of 8 citing papers.

  • LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs cs.CV · 2021-11-03 · unverdicted · none · ref 4

    LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.

  • Scaling Laws for Autoregressive Generative Modeling cs.LG · 2020-10-28 · accept · none · ref 10

    Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

  • Internal Data Repetition Destroys Language Models cs.LG · 2026-06-23 · unverdicted · none · ref 37 · internal anchor

    Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.

  • Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 55 · internal anchor

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  • GPT-NeoX-20B: An Open-Source Autoregressive Language Model cs.CL · 2022-04-14 · accept · none · ref 48 · internal anchor

    GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

  • Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 91 · internal anchor

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 181

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  • A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 123

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.