Extrapolating performance in language modeling benchmarks
Cite this postExecutive summary
- We investigate trends in large language model performance across five orders of magnitude of parameter scaling in 16 recent model architectures at 35 different model sizes.
- We present data on performance in BIG-Bench and MMLU, covering a range of model sizes and architectures.
- We examine trends in performance, showing a fairly smooth relationship between overall performance and scale, consistent with an S-curve currently yielding its steepest improvements.
- We outline an approach for predicting benchmark performance based on scaling-derived estimates of reducible loss.
- We back-test predictability of aggregate benchmark performance using this approach, showing that performance is moderately predictable from model scaling.
- We show that individual benchmark tasks are less predictable, but remain more predictable than chance or a simple per-task average baseline.
- We conclude that compute-based extrapolations are a promising way to forecast AI capabilities.
Background
Scaling laws allow prediction of a model’s loss from model and dataset sizes. However, scaling does not directly predict a model’s performance on downstream tasks - as assessed through benchmarks. To bridge this gap, we build on pre-existing methods and fit model performance against scaling laws. We then use back-testing to evaluate how well these fits can predict benchmark performance.
First, we use scaling laws to map models to loss according to model size, N, and dataset size, D. Loss can equivalently be expressed in terms of scaled compute - the compute required to achieve this loss under optimal scaling of N and D. This allows every model in our dataset to be associated with its scaled compute, as shown in Table 1.

Table 1: Models included in our datasets. Many architectures, such as PaLM, include results from models at several sizes.
Subsequently, we fit curves relating benchmark performance to loss. This leads to fits like those in Figure 1, predicting performance from loss (expressed as scaled compute). We favor simple forms with few parameters throughout this work, as we typically have small datasets on the order of tens of datapoints per task. To evaluate predictability, we perform back-tests: we hold out points to the right of the loss-performance curves when fitting, and then assess error in the predictions. We investigate fits for aggregate benchmark performance and individual benchmark tasks.

Figure 1: Aggregate benchmark performance is fairly predictable from scaling-estimated loss, i.e., scaling of model size and dataset size. Loss, on the x-axis, is expressed as scaled compute. Sigmoid fits are shown in black for both plots, with bold lines showing data used for fitting and dashed lines showing extrapolation. GPT-4’s training details were not publicly reported, so we show lower and higher estimates of its scaled compute.
Results
Aggregate benchmark performance is fairly predictable from scaling, as shown in Figure 1. Using a sigmoid fit to predict across an OOM of scaling, mean absolute error is 5%, whereas, e.g., scaling from 1e23 FLOP to 1e24 FLOP is associated with performance improvement of 20+% in BIG-Bench. Prediction requires some pre-existing progress: steep increases in performance make it difficult to predict far ahead using only data from low-performing models. Error gradually increases as one extrapolates further ahead, as shown in Figure 2. If current trends persist, our extrapolation suggests BIG-Bench could exceed human-level performance (80%) around 1e26 FLOP scaled compute, with a 90% chance of reaching this level by 1e27 FLOP.

Figure 2: Absolute error versus how far ahead performance is extrapolated, for different fits. Errors are evaluated over the entire series of held out points, bars show 90% confidence intervals.
Individual tasks are highly variable in their scaling, and the sharp emergence of capabilities can make it difficult to predict performance. Figure 3 shows qualitative examples - ranging from well-predicted tasks to those where scaling clearly deviates from a sigmoid. Nevertheless, performance on individual benchmark tasks is significantly more predictable than chance or a simple baseline, as shown in Figure 2. The distribution of errors across tasks is fat-tailed: over half of tasks can be predicted five points ahead with less than 10% error, but some tasks have substantially higher error, particularly tasks using exact string match as their preferred metric (discussed in more detail within the full report). Figure 3 also illustrates why previous analyses that examined a small number of models found performance was unpredictable from scaling, whereas a larger data series shows significant (but imperfect) predictability.

Figure 3: There is a lot of variation in per-task fit quality. Example data and back-tested model fits for three BIG-Bench tasks showing performance versus loss (scaled compute). Points in blue are held out for back-testing.
In conclusion, our results show that language benchmarks are fairly predictable from scaling, although prediction requires some pre-existing progress: steep increases in performance make it difficult to predict far ahead using only data from poorly-performing models. Aggregate benchmarks are much more predictable than individual tasks - a pattern seen in both BIG-Bench and MMLU. This supports the idea that overall model capabilities are predictable with scale, and gives support to a scaling-focused view of AI development. We hope that methods of this sort may eventually provide useful forecasts for guiding research and policy.