# Estimating Training Compute of Deep Learning Models

Cite this post
, , , , and
We describe two approaches for estimating the training compute of Deep Learning systems, by counting operations and looking at GPU time.

ML Models trained on more compute have better performance and more advanced capabilities (see e.g. Kaplan et al., 2020 or Hoffman et al., 2022). Due to this, estimating and reporting compute usage is crucial to enable accurate comparisons between ML models.

Compute usage is commonly measured as the number of floating point operations (FLOP) required to train the final version of the system. To estimate this we can resort to two strategies: a) using information about the architecture and amount of training data, or b) using information about the hardware used and training time.

Below we provide two calculators that illustrate these methods.

Do you see a mistake or do you want to submit missing information about hardware specs? Fill this form and we will look into it.

## Introduction

In this article we will explain (with examples) how to estimate the amount of compute used to train an AI system. We will explain two procedures, one based on the architecture of the network and number of training batches processed; and another based on the hardware setup and amount of training time.

This is largely based on AI and Compute, where the authors use these two methods to estimate the training compute of several milestone AI systems. We explain the methods in more detail.

Our final goal is to produce an estimate in terms of the number of floating point operations (FLOP) used to train the system. Other units exist - we discuss two other popular units in the table below.

## Method 1: Counting operations in the model

The first method is quite straightforward, and can be summarized as:

$\text{training_compute} = (\text{ops_per_forward_pass} + \text{ops_per_backward_pass}) * \text{n_passes}$

Where ops_per_forward_pass is the number of operations in a forward pass, ops_per_backward_pass is the number of operations in a backward pass and n_passes is the number of full passes (a full pass includes both the backward and forward pass) made during training.

n_passes is just equal to the product of the number of epochs and the number of examples:

$\text{n_passes} = \text{n_epochs} * \text{n_examples}$

If the number of examples is not directly stated, it sometimes can be computed as the number of batches per epoch times the size of each batch n_examples = n_batches * batch_size.

The ratio of ops_per_backward_pass to the number of ops_per_forward_pass is relatively stable, so if we summarize it as fp_to_bp_ratio = ops_per_backward_pass / ops_per_forward_pass we end up with the formula:

$\text{training_compute} = \text{ops_per_forward_pass} * (1 + \text{fp_to_bp_ratio}) * \text{n_passes}$

We estimate the value of fp_to_bp_ratio as 2:1 (see box below). The final formula is then:

$\text{training_compute} = \text{ops_per_forward_pass} * 3 * \text{n_epochs} * \text{n_examples}$

This formula is a heuristic. Depending on the exact architecture and training details it can be off2. We have found it to be a reasonable approximation in practice3.

### Forward pass compute and parameter counts of common layers

The remaining part is computing the number of operations per forward pass. Sometimes the authors are kind enough to provide this information in their papers. Most often, they do not, and we need to infer it from the architecture.

To help you do this, we have put together a table of common neural network layers, estimating their number of parameters and the number of floating point operations needed for a forward pass of the layer.

Note in many layers that the amount of FLOP in a forward pass is approximately equal to twice the amount of parameters. This suggests a reasonable alternative approximation of the number of operations if we already know the number of parameters and we know there is no parameter sharing. Reciprocally, this gives us a way to estimate the number of parameters from the amount of operations in a forward pass.

There are however many exceptions to this rule - for example CNNs have fewer parameters because of parameter sharing, and word embeddings make no operations.

A more precise heuristic is that the amount of operations in the forward pass is roughly twice the number of connections in the model. This is also satisfied by CNNs.

In addition, there are some software tools that can be used to automatically compute the number of parameters and the number of FLOP for the forward pass in an architecture. See appendix A for discussion on using these profilers.

### Example: CNN-LSTM-FCN model

For example, suppose that we have a CNN-LSTM-FCN architecture such that:

• The input is a sequence of images of shape [400x400x5].
• The average length of each input sequence is 20 images.
• The CNN has 16 filters of shape 5x5x5 and is applied with stride 2 and padding 2
• The LSTM is a many-to-one layer with 256 output units and bias vectors
• The fully connected layer has 10 output units
• The training process takes 10 epochs, where each epoch consists of 100 batches of size 128 sequences each.

Then we have that the recurrent part is the CNN and LSTM, and the FC is the non recurrent part of the network.

The CNN takes a 400*400*5 input and produces an output of width and height H’=W’=(W−K+2P)/S]+1=[(400−5+2*2+1)/2]=200 and 16 channels. In total, the forward pass of the CNN takes about 2*H2*W2*C*D/S2 = 2*4004*5*16/22 =1.024e12 FLOP.

Before feeding the input into the LSTM the output of the CNN is rearranged into a 200*200*16=640000 unit input. Then the amount of operations per token in the sequence of the LSTM is about 4*2*(N+M)*M=4*2*(640000+256)*256~=4*2*640000*256=1.310e9 FLOP. Finally, the FC layer takes about 2*N*M=2*256*10=5120 FLOP.

The non-recurrent part of the network is very small compared to the recurrent part, so we can approximate the number of total operations as

training_compute ~= ops_per_forward_pass_recurrent * 3 * n_epochs * n_batches * batch_size * avg_tokens_per_sequence ~= 1.024e12 FLOP * 3 * 10 * 100 * 128 * 20 = 7.86432e+18 FLOP.

When the architecture is too complex or we lack details of some of the layers we may want to use a method based on estimating the amount of operations from the amount of training time and the hardware used for training. We will cover that in the next section.

### Example: Transformer

Let’s take a look at the Transformer architecture in Attention is all you need.

• The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000.
• Each token is embedded and represented as a vector of size 1024.
• There are six encoder and decoder layers.
• Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers.
• At the end there is a final linear layer and a softmax.

Each MHA sublayer has the following parameters:

• Input size W=64
• Key size D=64
• Final output size M=1024

So the FLOP per token for a single MHA sublayer are 2*H*(W*(2*D+N)+L*(D+N)+N*M) = 2*16*(64*(2*64+64)+20*(64+64)+64*1024) ~= 2.6e6. Each FCN sublayer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. So the FLOP per token for each sublayer are 2*2*N*M = 4 * 1024 * 4096 = 1.7e7.

Without taking into account the Add & Norm sublayers (they are negligible), the whole encoder-decoder stack has a total per-token FLOP of 6 * (3 * 2.6e6 + 2 * 1.7e7) ~= 2.5e8. The FLOP per token of the final linear layer (matrix multiplication) are 2 * 1024 * 3e4 = 6.1e7. The final softmax is negligible too, so a single forward pass of a full sequence takes 2.5e8 + 6.1e7 = 3.1e8 FLOP per token.

The paper says they use batches of 25,000 tokens, and run the training for 300,000 steps. So the total training FLOP would be 2.5e4 * 3e5 * 3 * 3.1e9 = 6.97e18 FLOP.

## Method 2: GPU time

Instead of a detailed understanding of the forward pass compute, we can also use the reported training time and GPU model performance to estimate the training compute.

GPU-days describe the accumulated number of days a single GPU has been used for the training. If the training lasted 5 days and a total of 4 GPUs were used, that equals 20 GPU-days.

This metric has the obvious downside that it does not account for the computing hardware used. 20 GPU-days today are equivalent to more FLOP than 20 GPU-days ten years ago5.

In this section we will see how to correct for this. The final estimate will be:

$\text{training time} \times \text{# of cores} \times \text{peak FLOP/s} \times \text{utilization rate}$

### Estimating the number of FLOP from the GPU time

Extract the following information from the paper/reference:

1. Number of GPU-days
2. The computing system/GPU used
• When in doubt, one could go for the most common system used in same year publications or the geometric average of FLOP/s of computing systems in the same year.
3. The number representations used during the training run: FP32, FP16, BF16, INT8, etc.
• When in doubt, I’d recommend defaulting to FP32, as this is the default option in most frameworks. However, in recent years FP16 has become more prominent.

Step 2: Reading the hardware specifications

Learn about the computational performance of the GPU/system by extracting it from the specifications. Search for the system on Google and access a specifications sheet from the manufacturer.

Those datasheets usually list the peak performance (more on this in the utilization information box below) for a given number representation. Most GPUs come in different variants of memory bandwidth and system architecture. In general, I’d recommend using the non-PCIe version, and rather, e.g. the NVLink version — assuming that the model is trained on a server-cluster.

Here is an example for a NVIDIA A100:

If you cannot find the used hardware or the specifications of the mentioned hardware, we suggest referring to our sheet (HARDWARE_DATA) with estimates on the average computing capability in a given year. You can also find a chart with peak performance per year in the box below.

Step 3: Perform the estimate

Using the extracted information from the paper and the specifications of the hardware, we can now calculate an estimate of the training compute.

$\text{training time} \times \text{# of cores} \times \text{peak FLOP/s} \times \text{utilization rate}$

As the datasheet lists peak performance, we need to correct this by assuming a non-perfect utilization rate (< 100%). We suggest using a utilization rate of 0.3 for large language models, and a utilization rate of 0.4 for other networks.

### Example: Image GPT

As an example we will show how to estimate the training computing of Image GPT.

In the blogpost, we find the following:

“[..]iGPT-L was trained for roughly 2500 V100-days […]“

Unfortunately, we cannot find any information about the number representation in the blogpost or paper. However, given the size, date of publication and author of this model (corporation), we assume FP16 performance.

As a next step, we need to learn about the specifications of the NVIDIA V100. We can find the datasheet easily by googling for it: NVIDIA V100 Specifications.

In the datasheet, we find three different versions of the GPU. The NVIDIA NVLink systems interface is usually used in datacenters, and I’d recommend defaulting to this option — assuming the model was authored by a company. However, the selection of which version is not that crucial, as the performance differences are rather minimal (compared to our estimates).

We find 125 TFLOPS (TFLOP/s) for the tensor core (FP16) performance. However, making full use of the tensor performance requires next to FP16 and various properties of the network architecture. See here for more on this. We should take this into consideration for the utilization.

By default, we would assume a utilization of 30% to 40%. However, given the special requirements for achieving full tensor core performance, we pick 30%.

$30\% \times 125e12 \frac{FLOP}{s} \times 2500 days \times 86400 \frac{s}{day} = 8.1e21$

For more examples on estimating the compute used from the GPU time, see the Appendix of the blogpost AI and Compute.

## Conclusion

In this article we have explained two methods for estimating the training compute in FLOP of neural network models - one based on counting the number of operations in a forward pass and another one based on GPU time.

Generally we recommend defaulting to the first method when possible - it is more exact, as GPU utilization is hard to estimate. Ideally, one would perform the estimate both ways and compare, as a sanity check6.

Over the course of this article we have produced some novel insights:

• a precise estimation of the ratio of operations between the forward and backward pass in a neural network,
• an analysis of how recurrency affects the estimation of compute,
• a table with parameter counts and forward pass FLOP for some common NN layers,
• a method of how we can estimate the training compute FLOP from the commonly shared metric GPU-days, and
• shared a best guess on the average computing capability in a given year for a single GPU.

We hope this article will help readers starting their journey into scaling laws, and help provide a reference to standardize estimations of training compute.

### Acknowledgements

Thank you to Sylvain Viguier, Sasha Luccioni, Matthew Carrigan, Danny Hernandez, Girish Sastry and Stella Rose Biderman for their help answering our questions on estimating compute and GPU utilization rate.

Jean-Stanislas Denain helped us amass data about GPUs.

Thank you to Gwern Branwen and Jojo Lee for comments on the report.

## Appendices

### Appendix A: Profiler

In addition to the two discussed methods, one could use a profiler. A profiler is an analysis tool to measure metrics of interest. While the measurement of the number of FLOP executed is theoretically and technically possible, none of the profilers we tried (NVIDIA Nsight, PyTorch: main package and autograd) could fulfill our requirements for the training7 — many of them focus on the profiling of inference8.

Consequently, our two methods below are the method of choice, as (i) none of the existing profilers fulfills the required criteria and (ii) we require methods to estimate the compute of already published models based on the available data.

Profilers that only measure the forward pass, e.g. PyTorch’s fvcoreptflops or pthflops, work and do their job. Our problems only arose when we tried to measure anything but the forward pass.

### Appendix B: Comparing the estimates of different methods

To check whether both methods provide estimates that are consistent with one another, we compute both estimates for a few models for which this is feasible. The results (See figure below) confirm that the estimates are generally very similar (they differ by no more than a factor of 1.7).

Details of these estimates may be found here.

It is common to pre-train a large model on a large dataset and then fine-tune it on a smaller dataset. Similarly, it is common for researchers to manually train and tweak multiple versions of a system before they find the final architecture they use for training.

We recommend counting the pre-training compute as part of the total training compute.

However we do not recommend counting the tweak runs.

While these are important, for reproducibility purposes it is the pre-training and fine-tuning of the final architecture that matters most. And pragmatically speaking information on the compute used to train previous versions while finding the right architecture is seldom reported.

### Appendix D: Recurrent models

The formula is more complex for recurrent models, depending on the type of recurrency.

### Appendix E: Dropout

In method 1, we determined the training compute by counting the FLOP per parameter for the forward and backward passes, in order to determine the fp_to_bp_ratio. However in practice, this value is likely to vary due to regularization techniques. In this section we specifically consider the effect of dropout. This involves setting individual neurons to having a value of 0 (“dropping out”) with probability p, effectively yielding a thinned network with fewer parameters.

Clearly this can cause the number of FLOP in a forward to decrease quite significantly, depending on the value of p, but how much exactly? In a standard neural network, the forward pass is an affine transformation and compute is dominated by the dot product if the number of neurons per layer is sufficiently large:

$z_i^{(l+1)} = w_i^{(l+1)} \cdot y^{(l)} + b_i^{l+1},$

$y_i^{(l+1)} = f\left(z_i^{(l+1)}\right),$

where the symbols have their usual meanings (see for instance Srivastava et al). With dropout, the neuron value is instead a random sample drawn from a Bernoulli distribution:

$r_j^{(l)} \sim \text{Bernoulli}(p),$

$\hat{y}^{(l)} = r^{(l)} \ast y^{(l)},$

$z_i^{(l+1)} = w_i^{(l+1)} \cdot \hat{y}^{(l)} + b_i^{(l+1)},$

$y_i^{(l+1)} = f\left(z_i^{(l+1)}\right),$

where $\ast$ denotes the Hadamard element-wise product. If we assume that the number of neurons is small relative to the number of parameters, then we ignore the contributions from the first two steps (random sampling and the Hadamard product), to yield roughly 2 FLOP per parameter.

What happens in the backward pass? The architecture still stays as a thinned network, and the previous consideration still holds - there are 5 FLOP per parameter in backpropagation (with the same assumptions as previously). Thus with dropout we should expect roughly the same value of 2.5 for the fp_to_bp_ratio, although perhaps adjusted slightly upward compared to the standard neural network.

Note that in addition to the fp_to_bp_ratio staying the same after dropout, ops_per_forward_pass doesn’t change by very much either. This is because dropout is typically implemented as in the equations above - by multiplying a neuron by 0 if it is to be dropped out (see for instance the TensorFlow implementation of dropout). Thus, dropout doesn’t reduce the number of operations (as one might expect if the neurons were truly removed from the network), but in fact increases it slightly, but probably not significantly. This consideration suggests that we should expect the compute implications of dropout to be quite minimal.

The inference compute is also largely unchanged - at most it is slightly increased. Generally this is implemented by multiplying the weights corresponding to the connections of a neuron by the probability p at which the neuron was dropped out. This corresponds to the same standard neural network but with a number of additional initial calculations at test time, less than or equal to the number of parameters (depending on how many neurons are dropped out).

In short, it seems that other methods of regularization like using an L1 norm are likely to have a larger impact on the training compute.

1. On their website, NVIDIA states “The peak single-precision floating-point performance of a CUDA device is defined as the number of CUDA Cores times the graphics clock frequency multiplied by two. The factor of two stems from the ability to execute two operations at once using fused multiply-add (FFMA) instructions”. We interpret this statement to mean that NVIDIA used the FMA=2FLOP assumption.

2. In appendix D we discuss how the formula changes when considering recurrent models.

3. In appendix E we discuss the effect of dropout in the training compute. We find that in a popular implementation of dropout it does not affect the amount of operations in the forward nor backward pass.

4. You might want to compare this to “travel-days” as a measure. Eventually, you would be interested in the distance — the quantity — so you can adjust to your means of transportations: walking, a horse, a car, etc.. Especially with computing hardware we have seen tremendous improvements in computational power over the years, so it’s relevant.

5. In appendix B we show a comparison of the estimates resulting from both methods. We conclude that they are reasonably similar, and tend to be within a factor of 2 of each other.

6. Marius goes into more details in the post “How to measure FLOP/s for Neural Networks empirically?”. It is easier to profile the forward pass but as soon as you add the backward pass, most profilers give wrong estimates.

7. There is a strong interest in the use of profilers to optimize inference, as inference makes up the majority of total compute and therefore the costs (70% to 90% (Patterson et al. 2021)).

If you want to contribute to our research, consider filling our expression of interest form.