Executive summary

Figure 1: Peak computational performance of common ML accelerators at a given precision. New number formats have emerged since 2016. Trendlines are shown for number formats with eight or more accelerators: FP32, FP16 (FP = floating-point, tensor-* = processed by a tensor core, TF = Nvidia tensor floating-point, INT = integer)

We study the performance of GPU for computational performance across different number representations, memory capacities and bandwidth, and interconnect bandwidth using a dataset of 47 ML accelerators (GPUs and other AI chips) commonly used in ML experiments from 2010-2023, plus 1,948 additional GPUs from 2006-2021. Our main findings are:

  1. Lower-precision number formats like 16-bit floating point (FP16) and 8-bit integers (INT8), combined with specialized tensor core units, can provide order-of-magnitude performance improvements for machine learning workloads compared to traditionally used 32-bit floating point (FP32). For example, we estimate, though using limited amounts of data, that using tensor-FP16 can provide roughly 10x speedup compared to FP32.
  2. Given that the overall performance of large hardware clusters for state-of-the-art ML model training and inference depends on factors beyond just computational performance, we investigate memory capacity, memory bandwidth and interconnects, and find that:
    1. Memory capacity is doubling every ~4 years and memory bandwidth every ~4.1 years. They have increased at a slower rate than computational performance which doubles every ~2.3 years. This is a common finding and often described as the memory wall.
    2. The latest ML hardware often comes with proprietary chip-to-chip interconnect protocols (Nvidia’s NVLink or Google’s TPU’s ICI) that offer higher communication bandwidth between chips compared to the PCI Express (PCIe). For example, NVLink in H100 supports 7x the bandwidth of PCIe 5.0.
  3. Key hardware performance metrics and their improvement rates found in the analysis include: computational performance [FLOP/s] doubling every 2.3 years for both ML and general GPUs; computational price-performance [FLOP per $] doubling every 2.1 years for ML GPUs and 2.5 years for general GPUs; and energy efficiency [FLOP/s per Watt] doubling every 3.0 years for ML GPUs and 2.7 years for general GPUs.
Specification and unit Growth rate Datapoint of highest performance N
Computational Performance FLOP/s (FP32) 2x every 2.3 [2.1; 2.6] years 10x every 7.7 [6.9; 8.6] years 0.13 [0.15; 0.12] OOMs per year ~90 TFLOP/s ~9e13 FLOP/s (NVIDIA L40) 45
FLOP/s (tensor-FP32)

NA1

~495 TFLOP/s ~4.95e14 FLOP/s (NVIDIA H100 SXM) 7
FLOP/s (tensor-FP16) NA ~990 TFLOP/s ~9.9e14 FLOP/s (NVIDIA H100 SXM) 8
OP/s (INT8) NA ~1980 TOP/s ~1.98e15 OP/s (NVIDIA H100 SXM) 10
Computational price-performance FLOP per $ (FP32) 2x every 2.1 [1.6; 2.91] years 10x every 7 [5; 9] years 0.14 [0.18; 0.10] OOMs per year ~4.2 exaFLOP per $ ~4.2e18 FLOP per $ (AMD Radeon RX 7900 XTX) 33
Computational energy-efficiency FLOP/s per Watt (FP32) 2x every 3.0 [2.7; 3.3] years 10x every 10 [9; 11] years 0.10 [0.11; 0.09] OOMs per year ~302 GFLOP/s per W ~3e11 FLOP/s per W (NVIDIA L40) 43
Memory capacity DRAM capacity (Byte) 2x every 4 [3; 6] years 10x every 13 [10; 19] years 0.08 [0.10; 0.05] OOMs per year ~128 GB ~1.28e11 B (AMD Radeon Instinct MI250X) 47
Memory bandwidth DRAM bandwidth in Byte/s 2x every 4 [3; 5] years 10x every 14 [11; 17] years 0.07 [0.09; 0.06] OOMs per year ~3.3 TB/s ~3.3e12 B/s (NVIDIA H100 SXM) 47
Interconnect bandwidth Chip-to-chip communication bandwidth (Byte/s) NA ~900 GB/s ~9e11 B/s (NVIDIA H100) 45

Table 1: Key performance trends. All estimates are computed only for ML hardware. Numbers in brackets refer to the [5; 95]-th percentile estimate from bootstrapping with 1000 samples. OOM refers to order of magnitude, and N refers to the number of observations in our dataset. Note that performance figures are for dense matrix multiplication performance.

Introduction

Advances in machine learning over the last decade have in large part been the result of scaling up the amount of computational resources (compute) used for training (Sevilla et al., 2022), and advancements in hardware performance have played a modest role in this progress. Increased investments in ML R&D (Cottier, 2023) led to scaled-up hardware infrastructure as we move from a small number of chips to massive supercomputers.

This article provides an overview of trends in computational performance across a variety of number precisions and specialized components, such as tensor cores. Furthermore, we analyze additional performance factors such as memory capacity, memory bandwidth, and interconnect bandwidth. Overall, we want to provide a holistic picture of all ML hardware specifications and components that jointly determine practical hardware performance, especially in the era of large ML models.

We used peak performance of various metrics throughout this work for comparison, which we source from the specification sheets from hardware producers.2 Typically, only a fraction of the specified peak computational performance is utilized. This depends on a variety of factors, such as the workload specifications and the limits of other specifications, such as the memory capacity and bandwidth. For example, according to (Leland et al., 2016), for common supercomputing workloads, this might be between 5% and 20%, while in ML training this might range between 20% and 70%, depending on the size of the model, how it is parallelized, and other factors (Sevilla et al., 2022). Nevertheless, peak performance serves as a useful upper bound and standard basis for comparison across different hardware accelerators and generations.

Terminology

  • Number representation: We differentiate number representation along three dimensions:
    • Bit-length/precision: describes the number of bits used to store a number. This typically ranges from 4 to 64 bits.
    • Number format: refers to a specific layout of bits, e.g., integer or floating-point. The number format typically includes the bit-length like in FP32, however, we separate the bit layout and bit-length in our piece.3
    • Computation unit: shows if a dedicated matrix multiplication unit is used or not. In this piece, we only differentiate between tensor and non-tensor.
  • Hardware accelerator: refers to a chip that accelerates ML workloads, e.g., GPU or TPU. We use the terms chip and hardware accelerator interchangeably as general terms and GPU and TPU when referring to the specialized accelerator.

Dataset

We compiled hardware specifications from two key datasets. The first includes 1948 GPUs released between 2006 and 2021 based on Sun et al., 2019, which we will refer to as the general GPU dataset (based on primarily general GPUs not commonly used in ML training). The second dataset includes only 47 ML hardware accelerators starting in 2010, such as NVIDIA GPUs and Google TPUs, which were commonly used in notable ML experiments (as defined in Sevilla et al., 2022). We have curated the latter dataset ourselves and it will be referred to as the ML hardware dataset or, in short, ML dataset (based on ML GPUs). This dataset is publicly available in our datasheet.

In this section, we present the trends for different number representations, memory capabilities, computational price-performance, and energy efficiency. We briefly explain each metric’s relevance for ML development and deployment, present our findings, and briefly discuss their implications.

Number representations

The numeric representation used for calculations strongly influences computational performance. More specifically, the number of bits per value determines arithmetic density (operations per chip area per second).4 In recent years, hardware manufacturers have introduced specialized lower-precision number formats for ML applications. While FP64 has been common in high-performance computing,5 FP32 performance has been the focus of most consumer applications for the last 15 or so years.

Number formats with less precision have become more prevalent in recent years since low precision is sufficient for both developing and deploying the ML models (Dettmers et al., 2022; Suyog Gupta et al., 2015; Courbariaux et al., 2014). According to Rodriguez, 2020, FP32 remains a most widely adopted number format for both ML training and inference today, with industry increasingly transitioning to lower precision number formats like FP16 and Google’s bfloat16 (BF16) for certain training and inference tasks, as well as integer formats INT8 for select inference workloads.6 Other well-known emerging number formats are the 16-bit standard floating-point format FP16, integer format INT4, and the NVIDIA-developed 19-bit floating-point format TF32.7

Computational performance for FP32 and FP16

Historically, the computational performance trend for FP32 precision has been regular over nearly two decades with a doubling time of 2.3 years, closely in line with the rate associated with Moore’s Law. Over the last few years, especially since 2016, we have seen an emergence of hardware with dedicated support of FP16 precision—increasing the absolute computational performance given the reduced bit length.

Figure 2: Peak performance of general and ML GPUs for FP32 and FP16 precision over the last two decades. The top figure shows that ML GPUs have a higher median performance than all general GPUs, but the rate of growth is similar. The bottom figure shows that in 2014 some hardware accelerators started to provide FP16 performance details.

The computational performance of general and ML hardware over the last decade for FP32 show nearly identical growth rates but differ in their levels. The accelerators from our ML hardware dataset are consistently among the best available hardware. We think this is in part because ML practitioners selecting the most powerful hardware available and, secondly, the result of the introduction of recent hardware high-end data center GPUs specifically targeting the ML market, e.g., NVIDIA’s V/A/H100 or Google’s TPUs.

Computational performance gains through hardware support for less precise number formats

The performance gains from reduced numeric precision are enabled by multiple architectural improvements in modern ML chips, not just the reduced bit-width alone. The smaller data types allow more FLOPs per chip area and reduce memory footprint. But other advancements also contribute significantly: a) the introduction of new instructions specialized for matrix multiplication,8 b) hardware data compression, and c) the elimination of excess data buffers for matrix multiplication hardware like in NVIDIA A100 (Choquette et al., 20219) which contribute to lower data and instruction memory requirements, leading to more operations per chip area. These advancements are further complemented by faster memory access features in the H100 (Choquette, 2023).

Figure 3: Box plot showing performance of ML accelerators across different precision number formats as a ratio of their own FP32 performance. This illustrates the improvement in performance compared to FP32. We find that new number representations tensor-FP32/TF32, tensor-FP16, tensor-INT8, can increase average computational performance by ~5x, ~8x, ~13x relative to their own FP32 performance respectively. Not all GPUs have dedicated support for lower-precision formats. We removed from this figure all models of GPUs for which computational performance on lower-precision formats was not higher than on higher-precision formats to screen off GPUs without dedicated support.

GPU performance for ML workloads has substantially improved in recent years thanks to lower numeric precision. On average, using lower precision number formats like tensor-FP32 (TF32), tensor-FP16, tensor-INT8 and tensor-INT4 provides around ~5x, ~8x, ~13x and ~18x higher computational performance respectively compared to using FP32 on the same GPU. To put this in context, given that peak FP32 performance has historically doubled every ~2.3 years, these lower precision speedups are equivalent to performance improvements of 3-9 years. However, maximum speedups can exceed averages. NVIDIA’s H100 achieves ~7x, ~15x, and ~30x speedups with TF32, FP16, and INT8 compared to FP32. So for the H100, lower precisions provide even larger performance gains over FP32 than typical GPUs. Although, as we see, lower precision substantially improves computational performance, training often uses higher precision due to tradeoffs with model accuracy.10 Although the TF32, FP16, and INT8 formats provide speedups over FP32 in the H100, it’s important to note that this is not solely due to the smaller number formats being more efficient. The H100 is likely more optimized for operations on these formats, which contributes to the speedups.

Memory capacity and bandwidth

Typical processor cores carry out their calculations by reading data, processing them, and writing the processed result back to memory. Thus, memory acts like a storage for the data between processing cycles. Hardware tends to use a memory hierarchy: from a register file storing hundreds of kB of fast-access data near computation units to a random-access memory (RAM) capable of housing dozens of GB of slower-access data.11 Data is regularly pulled from the larger slow-access RAM through intermediate cache memories to the register file and written back if needed. Accelerator datasheets mostly provide the size of the largest RAM available on the accelerator card12. We refer to the number of these RAM bits as memory capacity. Data is transferred to the largest RAM in chunks and this takes some processing cycles depending on the used memory technology. We refer to the peak number of bits that can be transferred to the largest RAM per second (i.e., peak bit rate) as memory bandwidth.13

The system that contains the hardware accelerator typically contains a main memory that stores the application and data. These are then transferred to the accelerator for processing. To ensure that model weights and training data are available on the hardware accelerator at any given time during training or inference, larger memory capacities are required. If the data wouldn’t fit into the accelerator’s memory, the logic would need to use CPU memory or, even worse, an even higher level of memory (e.g. hard disk), which would lead to significant impacts on latency and bandwidth. In practice, the model data is distributed across multiple hardware accelerator’s memory to avoid this performance penalty.

Progress in processing capability of hardware requires a larger memory bandwidth. Without enough data input, the peak computational performance cannot be reached and memory bandwidth becomes a bottleneck,14 which is referred to as bandwidth wall (Rogers et al., 2009) or memory wall in general.

As shown in Figure 4 below, the growth rate of memory capacity and bandwidth trails the pace of computational performance improvements. Specifically, memory capacity doubles every 3.04 years for general GPUs and 4 years for ML accelerators, while memory bandwidth doubles every 3.64 and 4 years respectively. In contrast, computational performance doubles every 2.3 years based on our earlier analysis.