Producing datasets and analysing trends in Machine Learning
Despite surging interest in Machine Learning, there has been limited work systematically curating and studying datasets about what these systems are like—how much compute they used, what datasets they were trained on, and what their architectures are like.
This work will help us build a big picture understanding of what has happened in the field in the recent decades, and ultimately help us understand where it might go next.
This line of research involves:
- Developing standards for collecting and representing data on Machine Learning systems
- Building datasets and making these publicly available for other researchers to use
- Creating measuring tools to estimate or extract features of ML systems, such as compute used during training
- Analysing and explaining trends in the data, investigating discontinuities, and plausible contributing factors
- Analysing the implications of a continuation of existing trends, by, for example, producing extrapolations and projections
Prior work
Algorithmic progress in computer vision
Paper
Dec. 12, 2022
Ege Erdil, Tamay Besiroglu
We use a dataset of over a hundred computer vision models from the last decade to investigate how better algorithms and architectures have enabled researchers to use compute and data...
Compute Trends Across Three Eras of Machine Learning
Paper
Feb. 11, 2022
Jaime Sevilla, Tamay Besiroglu, Anson Ho, Lennart Heim, Marius Hobbhahn, and Pablo Villalobos
Compute, data, and algorithmic advances are the three fundamental factors that guide the progress of modern Machine Learning (ML). In this paper we study trends in the most readily quantified...
Estimating Training Compute of Deep Learning Models
Report
Jan. 20, 2022
Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, Anson Ho and Pablo Villalobos
We describe two approaches for estimating the training compute of Deep Learning systems, by counting operations and looking at GPU time.
Parameters, Compute and Data Trends in Machine Learning
Database
Jaime Sevilla et al.
Public dataset