Epoch Database Guide

Guide for contribution, review and usage

Introduction

This document is a user guide for the Epoch Database, a collection of historically significant or cutting-edge machine learning systems, used for research about trends in the history and future of artificial intelligence. This document covers the high-level database structure, its data fields and definitions, processes for submission of new entries, processes for review of existing entries, and processes for feedback and feature requests.

The data in this database is used in many of Epoch’s research projects, such as the Machine Learning Trends Dashboard and research on algorithmic progress. It’s also featured in many external publications, such as The brief history of artificial intelligence by Our World in Data and The race of the AI labs heats up from The Economist.

Database structure and format

The database is stored in Airtable as a collection of linked tables. The main table is publicly accessible here, and is also featured on our website as an interactive visualization or table. You can download it as a csv containing the subset of notable systems (recommended) or all systems including those which do not meet the notability criteria.

For a quick example of how to load the data and work with it in your scripts, see this Google Colab demo notebook.

For other research needs, you can also use Airtable views, which are like SQL queries and allow you to join and subset data across multiple tables. To request that a view be set up, email data@epochai.org.

Columns and values (main table)

Column Mandatory Type Definition
System Y String A unique string naming the system, e.g. “ResNet-101”. This should be the best-known name used for a given model.
Domain N Multiple Choice A foreign key field categorizing the system’s domain of machine learning.
This field links to the ML Domains table, and domains are selected from the options in that table.
Task N Multiple Choice A foreign key field categorizing the system’s task.
The field links to the ML Tasks table, and domains are selected from the options in that table.
Organization N Multiple Choice Multiple-select multiple-choice of organization(s) who created the system. In the spreadsheet, this field is formatted as a string, with multiple organizations separated by commas.
Organization categorization N Multiple Choice Categorization of the organization(s). The distinction is documented in Academia and Industry. Systems are categorized as “Industry” if their authors are affiliated with private sector organizations, “Academia” if the authors are affiliated with universities or academic institutions, or “Industry - Academia Collaboration” when at least 30% of the authors are from each.
Possible values: Industry, Research Collective, Academia, Industry - Academia Collaboration (Industry leaning), Industry - Academia Collaboration (Academia leaning), Non-profit
Authors N String Comma-separated list of authors
Publication date Y Date The publication, announcement, or release date of the model, in YYYY-MM-DD format. \ If the year and month are known but the day is unknown, the day is filled in as YYYY-MM-15.
If the year is known but the month and day are unknown, the month and day are filled in as YYYY-07-01.
Reference N String The literature reference for the system, such as a published journal or conference paper.
Link N String Link to primary source (preferably) or other best-choice documentation of a system.
Citations N Numeric Number of citations as of last update. Values collected from Semantic Scholar where available, otherwise Google Scholar.
Notability criteria N Multiple Choice The criterion or criteria met by the system which qualify it for notability and therefore inclusion in the dataset. A system may meet zero, one, or multiple criteria.
Parameters N Numeric Number of free parameters in the ML system.
Training compute N Numeric Quantity of training compute, in FLOP. This is the total training compute for a given system, i.e. pretrain + finetune. It should be filled in here when directly reported, or calculated via GPU-hours or backpropagation gradient updates.
Training dataset N String If a standard dataset (even a subset), it should be named here with comma-separated values. Use “other” to capture non-standard datasets.
Example: “ImageNet, The Pile, other”.
Training dataset size N Numeric Number of datapoints in the training dataset.
See Measuring model size, compute, and training data below.
Training time (hours) N Numeric Training time of the model, if reported. This refers to wall-clock time, not number of GPU-hours.
Inference compute N Numeric Compute required for inference on a datapoint. This is measured using the same settings that were used during model training, e.g. if there was a certain image resolution or context length used in training, record the amount of compute used for an equivalent inference.
Training compute cost N Numeric The training compute cost, using the value reported by the authors if any, or otherwise an estimate. Values are converted to 2020 US dollars.
Finetune compute N Numeric Compute used to fine-tune the model, if applicable.
Base model N Multiple Choice Which base model the model was fine-tuned from, if applicable.
Hardware N Multiple Choice Multiple select of the hardware (processors) used in training.
Hardware quantity N Numeric Indicates the quantity of the hardware used in training.
Utilization N Numeric Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. achieved FLOPs / theoretical maximum FLOPs.
Confidence N Multiple Choice Describes our epistemic status or confidence in the recorded values. See Meta information.
Foundation model N Numeric Checkbox (values 0 for false and 1 for true) indicating whether a model is a foundation model, based on the Stanford Center for Research on Foundation Models ecosystem data.
Last modified Y Datetime Shows the last time the data in a row was modified. This is automatically updated by Airtable.

Additional tables

The following auxiliary tables are also included within the database, and their data can be used along with the data in the main table.

  • ML domains
  • ML tasks
  • Hardware prices
  • Chip dataset
  • Benchmark data
  • Organizations

Nomenclature

Authors are named in the way that they report their names in their publications, if applicable. Examples:

Organizations may have multiple different names, but ideally it should be possible to find every system created by people working at an organization by selecting one option in this field. Therefore, organizations are periodically reviewed in Airtable and standardized to the most common name for them. Examples:

  • “Google Inc” and “Google Research” have been changed to “Google”
  • “University of California, Berkeley” and “Berkeley” have been changed to “UC Berkeley”

Datasets should also be standardized to their most common name. Examples:

  • “MS COCO” and “Microsoft COCO” to “COCO”
  • “CC” and “C4” to “Common Crawl”

Data methodology

Selection, inclusion and notability

To be included in the model database, an ML system must satisfy all inclusion criteria:

  • reliable documentation of its existence and relevance to machine learning;
  • entries for all mandatory fields;
  • the system must include a learning component, it cannot be a non-learned algorithm.

The model database is particularly focused on notable machine learning systems. Notability is satisfied by any of the following: (i) SOTA improvement on a recognized benchmark; (ii) highly cited (over 1000 citations); (iii) historical relevance; (iv) significant use. More detailed guidelines are provided here. Notability is indicated in a dedicated field, allowing users to limit their analysis to notable systems.

Systems were initially selected from various sources including “literature reviews, Papers With Code, historical accounts, previous datasets, most cited publications of top conferences, and suggestions from individuals”1, and this is currently a non-exhaustive list of notable systems. New systems are often added when they appear in news coverage, or bibliographies of other machine learning work. Ideally, we would like to include every system that has ever met the notability criteria. As of fall 2023, an upcoming data collection initiative aims to add 1000 more systems to the dataset.

Measuring model size, compute, and training data

We have additional documents about the approaches to measuring parameter counts, dataset size, and training compute. Here are the main guidelines for adding a system’s details:

  • Parameter counts might be reported by the authors, or can be calculated if the architecture is defined in the paper.
    • Look for a description of the architecture, i.e. type, number, and configuration of the layers, then calculate the parameters in each layer and sum them.
    • For a fully-connected layer, it’s the number of inputs to the layer times the number of neurons in the layer plus bias terms.
    • For a convolutional layer, the parameter count can be determined from the input size, output size, kernel size, stride, and padding.2

    • For a transformer, the parameter count can be determined from the number of blocks, the model dimensionality, and the internal dimensionality of the feed-forward layers.3
  • Compute, measured in FLOP, may be directly reported, but otherwise we calculate or estimate it from reported quantities.4

    • Method 1: counting the number of arithmetic operations.
      • For dense models:
      • compute = 2 × # of connections × 3 × # of training examples × # of epochs
    • Method 2: hardware details and usage
    • compute = training time × # of GPUs/TPUs × peak FLOP/s × utilization rate
  • Dataset size is measured in number of datapoints used as training examples, as outlined in the table below. (We record the number of distinct datapoints, not multiplied by the number of epochs.) Our dataset document explains a heuristic for converting a number of tokens to the corresponding number of words.
Task Way of measuring dataset size
Classification problem # training examples
Image classification # images
Image captioning # captions
Predictive language model # words
Translation # words in input language
Text classification # training examples
Speech recognition # words
Reinforcement learning # timesteps

Submission of new entries

How to add new entries

New systems can be added by Epoch staff, contractors, or outside volunteers. Submissions can be added directly to the database using the Epoch internal form. Systems added by people outside Epoch go through the public Airtable form, and are first reviewed by the database manager prior to being included in the database. For a new entry to pass review and be added to the database, all mandatory fields must be completed. These fields can be completed by the reviewer at the time of review - for example, new submissions can be partially completed by one person and then finished by the reviewer. Optional fields can be intentionally left blank.

Recording calculations or reasoning

  • Document calculations and reasoning in the corresponding notes for a given value.
  • Record where you find the values used in calculations.
  • For example, if calculating training compute using Method 1 above, your note might look like the following:
    • Calculate based on arithmetic ops: 3 * 2 * N * D. N=50010003 is taken from Fig 3 (page 9). D=1E6 is taken from Table 2 (page 10).
  • If a calculation goes beyond simple arithmetic and/or has more than ~5 terms, link to an explanatory Colab notebook in the note column.

Meta information

Meta information allows us to better handle data entry and management, as well as providing useful information about uncertainty.

The Confidence column is a single-select field taking any of the following values:

  • “Confident” for when all the inputs used in an estimate were reported by the authors or other relevant sources, and we did not have to make significant assumptions to estimate system details.
  • “Likely” for when we have had to make assumptions about input(s) such as dataset size, model size, training duration, etc, but most key information was available and estimates are easily compared to similar systems.
  • “Speculative” for when we made an estimate using multiple assumptions which could be wrong
  • “Unverified” for new entries submitted but not yet reviewed by the database manager
  • “Wrong” for entries containing mistakes that have been noticed

The Exclude column is an single-select field taking any of the following values:

  • 1, meaning the system should be excluded from the dataset when performing statistical analysis. This is typically used when a family of models is introduced together, excluding the smaller models so that the architecture is not over-represented in statistical results.
  • 0, meaning the system should not be excluded.
  • null, meaning the system should not be excluded.

Entries with status Unverified or Wrong, or with an exclusion value of 1, are not shown on the “Notable ML Systems” view.

The Last Modified column is a datetime field automatically updated by Airtable whenever an entry is created or modified.

Suggestions for new entries to add

If you have information about an ML system that is not in the database, you can suggest an entry via the Airtable submission form. If you know of an ML system but you don’t have any details about it, you can add it to the list of suggestions for new PCD entries and the database manager will research it. Epoch has a bounty program and offers payment for identifying these systems that are missing from our database.

Data review

Review process

The purpose of review is to perform checks that a proposed entry seems sufficiently accurate and complete to add to the database, and respects the database schema. Review allows us to take suggestions from outside sources, such as contractors or scraping.

Reviewers are not expected to fully replicate all of the work they review, but they should check that values don’t seem blatantly incorrect, and it’s advisable to thoroughly spot check entries from a new contributor.

Reviewer checklist

  1. Are all mandatory fields completed?
    • If not, complete them.
  2. Check that values seem reasonable.
    • Is a model much larger or smaller than peers?
    • Does it use much more or less compute than others in the same year?
  3. Check the confidence field.
    • For more confident epistemic statuses, you should be able to verify submitted values more precisely.
    • Check that calculations respect guidelines.

Raising and resolving issues

Issues can be raised as comments within Airtable. When the issue is resolved, the comment thread should be closed.

Incorrect data

Raising: Raise a comment in the corresponding cell. Set out your reasoning, and ideally provide an updated estimate.

Resolving: Check the existing value and the raised comment, determine which value is correct, and update the cell in Airtable. If a corrected entry is not provided, ideally provide one yourself. If you cannot provide one, update the epistemic status to Wrong and notify the data team5.

When / how to close the comment: If the entry has been corrected, post a reply and close the comment. If the entry cannot be corrected, ensure the epistemic status has been updated to Wrong. Post a reply explaining this, and close the comment.

Adding extra data to an existing entry

Raising: Raise a comment in the corresponding cell within ALL ML SYSTEMS. Ideally provide a summary of how you got the data (directly from the paper, or estimation method, so on) and your confidence in it.

Resolving: Ideally, incorporate the new data, and update the confidence correspondingly. If you think the new data is incorrect, reply to the comment and explain your reasoning, but do not resolve the comment yet.

If the proposer does not respond within a week, or if the issue is discussed and the new data is found to be incorrect, the edit can be rejected. Once the proposed change is accepted or rejected, close the comment thread.

High-level feedback

Public feedback can be directed to the database manager, currently Robi Rahman, at robi@epochai.org, or to the data group at data@epochai.org, or to Epoch’s management and operations team at info@epochai.org.

Maintenance

Checking new suggestions

Suggested systems / performance information should be periodically checked by the database manager, on a cadence of days to weeks, and added or rejected according to the review process.

Checking issues

Issues, which are raised in comments, should be periodically checked by the database manager, triaged and resolved following the established process.

Spot checks

Spot checks reassure us of the database’s validity and accuracy, as well as helping us identify common problems. Periodically, the database manager should arrange for people to perform spot-checks, with resulting findings and updates recorded by the reviewers.

Recomputing citations

Citation counts were originally retrieved from Google Scholar, but it no longer permits programmatic queries. Citation counts are now retrieved from Semantic Scholar using a script in AWS EC2 scheduled to run on the first day of each month at 14:00 UTC.

Backups

The table All ML Systems is downloaded as a csv and saved to Epoch’s Google Drive by an automatic backup script which is scheduled to run on the first day of each month at 14:30 UTC.

Changelog

Documentation edit history

  • 2024-02-26 Added a column labeling foundation models, based on the Stanford CRFM Ecosystem Graph.
  • 2024-01-22 The Google Sheet has been deprecated and replaced with an Airtable view embedded on our website. The documentation was edited to direct users to the website instead of the spreadsheet.
  • 2024-01-22 Meta information section was updated to include the Exclude field.
  • 2023-12-01 A new column was added for fine-tuning compute, and records can now link to other models to indicate if they were fine-tuned from a base model (November 2023). Standardization of names of organizations, companies, and universities was completed (November 2023). The Notability criteria column was reviewed to ensure that we haven’t missed any systems that should be marked as notable (October 2023).
  • 2023-10-09 Added nomenclature description and guidelines for organizations and datasets.
  • 2023-10-04 Renamed “Epistemic status” to “Confidence”, and “Inclusion criteria” to “Notability criteria”. Updated the documentation to reflect the new names.
  • 2023-07-28 Updated to reflect addition of “Wrong” epistemic status
  • 2023-07-20 Added documentation on estimating model parameter counts

Changes upcoming and in progress

  • The “Epochs” column is being filled in with the number of training epochs, for systems where this was reported. (November 2023)

Appendix

Bibliography and Credits

The data have been collected by Epoch’s employees and collaborators, including Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B. Nanjajjar, Tilman Rauker, Nuño Sempere, Max Rauker, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Jean-Stanislas Denain, Owen Dudney, David Atkinson, Ben Cottier, David Owen, Robi Rahman, Carl Guo, Josh You, and Bartosz Podkanowicz.

This documentation was written by David Owen and Robi Rahman.

We would like to thank the authors of several sources where we have found one or more ML systems to include in the database: Stanford CRFM’s foundation model ecosystem graph, AI Tracker, Stella Biderman’s directory of LLMs, Terry Um’s repo of deep learning papers, Alan Thompson’s models table, the OpenCompass Chinese LM leaderboard, the Akronomikon by LightOn AI, Papers With Code, and Hugging Face. We would also like to thank the authors of AI and compute and Compute and Energy Consumption Trends in Deep Learning Inference.

Citing this work

Cite this work as:

Epoch AI, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at epochai.org. Retrieved from: ‘https://epochai.org/data/epochdb/visualization’ [online resource]

BibTeX citation:

@misc{epoch2023pcdtrends,
  title = "Parameter, Compute and Data Trends in Machine Learning",
  author = {Epoch AI},
  year = 2022,
  url = {https://epochai.org/data/epochdb/visualization},
  note = "Accessed: 2024-01-22"
}

Notes

  1. Compute Trends Across Three Eras of Machine Learning, Sevilla et al. 2022 

  2. Convolutional Neural Networks cheatsheet by Shervine Amidi, Stanford CS 230 

  3. Transformer parameter count script by Ben Cottier 

  4. Compute is reported as a number of floating-point operations. We currently do not track the arithmetic precision format of the operations. 

  5. Epoch employees should raise the issue in the #databases channel in Slack. External users can email us