Guidelines for question writing

General guidelines

Questions should have a clear answer that can be automatically verified. Ideally this answer is an integer or a rational number, but we can accommodate more general type signatures of answers so long as the model can write a Python script that computes the answer and that can be checked for correctness using test cases. Consequently, e.g. questions of the type “Prove that…” should be avoided.

Example of a question fitting the guideline: Find a value of x such that 5^x = 101 modulo 10^9 + 7.
Example of a question not fitting the guideline: Prove that 2 is not a primitive root modulo 10^9 + 7.

Guessing the right answer to the question with a > 1% chance should be as difficult as solving the question. As a result, true/false questions or questions where most of the difficulty is in proving that a conjectured answer is correct should be avoided.

Example of a question fitting the guideline: Find a value of x such that 5^x = 101 modulo 10^9 + 7.
Example of a question not fitting the guideline: Prove that 2 is not a primitive root modulo 10^9 + 7.

The questions should span a wide range of difficulty. The easy questions should be accessible for smart high school students, while the hard questions should not be solvable by anyone other than mathematicians specializing in that domain of expertise.

We encourage question authors to experiment with existing large language models to get a sense for their capabilities when it comes to solving math problems. However, problem-writers should avoid designing problems that adversarially exploit weaknesses of current LLMs.

Operations

We will also ask you to sign a light NDA agreement before we purchase your question, to ensure the questions remain confidential.

For information security reasons, much of the communication related to problems will be through the platform Signal.

If written questions are stored in online platforms providing data storage services (such as Google Drive) then they must be encrypted. One convenient encryption format is to place the document containing the questions inside a password-protected zip file or tarball. The passwords should only be shared over secure channels such as the options mentioned above.

Note that we reserve the right to decline purchasing a question even if it fits the criteria, due to high submission volume or other processing constraints such as limitations in contracting in certain countries.

Format

A minimum viable submission is a clear question statement along with a detailed solution formatted as a .tex file and copy-pasted into our submission form. For information security reasons, please avoid using Overleaf for .tex document editing.

If the question requires writing a program to solve, the programming language used should be Python and a working version of this program should also be included copy-pasted into the submission form. Likewise, avoid using Google Colab and similar cloud-based services for editing, storing or sharing your solution code with us.

Metadata

Each question should come with some metadata provided by the question author. For the difficulty ratings we ask for, it’s fine if they are rough estimates or best guesses based on the question author’s judgment, though they should assume the human expert has access to a Python interpreter or another similar programming environment.

Background knowledge rating

A difficulty rating ranging from 1 to 5 quantifying how much background knowledge is required to solve the problem.

1: High school level 2: Early undergraduate level 3: Late undergraduate level 4: Graduate level 5: Research level

Creativity rating

This rating estimates the time an average human expert with the requisite background knowledge would take to find the key ideas for solving the problem.

Estimate the number of hours (T) required to find these key ideas.
Provide the estimate as a decimal number (e.g., 0.5 for 30 minutes, 10.5 for 10 hours and 30 minutes).
There is no upper limit; use your best judgment for very challenging problems.

Precision rating

This rating measures the amount of attention to detail and precise reasoning required to solve the problem after the key ideas have been identified.

Estimate the number of hours (T) a human expert needs to compute the correct answer after finding the key ideas.
Provide the estimate as a decimal number (e.g., 0.25 for 15 minutes, 10.5 for 10 hours and 30 minutes).
Consider factors such as detailed calculations, precise programming, etc.
There is no upper limit; use your best judgment for problems requiring extensive computation or implementation.

Subjects

A list of broad subjects that the question fits into: “analytic number theory”, “representation theory”, “differential geometry”, etc.  Note that a question can fit into more than one subject. It’s not necessary to exhaustively enumerate all subjects that could be relevant; just one or a few subjects that the author thinks are most relevant is sufficient.

Techniques

A list of techniques, theorems, results, etc. that can be used to solve the problem. The list doesn’t need to be exhaustive and only needs to contain the “most prominent” items that occur to the question author. Example: “generating functions”, “double counting”, “Vieta’s theorem”, etc.

Is programming required?

This is either “yes” or “no”, depending on whether a human expert would require access to a programming environment in order to find the answer to the question.

Note that even if the answer to this is “no”, the difficulty ratings early on should be based on what a human with access to such a programming environment would be able to do. This is because we can’t guarantee that closed-source models don’t perform programming that’s hidden from the user before answering a question.

Submit a problem

Submit your problems through our designated submission form.
Include a clear problem statement and a detailed solution.
Format your problem statement and solution as a .tex file and add to the submission form.
If the problem requires programming, add a working Python solution to the submission form.
Avoid using cloud-based services like Overleaf or Google Colab for editing or storing your submissions.

For problem-related questions or comments, join our discussion channel here.

We look forward to receiving your challenging and original mathematics problems to help advance the rigorous assessment of AI capabilities in mathematical reasoning!

Submit a problem