Home/Glossary/Benchmark
Concepts

Benchmark

A standardised test suite used to measure and compare AI model capabilities across tasks.

Full Definition

A benchmark is a curated set of tasks with ground-truth answers, used to measure model performance in a reproducible, comparable way. LLM benchmarks span reasoning (MMLU, BBH), maths (MATH, AIME), coding (HumanEval, SWE-bench), instruction following (IFEval), long context (SCROLLS), and safety (ToxiGen). Benchmark scores drive research direction and marketing claims, but they have known pathologies: benchmark contamination (test data leaking into training data), overfitting to popular benchmarks, and the gap between benchmark performance and real-world utility. Responsible model evaluation uses diverse, contamination-free benchmarks and human preference studies alongside automated scores.

Examples

1

MMLU (Massive Multitask Language Understanding): 14,000+ multiple-choice questions across 57 subjects from professional exams, used to estimate 'world knowledge'.

2

SWE-bench: real GitHub issues from open-source Python repositories, measuring a model's ability to write code patches that pass the associated test suite.

Apply this in your prompts

PromptIt automatically uses techniques like Benchmark to build better prompts for you.

✦ Try it free

Related guides

What is Prompt Engineering?How to Use Role in AI PromptsHow to Add Context to AI PromptsDefining the Task in Your AI Prompt

Related Terms

Evaluation

The systematic process of measuring AI model quality, safety, and alignment agai

View →

Scaling Laws

Empirical relationships describing how model performance improves predictably wi

View →

Reasoning Model

A model trained to perform extended internal reasoning before producing a respon

View →
← Browse all 100 terms