Concepts

Benchmark

A standardised test suite used to measure and compare AI model capabilities across tasks.

Full Definition

A benchmark is a curated set of tasks with ground-truth answers, used to measure model performance in a reproducible, comparable way. LLM benchmarks span reasoning (MMLU, BBH), maths (MATH, AIME), coding (HumanEval, SWE-bench), instruction following (IFEval), long context (SCROLLS), and safety (ToxiGen). Benchmark scores drive research direction and marketing claims, but they have known pathologies: benchmark contamination (test data leaking into training data), overfitting to popular benchmarks, and the gap between benchmark performance and real-world utility. Responsible model evaluation uses diverse, contamination-free benchmarks and human preference studies alongside automated scores.

Examples

MMLU (Massive Multitask Language Understanding): 14,000+ multiple-choice questions across 57 subjects from professional exams, used to estimate 'world knowledge'.

SWE-bench: real GitHub issues from open-source Python repositories, measuring a model's ability to write code patches that pass the associated test suite.

Apply this in your prompts

PromptIt automatically uses techniques like Benchmark to build better prompts for you.

✦ Try it free

Related Terms

Evaluation

The systematic process of measuring AI model quality, safety, and alignment agai…

View →

Scaling Laws

Empirical relationships describing how model performance improves predictably wi…

View →

Reasoning Model

A model trained to perform extended internal reasoning before producing a respon…

View →

← Browse all 100 terms