Benchmark Contamination

Benchmark contamination happens when evaluation items leak into model training, tuning, examples, or public prompt trails, making a score less meaningful.

Why it matters

A contaminated benchmark can make a model look generally capable when it has memorized or indirectly seen the test.

Mitigation

Prefer fresh, held-out, rotating, or source-audited benchmarks and always show the evaluation date.

Wiki rule

Do not rank models from stale score tables without source, date, method, and contamination caveats.

What to write on a benchmark page

Signal	How to handle it
Static public dataset	Explain that the test may be seen during training or prompt circulation.
Rotating or private set	Describe the update method only as far as the benchmark source documents it.
Leaderboard jump	Record model version, date, task setting, and whether the score is directly comparable.
Missing method details	Mark the claim as incomplete instead of filling gaps with inference.