Skip to content

LlmWikis knowledge page

Benchmark Contamination

Benchmark contamination happens when evaluation items leak into model training, tuning, examples, or public prompt trails, making a score less meaningful.

Why it matters

A contaminated benchmark can make a model look generally capable when it has memorized or indirectly seen the test.

Mitigation

Prefer fresh, held-out, rotating, or source-audited benchmarks and always show the evaluation date.

Wiki rule

Do not rank models from stale score tables without source, date, method, and contamination caveats.

What to write on a benchmark page

Signal How to handle it
Static public dataset Explain that the test may be seen during training or prompt circulation.
Rotating or private set Describe the update method only as far as the benchmark source documents it.
Leaderboard jump Record model version, date, task setting, and whether the score is directly comparable.
Missing method details Mark the claim as incomplete instead of filling gaps with inference.