Benchmark contamination happens when evaluation items leak into model training, tuning, examples, or public prompt trails, making a score less meaningful.
Why it matters
A contaminated benchmark can make a model look generally capable when it has memorized or indirectly seen the test.
Mitigation
Prefer fresh, held-out, rotating, or source-audited benchmarks and always show the evaluation date.
Wiki rule
Do not rank models from stale score tables without source, date, method, and contamination caveats.
What to write on a benchmark page
| Signal | How to handle it |
|---|---|
| Static public dataset | Explain that the test may be seen during training or prompt circulation. |
| Rotating or private set | Describe the update method only as far as the benchmark source documents it. |
| Leaderboard jump | Record model version, date, task setting, and whether the score is directly comparable. |
| Missing method details | Mark the claim as incomplete instead of filling gaps with inference. |