Current status
This hub teaches how to read and document benchmark claims. It does not publish live benchmark data, automated pulls, leaderboard freshness, or model rankings.
What a benchmark page must prevent
| Failure | How the page should handle it |
|---|---|
| Stale ranking | Show score date, benchmark version, model version, and whether results are copied or interpreted. |
| False comparability | Explain when two scores come from different prompts, harnesses, tool settings, quantization, or release dates. |
| Contamination silence | State whether the benchmark is static, rotating, held out, live, or unclear from the source. |
| Metric overreach | Explain what the benchmark measures and what a high score does not prove. |
| Missing source trail | Keep official benchmark source, paper, repository, or leaderboard URL close to the claim. |
| Benchmark page field | Why it matters |
|---|---|
| Benchmark scope | Names the task family, modality, dataset shape, and what the score can and cannot prove. |
| Source and cadence | Shows where results come from, when they were last checked, and how often they change. |
| Contamination risk | Separates static benchmark claims from rotating, held-out, or live evaluation methods. |
| Score normalization | Explains whether scores are directly comparable or only useful inside one leaderboard. |
| Implementation notes | Captures prompts, wrappers, tool use, runtime settings, and failure modes when sourceable. |