Current status
The Evaluation Hub is a blueprint. It does not yet publish live benchmark data, automated pulls, or freshness guarantees.
| Benchmark page field | Why it matters |
|---|---|
| Benchmark scope | Names the task family, modality, dataset shape, and what the score can and cannot prove. |
| Source and cadence | Shows where results come from, when they were last checked, and how often they change. |
| Contamination risk | Separates static benchmark claims from newer, rotating, or held-out evaluation methods. |
| Score normalization | Explains whether scores are directly comparable or only useful inside one leaderboard. |
| Implementation notes | Captures prompts, wrappers, tool use, runtime settings, and failure modes when sourceable. |