Skip to content

LlmWikis knowledge page

Benchmark Hub

Current status

This hub teaches how to read and document benchmark claims. It does not publish live benchmark data, automated pulls, leaderboard freshness, or model rankings.

What a benchmark page must prevent

Failure How the page should handle it
Stale ranking Show score date, benchmark version, model version, and whether results are copied or interpreted.
False comparability Explain when two scores come from different prompts, harnesses, tool settings, quantization, or release dates.
Contamination silence State whether the benchmark is static, rotating, held out, live, or unclear from the source.
Metric overreach Explain what the benchmark measures and what a high score does not prove.
Missing source trail Keep official benchmark source, paper, repository, or leaderboard URL close to the claim.
Benchmark page field Why it matters
Benchmark scope Names the task family, modality, dataset shape, and what the score can and cannot prove.
Source and cadence Shows where results come from, when they were last checked, and how often they change.
Contamination risk Separates static benchmark claims from rotating, held-out, or live evaluation methods.
Score normalization Explains whether scores are directly comparable or only useful inside one leaderboard.
Implementation notes Captures prompts, wrappers, tool use, runtime settings, and failure modes when sourceable.