T³ Atlas · benchmark library

Aggregated lm-eval-harness results across the T³ architectural lineage and canonical baselines. Each row shows the best checkpoint per (model, task); hover any model name for the lineage / methodology context note.

epistemic tiers — what the badges mean
verified ckpt re-evaluated 2026-05-01+, training log parsed, no known bugs, attribution independently confirmed
probable provenance sound, no known bugs, no independent re-eval performed
caveated provenance sound BUT a known bug or limitation affects interpretation; cite with caveat
suspect provenance unclear or attribution mismatch; do not cite without re-verification
archival known broken or pre-release; lineage continuity only, not for benchmark interpretation
Hover any badge to see the row's specific caveat tags.

Parameter efficiency — one panel per task

log-x: parameter count · y: accuracy · larger dots = more recent T³ versions. Cells where T³ at smaller scale matches or beats baselines at larger scale are the headline result.

Compute frontier — score vs params × tokens

log-x: total training compute (params × tokens, proxy for FLOPs) · y: accuracy. Dashed line + shaded region = Pareto frontier across all models in the library. Points on the frontier (white outline) define state-of-the-art per task at that compute level. Each panel reports the cleanest same-data apples-to-apples comparison: a T³ row paired with the vanilla baseline trained on the identical data mix. We avoid quoting compute-equivalence ratios against cross-corpus baselines (e.g. Qwen, SmolLM), which are trained on different data at very different scales and would produce misleading absolute ratios. Models without confirmed token counts are omitted.

Headline table

Best score per model × task. Click any column header to sort. Hover any model name for full lineage notes.

Known gaps + re-run targets