Aggregated lm-eval-harness results across the T³ architectural lineage and canonical baselines. Each row shows the best checkpoint per (model, task); hover any model name for the lineage / methodology context note.
log-x: parameter count · y: accuracy · larger dots = more recent T³ versions. Cells where T³ at smaller scale matches or beats baselines at larger scale are the headline result.
log-x: total training compute (params × tokens, proxy for FLOPs) · y: accuracy. Dashed line + shaded region = Pareto frontier across all models in the library. Points on the frontier (white outline) define state-of-the-art per task at that compute level. Each panel reports the cleanest same-data apples-to-apples comparison: a T³ row paired with the vanilla baseline trained on the identical data mix. We avoid quoting compute-equivalence ratios against cross-corpus baselines (e.g. Qwen, SmolLM), which are trained on different data at very different scales and would produce misleading absolute ratios. Models without confirmed token counts are omitted.
Best score per model × task. Click any column header to sort. Hover any model name for full lineage notes.