BenchmarksTerminal-Bench 2.0
Independent evaluation · 89 tasks

The Evidence

Agents with Covenant rules scored 25 points higher than agents without rules, on the same AI model. Built in one session by one person. Two hundred and thirty lines of rules.

Headline
67.4%

on Terminal-Bench 2.0, the standard benchmark for terminal-based programming tasks.

+25
pts vs ad-hoc
+9.4
pts vs vanilla
230
lines of Canon
III · I · The Comparison

Where the framework stands,
and where it does not.

The honest read: agents built on stronger, proprietary models still lead. The interesting result: the same Claude model gains 25 points just by following structured rules instead of winging it.

RankAgentModelScoreNotes
01Codex CLIGPT-5.582.0%Proprietary frontier
02ForgeCodeGPT-5.481.8%Proprietary frontier
03TongAgentsGemini 3.1 Pro80.2%Proprietary frontier
04Covenant AgentClaude Opus 4.767.4%Open framework
05Claude Code (vanilla)Claude Opus 4.658.0%No governance layer
--Ad-hoc prompted baselineClaude Opus 4.742.0%Same model, no Canon

Scores from Terminal-Bench leaderboard, May 2026. Full 89-task run, no retry, single attempt.

III · II · The Method

How the run was conducted.

01

Identical task set

All 89 Terminal-Bench 2.0 tasks run unmodified, in original order, against each configuration. No task selection, no retries beyond what the agent's own retry policy permits.

02

Three configurations on the same model

Claude Opus 4.7 with: (a) ad-hoc prompted baseline, (b) Covenant Canon, (c) Canon plus full agent registry. The headline reports (b) because it isolates the contribution of the rules themselves.

03

Adversarial review

A 20-task subset was replicated on independent infrastructure. Variance: plus or minus 1.2 points. The full report is published in the methodology appendix.

04

What did not change

No fine-tuning. No extra tools. No tricks. The 25-point improvement comes from the rules alone.

III · III · The Six Rules

What carried
the 25 points.

The 230 lines of rules used in the benchmark run. Testing showed most of the improvement came from rules 1, 3, and 4.

I.
Genesis

Before coding, list the directory and read key files. Understand what exists.

II.
Plan First

State your approach in one or two sentences before executing.

III.
Iterate, Don't Repeat

If a command fails, diagnose. Never run the same failing command twice.

IV.
Verify Before Done

After implementing, test. Run it. Check the output.

V.
Time Is Limited

Work efficiently. Don't read files you don't need.

VI.
When Stuck

If three attempts fail, step back and reconsider the whole approach.