Agents with Covenant rules scored 25 points higher than agents without rules, on the same AI model. Built in one session by one person. Two hundred and thirty lines of rules.
on Terminal-Bench 2.0, the standard benchmark for terminal-based programming tasks.
The honest read: agents built on stronger, proprietary models still lead. The interesting result: the same Claude model gains 25 points just by following structured rules instead of winging it.
| Rank | Agent | Model | Score | Notes |
|---|---|---|---|---|
| 01 | Codex CLI | GPT-5.5 | 82.0% | Proprietary frontier |
| 02 | ForgeCode | GPT-5.4 | 81.8% | Proprietary frontier |
| 03 | TongAgents | Gemini 3.1 Pro | 80.2% | Proprietary frontier |
| 04 | Covenant Agent | Claude Opus 4.7 | 67.4% | Open framework |
| 05 | Claude Code (vanilla) | Claude Opus 4.6 | 58.0% | No governance layer |
| -- | Ad-hoc prompted baseline | Claude Opus 4.7 | 42.0% | Same model, no Canon |
Scores from Terminal-Bench leaderboard, May 2026. Full 89-task run, no retry, single attempt.
All 89 Terminal-Bench 2.0 tasks run unmodified, in original order, against each configuration. No task selection, no retries beyond what the agent's own retry policy permits.
Claude Opus 4.7 with: (a) ad-hoc prompted baseline, (b) Covenant Canon, (c) Canon plus full agent registry. The headline reports (b) because it isolates the contribution of the rules themselves.
A 20-task subset was replicated on independent infrastructure. Variance: plus or minus 1.2 points. The full report is published in the methodology appendix.
No fine-tuning. No extra tools. No tricks. The 25-point improvement comes from the rules alone.
The 230 lines of rules used in the benchmark run. Testing showed most of the improvement came from rules 1, 3, and 4.
Before coding, list the directory and read key files. Understand what exists.
State your approach in one or two sentences before executing.
If a command fails, diagnose. Never run the same failing command twice.
After implementing, test. Run it. Check the output.
Work efficiently. Don't read files you don't need.
If three attempts fail, step back and reconsider the whole approach.