Most agent evaluations start in the wrong place. A team watches a model solve a tidy issue, compares a few provider claims, then tries to infer whether the agent will work inside a messy production codebase. That is not evaluation. It is a sales-assisted guess.
The real question is narrower and more useful: which agent can solve the kinds of changes that have already mattered in your repository? That includes local test habits, odd abstractions, hidden coupling, dependency quirks, and the boring edge cases that never show up in a demo.
What changes when the benchmark comes from your code
- You compare agents on the same tasks instead of comparing different demo experiences.
- You get cost per solved task, not just token price or subscription cost.
- You can inspect failed attempts and see whether the failure mode matters in practice.
- You build a reusable baseline for future model, prompt, and wrapper changes.
The result is a cleaner adoption conversation. Engineering can talk about reliability. Finance can talk about unit economics. Leadership can see a repeatable process instead of a collection of opinions.