RepoGauge Analysis

Replace vibes with evidence.

This report turns raw run and evaluation artifacts into a decision surface: which solvers win, which ones are cheap, how their latency and token burn compare, and how much spend you could claw back by switching already-solvable work to cheaper models.

Best Solvercodex-cli-mini
Cheapest Solvercodex-cli-mini
Group Bysolver_id
Expensive Threshold$1.00
Coverage

Resolved Instances

15

out of 15 unique tasks

Efficiency

Total Spend

$9.96

cost telemetry on 93.3% of attempts

Latency

Average Attempt

2.6m

45 attempt rows analyzed

Tokens

Total Tokens

17.8M

in 17.4M input and 464.2K output tokens

Tooling

Tool Calls

975

visible on 93.3% of attempts

Routing

Portfolio Cost Floor

$1.79

best solver gap $0.04

Best Solver

codex-cli-mini leads at 100.0% resolution with $0.12 per resolved issue.

Cheapest Successful Solver

codex-cli-mini is the low-cost anchor at $0.12 per resolved issue.

Cost Opportunity

If you route each already-solved task to the cheapest solver that also solved it, the current best-solver spend could drop by $0.04.

Solver Comparison

Resolution, cost, latency, tokens, and tool usage on one surface. Click any column to sort.

codex-cli-mini100.0%15$0.12$1.842.3m3.3m582.1K582.1K22.667$0.042.3%
claude-cli-sonnet93.3%14$0.39$5.563.0m4.7m8.0K8.6K20.8$3.8669.9%
opencode-kimi-k2p673.3%11$0.21$2.572.7m4.9m598.6K816.3K21.533$0.8035.1%

Solver Frontier

Resolution rate versus cost per resolved issue, with point size weighted by resolved instances.

0.0%25.0%50.0%75.0%100.0%$0.12$0.19$0.26$0.33$0.39codex-cli-mini | resolve 100.0% | cost $0.12 | latency 2.3m | tokens 582.1Kcodex-cli-miniclaude-cli-sonnet | resolve 93.3% | cost $0.39 | latency 3.0m | tokens 8.0Kclaude-cli-sonnetopencode-kimi-k2p6 | resolve 73.3% | cost $0.21 | latency 2.7m | tokens 598.6Kopencode-kimi-k2p6Higher and farther left is better.Cost Per Resolved IssueResolution Rate
  • codex-cli-mini
  • claude-cli-sonnet
  • opencode-kimi-k2p6

Failure Reasons

Share of unresolved rows by dominant harness outcome or failure reason.

  • model_not_found3
  • timeout2

Cost Opportunities

How much each solver spends today on the tasks it solved, versus the cheapest solver that also solved the same tasks.

claude-cli-sonnet14$5.53$1.67$3.8669.9%$0.28
opencode-kimi-k2p611$2.28$1.48$0.8035.1%$0.07
codex-cli-mini15$1.84$1.79$0.042.3%$0.0028

Potential Savings By Solver

Avoidable spend if successful tasks were handed to the cheapest successful model instead.

  • claude-cli-sonnet$3.86
  • opencode-kimi-k2p6$0.80
  • codex-cli-mini$0.04

High-Leverage Instance Savings

Concrete examples where solver substitution changes spend the most while preserving success.

s1liconcow__repogauge-rg-71b332d9f694-reviewedcodex-cli-mini$0.07claude-cli-sonnet$0.61$0.54codex-cli-mini, claude-cli-sonnet
s1liconcow__repogauge-rg-d1ea78738817-reviewedcodex-cli-mini$0.17claude-cli-sonnet$0.65$0.49codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-89e8e567eb39-reviewedcodex-cli-mini$0.16claude-cli-sonnet$0.58$0.42codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-791dd2e150b4-reviewedcodex-cli-mini$0.13claude-cli-sonnet$0.54$0.41codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-fa095be4cc9e-reviewedcodex-cli-mini$0.18claude-cli-sonnet$0.58$0.39codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-d05949a4fd70-reviewedcodex-cli-mini$0.20claude-cli-sonnet$0.56$0.36codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-aac6186e81ce-reviewedopencode-kimi-k2p6$0.11claude-cli-sonnet$0.43$0.32opencode-kimi-k2p6, codex-cli-mini, claude-cli-sonnet
s1liconcow__repogauge-rg-04cb13bab51d-reviewedcodex-cli-mini$0.22claude-cli-sonnet$0.54$0.31codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-7d17196fe3ca-reviewedcodex-cli-mini$0.13claude-cli-sonnet$0.36$0.23codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-897a464a8fd5-reviewedopencode-kimi-k2p6$0.06claude-cli-sonnet$0.19$0.13opencode-kimi-k2p6, codex-cli-mini, claude-cli-sonnet

Requested Summary

The custom grouped rollup that `repogauge analyze` was asked to produce.

claude-cli-sonnet93.3%14$0.393.0m8.0K20.8$0.04
codex-cli-mini100.0%15$0.122.3m582.1K22.667$0.01
opencode-kimi-k2p673.3%11$0.212.7m598.6K21.533$0.04

Budget Frontier

At each budget ceiling, this is the best affordable solver. The `best_solver_id` is the answer to the practical question: if this is my budget, what should I run?

$0.12codex-cli-mini100.0%$0.122.3m$1.84['codex-cli-mini']
$0.21codex-cli-mini100.0%$0.122.3m$1.84['opencode-kimi-k2p6']
$0.39codex-cli-mini100.0%$0.122.3m$1.84['claude-cli-sonnet']

Pareto Frontier

Rows here are not dominated on resolution, cost, and latency. They are the serious candidates.

codex-cli-mini100.0%$0.122.3m15

Unresolved Samples

The highest-cost unresolved rows, surfaced to make debugging the next tranche straightforward.

opencode-kimi-k2p6timeout

s1liconcow__repogauge-rg-d68b8bf7245e-reviewed

Observed behavior - s1liconcow/repogauge - Details: Implement deterministic environment signature and version key - Production changes: repogauge/mining/inspect.py, repogauge/mining/signature.py. Test changes: tests/unit/test_inspect.py, tests/unit/test_signat

Attempt State
timed_out
Latency
5.5m
Spend
$0.30
Cluster
len=long|signal=error|version=semantic
claude-cli-sonnettimeout

s1liconcow__repogauge-rg-d68b8bf7245e-reviewed

Observed behavior - s1liconcow/repogauge - Details: Implement deterministic environment signature and version key - Production changes: repogauge/mining/inspect.py, repogauge/mining/signature.py. Test changes: tests/unit/test_inspect.py, tests/unit/test_signat

Attempt State
timed_out
Latency
5.4m
Spend
$0.03
Cluster
len=long|signal=error|version=semantic
opencode-kimi-k2p6model_not_found

s1liconcow__repogauge-rg-71b332d9f694-reviewed

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-626 - orchestrate materialization flow - Production changes: repogauge/cli.py, repogauge/export/__init__.py, repogauge/export/materialize.py. Test changes: tests/unit/te

Attempt State
succeeded
Latency
25.6s
Spend
-
Cluster
len=very_long|signal=error|version=semantic
opencode-kimi-k2p6model_not_found

s1liconcow__repogauge-rg-ef70ca456338-reviewed

Observed behavior - s1liconcow/repogauge - Details: Fix JUnit classname-to-path splitting for class-based tests Classnames like 'tests.unit.test_foo.TestBar' were being converted to 'tests/unit/test_foo/TestBar.py' instead of the correct pytest node ID 'tests/

Attempt State
succeeded
Latency
23.1s
Spend
-
Cluster
len=very_long|signal=error|version=semantic
opencode-kimi-k2p6model_not_found

s1liconcow__repogauge-rg-c980ea1754af-reviewed

Observed behavior - s1liconcow/repogauge - Details: Land oss_repogauge-p46: add file role classifier - Production changes: repogauge/mining/__init__.py, repogauge/mining/file_roles.py. Test changes: tests/unit/test_file_roles.py. - Bead oss_repogauge-p46: Defi

Attempt State
succeeded
Latency
21.7s
Spend
-
Cluster
len=very_long|signal=error|version=semantic

LLM Judge

43 jobs judged. Average delta -1.266. Better 0.0%, worse 100.0%.

Best Judge Solver

codex-cli-mini leads the advisory code-health comparison.

Judge Errors

0 latest-attempt rows could not be scored.

LLM Judge Solver View

Advisory diff-versus-gold scoring aggregated on the latest attempt per job.

codex-cli-mini15-1.2430.0%100.0%150
claude-cli-sonnet14-1.2710.0%100.0%140
opencode-kimi-k2p614-1.2860.0%100.0%110

Judge Dimensions

Average better-or-worse signal per rubric dimension on the latest attempt per job.

task_fit0.3-1.140.0%100.0%
correctness_safety0.25-1.3950.0%95.3%
maintainability0.2-1.0930.0%100.0%
test_quality0.15-1.7910.0%100.0%
change_focus0.1-0.8847.0%58.1%

Resolved But Worse Than Gold

Successful attempts that still looked worse than the reference patch on code-health grounds.

claude-cli-sonnets1liconcow__repogauge-rg-04cb13bab51d-reviewedmuch_worse-1.80.92resolvedsucceededCandidate partially carries usage/cost provenance through runtime results, but misses key schema and failure-handling pieces from the gold patch and includes substantial unrelated packaging churn.
codex-cli-minis1liconcow__repogauge-rg-04cb13bab51d-reviewedmuch_worse-1.80.89resolvedsucceededCandidate adds part of the telemetry provenance plumbing, but misses the core propagation through adapter finalization, adds no regression tests, and includes unrelated packaging churn.
claude-cli-sonnets1liconcow__repogauge-rg-c980ea1754af-reviewedmuch_worse-1.80.94resolvedsucceededCandidate adds a path classifier, but it diverges from the required API, omits the shared bulk-classification surface, introduces clear overclassification risks, and includes substantial unrelated generated-file churn with no tests.
claude-cli-sonnets1liconcow__repogauge-rg-791dd2e150b4-reviewedmuch_worse-1.50.95resolvedsucceededCandidate lands the basic adapter bridge, but it is less compatible and less aligned with the validation parser than the gold patch, and it omits the accompanying regression tests.
claude-cli-sonnets1liconcow__repogauge-rg-897a464a8fd5-reviewedmuch_worse-1.50.89resolvedsucceededCandidate partially implements adapter registration metadata but misses several gold behaviors, adds unrelated packaging artifacts, and appears not to include the corresponding unit test coverage.
codex-cli-minis1liconcow__repogauge-rg-89e8e567eb39-reviewedmuch_worse-1.50.89resolvedsucceededCandidate implements the basic environment-plan wiring, but it diverges from the golden behavior in provenance/confidence semantics, omits the regression tests, and introduces unrelated TOML parsing churn with clear regression risk.
claude-cli-sonnets1liconcow__repogauge-rg-d1ea78738817-reviewedmuch_worse-1.50.92resolvedsucceededCandidate implements basic problem-statement synthesis but misses provenance detail, handles fewer source shapes than the gold patch, and includes unrelated packaging churn without adding the regression tests.
codex-cli-minis1liconcow__repogauge-rg-5626b1b0e355-reviewedmuch_worse-1.40.92resolvedsucceededCandidate fixes the `--junitxml` planning variant and handles explicit `<xpass/>`, but misses the gold patch's stricter junit parsing behavior and adds a different runtime-command rewrite in `validate.py` instead of the intended validation hardening.
opencode-kimi-k2p6s1liconcow__repogauge-rg-d1ea78738817-reviewedmuch_worse-1.40.91resolvedsucceededCandidate lands the main field wiring but misses part of the required provenance contract and implements a weaker synthesis policy than the gold patch.
codex-cli-minis1liconcow__repogauge-rg-fa095be4cc9e-reviewedmuch_worse-1.40.88resolvedsucceededCandidate partially implements adapter delegation but diverges from the golden fix by weakening dependency fingerprinting and omitting the regression test coverage.

Unresolved But Promising

Attempts that failed the harness but still looked directionally better than the gold patch on code quality.

No unresolved attempts looked promising against gold.

Best Diff Samples

The strongest candidate diffs according to the advisory judge.

codex-cli-minis1liconcow__repogauge-rg-ef70ca456338-reviewedworse-0.750.94resolvedsucceededCandidate fixes the classname splitting behavior, but it omits the targeted regression test and adds unrelated packaging artifact churn.
claude-cli-sonnets1liconcow__repogauge-rg-71b332d9f694-reviewedworse-0.90.93resolvedsucceededCandidate adds an `export` command path, but it is materially incomplete relative to the gold fix: it omits repo-root/input resolution support, does not include the production materialization module change, and adds no regression tests.
opencode-kimi-k2p6s1liconcow__repogauge-rg-791dd2e150b4-reviewedworse-0.90.95resolvedsucceededCandidate lands the adapter hook but falls short of the gold bridge by duplicating parser logic, narrowing accepted payload shapes, and omitting the validation-layer refactor and compatibility tests that make the bridge consistent and robust.
codex-cli-minis1liconcow__repogauge-rg-c980ea1754af-reviewedworse-0.90.9resolvedsucceededCandidate lands a workable classifier, but it is materially less complete than the gold patch and changes some taxonomy behavior in ways that increase downstream risk.
opencode-kimi-k2p6s1liconcow__repogauge-rg-fa095be4cc9e-reviewedworse-0.90.93resolvedsucceededCandidate captures the core delegation refactor in production code but misses the gold patch's broader hint-source handling and regression test coverage.
claude-cli-sonnets1liconcow__repogauge-rg-546450916c29-reviewedworse-0.950.87resolvedsucceededCandidate implements the core parser-name dispatch and lazy swebench import, but it is narrower and less robust than the gold patch and appears to omit the regression tests entirely.
opencode-kimi-k2p6s1liconcow__repogauge-rg-546450916c29-reviewedworse-0.950.93resolvedsucceededCandidate fixes the immediate bridge path with a minimal hardcoded dispatcher and lazy import, but it is less extensible and less well covered than the gold patch.
opencode-kimi-k2p6s1liconcow__repogauge-rg-04cb13bab51d-reviewedworse-10.95resolvedsucceededCandidate carries the schema/plumbing changes but misses key parts of the gold fix, especially Codex CLI failure handling, Anthropic provenance support, and the regression tests, while also adding unrelated packaging churn.
claude-cli-sonnets1liconcow__repogauge-rg-7d17196fe3ca-reviewedworse-10.97resolvedsucceededCandidate adds a basic split helper, but it misses the required integration and tests and includes unrelated packaging churn, so it falls well short of the gold fix.
opencode-kimi-k2p6s1liconcow__repogauge-rg-897a464a8fd5-reviewedworse-10.87resolvedsucceededCandidate fixes the basic adapter import/registration path but is narrower and noisier than the gold patch, with weaker edge-case handling and no corresponding test coverage in the diff.

Worst Diff Samples

The weakest candidate diffs according to the advisory judge.

opencode-kimi-k2p6s1liconcow__repogauge-rg-71b332d9f694-reviewedmuch_worse-20.99not_resolvedsucceededCandidate misses the materialization/export implementation entirely and adds unrelated packaging artifacts, so it is substantially worse than the gold patch.
opencode-kimi-k2p6s1liconcow__repogauge-rg-c980ea1754af-reviewedmuch_worse-20.99not_resolvedsucceededCandidate does not implement the file-role classifier or tests and instead adds unrelated packaging artifacts, so it is substantially worse than the gold patch.
opencode-kimi-k2p6s1liconcow__repogauge-rg-ef70ca456338-reviewedmuch_worse-20.99not_resolvedsucceededCandidate does not implement the JUnit classname parsing fix and instead adds unrelated packaging metadata files, so it is substantially worse than the gold patch.
claude-cli-sonnets1liconcow__repogauge-rg-04cb13bab51d-reviewedmuch_worse-1.80.92resolvedsucceededCandidate partially carries usage/cost provenance through runtime results, but misses key schema and failure-handling pieces from the gold patch and includes substantial unrelated packaging churn.
codex-cli-minis1liconcow__repogauge-rg-04cb13bab51d-reviewedmuch_worse-1.80.89resolvedsucceededCandidate adds part of the telemetry provenance plumbing, but misses the core propagation through adapter finalization, adds no regression tests, and includes unrelated packaging churn.
claude-cli-sonnets1liconcow__repogauge-rg-c980ea1754af-reviewedmuch_worse-1.80.94resolvedsucceededCandidate adds a path classifier, but it diverges from the required API, omits the shared bulk-classification surface, introduces clear overclassification risks, and includes substantial unrelated generated-file churn with no tests.
claude-cli-sonnets1liconcow__repogauge-rg-791dd2e150b4-reviewedmuch_worse-1.50.95resolvedsucceededCandidate lands the basic adapter bridge, but it is less compatible and less aligned with the validation parser than the gold patch, and it omits the accompanying regression tests.
claude-cli-sonnets1liconcow__repogauge-rg-897a464a8fd5-reviewedmuch_worse-1.50.89resolvedsucceededCandidate partially implements adapter registration metadata but misses several gold behaviors, adds unrelated packaging artifacts, and appears not to include the corresponding unit test coverage.
codex-cli-minis1liconcow__repogauge-rg-89e8e567eb39-reviewedmuch_worse-1.50.89resolvedsucceededCandidate implements the basic environment-plan wiring, but it diverges from the golden behavior in provenance/confidence semantics, omits the regression tests, and introduces unrelated TOML parsing churn with clear regression risk.
claude-cli-sonnets1liconcow__repogauge-rg-d1ea78738817-reviewedmuch_worse-1.50.92resolvedsucceededCandidate implements basic problem-statement synthesis but misses provenance detail, handles fewer source shapes than the gold patch, and includes unrelated packaging churn without adding the regression tests.