RepoGauge Analysis

Replace vibes with evidence.

This report turns raw run and evaluation artifacts into a decision surface: which solvers win, which ones are cheap, how their latency and token burn compare, and how much spend you could claw back by switching already-solvable work to cheaper models.

Best Solvercodex-cli-mini

Cheapest Solvercodex-cli-mini

Group Bysolver_id

Expensive Threshold$1.00

Coverage

Resolved Instances

out of 15 unique tasks

Efficiency

Total Spend

$9.96

cost telemetry on 93.3% of attempts

Latency

Average Attempt

2.6m

45 attempt rows analyzed

Tokens

Total Tokens

17.8M

in 17.4M input and 464.2K output tokens

Tooling

Tool Calls

975

visible on 93.3% of attempts

Routing

Portfolio Cost Floor

$1.79

best solver gap $0.04

Best Solver

codex-cli-mini leads at 100.0% resolution with $0.12 per resolved issue.

Cheapest Successful Solver

codex-cli-mini is the low-cost anchor at $0.12 per resolved issue.

Cost Opportunity

If you route each already-solved task to the cheapest solver that also solved it, the current best-solver spend could drop by $0.04.

Solver Comparison

Resolution, cost, latency, tokens, and tool usage on one surface. Click any column to sort.


codex-cli-mini	100.0%	15	$0.12	$1.84	2.3m	3.3m	582.1K	582.1K	22.667	$0.04	2.3%
claude-cli-sonnet	93.3%	14	$0.39	$5.56	3.0m	4.7m	8.0K	8.6K	20.8	$3.86	69.9%
opencode-kimi-k2p6	73.3%	11	$0.21	$2.57	2.7m	4.9m	598.6K	816.3K	21.533	$0.80	35.1%

Solver Frontier

Resolution rate versus cost per resolved issue, with point size weighted by resolved instances.

codex-cli-mini
claude-cli-sonnet
opencode-kimi-k2p6

Failure Reasons

Share of unresolved rows by dominant harness outcome or failure reason.

model_not_found3
timeout2

Cost Opportunities

How much each solver spends today on the tasks it solved, versus the cheapest solver that also solved the same tasks.


claude-cli-sonnet	14	$5.53	$1.67	$3.86	69.9%	$0.28
opencode-kimi-k2p6	11	$2.28	$1.48	$0.80	35.1%	$0.07
codex-cli-mini	15	$1.84	$1.79	$0.04	2.3%	$0.0028

Potential Savings By Solver

Avoidable spend if successful tasks were handed to the cheapest successful model instead.

claude-cli-sonnet$3.86
opencode-kimi-k2p6$0.80
codex-cli-mini$0.04

High-Leverage Instance Savings

Concrete examples where solver substitution changes spend the most while preserving success.


s1liconcow__repogauge-rg-71b332d9f694-reviewed	codex-cli-mini	$0.07	claude-cli-sonnet	$0.61	$0.54	codex-cli-mini, claude-cli-sonnet
s1liconcow__repogauge-rg-d1ea78738817-reviewed	codex-cli-mini	$0.17	claude-cli-sonnet	$0.65	$0.49	codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-89e8e567eb39-reviewed	codex-cli-mini	$0.16	claude-cli-sonnet	$0.58	$0.42	codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-791dd2e150b4-reviewed	codex-cli-mini	$0.13	claude-cli-sonnet	$0.54	$0.41	codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-fa095be4cc9e-reviewed	codex-cli-mini	$0.18	claude-cli-sonnet	$0.58	$0.39	codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-d05949a4fd70-reviewed	codex-cli-mini	$0.20	claude-cli-sonnet	$0.56	$0.36	codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-aac6186e81ce-reviewed	opencode-kimi-k2p6	$0.11	claude-cli-sonnet	$0.43	$0.32	opencode-kimi-k2p6, codex-cli-mini, claude-cli-sonnet
s1liconcow__repogauge-rg-04cb13bab51d-reviewed	codex-cli-mini	$0.22	claude-cli-sonnet	$0.54	$0.31	codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-7d17196fe3ca-reviewed	codex-cli-mini	$0.13	claude-cli-sonnet	$0.36	$0.23	codex-cli-mini, opencode-kimi-k2p6, claude-cli-sonnet
s1liconcow__repogauge-rg-897a464a8fd5-reviewed	opencode-kimi-k2p6	$0.06	claude-cli-sonnet	$0.19	$0.13	opencode-kimi-k2p6, codex-cli-mini, claude-cli-sonnet

Requested Summary

The custom grouped rollup that `repogauge analyze` was asked to produce.


claude-cli-sonnet	93.3%	14	$0.39	3.0m	8.0K	20.8	$0.04
codex-cli-mini	100.0%	15	$0.12	2.3m	582.1K	22.667	$0.01
opencode-kimi-k2p6	73.3%	11	$0.21	2.7m	598.6K	21.533	$0.04

Budget Frontier

At each budget ceiling, this is the best affordable solver. The `best_solver_id` is the answer to the practical question: if this is my budget, what should I run?


$0.12	codex-cli-mini	100.0%	$0.12	2.3m	$1.84	['codex-cli-mini']
$0.21	codex-cli-mini	100.0%	$0.12	2.3m	$1.84	['opencode-kimi-k2p6']
$0.39	codex-cli-mini	100.0%	$0.12	2.3m	$1.84	['claude-cli-sonnet']

Pareto Frontier

Rows here are not dominated on resolution, cost, and latency. They are the serious candidates.


codex-cli-mini	100.0%	$0.12	2.3m	15

Unresolved Samples

The highest-cost unresolved rows, surfaced to make debugging the next tranche straightforward.

opencode-kimi-k2p6timeout

s1liconcow__repogauge-rg-d68b8bf7245e-reviewed

Attempt State: timed_out
Latency: 5.5m
Spend: $0.30
Cluster: len=long|signal=error|version=semantic

claude-cli-sonnettimeout

s1liconcow__repogauge-rg-d68b8bf7245e-reviewed

Attempt State: timed_out
Latency: 5.4m
Spend: $0.03
Cluster: len=long|signal=error|version=semantic

opencode-kimi-k2p6model_not_found

s1liconcow__repogauge-rg-71b332d9f694-reviewed

Attempt State: succeeded
Latency: 25.6s
Spend: -
Cluster: len=very_long|signal=error|version=semantic

opencode-kimi-k2p6model_not_found

s1liconcow__repogauge-rg-ef70ca456338-reviewed

Attempt State: succeeded
Latency: 23.1s
Spend: -
Cluster: len=very_long|signal=error|version=semantic

opencode-kimi-k2p6model_not_found

s1liconcow__repogauge-rg-c980ea1754af-reviewed

Attempt State: succeeded
Latency: 21.7s
Spend: -
Cluster: len=very_long|signal=error|version=semantic

LLM Judge

43 jobs judged. Average delta -1.266. Better 0.0%, worse 100.0%.

Best Judge Solver

codex-cli-mini leads the advisory code-health comparison.

Judge Errors

0 latest-attempt rows could not be scored.

LLM Judge Solver View

Advisory diff-versus-gold scoring aggregated on the latest attempt per job.


codex-cli-mini	15	-1.243	0.0%	100.0%	15
claude-cli-sonnet	14	-1.271	0.0%	100.0%	14
opencode-kimi-k2p6	14	-1.286	0.0%	100.0%	11

Judge Dimensions

Average better-or-worse signal per rubric dimension on the latest attempt per job.


task_fit	0.3	-1.14	0.0%	100.0%
correctness_safety	0.25	-1.395	0.0%	95.3%
maintainability	0.2	-1.093	0.0%	100.0%
test_quality	0.15	-1.791	0.0%	100.0%
change_focus	0.1	-0.884	7.0%	58.1%

Resolved But Worse Than Gold

Successful attempts that still looked worse than the reference patch on code-health grounds.


claude-cli-sonnet	s1liconcow__repogauge-rg-04cb13bab51d-reviewed	much_worse	-1.8	0.92	resolved	succeeded	Candidate partially carries usage/cost provenance through runtime results, but misses key schema and failure-handling pieces from the gold patch and includes substantial unrelated packaging churn.
codex-cli-mini	s1liconcow__repogauge-rg-04cb13bab51d-reviewed	much_worse	-1.8	0.89	resolved	succeeded	Candidate adds part of the telemetry provenance plumbing, but misses the core propagation through adapter finalization, adds no regression tests, and includes unrelated packaging churn.
claude-cli-sonnet	s1liconcow__repogauge-rg-c980ea1754af-reviewed	much_worse	-1.8	0.94	resolved	succeeded	Candidate adds a path classifier, but it diverges from the required API, omits the shared bulk-classification surface, introduces clear overclassification risks, and includes substantial unrelated generated-file churn with no tests.
claude-cli-sonnet	s1liconcow__repogauge-rg-791dd2e150b4-reviewed	much_worse	-1.5	0.95	resolved	succeeded	Candidate lands the basic adapter bridge, but it is less compatible and less aligned with the validation parser than the gold patch, and it omits the accompanying regression tests.
claude-cli-sonnet	s1liconcow__repogauge-rg-897a464a8fd5-reviewed	much_worse	-1.5	0.89	resolved	succeeded	Candidate partially implements adapter registration metadata but misses several gold behaviors, adds unrelated packaging artifacts, and appears not to include the corresponding unit test coverage.
codex-cli-mini	s1liconcow__repogauge-rg-89e8e567eb39-reviewed	much_worse	-1.5	0.89	resolved	succeeded	Candidate implements the basic environment-plan wiring, but it diverges from the golden behavior in provenance/confidence semantics, omits the regression tests, and introduces unrelated TOML parsing churn with clear regression risk.
claude-cli-sonnet	s1liconcow__repogauge-rg-d1ea78738817-reviewed	much_worse	-1.5	0.92	resolved	succeeded	Candidate implements basic problem-statement synthesis but misses provenance detail, handles fewer source shapes than the gold patch, and includes unrelated packaging churn without adding the regression tests.
codex-cli-mini	s1liconcow__repogauge-rg-5626b1b0e355-reviewed	much_worse	-1.4	0.92	resolved	succeeded	Candidate fixes the `--junitxml` planning variant and handles explicit `<xpass/>`, but misses the gold patch's stricter junit parsing behavior and adds a different runtime-command rewrite in `validate.py` instead of the intended validation hardening.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-d1ea78738817-reviewed	much_worse	-1.4	0.91	resolved	succeeded	Candidate lands the main field wiring but misses part of the required provenance contract and implements a weaker synthesis policy than the gold patch.
codex-cli-mini	s1liconcow__repogauge-rg-fa095be4cc9e-reviewed	much_worse	-1.4	0.88	resolved	succeeded	Candidate partially implements adapter delegation but diverges from the golden fix by weakening dependency fingerprinting and omitting the regression test coverage.

Unresolved But Promising

Attempts that failed the harness but still looked directionally better than the gold patch on code quality.

No unresolved attempts looked promising against gold.

Best Diff Samples

The strongest candidate diffs according to the advisory judge.


codex-cli-mini	s1liconcow__repogauge-rg-ef70ca456338-reviewed	worse	-0.75	0.94	resolved	succeeded	Candidate fixes the classname splitting behavior, but it omits the targeted regression test and adds unrelated packaging artifact churn.
claude-cli-sonnet	s1liconcow__repogauge-rg-71b332d9f694-reviewed	worse	-0.9	0.93	resolved	succeeded	Candidate adds an `export` command path, but it is materially incomplete relative to the gold fix: it omits repo-root/input resolution support, does not include the production materialization module change, and adds no regression tests.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-791dd2e150b4-reviewed	worse	-0.9	0.95	resolved	succeeded	Candidate lands the adapter hook but falls short of the gold bridge by duplicating parser logic, narrowing accepted payload shapes, and omitting the validation-layer refactor and compatibility tests that make the bridge consistent and robust.
codex-cli-mini	s1liconcow__repogauge-rg-c980ea1754af-reviewed	worse	-0.9	0.9	resolved	succeeded	Candidate lands a workable classifier, but it is materially less complete than the gold patch and changes some taxonomy behavior in ways that increase downstream risk.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-fa095be4cc9e-reviewed	worse	-0.9	0.93	resolved	succeeded	Candidate captures the core delegation refactor in production code but misses the gold patch's broader hint-source handling and regression test coverage.
claude-cli-sonnet	s1liconcow__repogauge-rg-546450916c29-reviewed	worse	-0.95	0.87	resolved	succeeded	Candidate implements the core parser-name dispatch and lazy swebench import, but it is narrower and less robust than the gold patch and appears to omit the regression tests entirely.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-546450916c29-reviewed	worse	-0.95	0.93	resolved	succeeded	Candidate fixes the immediate bridge path with a minimal hardcoded dispatcher and lazy import, but it is less extensible and less well covered than the gold patch.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-04cb13bab51d-reviewed	worse	-1	0.95	resolved	succeeded	Candidate carries the schema/plumbing changes but misses key parts of the gold fix, especially Codex CLI failure handling, Anthropic provenance support, and the regression tests, while also adding unrelated packaging churn.
claude-cli-sonnet	s1liconcow__repogauge-rg-7d17196fe3ca-reviewed	worse	-1	0.97	resolved	succeeded	Candidate adds a basic split helper, but it misses the required integration and tests and includes unrelated packaging churn, so it falls well short of the gold fix.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-897a464a8fd5-reviewed	worse	-1	0.87	resolved	succeeded	Candidate fixes the basic adapter import/registration path but is narrower and noisier than the gold patch, with weaker edge-case handling and no corresponding test coverage in the diff.

Worst Diff Samples

The weakest candidate diffs according to the advisory judge.


opencode-kimi-k2p6	s1liconcow__repogauge-rg-71b332d9f694-reviewed	much_worse	-2	0.99	not_resolved	succeeded	Candidate misses the materialization/export implementation entirely and adds unrelated packaging artifacts, so it is substantially worse than the gold patch.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-c980ea1754af-reviewed	much_worse	-2	0.99	not_resolved	succeeded	Candidate does not implement the file-role classifier or tests and instead adds unrelated packaging artifacts, so it is substantially worse than the gold patch.
opencode-kimi-k2p6	s1liconcow__repogauge-rg-ef70ca456338-reviewed	much_worse	-2	0.99	not_resolved	succeeded	Candidate does not implement the JUnit classname parsing fix and instead adds unrelated packaging metadata files, so it is substantially worse than the gold patch.
claude-cli-sonnet	s1liconcow__repogauge-rg-04cb13bab51d-reviewed	much_worse	-1.8	0.92	resolved	succeeded	Candidate partially carries usage/cost provenance through runtime results, but misses key schema and failure-handling pieces from the gold patch and includes substantial unrelated packaging churn.
codex-cli-mini	s1liconcow__repogauge-rg-04cb13bab51d-reviewed	much_worse	-1.8	0.89	resolved	succeeded	Candidate adds part of the telemetry provenance plumbing, but misses the core propagation through adapter finalization, adds no regression tests, and includes unrelated packaging churn.
claude-cli-sonnet	s1liconcow__repogauge-rg-c980ea1754af-reviewed	much_worse	-1.8	0.94	resolved	succeeded	Candidate adds a path classifier, but it diverges from the required API, omits the shared bulk-classification surface, introduces clear overclassification risks, and includes substantial unrelated generated-file churn with no tests.
claude-cli-sonnet	s1liconcow__repogauge-rg-791dd2e150b4-reviewed	much_worse	-1.5	0.95	resolved	succeeded	Candidate lands the basic adapter bridge, but it is less compatible and less aligned with the validation parser than the gold patch, and it omits the accompanying regression tests.
claude-cli-sonnet	s1liconcow__repogauge-rg-897a464a8fd5-reviewed	much_worse	-1.5	0.89	resolved	succeeded	Candidate partially implements adapter registration metadata but misses several gold behaviors, adds unrelated packaging artifacts, and appears not to include the corresponding unit test coverage.
codex-cli-mini	s1liconcow__repogauge-rg-89e8e567eb39-reviewed	much_worse	-1.5	0.89	resolved	succeeded	Candidate implements the basic environment-plan wiring, but it diverges from the golden behavior in provenance/confidence semantics, omits the regression tests, and introduces unrelated TOML parsing churn with clear regression risk.
claude-cli-sonnet	s1liconcow__repogauge-rg-d1ea78738817-reviewed	much_worse	-1.5	0.92	resolved	succeeded	Candidate implements basic problem-statement synthesis but misses provenance detail, handles fewer source shapes than the gold patch, and includes unrelated packaging churn without adding the regression tests.

Run Metadata

{
  "attempt_rows": 45,
  "expensive_cost_threshold": 1.0,
  "group_by": [
    "solver_id"
  ],
  "input_root": "/home/david/projects/repogauge/out/run/repogauge-cli-compare",
  "instance_result_rows": 43,
  "llm_judge": {
    "model": {
      "cost": {},
      "model_name": "gpt-5.4",
      "prompt_version": "diff_judge/v1",
      "provider": "codex",
      "request_ref": null,
      "response_ref": null,
      "usage": {}
    },
    "rows_path": "/home/david/projects/repogauge/out/analyze/judge/llm_judge.jsonl",
    "top_line": {
      "avg_overall_delta": -1.266279,
      "best_solver_id": "codex-cli-mini",
      "better_share": 0.0,
      "cache_hit_count": 0,
      "error_job_count": 0,
      "judged_attempt_count": 43,
      "judged_job_count": 43,
      "scored_job_count": 43,
      "worse_share": 1.0
    }
  },
  "router_training_rows": 15,
  "run_retry_merge": {
    "merged_retry_attempt_rows": 45,
    "merged_retry_instance_result_rows": 0,
    "merged_retry_run_roots": [
      "/home/david/projects/repogauge/out/run/repogauge-cli-compare",
      "/home/david/projects/repogauge/out/run/repogauge-cli-compare-retry"
    ]
  },
  "run_root": "/home/david/projects/repogauge/out/run/repogauge-cli-compare",
  "task_feature_version": "task-features-v1"
}

Attempt Browser

Inspect each instance, the solvers that attempted it, and the diffs they produced. LLM Judge included

s1liconcow__repogauge-rg-04cb13bab51d-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-owj - track adapter telemetry provenance and fix Codex CLI failure handling - Production changes: repogauge/config.py, repogauge/runner/adapters.py, repogauge/runner/scheduler.py. Test changes: tests/unit/test_adapters.py, tests/unit/test_scheduler.py. - Bead oss_repogauge-owj: Implement concrete solver adapters for Anthropic, OpenAI/Codex, OpenCode, and OpenAI-compatible backends Context: - Implement the first batch of production solver adapters. Acceptance: - At least one adapter each for Anthropic, OpenAI/Codex, OpenCode, and OpenAI-compatible backends works behind the common interface. - Adapters emit consistent attempt records. - Provider-specific quirks are isolated and documented. --- - Reproduction: run tests impacted by this change (308 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 2.6m
Spend: $0.54
Tokens: 14.7K
Tool Calls: 19
Patch Size: 11,997 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..657c44e
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,134 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: PyYAML>=6.0.2
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+- [docs/junit_parser_contract.md](docs/junit_parser_contract.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--enrich-github` | disabled | Enable optional GitHub issue/PR metadata enrichment |
+| `--max-commits N` | `100` | Commits to scan |
+| `--github-token TOKEN` | `$(GITHUB_TOKEN)` | Token used for GitHub API calls |
+| `--github-enrichment-cache PATH` | `<out>/github_enrichment_cache.json` | Optional local cache for enrichment responses |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### Command artifact contract (current scaffold)
+
+For `--out` directory `./out`, the scaffold writes command-specific artifacts:
+
+- `manifest.json`
+  - command metadata and step status for each run invocation
+- `events.jsonl`
+  - machine-readable execution events for the same run
+- `repo_profile.json`, `scan.jsonl`, `candidates.jsonl` for `mine`
+- `reviewed.jsonl`, `review.md`, `review.html` for `review`
+- `materialized.jsonl`, `materialization_rejections.jsonl`, `dataset/dataset.jsonl`,
+  `dataset/predictions.gold.jsonl`, `adapter_<repo>.py`, `specs.json` for `export`
+- `validation.jsonl` for `eval`
+
+This list reflects what the v0.1 scaffold guarantees today; future stages
+extend it to include run-level and analysis artifacts.
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2cd83a4
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,52 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/requires.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/enrich.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/parsers/__init__.py
+repogauge/parsers/junit.py
+repogauge/runner/__init__.py
+repogauge/runner/adapters.py
+repogauge/runner/judge.py
+repogauge/runner/matrix.py
+repogauge/runner/normalize_patch.py
+repogauge/runner/planner.py
+repogauge/runner/providers.py
+repogauge/runner/scheduler.py
+repogauge/runner/solvers.py
+repogauge/runner/telemetry.py
+repogauge/runner/workspaces.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/evidence.py
+repogauge/validation/junit_parser.py
+repogauge/validation/testsel.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/requires.txt b/repogauge.egg-info/requires.txt
new file mode 100644
index 0000000..55c64f4
--- /dev/null
+++ b/repogauge.egg-info/requires.txt
@@ -0,0 +1 @@
+PyYAML>=6.0.2
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/runner/adapters.py b/repogauge/runner/adapters.py
index 815db04..00de00e 100644
--- a/repogauge/runner/adapters.py
+++ b/repogauge/runner/adapters.py
@@ -346,6 +346,8 @@ class _BaseConcreteSolverAdapter(SolverAdapter, ABC):
                     exit_reason="invalid patch: no unified diff found in model output",
                     usage=result.usage,
                     cost=result.cost,
+                    usage_source=result.usage_source,
+                    cost_source=result.cost_source,
                     metadata=metadata,
                 )
             return SolverAdapterResult(
@@ -356,6 +358,8 @@ class _BaseConcreteSolverAdapter(SolverAdapter, ABC):
                 exit_reason=result.exit_reason,
                 usage=_coerce_usage(result.usage),
                 cost=_coerce_cost(result.cost),
+                usage_source=result.usage_source,
+                cost_source=result.cost_source,
                 metadata=metadata,
             )
 
@@ -367,6 +371,8 @@ class _BaseConcreteSolverAdapter(SolverAdapter, ABC):
             exit_reason=result.exit_reason,
             usage=_coerce_usage(result.usage),
             cost=_coerce_cost(result.cost),
+            usage_source=result.usage_source,
+            cost_source=result.cost_source,
             metadata=metadata,
         )
 
@@ -653,6 +659,8 @@ class OpenAIResponsesAdapter(_BaseConcreteSolverAdapter):
             exit_reason="",
             usage=usage,
             cost=cost,
+            usage_source="response.usage",
+            cost_source="response.cost",
         )
 
 
@@ -765,6 +773,8 @@ class OpenAICompatibleAdapter(_BaseConcreteSolverAdapter):
             exit_reason="",
             usage=usage,
             cost=cost,
+            usage_source="response.usage",
+            cost_source="response.cost",
         )
 
 
@@ -880,6 +890,8 @@ class CodexCLIAdapter(_BaseConcreteSolverAdapter):
                 exit_reason="",
                 usage=usage,
                 cost=cost,
+                usage_source="codex_cli.event.usage",
+                cost_source="codex_cli.event.cost",
             )
         if command_result.timed_out:
             return SolverAdapterResult(
diff --git a/repogauge/runner/scheduler.py b/repogauge/runner/scheduler.py
index 8bea883..375ed8d 100644
--- a/repogauge/runner/scheduler.py
+++ b/repogauge/runner/scheduler.py
@@ -68,6 +68,8 @@ class SolverAdapterResult:
     exit_reason: str = ""
     usage: Mapping[str, Any] = field(default_factory=dict)
     cost: Mapping[str, Any] = field(default_factory=dict)
+    usage_source: str = ""
+    cost_source: str = ""
     metadata: Mapping[str, Any] = field(default_factory=dict)
 
 
@@ -210,6 +212,8 @@ def _serialize_attempt_row(
     raw_output: str,
     usage: Mapping[str, Any],
     cost: Mapping[str, Any],
+    usage_source: str,
+    cost_source: str,
     exit_reason: str,
     metadata: Mapping[str, Any],
 ) -> dict[str, Any]:
@@ -228,6 +232,8 @@ def _serialize_attempt_row(
     payload = row.to_dict()
     payload["raw_output"] = raw_output
     payload["attempt_state"] = attempt_state
+    payload["usage_source"] = usage_source
+    payload["cost_source"] = cost_source
     return payload
 
 
@@ -375,6 +381,8 @@ class SolverScheduler:
             raw_output=raw_output,
             usage=result.usage,
             cost=result.cost,
+            usage_source=result.usage_source,
+            cost_source=result.cost_source,
             exit_reason=result.exit_reason,
             metadata=result.metadata,
         )
@@ -468,6 +476,8 @@ class SolverScheduler:
                         exit_reason=result.exit_reason,
                         usage=result.usage,
                         cost=result.cost,
+                        usage_source=result.usage_source,
+                        cost_source=result.cost_source,
                         metadata=metadata,
                     )
 
@@ -482,6 +492,8 @@ class SolverScheduler:
                             exit_reason=f"adapter_finalize_error: {exc}",
                             usage=result.usage,
                             cost=result.cost,
+                            usage_source=result.usage_source,
+                            cost_source=result.cost_source,
                             metadata=result.metadata,
                         )

LLM Judgemuch worse

Overall Δ: -1.80
Confidence: 0.92

Candidate partially carries usage/cost provenance through runtime results, but misses key schema and failure-handling pieces from the gold patch and includes substantial unrelated packaging churn.

Task Fitmuch worse Δ-2

It only implements part of the telemetry provenance work and does not address the Codex CLI failure-handling regression that the task explicitly calls out. The gold patch also updates config-backed attempt rows and failure paths across adapters/scheduler; the candidate does not.

Correctness Safetymuch worse Δ-2

The candidate hard-codes `usage_source`/`cost_source` for OpenAI adapters even when those fields may be absent, unlike the gold patch which derives provenance conditionally from the payload. It also leaves Codex CLI non-timeout failure handling unchanged, so command failures still do not follow the corrected behavior from gold. Missing `AttemptRow` schema fields in `config.py` creates a risk that serialized rows are inconsistent with the contract.

Maintainabilityworse Δ-1

The gold patch centralizes provenance parsing in a helper and threads the new fields through the typed row model cleanly. The candidate duplicates string literals in individual adapters and bypasses the row contract by stuffing fields directly into serialized payloads, which is less cohesive and easier to drift. Unrelated egg-info artifacts also add noise.

Test Qualitymuch worse Δ-2

The gold patch adds focused unit coverage for OpenAI responses, OpenAI-compatible, Codex CLI provenance, and scheduler persistence. The candidate adds no tests, so the regression is not defended.

Change Focusmuch worse Δ-2

It adds multiple generated `repogauge.egg-info` files unrelated to the task. That churn is absent from the gold patch and reduces review clarity.

Harness: resolved
Attempt State: succeeded
Latency: 2.7m
Spend: $0.22
Tokens: 1.3M
Tool Calls: 33
Patch Size: 17,878 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..657c44e
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,134 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: PyYAML>=6.0.2
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+- [docs/junit_parser_contract.md](docs/junit_parser_contract.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--enrich-github` | disabled | Enable optional GitHub issue/PR metadata enrichment |
+| `--max-commits N` | `100` | Commits to scan |
+| `--github-token TOKEN` | `$(GITHUB_TOKEN)` | Token used for GitHub API calls |
+| `--github-enrichment-cache PATH` | `<out>/github_enrichment_cache.json` | Optional local cache for enrichment responses |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### Command artifact contract (current scaffold)
+
+For `--out` directory `./out`, the scaffold writes command-specific artifacts:
+
+- `manifest.json`
+  - command metadata and step status for each run invocation
+- `events.jsonl`
+  - machine-readable execution events for the same run
+- `repo_profile.json`, `scan.jsonl`, `candidates.jsonl` for `mine`
+- `reviewed.jsonl`, `review.md`, `review.html` for `review`
+- `materialized.jsonl`, `materialization_rejections.jsonl`, `dataset/dataset.jsonl`,
+  `dataset/predictions.gold.jsonl`, `adapter_<repo>.py`, `specs.json` for `export`
+- `validation.jsonl` for `eval`
+
+This list reflects what the v0.1 scaffold guarantees today; future stages
+extend it to include run-level and analysis artifacts.
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2cd83a4
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,52 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/requires.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/enrich.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/parsers/__init__.py
+repogauge/parsers/junit.py
+repogauge/runner/__init__.py
+repogauge/runner/adapters.py
+repogauge/runner/judge.py
+repogauge/runner/matrix.py
+repogauge/runner/normalize_patch.py
+repogauge/runner/planner.py
+repogauge/runner/providers.py
+repogauge/runner/scheduler.py
+repogauge/runner/solvers.py
+repogauge/runner/telemetry.py
+repogauge/runner/workspaces.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/evidence.py
+repogauge/validation/junit_parser.py
+repogauge/validation/testsel.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/requires.txt b/repogauge.egg-info/requires.txt
new file mode 100644
index 0000000..55c64f4
--- /dev/null
+++ b/repogauge.egg-info/requires.txt
@@ -0,0 +1 @@
+PyYAML>=6.0.2
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/config.py b/repogauge/config.py
index 81b25cc..df883d9 100644
--- a/repogauge/config.py
+++ b/repogauge/config.py
@@ -160,6 +160,8 @@ class AttemptRow(ContractRecord):
     model_patch: Optional[str] = None
     usage: Dict[str, Any] = field(default_factory=dict)
     cost: Dict[str, Any] = field(default_factory=dict)
+    usage_source: str = ""
+    cost_source: str = ""
     metadata: Dict[str, Any] = field(default_factory=dict)
 
 
diff --git a/repogauge/runner/adapters.py b/repogauge/runner/adapters.py
index 815db04..bbcc92d 100644
--- a/repogauge/runner/adapters.py
+++ b/repogauge/runner/adapters.py
@@ -229,12 +229,31 @@ def _build_prompt(
 
 def _parse_usage_cost(
     payload: Mapping[str, Any] | list[Any] | None,
-) -> tuple[dict[str, Any], dict[str, Any]]:
+) -> tuple[dict[str, Any], dict[str, Any], str, str]:
     if not isinstance(payload, Mapping):
-        return {}, {}
+        return {}, {}, "", ""
     usage: Mapping[str, Any] = _coerce_usage(payload.get("usage"))
     cost: Mapping[str, Any] = _coerce_cost(payload.get("cost"))
-    return dict(usage), dict(cost)
+    usage_source = "response.usage" if isinstance(payload.get("usage"), Mapping) else ""
+    cost_source = "response.cost" if isinstance(payload.get("cost"), Mapping) else ""
+    return dict(usage), dict(cost), usage_source, cost_source
+
+
+def _parse_cli_usage_cost(
+    events: list[dict[str, Any]],
+) -> tuple[dict[str, Any], dict[str, Any], str, str]:
+    usage: dict[str, Any] = {}
+    cost: dict[str, Any] = {}
+    usage_source = ""
+    cost_source = ""
+    for event in events:
+        if "usage" in event and isinstance(event["usage"], Mapping):
+            usage = dict(event["usage"])
+            usage_source = "codex_cli.event.usage"
+        if "cost" in event and isinstance(event["cost"], Mapping):
+            cost = dict(event["cost"])
+            cost_source = "codex_cli.event.cost"
+    return usage, cost, usage_source, cost_source
 
 
 def _post_json(
@@ -544,7 +563,7 @@ class AnthropicAgentSDKAdapter(_BaseConcreteSolverAdapter):
         text = self._coerce_output_text(
             _coerce_mapping(response, field_name="response")
         )
-        usage, cost = _parse_usage_cost(
+        usage, cost, usage_source, cost_source = _parse_usage_cost(
             _coerce_mapping(response, field_name="response")
         )
         return SolverAdapterResult(
@@ -555,6 +574,8 @@ class AnthropicAgentSDKAdapter(_BaseConcreteSolverAdapter):
             exit_reason="",
             usage=usage,
             cost=cost,
+            usage_source=usage_source,
+            cost_source=cost_source,
         )
 
 
@@ -636,7 +657,7 @@ class OpenAIResponsesAdapter(_BaseConcreteSolverAdapter):
             )
 
         response_payload = _coerce_mapping(response, field_name="response")
-        usage, cost = _parse_usage_cost(response_payload)
+        usage, cost, usage_source, cost_source = _parse_usage_cost(response_payload)
         text = response_payload.get("output_text")
         if not isinstance(text, str):
             text = ""
@@ -653,6 +674,8 @@ class OpenAIResponsesAdapter(_BaseConcreteSolverAdapter):
             exit_reason="",
             usage=usage,
             cost=cost,
+            usage_source=usage_source,
+            cost_source=cost_source,
         )
 
 
@@ -740,7 +763,7 @@ class OpenAICompatibleAdapter(_BaseConcreteSolverAdapter):
             )
 
         response_payload = _coerce_mapping(response, field_name="response")
-        usage, cost = _parse_usage_cost(response_payload)
+        usage, cost, usage_source, cost_source = _parse_usage_cost(response_payload)
         text = ""
         choices = response_payload.get("choices")
         if isinstance(choices, list) and choices:
@@ -765,6 +788,8 @@ class OpenAICompatibleAdapter(_BaseConcreteSolverAdapter):
             exit_reason="",
             usage=usage,
             cost=cost,
+            usage_source=usage_source,
+            cost_source=cost_source,
         )
 
 
@@ -852,26 +877,21 @@ class CodexCLIAdapter(_BaseConcreteSolverAdapter):
             input_text=prompt,
             timeout_seconds=self.timeout_seconds,
         )
-        usage = {}
-        cost = {}
         output = command_result.stdout
+        parsed = _parse_json_lines(output)
+        usage, cost, usage_source, cost_source = _parse_cli_usage_cost(parsed)
+        text_parts = []
+        for event in parsed:
+            if "message" in event and isinstance(event["message"], Mapping):
+                text = event["message"].get("content")
+                if isinstance(text, str):
+                    text_parts.append(text)
+            if "text" in event and isinstance(event["text"], str):
+                text_parts.append(event["text"])
+        text = "".join(text_parts).strip()
+        if not text and output.strip():
+            text = output.strip()
         if command_result.success:
-            parsed = _parse_json_lines(output)
-            text_parts = []
-            for event in parsed:
-                if "usage" in event and isinstance(event["usage"], Mapping):
-                    usage = dict(event["usage"])
-                if "cost" in event and isinstance(event["cost"], Mapping):
-                    cost = dict(event["cost"])
-                if "message" in event and isinstance(event["message"], Mapping):
-                    text = event["message"].get("content")
-                    if isinstance(text, str):
-                        text_parts.append(text)
-                if "text" in event and isinstance(event["text"], str):
-                    text_parts.append(event["text"])
-            text = "".join(text_parts).strip()
-            if not text and output.strip():
-                text = output.strip()
             return SolverAdapterResult(
                 attempt_id=request.attempt_id,
                 status=SolverAttemptState.SUCCEEDED,
@@ -880,21 +900,25 @@ class CodexCLIAdapter(_BaseConcreteSolverAdapter):
                 exit_reason="",
                 usage=usage,
                 cost=cost,
+                usage_source=usage_source,
+                cost_source=cost_source,
             )
         if command_result.timed_out:
             return SolverAdapterResult(
                 attempt_id=request.attempt_id,
                 status=SolverAttemptState.TIMED_OUT,
-                model_patch=None,
+                model_patch=text or None,
                 raw_output=output,
                 exit_reason=f"command timeout: {command_result.stderr or 'timed out'}",
-                usage={},
-                cost={},
+                usage=usage,
+                cost=cost,
+                usage_source=usage_source,
+                cost_source=cost_source,
             )
         return SolverAdapterResult(
             attempt_id=request.attempt_id,
             status=SolverAttemptState.FAILED,
-            model_patch=None,
+            model_patch=text or None,
             raw_output=output,
             exit_reason=(
                 command_result.stderr
@@ -903,6 +927,8 @@ class CodexCLIAdapter(_BaseConcreteSolverAdapter):
             ),
             usage=usage,
             cost=cost,
+            usage_source=usage_source,
+            cost_source=cost_source,
         )
 
 
diff --git a/repogauge/runner/scheduler.py b/repogauge/runner/scheduler.py
index 8bea883..17d929a 100644
--- a/repogauge/runner/scheduler.py
+++ b/repogauge/runner/scheduler.py
@@ -68,6 +68,8 @@ class SolverAdapterResult:
     exit_reason: str = ""
     usage: Mapping[str, Any] = field(default_factory=dict)
     cost: Mapping[str, Any] = field(default_factory=dict)
+    usage_source: str = ""
+    cost_source: str = ""
     metadata: Mapping[str, Any] = field(default_factory=dict)
 
 
@@ -210,6 +212,8 @@ def _serialize_attempt_row(
     raw_output: str,
     usage: Mapping[str, Any],
     cost: Mapping[str, Any],
+    usage_source: str,
+    cost_source: str,
     exit_reason: str,
     metadata: Mapping[str, Any],
 ) -> dict[str, Any]:
@@ -223,6 +227,8 @@ def _serialize_attempt_row(
         model_patch=patch,
         usage=dict(usage),
         cost=dict(cost),
+        usage_source=usage_source,
+        cost_source=cost_source,
         metadata=dict(metadata, attempt_state=attempt_state),
     )
     payload = row.to_dict()
@@ -375,6 +381,8 @@ class SolverScheduler:
             raw_output=raw_output,
             usage=result.usage,
             cost=result.cost,
+            usage_source=result.usage_source,
+            cost_source=result.cost_source,
             exit_reason=result.exit_reason,
             metadata=result.metadata,
         )
@@ -440,6 +448,8 @@ class SolverScheduler:
                         status=SolverAttemptState.FAILED,
                         exit_reason=f"adapter_prepare_error: {exc}",
                         raw_output="",
+                        usage_source="",
+                        cost_source="",
                     )
 
                 if result is None:
@@ -451,6 +461,8 @@ class SolverScheduler:
                             status=SolverAttemptState.FAILED,
                             exit_reason=f"adapter_execution_error: {exc}",
                             raw_output="",
+                            usage_source="",
+                            cost_source="",
                         )
 
                     try:
@@ -468,6 +480,8 @@ class SolverScheduler:
                         exit_reason=result.exit_reason,
                         usage=result.usage,
                         cost=result.cost,
+                        usage_source=result.usage_source,
+                        cost_source=result.cost_source,
                         metadata=metadata,
                     )
 
@@ -482,6 +496,8 @@ class SolverScheduler:
                             exit_reason=f"adapter_finalize_error: {exc}",
                             usage=result.usage,
                             cost=result.cost,
+                            usage_source=result.usage_source,
+                            cost_source=result.cost_source,
                             metadata=result.metadata,
                         )

LLM Judgemuch worse

Overall Δ: -1.80
Confidence: 0.89

Candidate adds part of the telemetry provenance plumbing, but misses the core propagation through adapter finalization, adds no regression tests, and includes unrelated packaging churn.

Task Fitmuch worse Δ-2

The main task was to track adapter telemetry provenance end-to-end and fix Codex CLI failure handling. While the candidate adds source fields and some parsing, it does not update the base adapter finalization paths to preserve `usage_source` and `cost_source`, so provenance can be dropped before persistence for normal adapter flows. That is a material miss relative to the gold patch.

Correctness Safetymuch worse Δ-2

Gold explicitly threads provenance through `_BaseConcreteSolverAdapter` result rewriting and keeps failed/timed-out Codex CLI attempts from surfacing partial patches. The candidate omits the base propagation changes and also returns `model_patch=text or None` plus parsed usage/cost on failed or timed-out Codex CLI runs, which increases regression risk by treating partial stdout as meaningful model output.

Maintainabilityworse Δ-1

The helper split in gold keeps generic response parsing and Codex CLI parsing clearly separated while preserving behavior. The candidate changes `_parse_usage_cost`’s contract and partially implements the feature, which makes the code less coherent. The added `repogauge.egg-info` files also add noise that does not belong in this fix.

Test Qualitymuch worse Δ-2

Gold adds focused unit coverage for response-source tagging, Codex CLI telemetry provenance, and scheduler persistence. The candidate includes no tests, so the key regression is left unprotected.

Change Focusmuch worse Δ-2

The production changes are already incomplete, and the diff also introduces unrelated generated packaging artifacts under `repogauge.egg-info/`, which is avoidable churn outside the task scope.

Harness: resolved
Attempt State: succeeded
Latency: 2.2m
Spend: $0.26
Tokens: 1.1M
Tool Calls: 29
Patch Size: 13,072 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..657c44e
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,134 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: PyYAML>=6.0.2
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+- [docs/junit_parser_contract.md](docs/junit_parser_contract.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--enrich-github` | disabled | Enable optional GitHub issue/PR metadata enrichment |
+| `--max-commits N` | `100` | Commits to scan |
+| `--github-token TOKEN` | `$(GITHUB_TOKEN)` | Token used for GitHub API calls |
+| `--github-enrichment-cache PATH` | `<out>/github_enrichment_cache.json` | Optional local cache for enrichment responses |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### Command artifact contract (current scaffold)
+
+For `--out` directory `./out`, the scaffold writes command-specific artifacts:
+
+- `manifest.json`
+  - command metadata and step status for each run invocation
+- `events.jsonl`
+  - machine-readable execution events for the same run
+- `repo_profile.json`, `scan.jsonl`, `candidates.jsonl` for `mine`
+- `reviewed.jsonl`, `review.md`, `review.html` for `review`
+- `materialized.jsonl`, `materialization_rejections.jsonl`, `dataset/dataset.jsonl`,
+  `dataset/predictions.gold.jsonl`, `adapter_<repo>.py`, `specs.json` for `export`
+- `validation.jsonl` for `eval`
+
+This list reflects what the v0.1 scaffold guarantees today; future stages
+extend it to include run-level and analysis artifacts.
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2cd83a4
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,52 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/requires.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/enrich.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/parsers/__init__.py
+repogauge/parsers/junit.py
+repogauge/runner/__init__.py
+repogauge/runner/adapters.py
+repogauge/runner/judge.py
+repogauge/runner/matrix.py
+repogauge/runner/normalize_patch.py
+repogauge/runner/planner.py
+repogauge/runner/providers.py
+repogauge/runner/scheduler.py
+repogauge/runner/solvers.py
+repogauge/runner/telemetry.py
+repogauge/runner/workspaces.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/evidence.py
+repogauge/validation/junit_parser.py
+repogauge/validation/testsel.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/requires.txt b/repogauge.egg-info/requires.txt
new file mode 100644
index 0000000..55c64f4
--- /dev/null
+++ b/repogauge.egg-info/requires.txt
@@ -0,0 +1 @@
+PyYAML>=6.0.2
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/config.py b/repogauge/config.py
index 81b25cc..df883d9 100644
--- a/repogauge/config.py
+++ b/repogauge/config.py
@@ -160,6 +160,8 @@ class AttemptRow(ContractRecord):
     model_patch: Optional[str] = None
     usage: Dict[str, Any] = field(default_factory=dict)
     cost: Dict[str, Any] = field(default_factory=dict)
+    usage_source: str = ""
+    cost_source: str = ""
     metadata: Dict[str, Any] = field(default_factory=dict)
 
 
diff --git a/repogauge/runner/adapters.py b/repogauge/runner/adapters.py
index 815db04..9fad2e8 100644
--- a/repogauge/runner/adapters.py
+++ b/repogauge/runner/adapters.py
@@ -346,6 +346,8 @@ class _BaseConcreteSolverAdapter(SolverAdapter, ABC):
                     exit_reason="invalid patch: no unified diff found in model output",
                     usage=result.usage,
                     cost=result.cost,
+                    usage_source=result.usage_source,
+                    cost_source=result.cost_source,
                     metadata=metadata,
                 )
             return SolverAdapterResult(
@@ -356,6 +358,8 @@ class _BaseConcreteSolverAdapter(SolverAdapter, ABC):
                 exit_reason=result.exit_reason,
                 usage=_coerce_usage(result.usage),
                 cost=_coerce_cost(result.cost),
+                usage_source=result.usage_source,
+                cost_source=result.cost_source,
                 metadata=metadata,
             )
 
@@ -367,6 +371,8 @@ class _BaseConcreteSolverAdapter(SolverAdapter, ABC):
             exit_reason=result.exit_reason,
             usage=_coerce_usage(result.usage),
             cost=_coerce_cost(result.cost),
+            usage_source=result.usage_source,
+            cost_source=result.cost_source,
             metadata=metadata,
         )
 
@@ -653,6 +659,8 @@ class OpenAIResponsesAdapter(_BaseConcreteSolverAdapter):
             exit_reason="",
             usage=usage,
             cost=cost,
+            usage_source="response.usage",
+            cost_source="response.cost",
         )
 
 
@@ -765,6 +773,8 @@ class OpenAICompatibleAdapter(_BaseConcreteSolverAdapter):
             exit_reason="",
             usage=usage,
             cost=cost,
+            usage_source="response.usage",
+            cost_source="response.cost",
         )
 
 
@@ -880,6 +890,8 @@ class CodexCLIAdapter(_BaseConcreteSolverAdapter):
                 exit_reason="",
                 usage=usage,
                 cost=cost,
+                usage_source="codex_cli.event.usage",
+                cost_source="codex_cli.event.cost",
             )
         if command_result.timed_out:
             return SolverAdapterResult(
@@ -890,6 +902,8 @@ class CodexCLIAdapter(_BaseConcreteSolverAdapter):
                 exit_reason=f"command timeout: {command_result.stderr or 'timed out'}",
                 usage={},
                 cost={},
+                usage_source="",
+                cost_source="",
             )
         return SolverAdapterResult(
             attempt_id=request.attempt_id,
@@ -903,6 +917,8 @@ class CodexCLIAdapter(_BaseConcreteSolverAdapter):
             ),
             usage=usage,
             cost=cost,
+            usage_source="",
+            cost_source="",
         )
 
 
diff --git a/repogauge/runner/scheduler.py b/repogauge/runner/scheduler.py
index 8bea883..f4000a4 100644
--- a/repogauge/runner/scheduler.py
+++ b/repogauge/runner/scheduler.py
@@ -68,6 +68,8 @@ class SolverAdapterResult:
     exit_reason: str = ""
     usage: Mapping[str, Any] = field(default_factory=dict)
     cost: Mapping[str, Any] = field(default_factory=dict)
+    usage_source: str = ""
+    cost_source: str = ""
     metadata: Mapping[str, Any] = field(default_factory=dict)
 
 
@@ -210,6 +212,8 @@ def _serialize_attempt_row(
     raw_output: str,
     usage: Mapping[str, Any],
     cost: Mapping[str, Any],
+    usage_source: str,
+    cost_source: str,
     exit_reason: str,
     metadata: Mapping[str, Any],
 ) -> dict[str, Any]:
@@ -223,6 +227,8 @@ def _serialize_attempt_row(
         model_patch=patch,
         usage=dict(usage),
         cost=dict(cost),
+        usage_source=usage_source,
+        cost_source=cost_source,
         metadata=dict(metadata, attempt_state=attempt_state),
     )
     payload = row.to_dict()
@@ -375,6 +381,8 @@ class SolverScheduler:
             raw_output=raw_output,
             usage=result.usage,
             cost=result.cost,
+            usage_source=result.usage_source,
+            cost_source=result.cost_source,
             exit_reason=result.exit_reason,
             metadata=result.metadata,
         )
@@ -468,6 +476,8 @@ class SolverScheduler:
                         exit_reason=result.exit_reason,
                         usage=result.usage,
                         cost=result.cost,
+                        usage_source=result.usage_source,
+                        cost_source=result.cost_source,
                         metadata=metadata,
                     )
 
@@ -482,6 +492,8 @@ class SolverScheduler:
                             exit_reason=f"adapter_finalize_error: {exc}",
                             usage=result.usage,
                             cost=result.cost,
+                            usage_source=result.usage_source,
+                            cost_source=result.cost_source,
                             metadata=result.metadata,
                         )

LLM Judgeworse

Overall Δ: -1.00
Confidence: 0.95

Candidate carries the schema/plumbing changes but misses key parts of the gold fix, especially Codex CLI failure handling, Anthropic provenance support, and the regression tests, while also adding unrelated packaging churn.

Task Fitworse Δ-1

It only partially implements telemetry provenance and does not match the gold fix's behavior changes for Codex CLI failure handling. It also omits the Anthropic adapter provenance update and adds no corresponding regression coverage for the stated task.

Correctness Safetyworse Δ-1

OpenAI adapters now stamp `response.usage`/`response.cost` unconditionally instead of deriving provenance from the actual payload, which can misreport source metadata. More importantly, the Codex CLI control-flow fix from the gold patch is missing, so the candidate does not address the failure-handling regression the task calls out.

Maintainabilityworse Δ-1

The gold patch centralizes provenance extraction with a helper and applies it consistently; the candidate duplicates hardcoded source strings in individual adapters and leaves coverage gaps. The extra checked-in `repogauge.egg-info` artifacts also increase noise and long-term maintenance burden.

Test Qualityworse Δ-1

The gold patch adds focused unit tests for OpenAI responses, OpenAI-compatible, Codex CLI provenance, and scheduler persistence. The candidate adds no tests, so it does not lock in the regression fix or the new provenance behavior.

Change Focusworse Δ-1

The production edits are only loosely scoped to the task, and the addition of multiple `repogauge.egg-info` generated files is unrelated churn absent from the gold patch.

s1liconcow__repogauge-rg-546450916c29-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Close h8h: add parser-name dispatcher - Production changes: repogauge/parsers/__init__.py, repogauge/parsers/junit.py. Test changes: tests/unit/test_harness_parser_bridge.py. - Reproduction: run tests impacted by this change (132 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 1.4m
Spend: $0.13
Tokens: 2.7K
Tool Calls: 8
Patch Size: 2,268 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/parsers/__init__.py b/repogauge/parsers/__init__.py
index d464a54..7c20def 100644
--- a/repogauge/parsers/__init__.py
+++ b/repogauge/parsers/__init__.py
@@ -1,5 +1,5 @@
 """Parser adapters for RepoGauge-to-harness bridge."""
 
-from .junit import parse_repogauge_junit
+from .junit import parse_repogauge_junit, parse_repogauge_test_output
 
-__all__ = ["parse_repogauge_junit"]
+__all__ = ["parse_repogauge_junit", "parse_repogauge_test_output"]
diff --git a/repogauge/parsers/junit.py b/repogauge/parsers/junit.py
index fb6b1b9..5fef780 100644
--- a/repogauge/parsers/junit.py
+++ b/repogauge/parsers/junit.py
@@ -5,8 +5,6 @@ from __future__ import annotations
 from pathlib import Path
 from typing import Any, Dict, Mapping
 
-from swebench.harness.log_parsers.python import parse_log_pytest_v2
-
 from repogauge.validation.junit_parser import (
     parse_junit_xml,
     parse_junit_xml_content,
@@ -35,6 +33,7 @@ def _parse_string_payload(report: str, test_spec: Any | None) -> Dict[str, str]:
         return parse_junit_xml_content(text)
 
     normalized = "\n".join(line.strip() for line in report.splitlines())
+    from swebench.harness.log_parsers.python import parse_log_pytest_v2  # noqa: PLC0415
     return parse_log_pytest_v2(normalized, test_spec)
 
 
@@ -97,3 +96,29 @@ def parse_repogauge_junit(
         f"unsupported report payload for parser: {type(report).__name__}; "
         f"expected file path, XML content, or mapping"
     )
+
+
+_PARSER_REGISTRY: Dict[str, Any] = {
+    "junit": parse_repogauge_junit,
+}
+
+
+def parse_repogauge_test_output(
+    report: object, parser_name: str, test_spec: Any | None = None
+) -> Dict[str, str]:
+    """Dispatch test output parsing by parser name.
+
+    Args:
+        report: The test output payload (path, string, or mapping).
+        parser_name: Name of the registered parser to use.
+        test_spec: Optional harness-provided spec object.
+
+    Returns:
+        A canonicalized mapping of test IDs to outcomes.
+
+    Raises:
+        KeyError: if parser_name is not registered.
+    """
+    if parser_name not in _PARSER_REGISTRY:
+        raise KeyError(f"unknown test parser: {parser_name!r}")
+    return _PARSER_REGISTRY[parser_name](report, test_spec)

LLM Judgeworse

Overall Δ: -0.95
Confidence: 0.87

Candidate implements the core parser-name dispatch and lazy swebench import, but it is narrower and less robust than the gold patch and appears to omit the regression tests entirely.

Task Fitworse Δ-1

It adds `parse_repogauge_test_output` and exposes it, which covers the main dispatcher behavior, but it does not match the fuller gold surface area (`get_parser`, `register_parser`, default `junit` dispatch path). The narrower API means it addresses the task, but not as completely as the reference.

Correctness Safetyworse Δ-1

The lazy import change is correct and reduces eager dependency loading, but the dispatcher is less defensive than gold: parser names are not normalized, there is no registration pathway, and the public API requires an explicit parser name rather than safely defaulting to `junit`. Those choices increase edge-case and integration risk relative to gold.

Maintainabilityworse Δ-1

The implementation is concise, but it relies on a module-level `Dict[str, Any]` registry and direct membership checks instead of typed parser functions and dedicated helpers for normalization and lookup. Gold is clearer about extension points and error handling.

Test Qualitymuch worse Δ-2

Compared with the gold patch, the candidate shows no added regression coverage for parser-name dispatch, unknown parser handling, or verification that the swebench parser is not eagerly imported. That is a substantial drop in testing posture.

Change Focusbetter Δ+1

The candidate keeps the production change tightly scoped to the immediate behavior and avoids the broader parser registration API added by gold. Relative to gold, there is less implementation churn outside the core fix.

Harness: resolved
Attempt State: succeeded
Latency: 1.6m
Spend: $0.05
Tokens: 255.9K
Tool Calls: 14
Patch Size: 2,015 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/parsers/__init__.py b/repogauge/parsers/__init__.py
index d464a54..7c20def 100644
--- a/repogauge/parsers/__init__.py
+++ b/repogauge/parsers/__init__.py
@@ -1,5 +1,5 @@
 """Parser adapters for RepoGauge-to-harness bridge."""
 
-from .junit import parse_repogauge_junit
+from .junit import parse_repogauge_junit, parse_repogauge_test_output
 
-__all__ = ["parse_repogauge_junit"]
+__all__ = ["parse_repogauge_junit", "parse_repogauge_test_output"]
diff --git a/repogauge/parsers/junit.py b/repogauge/parsers/junit.py
index fb6b1b9..7c4b74d 100644
--- a/repogauge/parsers/junit.py
+++ b/repogauge/parsers/junit.py
@@ -5,8 +5,6 @@ from __future__ import annotations
 from pathlib import Path
 from typing import Any, Dict, Mapping
 
-from swebench.harness.log_parsers.python import parse_log_pytest_v2
-
 from repogauge.validation.junit_parser import (
     parse_junit_xml,
     parse_junit_xml_content,
@@ -35,6 +33,8 @@ def _parse_string_payload(report: str, test_spec: Any | None) -> Dict[str, str]:
         return parse_junit_xml_content(text)
 
     normalized = "\n".join(line.strip() for line in report.splitlines())
+    from swebench.harness.log_parsers.python import parse_log_pytest_v2
+
     return parse_log_pytest_v2(normalized, test_spec)
 
 
@@ -97,3 +97,20 @@ def parse_repogauge_junit(
         f"unsupported report payload for parser: {type(report).__name__}; "
         f"expected file path, XML content, or mapping"
     )
+
+
+def parse_repogauge_test_output(
+    report: object,
+    *,
+    parser_name: str = "junit",
+    test_spec: Any | None = None,
+) -> Dict[str, str]:
+    """Dispatch test output parsing by parser name."""
+    parsers = {"junit": parse_repogauge_junit}
+    if not isinstance(parser_name, str):
+        raise KeyError(f"unknown test parser: {parser_name}")
+
+    parser = parsers.get(parser_name.strip().lower())
+    if parser is None:
+        raise KeyError(f"unknown test parser: {parser_name}")
+    return parser(report, test_spec)

LLM Judgeworse

Overall Δ: -1.05
Confidence: 0.95

Candidate fixes the immediate dispatcher regression and lazy-import issue, but it is narrower and less extensible than the gold patch and omits the registry API the gold introduces.

Task Fitworse Δ-1

It adds `parse_repogauge_test_output` and exports it, which addresses the observed regression, but it does not implement the parser registration/lookup surface (`register_parser`, `get_parser`) present in the gold patch, so it matches the task less completely if the intended fix includes named-parser extensibility.

Correctness Safetyworse Δ-1

Lazy-importing `parse_log_pytest_v2` inside `_parse_string_payload` aligns with the gold and avoids eager import side effects. However, the ad hoc local `parsers = {"junit": ...}` dispatcher is less robust than the gold registry-based design, and it does not normalize/validate parser names through a shared helper, leaving behavior less well-factored and harder to extend safely.

Maintainabilityworse Δ-1

The candidate is simpler, but the simplicity comes from hardcoding a one-off dictionary inside `parse_repogauge_test_output`. The gold patch centralizes parser normalization, lookup, and registration, which is cleaner for long-term maintenance and avoids duplicating dispatch logic when more parsers are added.

Test Qualitymuch worse Δ-2

No test changes are included, while the gold adds regression coverage for parser-name dispatch, unknown parser rejection, and the lazy-import behavior. That is a significant gap relative to the reference fix.

Change Focussame Δ+0

The scope stays tightly focused on the parser bridge and avoids unrelated churn, similar to the gold patch.

Harness: resolved
Attempt State: succeeded
Latency: 1.4m
Spend: $0.09
Tokens: 258.9K
Tool Calls: 17
Patch Size: 1,698 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/parsers/__init__.py b/repogauge/parsers/__init__.py
index d464a54..7c20def 100644
--- a/repogauge/parsers/__init__.py
+++ b/repogauge/parsers/__init__.py
@@ -1,5 +1,5 @@
 """Parser adapters for RepoGauge-to-harness bridge."""
 
-from .junit import parse_repogauge_junit
+from .junit import parse_repogauge_junit, parse_repogauge_test_output
 
-__all__ = ["parse_repogauge_junit"]
+__all__ = ["parse_repogauge_junit", "parse_repogauge_test_output"]
diff --git a/repogauge/parsers/junit.py b/repogauge/parsers/junit.py
index fb6b1b9..26b89c4 100644
--- a/repogauge/parsers/junit.py
+++ b/repogauge/parsers/junit.py
@@ -5,8 +5,6 @@ from __future__ import annotations
 from pathlib import Path
 from typing import Any, Dict, Mapping
 
-from swebench.harness.log_parsers.python import parse_log_pytest_v2
-
 from repogauge.validation.junit_parser import (
     parse_junit_xml,
     parse_junit_xml_content,
@@ -27,6 +25,8 @@ def _existing_path_from_text(report: str) -> Path | None:
 
 
 def _parse_string_payload(report: str, test_spec: Any | None) -> Dict[str, str]:
+    from swebench.harness.log_parsers.python import parse_log_pytest_v2
+
     text = report.strip()
     if not text:
         return {}
@@ -97,3 +97,10 @@ def parse_repogauge_junit(
         f"unsupported report payload for parser: {type(report).__name__}; "
         f"expected file path, XML content, or mapping"
     )
+
+
+def parse_repogauge_test_output(report: object, parser_name: str) -> Dict[str, str]:
+    """Parse test output by dispatched parser name."""
+    if parser_name == "junit":
+        return parse_repogauge_junit(report)
+    raise KeyError(f"unknown test parser: {parser_name!r}")

LLM Judgeworse

Overall Δ: -0.95
Confidence: 0.93

Candidate fixes the immediate bridge path with a minimal hardcoded dispatcher and lazy import, but it is less extensible and less well covered than the gold patch.

Task Fitworse Δ-1

It adds `parse_repogauge_test_output` and defers the swebench import, so it addresses the observed regression path, but it does not implement the general parser-name dispatcher/export surface from the gold patch (`get_parser`/`register_parser`) and only special-cases `junit`.

Correctness Safetyworse Δ-1

The candidate drops the `test_spec` argument from the new public dispatcher and always calls `parse_repogauge_junit(report)` without forwarding parser context. It also lacks parser-name normalization/validation behavior present in gold, making the API narrower and slightly more brittle.

Maintainabilityworse Δ-1

The hardcoded `if parser_name == "junit"` branch is simpler short-term but less cohesive and less extensible than the registry-based design in the gold patch. Future parser additions will require editing dispatch logic instead of registration.

Test Qualitymuch worse Δ-2

No candidate tests are shown for the new dispatcher behavior or the lazy-import regression. The gold patch adds targeted coverage for successful dispatch, unknown parser rejection, and avoiding eager swebench import.

Change Focusbetter Δ+1

The candidate stays tightly scoped to the immediate regression with minimal production churn and no extra public registration API. It is more minimal than the gold patch, though that comes with the tradeoff of reduced extensibility.

s1liconcow__repogauge-rg-5626b1b0e355-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Handle pytest junit variants in targeted test planning - Production changes: repogauge/validation/junit_parser.py, repogauge/validation/testsel.py, repogauge/validation/validate.py. Test changes: tests/unit/test_junit_parser.py, tests/unit/test_testsel.py. - Reproduction: run tests impacted by this change (45 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 1.2m
Spend: $0.14
Tokens: 2.0K
Tool Calls: 9
Patch Size: 7,076 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..5a86ef7
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,112 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--max-commits N` | `100` | Commits to scan |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..a50476b
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,39 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/junit_parser.py
+repogauge/validation/testsel.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/validation/junit_parser.py b/repogauge/validation/junit_parser.py
index 8966db3..acb3552 100644
--- a/repogauge/validation/junit_parser.py
+++ b/repogauge/validation/junit_parser.py
@@ -78,6 +78,8 @@ def _canonical_id(classname: str, name: str) -> str:
 def _outcome_of(testcase: ET.Element) -> str:
     if testcase.find("skipped") is not None:
         return OUTCOME_SKIP
+    if testcase.find("xpass") is not None:
+        return OUTCOME_SKIP
     if testcase.find("error") is not None:
         return OUTCOME_ERROR
     if testcase.find("failure") is not None:
diff --git a/repogauge/validation/testsel.py b/repogauge/validation/testsel.py
index a3d4583..1c78734 100644
--- a/repogauge/validation/testsel.py
+++ b/repogauge/validation/testsel.py
@@ -71,7 +71,7 @@ def _build_pytest_targeted_cmd(test_cmd_base: str) -> str:
         parts.append("--tb=no")
     if not _command_has_flag(parts, "-q"):
         parts.append("-q")
-    if not _command_has_flag(parts, "--junit-xml="):
+    if not _command_has_flag(parts, "--junit-xml=") and not _command_has_flag(parts, "--junitxml="):
         parts.append(f"--junit-xml={_JUNIT_XML_PLACEHOLDER}")
 
     return " ".join(parts)

LLM Judgemuch worse

Overall Δ: -1.25
Confidence: 0.93

Candidate addresses part of the pytest junit variant handling, but it misses the validate-path hardening from the gold patch, omits regression tests, and adds unrelated packaging artifacts.

Task Fitworse Δ-1

It partially fixes the targeted planning issue by preserving `--junitxml=...` and adds `xpass` parsing, but it does not include the gold patch's `validate.py` behavior change and does not cover the full junit variant set handled by the reference.

Correctness Safetyworse Δ-1

The candidate leaves `_run_pytest` silently tolerating missing junit output instead of raising `JUnitParseError`, which weakens failure detection versus gold. It also skips explicit `xfail` handling that the reference included, so edge-case coverage is narrower.

Maintainabilityworse Δ-1

The `--junitxml` support is implemented by duplicating checks at the call site rather than centralizing variant handling as in the gold patch's `_JUNIT_XML_FLAGS` approach. The unrelated `repogauge.egg-info` additions also add noise and maintenance burden.

Test Qualitymuch worse Δ-2

The gold patch adds focused regression tests for both junit parser behavior and `--junitxml` command preservation. The candidate adds no tests at all, so there is no evidence the new behavior is protected.

Change Focusmuch worse Δ-2

The candidate includes substantial unrelated generated `repogauge.egg-info` files, which are outside the task scope. Gold stays tightly focused on the validation logic and its tests.

Harness: resolved
Attempt State: succeeded
Latency: 1.6m
Spend: $0.07
Tokens: 346.7K
Tool Calls: 13
Patch Size: 2,690 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/validation/junit_parser.py b/repogauge/validation/junit_parser.py
index 8966db3..acb3552 100644
--- a/repogauge/validation/junit_parser.py
+++ b/repogauge/validation/junit_parser.py
@@ -78,6 +78,8 @@ def _canonical_id(classname: str, name: str) -> str:
 def _outcome_of(testcase: ET.Element) -> str:
     if testcase.find("skipped") is not None:
         return OUTCOME_SKIP
+    if testcase.find("xpass") is not None:
+        return OUTCOME_SKIP
     if testcase.find("error") is not None:
         return OUTCOME_ERROR
     if testcase.find("failure") is not None:
diff --git a/repogauge/validation/testsel.py b/repogauge/validation/testsel.py
index a3d4583..0e048eb 100644
--- a/repogauge/validation/testsel.py
+++ b/repogauge/validation/testsel.py
@@ -57,6 +57,13 @@ def _command_has_flag(parts: List[str], flag: str) -> bool:
     return flag in parts
 
 
+def _command_has_junit_xml_flag(parts: List[str]) -> bool:
+    return any(
+        part.startswith("--junit-xml=") or part.startswith("--junitxml=")
+        for part in parts
+    )
+
+
 def _build_pytest_targeted_cmd(test_cmd_base: str) -> str:
     candidate = test_cmd_base.strip() or _PYTEST_CMD_PREFIX
     try:
@@ -71,7 +78,7 @@ def _build_pytest_targeted_cmd(test_cmd_base: str) -> str:
         parts.append("--tb=no")
     if not _command_has_flag(parts, "-q"):
         parts.append("-q")
-    if not _command_has_flag(parts, "--junit-xml="):
+    if not _command_has_junit_xml_flag(parts):
         parts.append(f"--junit-xml={_JUNIT_XML_PLACEHOLDER}")
 
     return " ".join(parts)
diff --git a/repogauge/validation/validate.py b/repogauge/validation/validate.py
index bf1350c..6fdabff 100644
--- a/repogauge/validation/validate.py
+++ b/repogauge/validation/validate.py
@@ -94,15 +94,18 @@ def _run_pytest(
     ``test_cmd_base`` is taken from the adapter spec when available.
     """
     env = {**os.environ, "PYTHONPATH": str(worktree)}
-    cmd = (
-        _resolve_test_cmd(test_cmd_base)
-        + [
-            "--tb=no",
-            "-q",
-            f"--junit-xml={junit_xml}",
-        ]
-        + (test_files if test_files else [])
-    )
+    cmd = _resolve_test_cmd(test_cmd_base) + [
+        "--tb=no",
+        "-q",
+    ] + (test_files if test_files else [])
+    junit_flag = f"--junit-xml={junit_xml}"
+    replaced = False
+    for i, part in enumerate(cmd):
+        if part.startswith("--junit-xml=") or part.startswith("--junitxml="):
+            cmd[i] = junit_flag
+            replaced = True
+    if not replaced:
+        cmd.insert(3, junit_flag)
 
     result = run_command(
         cmd, cwd=str(worktree), env=env, timeout_seconds=timeout_seconds

LLM Judgemuch worse

Overall Δ: -1.40
Confidence: 0.92

Candidate fixes the `--junitxml` planning variant and handles explicit `<xpass/>`, but misses the gold patch's stricter junit parsing behavior and adds a different runtime-command rewrite in `validate.py` instead of the intended validation hardening.

Task Fitworse Δ-1

It addresses part of the task by preserving `--junitxml=` and recognizing `xpass`, but it does not implement the gold patch's `validate.py` behavior that fails fast when junit output is missing or unparsable.

Correctness Safetymuch worse Δ-2

The candidate leaves the prior behavior that can silently continue with empty outcomes when junit output is missing or invalid, while the gold patch converts that into a `JUnitParseError`. That is a real regression-risk difference because validation failures can be masked instead of surfaced.

Maintainabilityworse Δ-1

The dedicated `_command_has_junit_xml_flag` helper is serviceable, but the gold patch's change is more cohesive by extending existing flag-detection logic rather than introducing a one-off path. The `validate.py` rewrite also adds extra command-munging logic without addressing the core parser error-handling gap.

Test Qualitymuch worse Δ-2

Relative to gold, the candidate patch is missing the added regression tests for `xpass`/`xfail` parsing and the `--junitxml` targeted-plan variant, so the intended behavior is less well specified and protected.

Change Focusworse Δ-1

It stays near the affected area, but the `validate.py` change shifts toward command rewriting instead of the gold fix's focused validation behavior. That is adjacent rather than directly aligned with the reference fix.

Harness: resolved
Attempt State: succeeded
Latency: 2.1m
Spend: $0.11
Tokens: 299.2K
Tool Calls: 16
Patch Size: 1,720 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/validation/junit_parser.py b/repogauge/validation/junit_parser.py
index 8966db3..8a2df31 100644
--- a/repogauge/validation/junit_parser.py
+++ b/repogauge/validation/junit_parser.py
@@ -82,6 +82,8 @@ def _outcome_of(testcase: ET.Element) -> str:
         return OUTCOME_ERROR
     if testcase.find("failure") is not None:
         return OUTCOME_FAIL
+    if testcase.find("xpass") is not None:
+        return OUTCOME_SKIP
     return OUTCOME_PASS
 
 
diff --git a/repogauge/validation/testsel.py b/repogauge/validation/testsel.py
index a3d4583..1c78734 100644
--- a/repogauge/validation/testsel.py
+++ b/repogauge/validation/testsel.py
@@ -71,7 +71,7 @@ def _build_pytest_targeted_cmd(test_cmd_base: str) -> str:
         parts.append("--tb=no")
     if not _command_has_flag(parts, "-q"):
         parts.append("-q")
-    if not _command_has_flag(parts, "--junit-xml="):
+    if not _command_has_flag(parts, "--junit-xml=") and not _command_has_flag(parts, "--junitxml="):
         parts.append(f"--junit-xml={_JUNIT_XML_PLACEHOLDER}")
 
     return " ".join(parts)
diff --git a/repogauge/validation/validate.py b/repogauge/validation/validate.py
index bf1350c..23ab066 100644
--- a/repogauge/validation/validate.py
+++ b/repogauge/validation/validate.py
@@ -94,8 +94,10 @@ def _run_pytest(
     ``test_cmd_base`` is taken from the adapter spec when available.
     """
     env = {**os.environ, "PYTHONPATH": str(worktree)}
+    resolved = _resolve_test_cmd(test_cmd_base)
+    resolved = [p for p in resolved if not p.startswith(("--junit-xml=", "--junitxml="))]
     cmd = (
-        _resolve_test_cmd(test_cmd_base)
+        resolved
         + [
             "--tb=no",
             "-q",

LLM Judgeworse

Overall Δ: -1.05
Confidence: 0.86

Candidate fixes the main regression around `--junitxml` handling and `xpass`, but misses the gold patch's stronger validation behavior and lacks the accompanying regression tests.

Task Fitworse Δ-1

It addresses the observed pytest JUnit variant issue by recognizing `--junitxml=` and `xpass`, so the core task is substantially covered. However, it does not match the gold patch's full behavior around enforcing expected JUnit output parsing, leaving the solution less complete.

Correctness Safetyworse Δ-1

The candidate keeps existing behavior of silently proceeding when JUnit output is missing, whereas the gold patch raises `JUnitParseError` and parses unconditionally once the file exists. That makes failures in test execution/report generation easier to miss and weakens regression safety.

Maintainabilityworse Δ-1

The `testsel` change works but is less cohesive than the gold patch's centralized `_JUNIT_XML_FLAGS` handling inside `_command_has_flag`. The `validate.py` change also adds flag-stripping logic inline, which is more ad hoc than the gold patch's simpler, clearer control flow.

Test Qualitymuch worse Δ-2

The gold patch adds explicit regression tests for both `xpass`/xfail-style JUnit parsing and `--junitxml` command preservation. The candidate patch, as provided, adds no tests, so its coverage is materially weaker.

Change Focussame Δ+0

The candidate stays within the relevant validation and test-selection codepaths and does not introduce obvious unrelated churn. Its scope is comparably focused to the gold patch.

s1liconcow__repogauge-rg-791dd2e150b4-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Land bead oss_repogauge-3wi parser bridge for harness - Production changes: repogauge/export/adapter.py, repogauge/parsers/__init__.py, repogauge/parsers/junit.py, repogauge/validation/junit_parser.py. Test changes: tests/unit/test_adapter.py, tests/unit/test_harness_parser_bridge.py. - Bead oss_repogauge-3wi: Implement generic JUnit parser bridge into harness grading Context: - Bridge RepoGauge’s JUnit parsing strategy into the parser interface expected by the official harness. Acceptance: - Generated adapters can reference the parser successfully. - Parser output is consistent with validation-time semantics. - Harness-facing compatibility tests pass. --- - Reproduction: run tests impacted by this change (180 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 3.4m
Spend: $0.54
Tokens: 8.3K
Tool Calls: 33
Patch Size: 3,701 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index ad23ce9..abc8db4 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -68,6 +68,8 @@ _ADAPTER_TEMPLATE = '''\
 
 from __future__ import annotations
 
+from repogauge.parsers.junit import parse_repogauge_junit
+
 REPO = {repo_repr}
 VERSION = {version_repr}
 MODULE_NAME = {module_name_repr}
@@ -146,7 +148,7 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
                 }
             }
         ),
-        map_repo_to_parser_repr=repr({spec["repo"]: spec["parser"]}),
+        map_repo_to_parser_repr=f"{{{repr(spec['repo'])}: parse_repogauge_junit}}",
     )
 
 
diff --git a/repogauge/parsers/__init__.py b/repogauge/parsers/__init__.py
new file mode 100644
index 0000000..fba9ad3
--- /dev/null
+++ b/repogauge/parsers/__init__.py
@@ -0,0 +1 @@
+"""RepoGauge parsers package."""
diff --git a/repogauge/parsers/junit.py b/repogauge/parsers/junit.py
new file mode 100644
index 0000000..49a0cfa
--- /dev/null
+++ b/repogauge/parsers/junit.py
@@ -0,0 +1,76 @@
+"""Harness-facing JUnit parser bridge (bead 3wi).
+
+Wraps repogauge.validation.junit_parser to accept the variety of report
+payload formats the harness may supply: Path, str, bytes, or a dict with
+a ``junit_xml`` key pointing to a Path.
+"""
+
+from __future__ import annotations
+
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from typing import Dict, Union
+
+from repogauge.validation.junit_parser import (
+    JUnitParseError,
+    _canonical_id,
+    _outcome_of,
+)
+
+
+def _parse_text(text: str) -> Dict[str, str]:
+    try:
+        root = ET.fromstring(text)
+    except ET.ParseError as exc:
+        raise JUnitParseError(f"malformed JUnit XML: {exc}") from exc
+
+    suites = root.findall(".//testsuite") or ([root] if root.tag == "testsuite" else [])
+    if not suites:
+        raise JUnitParseError("no <testsuite> elements found")
+
+    results: Dict[str, str] = {}
+    for suite in suites:
+        for tc in suite.findall("testcase"):
+            classname = (tc.get("classname") or "").strip()
+            name = (tc.get("name") or "").strip()
+            if not name:
+                continue
+            results[_canonical_id(classname, name)] = _outcome_of(tc)
+
+    return results
+
+
+def parse_repogauge_junit(payload: Union[Path, str, bytes, dict]) -> Dict[str, str]:
+    """Parse JUnit XML and return ``{test_id: outcome}``.
+
+    Accepts:
+    - ``Path``: read the file at that path
+    - ``str``: treat as raw XML text
+    - ``bytes``: decode as UTF-8, then parse
+    - ``dict``: read from ``payload["junit_xml"]`` (a Path or str path)
+
+    Raises:
+        TypeError: for any other payload type
+        JUnitParseError: if the XML is absent, empty, or malformed
+    """
+    if isinstance(payload, Path):
+        if not payload.exists():
+            raise JUnitParseError(f"JUnit XML not found: {payload}")
+        text = payload.read_text(encoding="utf-8").strip()
+        if not text:
+            raise JUnitParseError(f"JUnit XML is empty: {payload}")
+    elif isinstance(payload, bytes):
+        text = payload.decode("utf-8")
+    elif isinstance(payload, str):
+        text = payload
+    elif isinstance(payload, dict):
+        xml_path = Path(payload["junit_xml"])
+        if not xml_path.exists():
+            raise JUnitParseError(f"JUnit XML not found: {xml_path}")
+        text = xml_path.read_text(encoding="utf-8").strip()
+        if not text:
+            raise JUnitParseError(f"JUnit XML is empty: {xml_path}")
+    else:
+        raise TypeError(f"unsupported report payload: {type(payload)!r}")
+
+    return _parse_text(text)

LLM Judgemuch worse

Overall Δ: -1.50
Confidence: 0.95

Candidate lands the basic adapter bridge, but it is less compatible and less aligned with the validation parser than the gold patch, and it omits the accompanying regression tests.

Task Fitworse Δ-1

It wires generated adapters to a callable parser and likely fixes the immediate harness regression, but it does not fully match the task as well as gold because it only handles a narrower set of harness payload shapes and does not expose the parser via the package init as the reference does.

Correctness Safetymuch worse Δ-2

The bridge treats every string as XML content rather than supporting string file paths, only accepts dict payloads with a required `junit_xml` key, and duplicates parsing logic instead of reusing shared validation-time parsing. That creates real divergence risk from validation semantics and misses several edge cases the gold patch explicitly covers.

Maintainabilitymuch worse Δ-2

Instead of extracting shared XML-content parsing into `repogauge.validation.junit_parser`, it reimplements parsing in the bridge and imports private helpers (`_canonical_id`, `_outcome_of`). That increases coupling, duplicates logic, and makes future parser changes easier to miss.

Test Qualitymuch worse Δ-2

The candidate patch includes no visible test updates, while the gold patch adds adapter assertions and dedicated compatibility tests for path, bytes, dict payloads, and malformed XML behavior. Relative to gold, the testing posture is substantially weaker.

Change Focussame Δ+0

The patch stays on-task and does not introduce obvious unrelated churn. Although its implementation choice is weaker than gold, the scope itself remains disciplined.

Harness: resolved
Attempt State: succeeded
Latency: 2.2m
Spend: $0.13
Tokens: 664.6K
Tool Calls: 28
Patch Size: 5,776 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index ad23ce9..349abce 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -68,6 +68,8 @@ _ADAPTER_TEMPLATE = '''\
 
 from __future__ import annotations
 
+from repogauge.parsers.junit import parse_repogauge_junit
+
 REPO = {repo_repr}
 VERSION = {version_repr}
 MODULE_NAME = {module_name_repr}
@@ -82,7 +84,7 @@ DOCKER_SPECS = {docker_specs_repr}
 
 MAP_REPO_TO_EXT = {map_repo_to_ext_repr}
 MAP_REPO_VERSION_TO_SPECS = {map_repo_version_specs_repr}
-MAP_REPO_TO_PARSER = {map_repo_to_parser_repr}
+MAP_REPO_TO_PARSER = {{ {repo_repr}: parse_repogauge_junit }}
 
 
 def get_spec() -> dict:
@@ -146,7 +148,6 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
                 }
             }
         ),
-        map_repo_to_parser_repr=repr({spec["repo"]: spec["parser"]}),
     )
 
 
diff --git a/repogauge/parsers/__init__.py b/repogauge/parsers/__init__.py
new file mode 100644
index 0000000..3f1057c
--- /dev/null
+++ b/repogauge/parsers/__init__.py
@@ -0,0 +1,5 @@
+"""Parser bridge package."""
+
+from repogauge.parsers.junit import parse_repogauge_junit
+
+__all__ = ["parse_repogauge_junit"]
diff --git a/repogauge/parsers/junit.py b/repogauge/parsers/junit.py
new file mode 100644
index 0000000..54ad5d4
--- /dev/null
+++ b/repogauge/parsers/junit.py
@@ -0,0 +1,55 @@
+"""Harness-facing JUnit parser bridge.
+
+This module accepts the payload shapes used by the official harness and
+normalizes them into the validation parser's file-based interface.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any, Dict
+
+from repogauge.validation.junit_parser import (
+    parse_junit_xml,
+    _parse_junit_xml_text,
+)
+
+
+def _payload_to_path_or_text(payload: Any) -> Path | str | None:
+    if isinstance(payload, bytes):
+        return payload.decode("utf-8")
+    if isinstance(payload, Path):
+        return payload
+    if isinstance(payload, str):
+        stripped = payload.lstrip()
+        candidate = Path(payload)
+        if candidate.exists() or (stripped and not stripped.startswith("<")):
+            return candidate
+        return payload
+    if isinstance(payload, dict):
+        for key in ("junit_xml", "xml", "path", "report", "payload"):
+            if key in payload:
+                return _payload_to_path_or_text(payload[key])
+    return None
+
+
+def parse_repogauge_junit(report: Any) -> Dict[str, str]:
+    """Parse a harness JUnit payload into canonical test outcomes.
+
+    Accepted payloads include:
+    - ``pathlib.Path`` instances pointing to a JUnit XML file
+    - filesystem path strings
+    - raw XML strings / bytes
+    - dict payloads containing ``junit_xml`` or common path/content keys
+    """
+    normalized = _payload_to_path_or_text(report)
+    if normalized is None:
+        raise TypeError("unsupported report payload")
+
+    if isinstance(normalized, Path):
+        return parse_junit_xml(normalized)
+    if isinstance(normalized, str) and normalized.lstrip().startswith("<"):
+        return _parse_junit_xml_text(normalized, source="<harness report>")
+    if isinstance(normalized, str):
+        return parse_junit_xml(Path(normalized))
+    return _parse_junit_xml_text(normalized, source="<bytes>")
diff --git a/repogauge/validation/junit_parser.py b/repogauge/validation/junit_parser.py
index 62eb182..2e91874 100644
--- a/repogauge/validation/junit_parser.py
+++ b/repogauge/validation/junit_parser.py
@@ -91,30 +91,21 @@ def _outcome_of(testcase: ET.Element) -> str:
     return OUTCOME_PASS
 
 
-def parse_junit_xml(xml_path: Path) -> Dict[str, str]:
-    """Parse a pytest JUnit XML file and return ``{test_id: outcome}``.
-
-    Outcomes are one of: ``"pass"``, ``"fail"``, ``"error"``, ``"skip"``.
-
-    Raises:
-        JUnitParseError: if the file is absent, empty, or malformed.
-    """
-    if not xml_path.exists():
-        raise JUnitParseError(f"JUnit XML not found: {xml_path}")
-
-    text = xml_path.read_text(encoding="utf-8").strip()
-    if not text:
-        raise JUnitParseError(f"JUnit XML is empty: {xml_path}")
+def _parse_junit_xml_text(text: str, *, source: str) -> Dict[str, str]:
+    """Parse pytest JUnit XML content and return ``{test_id: outcome}``."""
+    payload = text.strip()
+    if not payload:
+        raise JUnitParseError(f"JUnit XML is empty: {source}")
 
     try:
-        root = ET.fromstring(text)
+        root = ET.fromstring(payload)
     except ET.ParseError as exc:
-        raise JUnitParseError(f"malformed JUnit XML at {xml_path}: {exc}") from exc
+        raise JUnitParseError(f"malformed JUnit XML at {source}: {exc}") from exc
 
     # Support both <testsuites><testsuite>… and bare <testsuite>…
     suites = root.findall(".//testsuite") or ([root] if root.tag == "testsuite" else [])
     if not suites:
-        raise JUnitParseError(f"no <testsuite> elements found in {xml_path}")
+        raise JUnitParseError(f"no <testsuite> elements found in {source}")
 
     results: Dict[str, str] = {}
     for suite in suites:
@@ -127,3 +118,18 @@ def parse_junit_xml(xml_path: Path) -> Dict[str, str]:
             results[test_id] = _outcome_of(tc)
 
     return results
+
+
+def parse_junit_xml(xml_path: Path) -> Dict[str, str]:
+    """Parse a pytest JUnit XML file and return ``{test_id: outcome}``.
+
+    Outcomes are one of: ``"pass"``, ``"fail"``, ``"error"``, ``"skip"``.
+
+    Raises:
+        JUnitParseError: if the file is absent, empty, or malformed.
+    """
+    if not xml_path.exists():
+        raise JUnitParseError(f"JUnit XML not found: {xml_path}")
+
+    text = xml_path.read_text(encoding="utf-8")
+    return _parse_junit_xml_text(text, source=str(xml_path))

LLM Judgeworse

Overall Δ: -1.05
Confidence: 0.92

Candidate resolves the main bridge but is weaker than the gold patch on harness payload compatibility and testing coverage.

Task Fitworse Δ-1

It wires generated adapters to a callable parser and adds a bridge module, but the accepted mapping keys are narrower than the gold patch. Gold explicitly supports more harness-style payload shapes such as `junit_xml_path`, `junit_xml_file`, `output`, `log`, `result`, `stdout`, `stderr`, and `raw`, so it better matches the stated harness-compatibility goal.

Correctness Safetyworse Δ-1

The candidate generally works, but its string normalization is less safe: non-XML strings are treated as filesystem paths, which can turn malformed or empty content cases into `not found` path errors rather than content-parsing semantics. It also depends on a private helper `_parse_junit_xml_text` across modules, which is a more brittle interface than the gold patch's public `parse_junit_xml_content`.

Maintainabilityworse Δ-1

The gold patch exposes a clean public parsing API and keeps the bridge layered on top of it. The candidate leaks a private validation helper into the bridge and hardcodes parser mapping directly in the template, which is workable but less cohesive and less extensible.

Test Qualitymuch worse Δ-2

Relative to the gold patch, the candidate patch shows no corresponding adapter assertion update or harness bridge tests. Gold adds direct coverage for callable parser export, path/string/dict/bytes inputs, and malformed XML behavior, which materially reduces regression risk.

Change Focussame Δ+0

The scope stays centered on the parser bridge and related adapter/parser code, with no obvious unrelated churn. The extra internal refactor in the validation parser is still directly tied to the task.

Harness: resolved
Attempt State: succeeded
Latency: 4.9m
Spend: $0.17
Tokens: 533.2K
Tool Calls: 29
Patch Size: 3,551 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index ad23ce9..fe8a9e9 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -68,6 +68,8 @@ _ADAPTER_TEMPLATE = '''\
 
 from __future__ import annotations
 
+from repogauge.parsers.junit import parse_repogauge_junit
+
 REPO = {repo_repr}
 VERSION = {version_repr}
 MODULE_NAME = {module_name_repr}
@@ -146,7 +148,7 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
                 }
             }
         ),
-        map_repo_to_parser_repr=repr({spec["repo"]: spec["parser"]}),
+        map_repo_to_parser_repr="{" + repr(spec["repo"]) + ": parse_repogauge_junit}",
     )
 
 
diff --git a/repogauge/parsers/__init__.py b/repogauge/parsers/__init__.py
new file mode 100644
index 0000000..ea80d59
--- /dev/null
+++ b/repogauge/parsers/__init__.py
@@ -0,0 +1 @@
+"""Parser bridge modules for harness integration."""
diff --git a/repogauge/parsers/junit.py b/repogauge/parsers/junit.py
new file mode 100644
index 0000000..afd4af8
--- /dev/null
+++ b/repogauge/parsers/junit.py
@@ -0,0 +1,73 @@
+"""JUnit XML parser bridge for harness integration (bead 3wi).
+
+Exposes ``parse_repogauge_junit``, a polymorphic entry-point that accepts
+paths, raw XML strings/bytes, or dict payloads and returns a mapping of
+canonical test IDs to outcomes.
+"""
+
+from __future__ import annotations
+
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from typing import Any, Dict
+
+from repogauge.validation.junit_parser import (
+    JUnitParseError,
+    _canonical_id,
+    _outcome_of,
+    parse_junit_xml,
+)
+
+
+def parse_repogauge_junit(payload: Any) -> Dict[str, str]:
+    """Parse a JUnit XML payload and return ``{test_id: outcome}``.
+
+    Supported *payload* types:
+
+    - ``pathlib.Path`` -> read the file at that path.
+    - ``str`` -> parse the raw XML text.
+    - ``bytes`` -> parse the raw XML bytes.
+    - ``dict`` -> must contain the key ``"junit_xml"`` whose value is a
+      ``pathlib.Path``.
+
+    Outcomes are one of: ``"pass"``, ``"fail"``, ``"error"``, ``"skip"``.
+
+    Raises:
+        JUnitParseError: if the XML is absent, empty, or malformed.
+        TypeError: if *payload* is not one of the supported types.
+    """
+    if isinstance(payload, Path):
+        return parse_junit_xml(payload)
+
+    if isinstance(payload, dict):
+        xml_path = payload.get("junit_xml")
+        if isinstance(xml_path, Path):
+            return parse_junit_xml(xml_path)
+        raise TypeError(
+            f"unsupported report payload: {type(payload).__name__}"
+        )
+
+    if isinstance(payload, (str, bytes)):
+        try:
+            root = ET.fromstring(payload)
+        except ET.ParseError as exc:
+            raise JUnitParseError(f"malformed JUnit XML: {exc}") from exc
+
+        suites = root.findall(".//testsuite") or (
+            [root] if root.tag == "testsuite" else []
+        )
+        if not suites:
+            raise JUnitParseError("no <testsuite> elements found")
+
+        results: Dict[str, str] = {}
+        for suite in suites:
+            for tc in suite.findall("testcase"):
+                classname = (tc.get("classname") or "").strip()
+                name = (tc.get("name") or "").strip()
+                if not name:
+                    continue
+                test_id = _canonical_id(classname, name)
+                results[test_id] = _outcome_of(tc)
+        return results
+
+    raise TypeError(f"unsupported report payload: {type(payload).__name__}")

LLM Judgeworse

Overall Δ: -0.90
Confidence: 0.95

Candidate lands the adapter hook but falls short of the gold bridge by duplicating parser logic, narrowing accepted payload shapes, and omitting the validation-layer refactor and compatibility tests that make the bridge consistent and robust.

Task Fitworse Δ-1

It does wire generated adapters to `parse_repogauge_junit`, so the main integration path is covered. But it does not fully match the task intent of a generic harness bridge because it only accepts dict payloads with a `junit_xml` Path and does not support string file paths or the broader payload conventions handled by the gold patch.

Correctness Safetyworse Δ-1

The gold patch centralizes XML-content parsing in `repogauge.validation.junit_parser` so harness parsing stays aligned with validation semantics. The candidate instead reimplements XML parsing in `repogauge/parsers/junit.py` using private helpers, creating semantic drift risk and bypassing the validation parser’s canonical error behavior. It also rejects mapping payloads that contain XML text or alternate common keys, and a string path to an existing XML file would be misparsed as XML content rather than read from disk.

Maintainabilityworse Δ-1

The candidate duplicates parsing logic instead of reusing a shared `parse_junit_xml_content` helper, which increases long-term maintenance cost and makes future parser changes easy to miss. It also imports private validation helpers (`_canonical_id`, `_outcome_of`) across module boundaries, which is a weaker design than the gold patch’s public helper reuse. `repogauge/parsers/__init__.py` is also incomplete relative to the gold export surface.

Test Qualityworse Δ-1

No candidate tests are shown for the new bridge behavior. The gold patch adds direct coverage for adapter generation, path/XML/bytes/dict payload handling, malformed XML propagation, and parser identity in generated adapters.

Change Focussame Δ+0

The candidate stays scoped to the parser bridge work and does not introduce obvious unrelated churn. The main issue is incompleteness and duplication, not scope creep.

s1liconcow__repogauge-rg-7d17196fe3ca-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-90w - production-vs-test patch split - Production changes: repogauge/export/materialize.py, repogauge/export/split_patch.py. Test changes: tests/unit/test_split_patch.py. - Bead oss_repogauge-90w: Implement production-vs-test patch split with exact diff preservation Context: - Split a fixing commit into: - `patch`: production-code diff only - `test_patch`: tests plus required test-support files - Preserve exact unified diff text and apply semantics. Acceptance: - `patch` and `test_patch` are generated as valid unified diffs. - Split logic is shared and tested with gold fixtures. - Failed or ambiguous splits are rejected with reason codes. --- - Reproduction: run tests impacted by this change (335 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 2.8m
Spend: $0.36
Tokens: 10.2K
Tool Calls: 12
Patch Size: 7,474 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2ec2a31
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,31 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/export/split_patch.py b/repogauge/export/split_patch.py
new file mode 100644
index 0000000..1c26e56
--- /dev/null
+++ b/repogauge/export/split_patch.py
@@ -0,0 +1,99 @@
+"""Utilities for splitting a unified diff into production and test patches."""
+
+from __future__ import annotations
+
+import re
+from typing import Dict, List, Tuple
+
+from repogauge.mining.file_roles import classify_file
+
+
+DIFF_HEADER_RE = re.compile(r"^diff --git (?:a/)?(.+?) (?:b/)?(.+)$")
+
+
+class PatchSplitError(RuntimeError):
+    """Raised when a patch cannot be cleanly split into prod and test components."""
+
+
+def split_prod_and_test(diff: str) -> Tuple[str, str, Dict[str, List[str]]]:
+    """Split a unified diff into production and test patches.
+
+    test_support files (conftest, fixtures, etc.) follow test files into the test
+    patch when the diff contains at least one actual test file; otherwise they fall
+    through to the prod patch.
+
+    Returns:
+        (prod_patch, test_patch, meta) where meta holds 'prod_files' and 'test_files'.
+
+    Raises:
+        PatchSplitError: if a chunk crosses the prod/test boundary (e.g. a rename).
+    """
+    chunks: list[tuple[str, list[str]]] = []
+    current_lines: list[str] | None = None
+    current_role: str = "prod"
+
+    for line in diff.splitlines(keepends=True):
+        match = DIFF_HEADER_RE.match(line)
+        if match:
+            if current_lines is not None:
+                chunks.append((current_role, current_lines))
+
+            from_path = match.group(1).strip()
+            to_path = match.group(2).strip()
+
+            from_role = classify_file(from_path).role
+            to_role = classify_file(to_path).role
+
+            from_is_test_side = from_role in {"test", "test_support"}
+            to_is_test_side = to_role in {"test", "test_support"}
+            if from_is_test_side != to_is_test_side:
+                raise PatchSplitError(
+                    f"cross-boundary rename: {from_path!r} ({from_role}) -> {to_path!r} ({to_role})"
+                )
+
+            current_role = to_role
+            current_lines = [line]
+        else:
+            if current_lines is not None:
+                current_lines.append(line)
+
+    if current_lines is not None:
+        chunks.append((current_role, current_lines))
+
+    has_test_files = any(role == "test" for role, _ in chunks)
+
+    prod_chunks: list[str] = []
+    test_chunks: list[str] = []
+    prod_files: list[str] = []
+    test_files: list[str] = []
+
+    for role, lines in chunks:
+        chunk_text = "".join(lines)
+        file_path = ""
+        m = DIFF_HEADER_RE.match(lines[0])
+        if m:
+            file_path = m.group(2).strip()
+
+        if role == "test":
+            test_chunks.append(chunk_text)
+            if file_path:
+                test_files.append(file_path)
+        elif role == "test_support":
+            if has_test_files:
+                test_chunks.append(chunk_text)
+                if file_path:
+                    test_files.append(file_path)
+            else:
+                prod_chunks.append(chunk_text)
+                if file_path:
+                    prod_files.append(file_path)
+        else:
+            prod_chunks.append(chunk_text)
+            if file_path:
+                prod_files.append(file_path)
+
+    return (
+        "".join(prod_chunks),
+        "".join(test_chunks),
+        {"prod_files": prod_files, "test_files": test_files},
+    )
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgeworse

Overall Δ: -1.00
Confidence: 0.97

Candidate adds a basic split helper, but it misses the required integration and tests and includes unrelated packaging churn, so it falls well short of the gold fix.

Task Fitworse Δ-1

The task required both shared split logic and wiring it into materialization so production/test patches are emitted and ambiguous splits are rejected with reason codes. The candidate only adds a standalone `split_patch.py` and does not update `repogauge/export/materialize.py` or add the regression tests, so it does not address the full task as the gold patch does.

Correctness Safetyworse Δ-1

The candidate's splitter is materially less safe than gold. It classifies boundaries only from the `diff --git` header paths instead of parsing rename metadata, so rename handling is less exact. It also omits the gold patch's explicit metadata (`test_support_files`, `all_touched_files`) and does not integrate `PatchSplitError` into materialization rejection flow, so failed/ambiguous splits would not produce the required downstream behavior.

Maintainabilityworse Δ-1

The new helper is readable, but it is a simplified duplicate of logic that the gold patch implements more robustly. The gold version has clearer decomposition for header parsing, rename extraction, bucket assignment, and touched-file tracking, making future extension safer. Missing integration also leaves the overall change incomplete from a maintenance perspective.

Test Qualityworse Δ-1

The candidate adds no tests at all, while the gold patch adds focused unit coverage for prod/test splitting, test-support routing, and cross-boundary rename rejection. Relative to the gold fix, test posture is substantially weaker.

Change Focusworse Δ-1

The candidate introduces unrelated generated packaging artifacts (`repogauge.egg-info/*`, `uv.lock`) that are not part of the task. The gold patch is tightly scoped to the production files and regression test needed for the feature.

Harness: resolved
Attempt State: succeeded
Latency: 2.2m
Spend: $0.13
Tokens: 613.8K
Tool Calls: 23
Patch Size: 13,836 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2ec2a31
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,31 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/export/__init__.py b/repogauge/export/__init__.py
index 46826cb..0260ab5 100644
--- a/repogauge/export/__init__.py
+++ b/repogauge/export/__init__.py
@@ -1,5 +1,6 @@
 """Export package."""
 
 from .materialize import run_materialization
+from .split_patch import PatchSplitError, split_prod_and_test
 
-__all__ = ["run_materialization"]
+__all__ = ["run_materialization", "PatchSplitError", "split_prod_and_test"]
diff --git a/repogauge/export/materialize.py b/repogauge/export/materialize.py
index 43ef04a..8066871 100644
--- a/repogauge/export/materialize.py
+++ b/repogauge/export/materialize.py
@@ -3,18 +3,14 @@
 from __future__ import annotations
 
 import json
-import re
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
 
-from repogauge.mining.file_roles import classify_file
+from repogauge.export.split_patch import PatchSplitError, split_prod_and_test
 from repogauge.utils.git import extract_commit_diff, get_repo_root, list_commit_parents
 
 
-DIFF_HEADER_RE = re.compile(r"^diff --git (?:a/)?(.+?) (?:b/)?(.+)$")
-
-
 class MaterializationError(RuntimeError):
     """Raised when a reviewed candidate cannot be materialized."""
 
@@ -57,51 +53,11 @@ def _read_jsonl(path: Path) -> List[Dict[str, Any]]:
     return rows
 
 
-def _parse_diff_headers(diff: str) -> Tuple[List[str], List[str], Dict[str, List[str]]]:
-    prod_chunks: list[str] = []
-    test_chunks: list[str] = []
-    touched: Dict[str, List[str]] = {"prod": [], "test": [], "unknown": []}
-
-    current_bucket: Optional[str] = None
-
-    for line in diff.splitlines(keepends=True):
-        match = DIFF_HEADER_RE.match(line)
-        if match:
-            # Prefer the b/ path from the diff header when present.
-            file_path = match.group(2).strip()
-            role = classify_file(file_path).role
-            if role == "test":
-                current_bucket = "test"
-            elif role in {"prod", "config_build", "docs", "generated_vendor", "unknown", "test_support"}:
-                current_bucket = "prod"
-            else:
-                current_bucket = "prod"
-
-            bucket_name = current_bucket if current_bucket in {"test", "prod"} else "unknown"
-            if file_path not in touched[bucket_name]:
-                touched[bucket_name].append(file_path)
-
-            if bucket_name == "test":
-                test_chunks.append(line)
-            else:
-                prod_chunks.append(line)
-            continue
-
-        if current_bucket == "test":
-            test_chunks.append(line)
-        elif current_bucket == "prod":
-            prod_chunks.append(line)
-        else:
-            prod_chunks.append(line)
-
-    return prod_chunks, test_chunks, touched
-
-
 def _split_patch(diff: str) -> Tuple[str, str, Dict[str, List[str]]]:
-    prod_chunks, test_chunks, touched = _parse_diff_headers(diff)
-    return "".join(prod_chunks), "".join(test_chunks), {
-        "prod_files": touched["prod"],
-        "test_files": touched["test"],
+    prod_patch, test_patch, split_meta = split_prod_and_test(diff)
+    return prod_patch, test_patch, {
+        "prod_files": list(split_meta.get("prod_files", [])),
+        "test_files": list(split_meta.get("test_files", [])),
     }
 
 
@@ -258,7 +214,22 @@ def _materialize_candidate(
             metadata={"reason": "patch extraction returned no content"},
         )
 
-    prod_patch, test_patch, split_meta = _split_patch(patch)
+    try:
+        prod_patch, test_patch, split_meta = _split_patch(patch)
+    except PatchSplitError as exc:
+        return None, MaterializedItem(
+            candidate_id=candidate_id,
+            repo=repo,
+            commit=commit,
+            base_commit=base_commit,
+            patch=patch,
+            test_patch="",
+            prod_patch="",
+            status="rejected",
+            reason="patch_split_failed",
+            metadata={"reason": exc.message, "split_reason": exc.reason},
+        )
+
     if not prod_patch.strip():
         return None, MaterializedItem(
             candidate_id=candidate_id,
diff --git a/repogauge/export/split_patch.py b/repogauge/export/split_patch.py
new file mode 100644
index 0000000..f7eab98
--- /dev/null
+++ b/repogauge/export/split_patch.py
@@ -0,0 +1,164 @@
+"""Split unified diffs into production and test patches."""
+
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple
+
+from repogauge.mining.file_roles import classify_file
+
+
+DIFF_HEADER_RE = re.compile(r"^diff --git (?:a/)?(.+?) (?:b/)?(.+)$")
+RENAME_FROM_RE = re.compile(r"^rename from (.+)$")
+RENAME_TO_RE = re.compile(r"^rename to (.+)$")
+COPY_FROM_RE = re.compile(r"^copy from (.+)$")
+COPY_TO_RE = re.compile(r"^copy to (.+)$")
+
+
+@dataclass
+class PatchSplitError(RuntimeError):
+    """Raised when a unified diff cannot be split safely."""
+
+    reason: str
+    message: str
+
+    def __post_init__(self) -> None:
+        super().__init__(self.message)
+
+
+def _path_role(path: str) -> str:
+    return classify_file(path).role
+
+
+def _is_test_role(role: str) -> bool:
+    return role == "test"
+
+
+def _is_test_support_role(role: str) -> bool:
+    return role == "test_support"
+
+
+def _chunk_bucket(role: str, *, has_test_changes: bool) -> str:
+    if _is_test_role(role):
+        return "test"
+    if _is_test_support_role(role):
+        return "test" if has_test_changes else "prod"
+    return "prod"
+
+
+def _raise_rename_boundary_error(old_path: str, new_path: str) -> None:
+    old_role = _path_role(old_path)
+    new_role = _path_role(new_path)
+    if (_is_test_role(old_role) or _is_test_support_role(old_role)) != (
+        _is_test_role(new_role) or _is_test_support_role(new_role)
+    ):
+        raise PatchSplitError(
+            reason="cross_boundary_rename",
+            message=f"rename across production/test boundary is not supported: {old_path} -> {new_path}",
+        )
+
+
+def _inspect_for_rename_markers(chunk_lines: List[str]) -> None:
+    rename_from: Optional[str] = None
+    rename_to: Optional[str] = None
+    copy_from: Optional[str] = None
+    copy_to: Optional[str] = None
+
+    for line in chunk_lines:
+        if rename_from is None:
+            match = RENAME_FROM_RE.match(line)
+            if match:
+                rename_from = match.group(1).strip()
+                continue
+        if rename_to is None:
+            match = RENAME_TO_RE.match(line)
+            if match:
+                rename_to = match.group(1).strip()
+                continue
+        if copy_from is None:
+            match = COPY_FROM_RE.match(line)
+            if match:
+                copy_from = match.group(1).strip()
+                continue
+        if copy_to is None:
+            match = COPY_TO_RE.match(line)
+            if match:
+                copy_to = match.group(1).strip()
+
+    if rename_from and rename_to:
+        _raise_rename_boundary_error(rename_from, rename_to)
+    if copy_from and copy_to:
+        _raise_rename_boundary_error(copy_from, copy_to)
+
+
+def split_prod_and_test(diff: str) -> Tuple[str, str, Dict[str, Any]]:
+    """Split a unified diff into production and test patches.
+
+    The function preserves the exact diff text for each chunk. Test-support
+    files are routed with test changes when the patch includes at least one
+    explicit test file; otherwise they remain with production changes.
+    """
+
+    chunks: list[tuple[str, List[str], List[str]]] = []
+    current_header: Optional[str] = None
+    current_lines: List[str] = []
+    current_paths: List[str] = []
+
+    def flush() -> None:
+        nonlocal current_header, current_lines, current_paths
+        if current_header is None:
+            return
+        chunks.append((current_header, current_paths[:], current_lines[:]))
+        current_header = None
+        current_lines = []
+        current_paths = []
+
+    for line in diff.splitlines(keepends=True):
+        header_match = DIFF_HEADER_RE.match(line)
+        if header_match:
+            flush()
+            current_header = line
+            current_lines = [line]
+            current_paths = [header_match.group(1).strip(), header_match.group(2).strip()]
+            continue
+
+        if current_header is None:
+            # Preserve any leading diff prologue with the production patch.
+            current_header = ""
+            current_lines = [line]
+            current_paths = []
+        else:
+            current_lines.append(line)
+
+    flush()
+
+    has_test_changes = any(_is_test_role(classify_file(path).role) for _, paths, _ in chunks for path in paths if path)
+
+    prod_chunks: list[str] = []
+    test_chunks: list[str] = []
+    touched: Dict[str, List[str]] = {"prod": [], "test": [], "unknown": []}
+
+    for header, paths, lines in chunks:
+        if header:
+            _inspect_for_rename_markers(lines)
+
+        # Prefer the b/ path when present, but keep both sides for rename checks.
+        file_path = paths[1] if len(paths) > 1 and paths[1] else (paths[0] if paths else "")
+        role = classify_file(file_path).role if file_path else "unknown"
+        bucket = _chunk_bucket(role, has_test_changes=has_test_changes)
+
+        bucket_name = bucket if bucket in {"prod", "test"} else "unknown"
+        if file_path and file_path not in touched[bucket_name]:
+            touched[bucket_name].append(file_path)
+
+        rendered = "".join(lines)
+        if bucket == "test":
+            test_chunks.append(rendered)
+        else:
+            prod_chunks.append(rendered)
+
+    return "".join(prod_chunks), "".join(test_chunks), {
+        "prod_files": touched["prod"],
+        "test_files": touched["test"],
+    }
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgemuch worse

Overall Δ: -1.25
Confidence: 0.89

Candidate implements the core split path but is materially weaker than the gold patch due to missing regression tests, less robust diff parsing/metadata handling, and substantial unrelated churn.

Task Fitworse Δ-1

It adds a shared split helper and wires materialization through it, but it does not match the gold patch's full task fit: no included regression test coverage, reduced split metadata, and a more generic rejection reason path instead of the gold's clearer unsupported-rename handling.

Correctness Safetyworse Δ-1

The candidate preserves chunk text, but its `diff --git` parsing is regex-based rather than shell-token aware, so quoted or space-containing paths are less safe than the gold implementation. It also omits the gold patch's explicit helper-file handling (`conftest.py`, `pytest.ini`, `tox.ini`) and returns less detailed split metadata, increasing edge-case risk.

Maintainabilityworse Δ-1

The new module is reasonably structured, but the design is less cohesive than gold: `materialize.py` keeps an unnecessary `_split_patch` wrapper that discards metadata, and the split logic is less explicit about supported cases. The extra export changes in `repogauge/export/__init__.py` also broaden surface area without being required by the task.

Test Qualitymuch worse Δ-2

The gold patch adds focused unit tests covering core split behavior, test-support routing, no-test fallback, and boundary rename rejection. The candidate patch includes no corresponding tests, which is a major gap relative to the reference.

Change Focusmuch worse Δ-2

The candidate includes unrelated packaging artifacts (`repogauge.egg-info/*`, `uv.lock`) and extra API exports, which are outside the requested production-vs-test patch split fix. The gold patch stays tightly scoped to the production code and its regression tests.

Harness: resolved
Attempt State: succeeded
Latency: 3.3m
Spend: $0.22
Tokens: 816.5K
Tool Calls: 34
Patch Size: 14,420 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/.repogauge/review/events.jsonl b/.repogauge/review/events.jsonl
new file mode 100644
index 0000000..9218ad2
--- /dev/null
+++ b/.repogauge/review/events.jsonl
@@ -0,0 +1,4 @@
+{"command": "review", "dry_run": false, "event": "command.start", "inputs_hash": "96b0ac84f9623fe6e910daec8ed38150fd5ea681cc5552d0c3154c7ac2407469", "path": "/tmp/tmp9d3zyy14/candidates.jsonl", "status": "running", "timestamp": "2026-04-21T18:47:27.269084Z"}
+{"command": "review", "event": "command.finish", "status": "succeeded", "timestamp": "2026-04-21T18:47:27.269757Z"}
+{"command": "review", "dry_run": false, "event": "command.start", "inputs_hash": "788ee5b3a5c39fdaeb5da578212a2be2837813df7f98d46342181b23b82f9709", "path": "/tmp/tmp1ajdtrhx/candidates.jsonl", "status": "running", "timestamp": "2026-04-21T18:47:27.271809Z"}
+{"command": "review", "event": "command.finish", "status": "succeeded", "timestamp": "2026-04-21T18:47:27.272427Z"}
diff --git a/.repogauge/review/manifest.json b/.repogauge/review/manifest.json
new file mode 100644
index 0000000..9e7d705
--- /dev/null
+++ b/.repogauge/review/manifest.json
@@ -0,0 +1 @@
+{"artifact_paths": {"events": "/testbed/.repogauge/review/events.jsonl", "manifest": "/testbed/.repogauge/review/manifest.json", "review_html": "/tmp/tmp1ajdtrhx/review.html", "review_markdown": "/tmp/tmp1ajdtrhx/review.md", "reviewed": "/tmp/tmp1ajdtrhx/reviewed.jsonl"}, "command": "review", "ended_at": "2026-04-21T18:47:27.272427Z", "host_info": {"platform": "Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35", "python_implementation": "CPython"}, "inputs_hash": "788ee5b3a5c39fdaeb5da578212a2be2837813df7f98d46342181b23b82f9709", "metadata": {"path": "/tmp/tmp1ajdtrhx/candidates.jsonl", "reason": "review_complete", "review": {"accepted": 2, "candidates_path": "/tmp/tmp1ajdtrhx/candidates.jsonl", "html_path": "/tmp/tmp1ajdtrhx/review.html", "markdown_path": "/tmp/tmp1ajdtrhx/review.md", "open": 0, "rejected": 0, "reviewed_path": "/tmp/tmp1ajdtrhx/reviewed.jsonl", "total": 2}}, "schema_version": "0.1.0", "started_at": "2026-04-21T18:47:27.271817Z", "status": "succeeded", "step_statuses": {"bootstrap": "running", "execute": "succeeded", "finish": "succeeded", "inspect": "succeeded"}, "steps": {"bootstrap_started_at": "2026-04-21T18:47:27.271809Z", "execute_ended_at": "2026-04-21T18:47:27.272421Z", "finish_ended_at": "2026-04-21T18:47:27.272424Z", "inspect_started_at": "2026-04-21T18:47:27.271809Z"}}
diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2ec2a31
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,31 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/export/materialize.py b/repogauge/export/materialize.py
index 43ef04a..6090673 100644
--- a/repogauge/export/materialize.py
+++ b/repogauge/export/materialize.py
@@ -8,13 +8,10 @@ from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
 
-from repogauge.mining.file_roles import classify_file
+from repogauge.export.split_patch import PatchSplitError, split_prod_and_test
 from repogauge.utils.git import extract_commit_diff, get_repo_root, list_commit_parents
 
 
-DIFF_HEADER_RE = re.compile(r"^diff --git (?:a/)?(.+?) (?:b/)?(.+)$")
-
-
 class MaterializationError(RuntimeError):
     """Raised when a reviewed candidate cannot be materialized."""
 
@@ -57,54 +54,6 @@ def _read_jsonl(path: Path) -> List[Dict[str, Any]]:
     return rows
 
 
-def _parse_diff_headers(diff: str) -> Tuple[List[str], List[str], Dict[str, List[str]]]:
-    prod_chunks: list[str] = []
-    test_chunks: list[str] = []
-    touched: Dict[str, List[str]] = {"prod": [], "test": [], "unknown": []}
-
-    current_bucket: Optional[str] = None
-
-    for line in diff.splitlines(keepends=True):
-        match = DIFF_HEADER_RE.match(line)
-        if match:
-            # Prefer the b/ path from the diff header when present.
-            file_path = match.group(2).strip()
-            role = classify_file(file_path).role
-            if role == "test":
-                current_bucket = "test"
-            elif role in {"prod", "config_build", "docs", "generated_vendor", "unknown", "test_support"}:
-                current_bucket = "prod"
-            else:
-                current_bucket = "prod"
-
-            bucket_name = current_bucket if current_bucket in {"test", "prod"} else "unknown"
-            if file_path not in touched[bucket_name]:
-                touched[bucket_name].append(file_path)
-
-            if bucket_name == "test":
-                test_chunks.append(line)
-            else:
-                prod_chunks.append(line)
-            continue
-
-        if current_bucket == "test":
-            test_chunks.append(line)
-        elif current_bucket == "prod":
-            prod_chunks.append(line)
-        else:
-            prod_chunks.append(line)
-
-    return prod_chunks, test_chunks, touched
-
-
-def _split_patch(diff: str) -> Tuple[str, str, Dict[str, List[str]]]:
-    prod_chunks, test_chunks, touched = _parse_diff_headers(diff)
-    return "".join(prod_chunks), "".join(test_chunks), {
-        "prod_files": touched["prod"],
-        "test_files": touched["test"],
-    }
-
-
 def _coerce_accepted_state(value: Any) -> bool:
     if hasattr(value, "value") and isinstance(getattr(value, "value"), str):
         value = getattr(value, "value")
@@ -138,7 +87,7 @@ def _resolve_base_commit(repo_root: Path, commit: str, row: Dict[str, Any]) -> s
 
 def _extract_candidate_metadata(row: Dict[str, Any], patch: str, base_commit: str) -> Dict[str, Any]:
     metadata = dict(row.get("metadata", {}))
-    split_prod, split_test, split_meta = _split_patch(patch)
+    split_prod, split_test, split_meta = split_prod_and_test(patch)
     metadata.update(
         {
             "materialization": {
@@ -258,7 +207,22 @@ def _materialize_candidate(
             metadata={"reason": "patch extraction returned no content"},
         )
 
-    prod_patch, test_patch, split_meta = _split_patch(patch)
+    try:
+        prod_patch, test_patch, split_meta = split_prod_and_test(patch)
+    except PatchSplitError as exc:
+        return None, MaterializedItem(
+            candidate_id=candidate_id,
+            repo=repo,
+            commit=commit,
+            base_commit=base_commit,
+            patch=patch,
+            test_patch="",
+            prod_patch="",
+            status="rejected",
+            reason="patch_split_failed",
+            metadata={"reason": str(exc)},
+        )
+
     if not prod_patch.strip():
         return None, MaterializedItem(
             candidate_id=candidate_id,
diff --git a/repogauge/export/split_patch.py b/repogauge/export/split_patch.py
new file mode 100644
index 0000000..1ef9c66
--- /dev/null
+++ b/repogauge/export/split_patch.py
@@ -0,0 +1,113 @@
+"""Unified diff splitting into production and test patches."""
+
+from __future__ import annotations
+
+import re
+from typing import Dict, List, Tuple
+
+from repogauge.mining.file_roles import classify_file
+
+DIFF_HEADER_RE = re.compile(r"^diff --git (?:a/)?(.+?) (?:b/)?(.+)$")
+
+
+class PatchSplitError(RuntimeError):
+    """Raised when a patch cannot be split into production and test patches."""
+
+
+def _parse_chunks(diff: str) -> List[List[str]]:
+    """Split a unified diff into chunks per file."""
+    chunks: List[List[str]] = []
+    current_chunk: List[str] = []
+    for line in diff.splitlines(keepends=True):
+        if DIFF_HEADER_RE.match(line):
+            if current_chunk:
+                chunks.append(current_chunk)
+            current_chunk = [line]
+        else:
+            current_chunk.append(line)
+    if current_chunk:
+        chunks.append(current_chunk)
+    return chunks
+
+
+def _is_rename(chunk: List[str]) -> bool:
+    """Detect if a diff chunk represents a rename."""
+    return any("rename from " in line or "rename to " in line for line in chunk)
+
+
+def _is_test_realm(role: str) -> bool:
+    return role in ("test", "test_support")
+
+
+def split_prod_and_test(diff: str) -> Tuple[str, str, Dict[str, List[str]]]:
+    """Split a unified diff into production and test patches.
+
+    Test-support files are placed in the test patch only when there are
+    actual test file changes in the diff. Otherwise they remain in the
+    production patch.
+
+    Cross-category renames (between production and test realms) are
+    rejected with :class:`PatchSplitError`.
+
+    Returns:
+        Tuple of (prod_patch, test_patch, metadata) where metadata
+        contains ``prod_files`` and ``test_files`` lists.
+    """
+    chunks = _parse_chunks(diff)
+
+    chunk_infos: List[Tuple[List[str], str, str]] = []
+
+    for chunk in chunks:
+        header = chunk[0]
+        match = DIFF_HEADER_RE.match(header)
+        if not match:
+            raise PatchSplitError("Missing diff header in chunk")
+
+        a_path = match.group(1).strip()
+        b_path = match.group(2).strip()
+
+        a_role = classify_file(a_path).role
+        b_role = classify_file(b_path).role
+
+        if _is_rename(chunk):
+            a_is_test = _is_test_realm(a_role)
+            b_is_test = _is_test_realm(b_role)
+            if a_is_test != b_is_test:
+                raise PatchSplitError(
+                    f"Cross-category rename from {a_path} to {b_path}"
+                )
+
+        chunk_infos.append((chunk, b_path, b_role))
+
+    has_test_changes = any(role == "test" for _, _, role in chunk_infos)
+
+    prod_chunks: List[List[str]] = []
+    test_chunks: List[List[str]] = []
+    touched: Dict[str, List[str]] = {"prod": [], "test": []}
+
+    for chunk, path, role in chunk_infos:
+        if role == "test":
+            test_chunks.append(chunk)
+            if path not in touched["test"]:
+                touched["test"].append(path)
+        elif role == "test_support":
+            if has_test_changes:
+                test_chunks.append(chunk)
+                if path not in touched["test"]:
+                    touched["test"].append(path)
+            else:
+                prod_chunks.append(chunk)
+                if path not in touched["prod"]:
+                    touched["prod"].append(path)
+        else:
+            prod_chunks.append(chunk)
+            if path not in touched["prod"]:
+                touched["prod"].append(path)
+
+    prod_patch = "".join("".join(chunk) for chunk in prod_chunks)
+    test_patch = "".join("".join(chunk) for chunk in test_chunks)
+
+    return prod_patch, test_patch, {
+        "prod_files": touched["prod"],
+        "test_files": touched["test"],
+    }
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgemuch worse

Overall Δ: -1.25
Confidence: 0.88

Candidate captures the basic extraction refactor but falls short of the gold fix on split robustness, rejection semantics, and test coverage, with substantial unrelated churn.

Task Fitworse Δ-1

It moves split logic into a shared module and wires materialization through it, but it does not match the gold behavior fully: rejection uses a generic `patch_split_failed` reason instead of the expected specific unsupported-rename reason code, and it omits the richer split metadata the gold patch records.

Correctness Safetyworse Δ-1

The candidate splitter handles simple file-chunk separation, but it is less safe than gold for exact diff preservation and edge cases. The gold parser uses shell-style header parsing and explicit rename path extraction, which is safer for quoted paths and rename headers; the candidate relies on a regex and broad rename detection. It also misses the gold special handling for helper files like `conftest.py`/`pytest.ini`/`tox.ini` and does not preserve `test_support_files`/`all_touched_files` metadata.

Maintainabilityworse Δ-1

The refactor is directionally good, but the implementation is thinner and less explicit than gold about boundary cases and output metadata. The gold patch has clearer decomposition around header parsing, rename extraction, bucket assignment, and error typing, making future extension safer.

Test Qualitymuch worse Δ-2

The gold patch adds focused regression tests covering prod/test chunk splitting, test-support inclusion rules, keeping support files in prod when no tests change, and cross-boundary rename rejection. The candidate patch adds no corresponding tests at all.

Change Focusmuch worse Δ-2

The candidate includes substantial unrelated churn: review artifacts, `repogauge.egg-info/*`, and `uv.lock`. The gold patch is tightly scoped to the production files and one targeted test file.

s1liconcow__repogauge-rg-897a464a8fd5-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-q5b - adapter registration maps and tests - Production changes: repogauge/export/adapter.py. Test changes: tests/unit/test_adapter.py. - Bead oss_repogauge-q5b: Generate harness adapter registration code and serialized specs Context: - Generate the repo-specific adapter code that registers new repo/version/environment entries with the official harness at runtime. Acceptance: - Adapter code and specs are generated deterministically. - `repogauge eval` can import the generated adapter. - Unit tests cover map registration and idempotent codegen. --- - Reproduction: run tests impacted by this change (123 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 1.2m
Spend: $0.19
Tokens: 3.0K
Tool Calls: 11
Patch Size: 10,091 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..ea258a9
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,133 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+- [docs/junit_parser_contract.md](docs/junit_parser_contract.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--enrich-github` | disabled | Enable optional GitHub issue/PR metadata enrichment |
+| `--max-commits N` | `100` | Commits to scan |
+| `--github-token TOKEN` | `$(GITHUB_TOKEN)` | Token used for GitHub API calls |
+| `--github-enrichment-cache PATH` | `<out>/github_enrichment_cache.json` | Optional local cache for enrichment responses |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### Command artifact contract (current scaffold)
+
+For `--out` directory `./out`, the scaffold writes command-specific artifacts:
+
+- `manifest.json`
+  - command metadata and step status for each run invocation
+- `events.jsonl`
+  - machine-readable execution events for the same run
+- `repo_profile.json`, `scan.jsonl`, `candidates.jsonl` for `mine`
+- `reviewed.jsonl`, `review.md`, `review.html` for `review`
+- `materialized.jsonl`, `materialization_rejections.jsonl`, `dataset/dataset.jsonl`,
+  `dataset/predictions.gold.jsonl`, `adapter_<repo>.py`, `specs.json` for `export`
+- `validation.jsonl` for `eval`
+
+This list reflects what the v0.1 scaffold guarantees today; future stages
+extend it to include run-level and analysis artifacts.
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2387ff5
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,41 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/enrich.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/evidence.py
+repogauge/validation/junit_parser.py
+repogauge/validation/testsel.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index 553c449..7a7c9e8 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -29,15 +29,24 @@ def build_adapter_spec(
     repo_name: str, environment_plan: Dict[str, Any]
 ) -> Dict[str, Any]:
     """Build a serialisable adapter spec dict from a repo name and env plan."""
+    module_name = _safe_module_name(repo_name)
+    version = environment_plan.get("version", "0.0.0")
+    python_version = environment_plan.get("python_version", "3.11")
     return {
         "repo": repo_name,
-        "python_version": environment_plan.get("python_version", "3.11"),
+        "version": version,
+        "module_name": module_name,
+        "python_version": python_version,
         "pre_install": environment_plan.get("pre_install", []),
         "install": environment_plan.get("install", ["pip install -e ."]),
         "build": environment_plan.get("build", []),
         "test_cmd_base": environment_plan.get("test_cmd_base", "python -m pytest"),
         "strategy_name": environment_plan.get("strategy_name", ""),
         "parser": "junit",
+        "docker_specs": environment_plan.get(
+            "docker_specs",
+            {"image": f"python:{python_version}-slim"},
+        ),
     }
 
 
@@ -53,6 +62,8 @@ _ADAPTER_TEMPLATE = '''\
 from __future__ import annotations
 
 REPO = {repo_repr}
+VERSION = {version_repr}
+MODULE_NAME = {module_name_repr}
 PYTHON_VERSION = {python_version_repr}
 PRE_INSTALL = {pre_install_repr}
 INSTALL = {install_repr}
@@ -61,11 +72,38 @@ TEST_CMD_BASE = {test_cmd_base_repr}
 PARSER = "junit"
 STRATEGY_NAME = {strategy_name_repr}
 
+MAP_REPO_TO_EXT = {{REPO: "py"}}
+MAP_REPO_TO_PARSER = {{REPO: PARSER}}
+MAP_REPO_VERSION_TO_SPECS = {{
+    REPO: {{
+        VERSION: {{
+            "python_version": PYTHON_VERSION,
+            "install": INSTALL,
+            "build": BUILD,
+            "test_cmd_base": TEST_CMD_BASE,
+            "parser": PARSER,
+            "strategy_name": STRATEGY_NAME,
+        }}
+    }}
+}}
+
+
+def registration_context() -> dict:
+    """Return the registration context dict for this repo."""
+    return {{
+        "repo": REPO,
+        "version": VERSION,
+        "module_name": MODULE_NAME,
+        "parser": PARSER,
+    }}
+
 
 def get_spec() -> dict:
     """Return the environment spec dict for this repo."""
     return {{
         "repo": REPO,
+        "version": VERSION,
+        "module_name": MODULE_NAME,
         "python_version": PYTHON_VERSION,
         "pre_install": PRE_INSTALL,
         "install": INSTALL,
@@ -81,6 +119,8 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
     return _ADAPTER_TEMPLATE.format(
         repo=spec["repo"],
         repo_repr=repr(spec["repo"]),
+        version_repr=repr(spec["version"]),
+        module_name_repr=repr(spec["module_name"]),
         python_version_repr=repr(spec["python_version"]),
         pre_install_repr=repr(spec["pre_install"]),
         install_repr=repr(spec["install"]),

LLM Judgemuch worse

Overall Δ: -1.50
Confidence: 0.89

Candidate partially implements adapter registration metadata but misses several gold behaviors, adds unrelated packaging artifacts, and appears not to include the corresponding unit test coverage.

Task Fitworse Δ-1

It adds version/module fields and basic registration maps, so it moves toward the task, but it does not match the gold patch's full harness-facing shape. The generated adapter lacks the richer registration payload and complete serialized spec structure expected by the reference fix.

Correctness Safetymuch worse Δ-2

The candidate keeps the older `_safe_module_name` behavior, so edge cases the gold patch hardens remain exposed. Its `docker_specs` format differs from gold, `MAP_REPO_VERSION_TO_SPECS` omits fields included by the reference implementation, and `registration_context()` returns a reduced structure instead of the map bundle the harness integration expects.

Maintainabilityworse Δ-1

The production change is understandable, but it is less cohesive than gold because related data is split across partially duplicated structures and the registration contract is under-specified. The unrelated `repogauge.egg-info` additions also add noise and maintenance burden.

Test Qualitymuch worse Δ-2

The gold patch adds targeted tests for required fields, importability, registration maps, and deterministic generation. No corresponding candidate test updates are shown, so the new behavior is materially less protected against regression.

Change Focusmuch worse Δ-2

The candidate introduces multiple generated packaging files unrelated to the bead's adapter-registration task. That extra churn is absent from the gold patch and makes the diff substantially less disciplined.

Harness: resolved
Attempt State: succeeded
Latency: 1.7m
Spend: $0.09
Tokens: 535.4K
Tool Calls: 25
Patch Size: 11,290 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..ea258a9
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,133 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+- [docs/junit_parser_contract.md](docs/junit_parser_contract.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--enrich-github` | disabled | Enable optional GitHub issue/PR metadata enrichment |
+| `--max-commits N` | `100` | Commits to scan |
+| `--github-token TOKEN` | `$(GITHUB_TOKEN)` | Token used for GitHub API calls |
+| `--github-enrichment-cache PATH` | `<out>/github_enrichment_cache.json` | Optional local cache for enrichment responses |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### Command artifact contract (current scaffold)
+
+For `--out` directory `./out`, the scaffold writes command-specific artifacts:
+
+- `manifest.json`
+  - command metadata and step status for each run invocation
+- `events.jsonl`
+  - machine-readable execution events for the same run
+- `repo_profile.json`, `scan.jsonl`, `candidates.jsonl` for `mine`
+- `reviewed.jsonl`, `review.md`, `review.html` for `review`
+- `materialized.jsonl`, `materialization_rejections.jsonl`, `dataset/dataset.jsonl`,
+  `dataset/predictions.gold.jsonl`, `adapter_<repo>.py`, `specs.json` for `export`
+- `validation.jsonl` for `eval`
+
+This list reflects what the v0.1 scaffold guarantees today; future stages
+extend it to include run-level and analysis artifacts.
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2387ff5
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,41 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/enrich.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/evidence.py
+repogauge/validation/junit_parser.py
+repogauge/validation/testsel.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index 553c449..e6553a5 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -29,15 +29,34 @@ def build_adapter_spec(
     repo_name: str, environment_plan: Dict[str, Any]
 ) -> Dict[str, Any]:
     """Build a serialisable adapter spec dict from a repo name and env plan."""
+    module_name = _safe_module_name(repo_name)
+    version = environment_plan.get("version", "0.0.0")
+    python_version = environment_plan.get("python_version", "3.11")
+    pre_install = environment_plan.get("pre_install", [])
+    install = environment_plan.get("install", ["pip install -e ."])
+    build = environment_plan.get("build", [])
+    test_cmd_base = environment_plan.get("test_cmd_base", "python -m pytest")
+    strategy_name = environment_plan.get("strategy_name", "")
+    docker_specs = {
+        "python_version": python_version,
+        "pre_install": pre_install,
+        "install": install,
+        "build": build,
+        "test_cmd": [test_cmd_base],
+        "strategy_name": strategy_name,
+    }
     return {
         "repo": repo_name,
-        "python_version": environment_plan.get("python_version", "3.11"),
-        "pre_install": environment_plan.get("pre_install", []),
-        "install": environment_plan.get("install", ["pip install -e ."]),
-        "build": environment_plan.get("build", []),
-        "test_cmd_base": environment_plan.get("test_cmd_base", "python -m pytest"),
-        "strategy_name": environment_plan.get("strategy_name", ""),
+        "version": version,
+        "module_name": module_name,
+        "python_version": python_version,
+        "pre_install": pre_install,
+        "install": install,
+        "build": build,
+        "test_cmd_base": test_cmd_base,
+        "strategy_name": strategy_name,
         "parser": "junit",
+        "docker_specs": docker_specs,
     }
 
 
@@ -53,6 +72,8 @@ _ADAPTER_TEMPLATE = '''\
 from __future__ import annotations
 
 REPO = {repo_repr}
+VERSION = {version_repr}
+MODULE_NAME = {module_name_repr}
 PYTHON_VERSION = {python_version_repr}
 PRE_INSTALL = {pre_install_repr}
 INSTALL = {install_repr}
@@ -60,12 +81,41 @@ BUILD = {build_repr}
 TEST_CMD_BASE = {test_cmd_base_repr}
 PARSER = "junit"
 STRATEGY_NAME = {strategy_name_repr}
+DOCKER_SPECS = {docker_specs_repr}
+
+MAP_REPO_TO_EXT = {{}}
+MAP_REPO_VERSION_TO_SPECS = {{}}
+MAP_REPO_TO_PARSER = {{}}
+
+MAP_REPO_TO_EXT[REPO] = "py"
+MAP_REPO_VERSION_TO_SPECS.setdefault(REPO, {{}})[VERSION] = {{
+    "docker_specs": DOCKER_SPECS,
+    "pre_install": PRE_INSTALL,
+    "install": INSTALL,
+    "build": BUILD,
+    "test_cmd": [f"{{TEST_CMD_BASE}} --junitxml=/tmp/rg-junit.xml"],
+    "parser": PARSER,
+}}
+MAP_REPO_TO_PARSER[REPO] = PARSER
+
+
+def registration_context() -> dict:
+    """Return the stable registration payload for this repo/version."""
+    return {{
+        "repo": REPO,
+        "version": VERSION,
+        "module_name": MODULE_NAME,
+        "python_version": PYTHON_VERSION,
+        "parser": PARSER,
+    }}
 
 
 def get_spec() -> dict:
     """Return the environment spec dict for this repo."""
     return {{
         "repo": REPO,
+        "version": VERSION,
+        "module_name": MODULE_NAME,
         "python_version": PYTHON_VERSION,
         "pre_install": PRE_INSTALL,
         "install": INSTALL,
@@ -73,6 +123,7 @@ def get_spec() -> dict:
         "test_cmd_base": TEST_CMD_BASE,
         "parser": PARSER,
         "strategy_name": STRATEGY_NAME,
+        "docker_specs": DOCKER_SPECS,
     }}
 '''
 
@@ -81,12 +132,15 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
     return _ADAPTER_TEMPLATE.format(
         repo=spec["repo"],
         repo_repr=repr(spec["repo"]),
+        version_repr=repr(spec["version"]),
+        module_name_repr=repr(spec["module_name"]),
         python_version_repr=repr(spec["python_version"]),
         pre_install_repr=repr(spec["pre_install"]),
         install_repr=repr(spec["install"]),
         build_repr=repr(spec["build"]),
         test_cmd_base_repr=repr(spec["test_cmd_base"]),
         strategy_name_repr=repr(spec["strategy_name"]),
+        docker_specs_repr=repr(spec["docker_specs"]),
     )

LLM Judgemuch worse

Overall Δ: -1.25
Confidence: 0.86

Candidate implements adapter registration data, but diverges from the gold contract, omits the targeted tests, and adds unrelated packaging artifacts.

Task Fitworse Δ-1

It covers the core idea of adding version/module metadata and registration maps, but it does not match the gold patch's registration payload shape. In particular, `registration_context()` omits the map bundle expected by the gold solution, and the generated map spec uses a different structure (`test_cmd` instead of the gold's `test_cmd_base` plus `strategy_name`).

Correctness Safetyworse Δ-1

The candidate leaves `_safe_module_name()` unchanged, so edge cases the gold fix handles (empty repo name, leading digit, stricter sanitization) remain exposed. It also changes the serialized/runtime contract by embedding a richer `docker_specs` object and different version-spec contents, which increases compatibility risk versus the gold implementation.

Maintainabilityworse Δ-1

The gold patch keeps the generated registration data declarative and directly rendered from the spec. The candidate instead initializes empty dicts and mutates them at import time, while also duplicating overlapping fields across `docker_specs` and top-level spec content. That is less clear and makes the generated adapter contract harder to reason about.

Test Qualitymuch worse Δ-2

The gold patch adds focused unit coverage for the new fields, importability, registration maps, and determinism. No corresponding test updates are present in the candidate diff, so the new behavior is materially less protected.

Change Focusmuch worse Δ-2

The task is confined to `repogauge/export/adapter.py` and `tests/unit/test_adapter.py`, but the candidate also adds multiple `repogauge.egg-info` files unrelated to the regression. That is clear scope creep compared with the gold patch's disciplined change set.

Harness: resolved
Attempt State: succeeded
Latency: 1.5m
Spend: $0.06
Tokens: 158.8K
Tool Calls: 9
Patch Size: 10,248 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..ea258a9
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,133 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+- [docs/junit_parser_contract.md](docs/junit_parser_contract.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--enrich-github` | disabled | Enable optional GitHub issue/PR metadata enrichment |
+| `--max-commits N` | `100` | Commits to scan |
+| `--github-token TOKEN` | `$(GITHUB_TOKEN)` | Token used for GitHub API calls |
+| `--github-enrichment-cache PATH` | `<out>/github_enrichment_cache.json` | Optional local cache for enrichment responses |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### Command artifact contract (current scaffold)
+
+For `--out` directory `./out`, the scaffold writes command-specific artifacts:
+
+- `manifest.json`
+  - command metadata and step status for each run invocation
+- `events.jsonl`
+  - machine-readable execution events for the same run
+- `repo_profile.json`, `scan.jsonl`, `candidates.jsonl` for `mine`
+- `reviewed.jsonl`, `review.md`, `review.html` for `review`
+- `materialized.jsonl`, `materialization_rejections.jsonl`, `dataset/dataset.jsonl`,
+  `dataset/predictions.gold.jsonl`, `adapter_<repo>.py`, `specs.json` for `export`
+- `validation.jsonl` for `eval`
+
+This list reflects what the v0.1 scaffold guarantees today; future stages
+extend it to include run-level and analysis artifacts.
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2387ff5
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,41 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/enrich.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/evidence.py
+repogauge/validation/junit_parser.py
+repogauge/validation/testsel.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index 553c449..648bfb4 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -21,7 +21,7 @@ from typing import Any, Dict
 
 
 def _safe_module_name(repo: str) -> str:
-    """Turn ``owner/repo`` into a valid Python identifier ``owner__repo``."""
+    """Turn ``owner/repo`` into a valid Python identifier ``owner_repo``."""
     return re.sub(r"[^a-zA-Z0-9]", "_", repo)
 
 
@@ -31,6 +31,7 @@ def build_adapter_spec(
     """Build a serialisable adapter spec dict from a repo name and env plan."""
     return {
         "repo": repo_name,
+        "version": environment_plan.get("version", "0.0.0"),
         "python_version": environment_plan.get("python_version", "3.11"),
         "pre_install": environment_plan.get("pre_install", []),
         "install": environment_plan.get("install", ["pip install -e ."]),
@@ -38,6 +39,8 @@ def build_adapter_spec(
         "test_cmd_base": environment_plan.get("test_cmd_base", "python -m pytest"),
         "strategy_name": environment_plan.get("strategy_name", ""),
         "parser": "junit",
+        "module_name": _safe_module_name(repo_name),
+        "docker_specs": {},
     }
 
 
@@ -53,6 +56,8 @@ _ADAPTER_TEMPLATE = '''\
 from __future__ import annotations
 
 REPO = {repo_repr}
+VERSION = {version_repr}
+MODULE_NAME = {module_name_repr}
 PYTHON_VERSION = {python_version_repr}
 PRE_INSTALL = {pre_install_repr}
 INSTALL = {install_repr}
@@ -60,12 +65,24 @@ BUILD = {build_repr}
 TEST_CMD_BASE = {test_cmd_base_repr}
 PARSER = "junit"
 STRATEGY_NAME = {strategy_name_repr}
+DOCKER_SPECS = {docker_specs_repr}
+
+def registration_context() -> dict:
+    """Return the registration context for this repo."""
+    return {{
+        "repo": REPO,
+        "version": VERSION,
+        "module_name": MODULE_NAME,
+        "parser": PARSER,
+    }}
 
 
 def get_spec() -> dict:
     """Return the environment spec dict for this repo."""
     return {{
         "repo": REPO,
+        "version": VERSION,
+        "module_name": MODULE_NAME,
         "python_version": PYTHON_VERSION,
         "pre_install": PRE_INSTALL,
         "install": INSTALL,
@@ -73,7 +90,13 @@ def get_spec() -> dict:
         "test_cmd_base": TEST_CMD_BASE,
         "parser": PARSER,
         "strategy_name": STRATEGY_NAME,
+        "docker_specs": DOCKER_SPECS,
     }}
+
+
+MAP_REPO_TO_EXT = {{REPO: "py"}}
+MAP_REPO_TO_PARSER = {{REPO: "junit"}}
+MAP_REPO_VERSION_TO_SPECS = {{REPO: {{VERSION: get_spec()}}}}
 '''
 
 
@@ -81,12 +104,15 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
     return _ADAPTER_TEMPLATE.format(
         repo=spec["repo"],
         repo_repr=repr(spec["repo"]),
+        version_repr=repr(spec["version"]),
+        module_name_repr=repr(spec["module_name"]),
         python_version_repr=repr(spec["python_version"]),
         pre_install_repr=repr(spec["pre_install"]),
         install_repr=repr(spec["install"]),
         build_repr=repr(spec["build"]),
         test_cmd_base_repr=repr(spec["test_cmd_base"]),
         strategy_name_repr=repr(spec["strategy_name"]),
+        docker_specs_repr=repr(spec["docker_specs"]),
     )

LLM Judgeworse

Overall Δ: -1.00
Confidence: 0.87

Candidate fixes the basic adapter import/registration path but is narrower and noisier than the gold patch, with weaker edge-case handling and no corresponding test coverage in the diff.

Task Fitworse Δ-1

It adds version/module metadata and registration maps, which addresses the core regression, but it does not match the gold patch's fuller harness registration payload. In particular, `registration_context()` omits the maps structure the gold patch exposes, and `docker_specs` is left empty instead of carrying the Python version.

Correctness Safetyworse Δ-1

The candidate keeps the old `_safe_module_name()` behavior, so empty repo names and leading-digit names still produce invalid or fragile module identifiers; the gold patch hardens those cases. `version` is not normalized to `str`, and `MAP_REPO_VERSION_TO_SPECS` stores `get_spec()` wholesale rather than the more targeted harness-spec structure from the gold patch, increasing compatibility risk.

Maintainabilityworse Δ-1

The main code is readable, but the generated contract is less explicit than the gold version because `registration_context()` and the versioned specs map do not clearly mirror the intended harness map shapes. The extra checked-in `repogauge.egg-info` artifacts also add avoidable repository noise.

Test Qualityworse Δ-1

The gold patch adds unit coverage for version/module fields, importability, registration maps, and stability. The candidate diff shows no corresponding test updates, so the new behavior is much less defended against regression.

Change Focusworse Δ-1

The candidate includes unrelated generated packaging files under `repogauge.egg-info/`, which are outside the task scope. The gold patch stays tightly focused on the adapter generator and its tests.

s1liconcow__repogauge-rg-89e8e567eb39-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-rmy - deterministic env plan - Production changes: repogauge/mining/inspect.py, repogauge/validation/__init__.py, repogauge/validation/env_detect.py. Test changes: tests/unit/test_env_detect.py, tests/unit/test_imports.py, tests/unit/test_inspect.py. - Bead oss_repogauge-rmy: Implement deterministic environment detection heuristics Context: - Convert repo inspection hints into a concrete install and test environment plan for validation. - This is the Level 1 deterministic part of the environment ladder. Acceptance: - Common Python repo patterns produce deterministic `EnvPlan` outputs. - Strategy names and provenance are persisted. - The next dry-run stage can consume the plan without ad hoc inference. --- - Reproduction: run tests impacted by this change (288 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 4.7m
Spend: $0.58
Tokens: 13.5K
Tool Calls: 39
Patch Size: 9,406 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..8b9cd00
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,36 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/mining/inspect.py b/repogauge/mining/inspect.py
index edf6b5d..e8e650f 100644
--- a/repogauge/mining/inspect.py
+++ b/repogauge/mining/inspect.py
@@ -6,6 +6,7 @@ from typing import Any, Dict, List
 import re
 
 from repogauge.mining.signature import REPO_VERSION_UNKNOWN, build_environment_signature
+from repogauge.validation.env_detect import build_environment_plan
 from repogauge.exec import run_command
 from repogauge.utils.git import get_default_branch, get_repo_root
 
@@ -384,5 +385,6 @@ def inspect_repository(path: str | Path) -> Dict[str, Any]:
     }
     profile["environment_signature"] = build_environment_signature(profile)
     profile["version"] = profile["environment_signature"]["version"]
+    profile["environment_plan"] = build_environment_plan(profile).to_dict()
 
     return profile
diff --git a/repogauge/validation/__init__.py b/repogauge/validation/__init__.py
index 90dcd7e..ed009af 100644
--- a/repogauge/validation/__init__.py
+++ b/repogauge/validation/__init__.py
@@ -1 +1,5 @@
 """Validation package."""
+
+from repogauge.validation.env_detect import EnvPlan
+
+__all__ = ["EnvPlan"]
diff --git a/repogauge/validation/env_detect.py b/repogauge/validation/env_detect.py
new file mode 100644
index 0000000..d8a9a51
--- /dev/null
+++ b/repogauge/validation/env_detect.py
@@ -0,0 +1,114 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any
+
+
+@dataclass
+class EnvPlan:
+    python_version: str
+    pre_install: list[str]
+    install: list[str]
+    build: list[str]
+    test_cmd_base: str
+    strategy_name: str
+    confidence: float
+    provenance: list[str]
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "python_version": self.python_version,
+            "pre_install": self.pre_install,
+            "install": self.install,
+            "build": self.build,
+            "test_cmd_base": self.test_cmd_base,
+            "strategy_name": self.strategy_name,
+            "confidence": self.confidence,
+            "provenance": self.provenance,
+        }
+
+
+def _sort_versions(versions: list[str]) -> list[str]:
+    return sorted(versions, key=lambda v: tuple(int(x) for x in v.split(".")))
+
+
+def build_environment_plan(profile: dict[str, Any]) -> EnvPlan:
+    python_hints = profile.get("python_hints", {})
+    versions = python_hints.get("versions", [])
+    package_managers = python_hints.get("package_managers", [])
+    install_hints = profile.get("install_hints", [])
+    test_commands = profile.get("test_runner_hints", {}).get("commands", [])
+
+    provenance: list[str] = []
+    confidence = 1.0
+
+    if len(versions) > 1:
+        provenance.append("python_version:conflict")
+        provenance.append("python_version:chose-minimum")
+        python_version = _sort_versions(versions)[0]
+        confidence -= 0.1
+    elif versions:
+        python_version = versions[0]
+    else:
+        python_version = "3.11"
+
+    install: list[str] = []
+    install_strategy: str
+
+    if "poetry" in package_managers:
+        install_strategy = "poetry"
+        install.append("poetry install")
+        for hint in install_hints:
+            if hint != "poetry install" and hint.startswith("pip install -e"):
+                install.append("pip install pytest")
+        provenance.append("install_strategy:poetry")
+    elif "setuptools" in package_managers:
+        install_strategy = "setuptools"
+        install = ["pip install -e ."]
+        provenance.append("install_strategy:setuptools")
+    elif "requirements" in package_managers:
+        install_strategy = "requirements"
+        req_hints = sorted(h for h in install_hints if h.startswith("pip install -r"))
+        if req_hints:
+            install.append(req_hints[0])
+        provenance.append("install_strategy:requirements")
+    else:
+        install_strategy = "fallback"
+        install = list(install_hints) if install_hints else ["pip install -e ."]
+        provenance.append("install_strategy:fallback")
+
+    test_strategy: str
+    if "pytest" in test_commands:
+        test_cmd_base = "pytest"
+        test_strategy = "pytest"
+        provenance.append("test_runner:pytest")
+    elif "python -m unittest" in test_commands:
+        test_cmd_base = "python -m unittest"
+        test_strategy = "unittest"
+        provenance.append("test_runner:unittest")
+    elif "tox" in test_commands:
+        test_cmd_base = "tox"
+        test_strategy = "tox"
+        provenance.append("test_runner:tox")
+    elif "nox" in test_commands:
+        test_cmd_base = "nox"
+        test_strategy = "nox"
+        provenance.append("test_runner:nox")
+    else:
+        test_cmd_base = "python -m pytest"
+        test_strategy = "pytest-default"
+        provenance.append("test_runner:pytest-default")
+
+    if install_strategy == "requirements" and test_strategy in ("pytest", "pytest-default"):
+        install.append("pip install pytest")
+
+    return EnvPlan(
+        python_version=python_version,
+        pre_install=[],
+        install=install,
+        build=[],
+        test_cmd_base=test_cmd_base,
+        strategy_name=f"{install_strategy}:{test_strategy}",
+        confidence=confidence,
+        provenance=provenance,
+    )
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgemuch worse

Overall Δ: -1.25
Confidence: 0.93

Candidate implements the core environment-plan wiring but is materially weaker than the gold patch on API completeness, heuristic coverage, robustness, and tests, with unrelated packaging churn added.

Task Fitworse Δ-1

It adds `environment_plan` generation and a basic `EnvPlan`, but it does not fully match the intended deterministic env-planning surface. `repogauge.validation.__init__` exports only `EnvPlan` and omits `build_environment_plan`, which the gold patch intentionally exposes. Heuristic coverage is narrower: no explicit handling for `uv` or `pipenv`, no preserved install-file provenance, and fallback behavior differs from the reference.

Correctness Safetyworse Δ-1

The candidate is less defensive than gold. `build_environment_plan` assumes dict/list inputs instead of coercing malformed shapes, `EnvPlan` is mutable rather than frozen, and confidence/provenance logic is simplified enough to lose signal. It also records `test_runner:pytest-default` instead of the gold's `test_runner:default`, changing persisted provenance semantics, and its fallback install path may replay arbitrary install hints rather than choosing a safer deterministic default.

Maintainabilityworse Δ-1

The gold patch is cleaner and more extensible: helper functions isolate normalization, version choice, install strategy selection, and pytest augmentation. The candidate compresses that logic into one function, making behavior harder to reason about and extend. Missing module docstrings and fewer typed/coercion helpers also reduce clarity.

Test Qualitymuch worse Δ-2

The gold patch adds focused unit coverage for poetry, setuptools, requirements, conflicting Python versions, import contracts, and inspect integration. The candidate adds no tests at all, so regressions in strategy naming, provenance, and deterministic selection are not protected.

Change Focusmuch worse Δ-2

The candidate includes unrelated generated artifacts (`repogauge.egg-info/*`, `uv.lock`) that are outside the task. The gold patch stays tightly scoped to the env-detection implementation and its tests.

Harness: resolved
Attempt State: succeeded
Latency: 3.3m
Spend: $0.16
Tokens: 699.9K
Tool Calls: 32
Patch Size: 8,576 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/mining/inspect.py b/repogauge/mining/inspect.py
index edf6b5d..9863962 100644
--- a/repogauge/mining/inspect.py
+++ b/repogauge/mining/inspect.py
@@ -7,13 +7,9 @@ import re
 
 from repogauge.mining.signature import REPO_VERSION_UNKNOWN, build_environment_signature
 from repogauge.exec import run_command
+from repogauge.validation.env_detect import build_environment_plan
 from repogauge.utils.git import get_default_branch, get_repo_root
 
-try:
-    import tomllib
-except Exception:  # pragma: no cover
-    import tomli as tomllib  # type: ignore[import-not-found]
-
 
 def _as_sorted_unique(values: list[str]) -> list[str]:
     return sorted(set(values))
@@ -31,23 +27,26 @@ def _safe_read_text(path: Path) -> str:
 
 
 def _extract_toml_value(path: Path, sections: list[str], key: str) -> str | None:
-    try:
-        payload = tomllib.loads(_safe_read_text(path))
-    except Exception:
+    raw = _safe_read_text(path)
+    if not raw:
         return None
 
-    cursor: Any = payload
-    for section in sections:
-        if not isinstance(cursor, dict) or section not in cursor:
-            return None
-        cursor = cursor.get(section)
-    if not isinstance(cursor, dict):
-        return None
+    target_section = ".".join(sections)
+    section_re = re.compile(r"^\s*\[([^\]]+)\]\s*$")
+    key_re = re.compile(rf"^\s*{re.escape(key)}\s*=\s*([\"'])(.*?)\1\s*(?:#.*)?$")
 
-    value = cursor.get(key)
-    if isinstance(value, str):
-        cleaned = value.strip().strip('"\'')
-        return cleaned if cleaned else None
+    current_section = ""
+    for line in raw.splitlines():
+        section_match = section_re.match(line)
+        if section_match:
+            current_section = section_match.group(1).strip()
+            continue
+        if current_section != target_section:
+            continue
+        key_match = key_re.match(line)
+        if key_match:
+            cleaned = key_match.group(2).strip()
+            return cleaned or None
     return None
 
 
@@ -383,6 +382,7 @@ def inspect_repository(path: str | Path) -> Dict[str, Any]:
         "profile_warnings": warnings,
     }
     profile["environment_signature"] = build_environment_signature(profile)
+    profile["environment_plan"] = build_environment_plan(profile).to_dict()
     profile["version"] = profile["environment_signature"]["version"]
 
     return profile
diff --git a/repogauge/validation/__init__.py b/repogauge/validation/__init__.py
index 90dcd7e..bd92e9d 100644
--- a/repogauge/validation/__init__.py
+++ b/repogauge/validation/__init__.py
@@ -1 +1,5 @@
 """Validation package."""
+
+from .env_detect import EnvPlan, build_environment_plan
+
+__all__ = ["EnvPlan", "build_environment_plan"]
diff --git a/repogauge/validation/env_detect.py b/repogauge/validation/env_detect.py
new file mode 100644
index 0000000..b324c1c
--- /dev/null
+++ b/repogauge/validation/env_detect.py
@@ -0,0 +1,146 @@
+"""Deterministic environment plan synthesis."""
+
+from __future__ import annotations
+
+from dataclasses import asdict, dataclass, field
+from typing import Any, Iterable
+
+
+def _as_sorted_unique(values: Iterable[str]) -> list[str]:
+    return sorted({str(value).strip() for value in values if str(value).strip()})
+
+
+def _version_key(version: str) -> tuple[int, ...]:
+    parts: list[int] = []
+    for chunk in str(version).strip().split("."):
+        if not chunk.isdigit():
+            break
+        parts.append(int(chunk))
+    return tuple(parts) if parts else (0,)
+
+
+def _pick_python_version(versions: list[str], provenance: list[str]) -> str:
+    normalized = _as_sorted_unique(versions)
+    if not normalized:
+        provenance.append("python_version:default-3.11")
+        return "3.11"
+    if len(normalized) > 1:
+        provenance.append("python_version:conflict")
+        chosen = min(normalized, key=_version_key)
+        provenance.append("python_version:chose-minimum")
+        return chosen
+    return normalized[0]
+
+
+def _pick_install_strategy(package_managers: list[str], install_hints: list[str], provenance: list[str]) -> tuple[str, list[str]]:
+    managers = _as_sorted_unique(package_managers)
+    hints = _as_sorted_unique(install_hints)
+
+    if "poetry" in managers:
+        provenance.append("install_strategy:poetry")
+        command = next((hint for hint in hints if hint.startswith("poetry ")), "poetry install")
+        return "poetry", [command]
+    if "uv" in managers:
+        provenance.append("install_strategy:uv")
+        command = next((hint for hint in hints if hint.startswith("uv ")), "uv sync")
+        return "uv", [command]
+    if "pipenv" in managers:
+        provenance.append("install_strategy:pipenv")
+        command = next((hint for hint in hints if hint.startswith("pipenv ")), "pipenv install")
+        return "pipenv", [command]
+    if "setuptools" in managers:
+        provenance.append("install_strategy:setuptools")
+        command = next((hint for hint in hints if hint == "pip install -e ."), "pip install -e .")
+        return "setuptools", [command]
+    if "requirements" in managers:
+        provenance.append("install_strategy:requirements")
+        command = next((hint for hint in hints if hint.startswith("pip install -r ")), "pip install -e .")
+        return "requirements", [command]
+
+    provenance.append("install_strategy:fallback")
+    command = next((hint for hint in hints if hint == "pip install -e ."), "pip install -e .")
+    return "fallback", [command]
+
+
+def _pick_test_runner(test_commands: list[str], provenance: list[str]) -> tuple[str, str]:
+    commands = _as_sorted_unique(test_commands)
+
+    if "pytest" in commands:
+        provenance.append("test_runner:pytest")
+        return "pytest", "pytest"
+    if "python -m unittest" in commands:
+        provenance.append("test_runner:unittest")
+        return "unittest", "python -m unittest"
+    if "python -m pytest" in commands:
+        provenance.append("test_runner:python-m-pytest")
+        return "pytest", "python -m pytest"
+    if "tox" in commands:
+        provenance.append("test_runner:tox")
+        return "tox", "tox"
+    if "nox" in commands:
+        provenance.append("test_runner:nox")
+        return "nox", "nox"
+
+    provenance.append("test_runner:pytest-default")
+    return "pytest-default", "python -m pytest"
+
+
+@dataclass
+class EnvPlan:
+    python_version: str
+    pre_install: list[str] = field(default_factory=list)
+    install: list[str] = field(default_factory=list)
+    build: list[str] = field(default_factory=list)
+    test_cmd_base: str = ""
+    strategy_name: str = ""
+    confidence: float = 0.0
+    provenance: list[str] = field(default_factory=list)
+
+    def to_dict(self) -> dict[str, Any]:
+        return asdict(self)
+
+
+def build_environment_plan(profile: dict[str, Any]) -> EnvPlan:
+    python_hints = profile.get("python_hints", {}) or {}
+    test_runner_hints = profile.get("test_runner_hints", {}) or {}
+    install_hints = list(profile.get("install_hints", []) or [])
+
+    provenance: list[str] = []
+    python_version = _pick_python_version(list(python_hints.get("versions", []) or []), provenance)
+    install_strategy, install_cmds = _pick_install_strategy(
+        list(python_hints.get("package_managers", []) or []),
+        install_hints,
+        provenance,
+    )
+    test_strategy, test_cmd_base = _pick_test_runner(
+        list(test_runner_hints.get("commands", []) or []),
+        provenance,
+    )
+
+    if test_strategy in {"pytest", "pytest-default"} and install_strategy not in {"tox", "nox"}:
+        if all("pytest" not in command for command in install_cmds):
+            install_cmds.append("pip install pytest")
+
+    confidence = 0.85
+    if len(_as_sorted_unique(python_hints.get("versions", []) or [])) <= 1:
+        confidence += 0.05
+    if install_strategy in {"poetry", "setuptools", "requirements"}:
+        confidence += 0.05
+    if test_strategy in {"pytest", "unittest", "tox", "nox"}:
+        confidence += 0.05
+    if any(item.startswith("python_version:conflict") for item in provenance):
+        confidence -= 0.1
+
+    return EnvPlan(
+        python_version=python_version,
+        pre_install=[],
+        install=install_cmds,
+        build=[],
+        test_cmd_base=test_cmd_base,
+        strategy_name=f"{install_strategy}:{test_strategy}",
+        confidence=max(0.0, min(1.0, confidence)),
+        provenance=provenance,
+    )
+
+
+__all__ = ["EnvPlan", "build_environment_plan"]

LLM Judgemuch worse

Overall Δ: -1.50
Confidence: 0.89

Candidate implements the basic environment-plan wiring, but it diverges from the golden behavior in provenance/confidence semantics, omits the regression tests, and introduces unrelated TOML parsing churn with clear regression risk.

Task Fitworse Δ-1

It adds `environment_plan` generation and exports `EnvPlan`, but the plan-building behavior does not match the gold contract closely: provenance strings differ, confidence scoring is materially different, and some fallback logic is less aligned with the deterministic heuristics expected by the task.

Correctness Safetymuch worse Δ-2

The candidate replaces the existing TOML loader with a hand-rolled regex parser in `inspect.py`, which is unrelated to the task and is less robust for valid TOML structures. In `env_detect.py`, it weakens input coercion, misses gold provenance cases such as explicit single-version provenance and `test_runner:default`, and only uses requirements-file install hints when `package_managers` explicitly contains `requirements`, making behavior less safe and less compatible with noisy inspection profiles.

Maintainabilityworse Δ-1

The new env-detection module is reasonably readable, but it is less disciplined than the gold version: mutable non-frozen dataclass, narrower typing, fewer normalization helpers, and more ad hoc scoring/provenance logic. The unrelated manual TOML parsing also increases maintenance burden.

Test Qualitymuch worse Δ-2

The gold patch adds focused unit coverage for environment-plan selection, import/export contract, and repository inspection integration. The candidate patch provides no corresponding tests, so the behavioral differences it introduces are unguarded.

Change Focusmuch worse Δ-2

The task is environment detection heuristics, but the candidate also rewrites TOML extraction in `inspect.py`. That is unrelated churn in a sensitive code path and makes the patch substantially less focused than the gold change.

Harness: resolved
Attempt State: succeeded
Latency: 4.1m
Spend: $0.33
Tokens: 1.2M
Tool Calls: 40
Patch Size: 8,021 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/mining/inspect.py b/repogauge/mining/inspect.py
index edf6b5d..5177d77 100644
--- a/repogauge/mining/inspect.py
+++ b/repogauge/mining/inspect.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import dataclasses
 from configparser import ConfigParser
 from pathlib import Path
 from typing import Any, Dict, List
@@ -8,6 +9,7 @@ import re
 from repogauge.mining.signature import REPO_VERSION_UNKNOWN, build_environment_signature
 from repogauge.exec import run_command
 from repogauge.utils.git import get_default_branch, get_repo_root
+from repogauge.validation.env_detect import build_environment_plan
 
 try:
     import tomllib
@@ -384,5 +386,6 @@ def inspect_repository(path: str | Path) -> Dict[str, Any]:
     }
     profile["environment_signature"] = build_environment_signature(profile)
     profile["version"] = profile["environment_signature"]["version"]
+    profile["environment_plan"] = dataclasses.asdict(build_environment_plan(profile))
 
     return profile
diff --git a/repogauge/validation/__init__.py b/repogauge/validation/__init__.py
index 90dcd7e..ed009af 100644
--- a/repogauge/validation/__init__.py
+++ b/repogauge/validation/__init__.py
@@ -1 +1,5 @@
 """Validation package."""
+
+from repogauge.validation.env_detect import EnvPlan
+
+__all__ = ["EnvPlan"]
diff --git a/repogauge/validation/env_detect.py b/repogauge/validation/env_detect.py
new file mode 100644
index 0000000..18f4bb6
--- /dev/null
+++ b/repogauge/validation/env_detect.py
@@ -0,0 +1,172 @@
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any
+
+
+@dataclass
+class EnvPlan:
+    python_version: str
+    pre_install: list[str]
+    install: list[str]
+    build: list[str]
+    test_cmd_base: str
+    strategy_name: str
+    confidence: float
+    provenance: list[str] = field(default_factory=list)
+
+
+def build_environment_plan(profile: dict[str, Any]) -> EnvPlan:
+    """Convert repo inspection hints into a concrete install and test environment plan."""
+    python_hints = profile.get("python_hints", {}) or {}
+    install_hints = list(profile.get("install_hints", []))
+    test_runner_hints = profile.get("test_runner_hints", {}) or {}
+
+    versions = list(python_hints.get("versions", []))
+    package_managers = list(python_hints.get("package_managers", []))
+    test_commands = list(test_runner_hints.get("commands", []))
+
+    provenance: list[str] = []
+    confidence = 1.0
+
+    # Python version resolution
+    if not versions:
+        python_version = "3.10"
+        provenance.append("python_version:default")
+        confidence -= 0.1
+    elif len(versions) == 1:
+        python_version = versions[0]
+        provenance.append("python_version:explicit")
+    else:
+        def _version_key(v: str) -> tuple[int, ...]:
+            return tuple(int(p) for p in v.split(".") if p.isdigit())
+        python_version = min(versions, key=_version_key)
+        provenance.append("python_version:conflict")
+        provenance.append("python_version:chose-minimum")
+        confidence -= 0.1
+
+    # Package manager priority
+    priority = ["poetry", "pipenv", "uv", "setuptools", "requirements", "pep621", "pyproject"]
+    strategy = "unknown"
+    for pm in priority:
+        if pm in package_managers:
+            strategy = pm
+            break
+    if not package_managers:
+        strategy = "fallback"
+        confidence -= 0.1
+
+    # Determine install commands based on strategy
+    install: list[str] = []
+    pip_based_original = any("pip install" in hint for hint in install_hints)
+
+    if strategy == "poetry":
+        poetry_hints = [h for h in install_hints if "poetry" in h]
+        install.extend(poetry_hints if poetry_hints else ["poetry install"])
+        if _is_pytest_runner(test_commands) and pip_based_original:
+            install.append("pip install pytest")
+        provenance.append("install_strategy:poetry")
+    elif strategy == "setuptools":
+        setuptools_hints = [h for h in install_hints if "pip install -e ." in h]
+        install.extend(setuptools_hints if setuptools_hints else ["pip install -e ."])
+        if _is_pytest_runner(test_commands):
+            install.append("pip install pytest")
+        provenance.append("install_strategy:setuptools")
+    elif strategy == "requirements":
+        req_hints = [h for h in install_hints if "requirements" in h]
+        install.extend(req_hints[:1] if req_hints else ["pip install -r requirements.txt"])
+        if _is_pytest_runner(test_commands) or _is_default_pytest(test_commands):
+            install.append("pip install pytest")
+        provenance.append("install_strategy:requirements")
+    elif strategy == "pep621":
+        pep_hints = [h for h in install_hints if "pip install" in h]
+        install.extend(pep_hints[:1] if pep_hints else ["pip install -e ."])
+        if _is_pytest_runner(test_commands):
+            install.append("pip install pytest")
+        provenance.append("install_strategy:pep621")
+    elif strategy == "pipenv":
+        pipenv_hints = [h for h in install_hints if "pipenv" in h]
+        install.extend(pipenv_hints if pipenv_hints else ["pipenv install"])
+        if _is_pytest_runner(test_commands) and pip_based_original:
+            install.append("pip install pytest")
+        provenance.append("install_strategy:pipenv")
+    elif strategy == "uv":
+        uv_hints = [h for h in install_hints if "uv" in h]
+        install.extend(uv_hints if uv_hints else ["uv sync"])
+        if _is_pytest_runner(test_commands) and pip_based_original:
+            install.append("pip install pytest")
+        provenance.append("install_strategy:uv")
+    else:
+        if install_hints:
+            install.append(install_hints[0])
+        else:
+            install.append("pip install -e .")
+        if _is_pytest_runner(test_commands) or _is_default_pytest(test_commands):
+            install.append("pip install pytest")
+        provenance.append(f"install_strategy:{strategy}")
+
+    # Deduplicate install commands while preserving order
+    seen: set[str] = set()
+    deduped_install: list[str] = []
+    for cmd in install:
+        if cmd not in seen:
+            seen.add(cmd)
+            deduped_install.append(cmd)
+    install = deduped_install
+
+    # Test runner resolution
+    if not test_commands:
+        test_cmd_base = "python -m pytest"
+        runner_label = "pytest-default"
+        confidence -= 0.05
+        provenance.append("test_runner:default")
+    else:
+        test_cmd_base = test_commands[0]
+        runner_label = _normalize_runner_label(test_cmd_base)
+        provenance.append(f"test_runner:{runner_label}")
+
+    strategy_name = f"{strategy}:{runner_label}"
+
+    # Confidence adjustments for non-modern stacks
+    if strategy in ("setuptools", "requirements", "fallback", "unknown"):
+        confidence -= 0.05
+    if runner_label not in ("pytest", "pytest-default"):
+        confidence -= 0.05
+
+    # Multiple package managers or test runners reduce confidence
+    if len(package_managers) > 1:
+        confidence -= 0.05
+    if len(test_commands) > 1:
+        confidence -= 0.05
+
+    return EnvPlan(
+        python_version=python_version,
+        pre_install=[],
+        install=install,
+        build=[],
+        test_cmd_base=test_cmd_base,
+        strategy_name=strategy_name,
+        confidence=round(confidence, 2),
+        provenance=provenance,
+    )
+
+
+def _is_pytest_runner(commands: list[str]) -> bool:
+    return any("pytest" in cmd.lower() for cmd in commands)
+
+
+def _is_default_pytest(commands: list[str]) -> bool:
+    return not commands
+
+
+def _normalize_runner_label(command: str) -> str:
+    lower = command.lower()
+    if "pytest" in lower:
+        return "pytest"
+    if "unittest" in lower:
+        return "unittest"
+    if "tox" in lower:
+        return "tox"
+    if "nox" in lower:
+        return "nox"
+    return "unknown"

LLM Judgemuch worse

Overall Δ: -1.30
Confidence: 0.91

Candidate implements the core environment-plan feature but misses parts of the gold contract, weakens determinism in several heuristics, and omits the regression tests added by the reference patch.

Task Fitworse Δ-1

It adds `environment_plan` generation and an `EnvPlan` model, but it does not match the full gold-facing API because `repogauge.validation.__init__` fails to export `build_environment_plan`. It also diverges from the expected deterministic defaults and selection rules, so it addresses the task only partially relative to the gold patch.

Correctness Safetymuch worse Δ-2

Several behaviors differ materially from the gold implementation: default Python falls back to `3.10` instead of `3.11`; test-runner selection depends on first command order instead of deterministic priority; pytest dependency injection is conditional on pip-like install hints for some strategies, unlike the gold logic; and the function accepts only `dict[str, Any]` rather than coercing arbitrary profile shapes safely. These differences increase regression risk and make the plan less reliably consumable by later deterministic stages.

Maintainabilityworse Δ-1

The candidate is readable, but it reimplements serialization with `dataclasses.asdict(...)` at the call site instead of keeping a `to_dict()` method on `EnvPlan`, omits the explicit package export for `build_environment_plan`, and hard-codes more ad hoc branching than the gold helper decomposition. The gold patch is more cohesive and easier to extend safely.

Test Qualitymuch worse Δ-2

The candidate patch includes no tests, while the gold patch adds targeted unit coverage for plan construction, package export/import behavior, and repository inspection integration. That leaves important regressions around defaults, provenance, and deterministic selection unguarded.

Change Focussame Δ+0

The patch stays focused on the environment-detection feature area and does not introduce obvious unrelated churn. Its scope discipline is comparable to the gold patch even though it is less complete.

s1liconcow__repogauge-rg-aac6186e81ce-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Add deterministic pytest attempt ladder - Production changes: repogauge/validation/validate.py. Test changes: tests/unit/test_validate.py. - Reproduction: run tests impacted by this change (189 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 3.0m
Spend: $0.43
Tokens: 11.7K
Tool Calls: 18
Patch Size: 4,378 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/validation/validate.py b/repogauge/validation/validate.py
index 2c697c0..b58261c 100644
--- a/repogauge/validation/validate.py
+++ b/repogauge/validation/validate.py
@@ -79,6 +79,22 @@ def _resolve_test_cmd(test_cmd_base: str) -> List[str]:
     return parts
 
 
+def _pytest_command_attempts(test_cmd_base: str) -> List[List[str]]:
+    """Return an ordered list of command argv prefixes to try for running pytest.
+
+    The first entry is the command as parsed; the second is a fallback that
+    invokes pytest via the current Python interpreter (``python -m pytest``).
+    The fallback is omitted when the original command already uses a Python
+    interpreter directly.
+    """
+    parts = shlex.split(test_cmd_base) if test_cmd_base.strip() else ["pytest"]
+    attempts: List[List[str]] = [parts]
+    if parts and not re.match(r"^python3?(\.\d+)?$", parts[0]):
+        rest = parts[1:]
+        attempts.append([sys.executable, "-m", "pytest"] + rest)
+    return attempts
+
+
 def _run_pytest(
     worktree: Path,
     *,
@@ -86,35 +102,43 @@ def _run_pytest(
     junit_xml: Path,
     timeout_seconds: int = 120,
     test_cmd_base: str = "python -m pytest",
-) -> Tuple[Dict[str, str], str]:
-    """Run pytest in *worktree*, return (results_dict, raw_output).
+) -> Tuple[Dict[str, str], str, List[Dict[str, Any]]]:
+    """Run pytest in *worktree*, return (results_dict, raw_output, attempts).
+
+    Tries each command returned by ``_pytest_command_attempts`` in order,
+    stopping as soon as a parseable JUnit XML is produced.  ``attempts`` is a
+    list of per-attempt dicts with at least a ``"status"`` key
+    (``"success"`` | ``"parse_error"``).
 
-    ``results_dict`` maps test_id → outcome string, empty if XML missing/malformed.
-    ``raw_output`` is the combined stdout+stderr for log purposes.
     ``test_cmd_base`` is taken from the adapter spec when available.
     """
     env = {**os.environ, "PYTHONPATH": str(worktree)}
-    cmd = (
-        _resolve_test_cmd(test_cmd_base)
-        + [
-            "--tb=no",
-            "-q",
-            f"--junit-xml={junit_xml}",
-        ]
-        + (test_files if test_files else [])
-    )
-
-    result = run_command(
-        cmd, cwd=str(worktree), env=env, timeout_seconds=timeout_seconds
+    base_attempts = _pytest_command_attempts(test_cmd_base)
+    attempts_log: List[Dict[str, Any]] = []
+    raw = ""
+
+    for base_cmd in base_attempts:
+        cmd = (
+            base_cmd
+            + ["--tb=no", "-q", f"--junit-xml={junit_xml}"]
+            + (test_files if test_files else [])
+        )
+        result = run_command(cmd, cwd=str(worktree), env=env, timeout_seconds=timeout_seconds)
+        raw = f"[stdout]\n{result.stdout}\n[stderr]\n{result.stderr}"
+
+        if junit_xml.exists():
+            try:
+                outcomes = parse_junit_xml(junit_xml)
+                attempts_log.append({"status": "success", "cmd": list(cmd)})
+                return outcomes, raw, attempts_log
+            except JUnitParseError:
+                attempts_log.append({"status": "parse_error", "cmd": list(cmd)})
+        else:
+            attempts_log.append({"status": "parse_error", "cmd": list(cmd)})
+
+    raise JUnitParseError(
+        f"missing expected junit output after {len(base_attempts)} attempt(s): {junit_xml}"
     )
-    raw = f"[stdout]\n{result.stdout}\n[stderr]\n{result.stderr}"
-
-    if not junit_xml.exists():
-        raise JUnitParseError(f"missing expected junit output: {junit_xml}")
-
-    outcomes = parse_junit_xml(junit_xml)
-
-    return outcomes, raw
 
 
 def _derive_test_lists(
@@ -196,7 +220,7 @@ def _eval_instance(
             if test_patch.strip():
                 apply_patch_text(wt_b.path, test_patch)
             xml_b = tmp / "junit_b.xml"
-            run_b, log_b = _run_pytest(
+            run_b, log_b, _ = _run_pytest(
                 wt_b.path,
                 test_files=test_inputs,
                 junit_xml=xml_b,
@@ -235,7 +259,7 @@ def _eval_instance(
             if pred_patch.strip():
                 apply_patch_text(wt_c.path, pred_patch)
             xml_c = tmp / "junit_c.xml"
-            run_c, log_c = _run_pytest(
+            run_c, log_c, _ = _run_pytest(
                 wt_c.path,
                 test_files=test_inputs,
                 junit_xml=xml_c,

LLM Judgemuch worse

Overall Δ: -1.30
Confidence: 0.94

Candidate implements a basic retry path for pytest, but it is materially less complete and less safe than the gold fix because it omits attempt propagation/telemetry, uses a broader and sometimes incorrect fallback rewrite, and adds no regression tests.

Task Fitworse Δ-1

It partially addresses the deterministic pytest retry behavior, but it does not carry attempt metadata through `_eval_instance`/`run_eval` the way the gold patch does, so the feature is not implemented as completely as the reference.

Correctness Safetymuch worse Δ-2

The fallback command generation is overly broad: any non-`python*` launcher gets rewritten to `python -m pytest ...`, which can corrupt commands like `uv run pytest` or other wrappers. `_run_pytest` also does not remove stale JUnit XML between attempts, creating a risk of parsing leftover results from an earlier run.

Maintainabilityworse Δ-1

The candidate removes structured failure handling present in the gold patch. It returns only minimal attempt logs, keeps using `JUnitParseError` instead of a purpose-built execution exception, and drops the richer attempt details that make failures easier to diagnose.

Test Qualitymuch worse Δ-2

The gold patch adds focused regression tests for both command selection and retry-on-missing-XML behavior. No corresponding candidate tests are shown, so the change is much less protected against regressions.

Change Focussame Δ+0

The candidate stays within the relevant validation codepath and does not introduce obvious unrelated churn. Its scope is narrower than gold, but that comes from missing relevant behavior rather than better discipline.

Harness: resolved
Attempt State: succeeded
Latency: 2.0m
Spend: $0.12
Tokens: 425.0K
Tool Calls: 17
Patch Size: 5,957 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/validation/validate.py b/repogauge/validation/validate.py
index 2c697c0..15b8c7c 100644
--- a/repogauge/validation/validate.py
+++ b/repogauge/validation/validate.py
@@ -79,6 +79,36 @@ def _resolve_test_cmd(test_cmd_base: str) -> List[str]:
     return parts
 
 
+def _pytest_command_attempts(test_cmd_base: str) -> List[List[str]]:
+    """Return deterministic pytest command attempts.
+
+    The first attempt is the requested command, normalized to the current Python
+    interpreter when it starts with a bare ``python`` token.  The fallback is an
+    interpreter-based ``python -m pytest`` invocation that preserves the pytest
+    arguments from the original command.
+    """
+    parts = _resolve_test_cmd(test_cmd_base)
+    if not parts:
+        parts = [sys.executable, "-m", "pytest"]
+
+    attempts: List[List[str]] = [parts]
+
+    if parts[:3] == [sys.executable, "-m", "pytest"]:
+        return attempts
+
+    if parts and parts[0] == "pytest":
+        fallback = [sys.executable, "-m", "pytest", *parts[1:]]
+    elif len(parts) >= 3 and parts[1:3] == ["-m", "pytest"]:
+        fallback = [sys.executable, "-m", "pytest", *parts[3:]]
+    else:
+        fallback = [sys.executable, "-m", "pytest", *parts[1:]]
+
+    if fallback != attempts[0]:
+        attempts.append(fallback)
+
+    return attempts
+
+
 def _run_pytest(
     worktree: Path,
     *,
@@ -86,35 +116,81 @@ def _run_pytest(
     junit_xml: Path,
     timeout_seconds: int = 120,
     test_cmd_base: str = "python -m pytest",
-) -> Tuple[Dict[str, str], str]:
-    """Run pytest in *worktree*, return (results_dict, raw_output).
+) -> Tuple[Dict[str, str], str, List[Dict[str, Any]]]:
+    """Run pytest in *worktree*, return (results_dict, raw_output, attempts).
 
     ``results_dict`` maps test_id → outcome string, empty if XML missing/malformed.
     ``raw_output`` is the combined stdout+stderr for log purposes.
     ``test_cmd_base`` is taken from the adapter spec when available.
     """
     env = {**os.environ, "PYTHONPATH": str(worktree)}
-    cmd = (
-        _resolve_test_cmd(test_cmd_base)
-        + [
-            "--tb=no",
-            "-q",
-            f"--junit-xml={junit_xml}",
-        ]
-        + (test_files if test_files else [])
-    )
+    attempts: List[Dict[str, Any]] = []
+    raw_chunks: List[str] = []
+    last_error: Optional[Exception] = None
+
+    for idx, base_cmd in enumerate(_pytest_command_attempts(test_cmd_base), start=1):
+        cmd = (
+            base_cmd
+            + [
+                "--tb=no",
+                "-q",
+                f"--junit-xml={junit_xml}",
+            ]
+            + (test_files if test_files else [])
+        )
 
-    result = run_command(
-        cmd, cwd=str(worktree), env=env, timeout_seconds=timeout_seconds
-    )
-    raw = f"[stdout]\n{result.stdout}\n[stderr]\n{result.stderr}"
+        if junit_xml.exists():
+            try:
+                junit_xml.unlink()
+            except Exception:
+                pass
 
-    if not junit_xml.exists():
-        raise JUnitParseError(f"missing expected junit output: {junit_xml}")
+        result = run_command(
+            cmd, cwd=str(worktree), env=env, timeout_seconds=timeout_seconds
+        )
+        raw_chunks.append(
+            f"[attempt {idx}]\n[stdout]\n{result.stdout}\n[stderr]\n{result.stderr}"
+        )
 
-    outcomes = parse_junit_xml(junit_xml)
+        if not junit_xml.exists():
+            error = JUnitParseError(f"missing expected junit output: {junit_xml}")
+            attempts.append(
+                {
+                    "command": cmd,
+                    "returncode": result.returncode,
+                    "status": "parse_error",
+                    "error": str(error),
+                }
+            )
+            last_error = error
+            continue
+
+        try:
+            outcomes = parse_junit_xml(junit_xml)
+        except JUnitParseError as exc:
+            attempts.append(
+                {
+                    "command": cmd,
+                    "returncode": result.returncode,
+                    "status": "parse_error",
+                    "error": str(exc),
+                }
+            )
+            last_error = exc
+            continue
+
+        attempts.append(
+            {
+                "command": cmd,
+                "returncode": result.returncode,
+                "status": "success",
+            }
+        )
+        return outcomes, "\n".join(raw_chunks), attempts
 
-    return outcomes, raw
+    if last_error is None:
+        last_error = JUnitParseError("pytest did not produce parseable junit output")
+    raise last_error
 
 
 def _derive_test_lists(
@@ -191,12 +267,13 @@ def _eval_instance(
         wt_b = None
         run_b: Dict[str, str] = {}
         log_b = ""
+        attempts_b: List[Dict[str, Any]] = []
         try:
             wt_b = create_worktree(repo_root, ref=base_commit)
             if test_patch.strip():
                 apply_patch_text(wt_b.path, test_patch)
             xml_b = tmp / "junit_b.xml"
-            run_b, log_b = _run_pytest(
+            run_b, log_b, attempts_b = _run_pytest(
                 wt_b.path,
                 test_files=test_inputs,
                 junit_xml=xml_b,
@@ -228,6 +305,7 @@ def _eval_instance(
         wt_c = None
         run_c: Dict[str, str] = {}
         log_c = ""
+        attempts_c: List[Dict[str, Any]] = []
         try:
             wt_c = create_worktree(repo_root, ref=base_commit)
             if test_patch.strip():
@@ -235,7 +313,7 @@ def _eval_instance(
             if pred_patch.strip():
                 apply_patch_text(wt_c.path, pred_patch)
             xml_c = tmp / "junit_c.xml"
-            run_c, log_c = _run_pytest(
+            run_c, log_c, attempts_c = _run_pytest(
                 wt_c.path,
                 test_files=test_inputs,
                 junit_xml=xml_c,

LLM Judgemuch worse

Overall Δ: -1.30
Confidence: 0.88

Candidate implements a basic retry ladder for pytest, but it is materially less complete than the gold fix: it omits the regression tests, does not preserve attempt metadata through failure/reporting paths, and introduces a broader fallback rewrite that is less safe for nontrivial commands.

Task Fitworse Δ-1

It addresses the core retry behavior in `_run_pytest`, but it stops short of the full task covered by the gold patch. The gold implementation also threads attempt data through `_eval_instance` and `run_eval` and adds the matching regression tests; the candidate only partially updates the call sites and does not complete the reporting surface.

Correctness Safetymuch worse Δ-2

The gold patch introduces `PytestExecutionError` so failed retries still carry structured attempt history; the candidate raises a plain `JUnitParseError`, so downstream error handling cannot recover attempt metadata. Its fallback command rewriting is also less conservative: for commands not starting with bare `pytest` or exact `python -m pytest`, it rewrites by dropping the first token and prepending `python -m pytest`, which can mis-handle wrappers such as `uv run pytest` or other launcher forms. It also omits the gold patch’s additions that return `run_b_attempts`/`run_c_attempts` in result payloads.

Maintainabilityworse Δ-1

The candidate keeps the change localized, but the command-rewrite logic is more ad hoc and harder to reason about than the gold patch’s narrow, explicit fallback. Not introducing a dedicated exception type also makes failure handling less cohesive because callers have no stable structured contract for retry details.

Test Qualitymuch worse Δ-2

The gold patch adds focused regression tests for both the deterministic command ladder and the missing-junit fallback path. No candidate test coverage is shown, so the behavioral change is left largely unpinned relative to the reference fix.

Change Focussame Δ+0

The candidate stays within the intended validation flow and does not add obvious unrelated churn. Its main issue is incompleteness rather than unnecessary scope expansion.

Harness: resolved
Attempt State: succeeded
Latency: 2.0m
Spend: $0.11
Tokens: 282.5K
Tool Calls: 15
Patch Size: 4,555 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/validation/validate.py b/repogauge/validation/validate.py
index 2c697c0..dc3b6dc 100644
--- a/repogauge/validation/validate.py
+++ b/repogauge/validation/validate.py
@@ -79,6 +79,20 @@ def _resolve_test_cmd(test_cmd_base: str) -> List[str]:
     return parts
 
 
+def _pytest_command_attempts(test_cmd_base: str) -> List[List[str]]:
+    """Return a list of pytest command attempts, falling back to interpreter invocation."""
+    parts = (
+        shlex.split(test_cmd_base)
+        if test_cmd_base.strip()
+        else ["python", "-m", "pytest"]
+    )
+    attempts: List[List[str]] = [parts]
+    if parts and not re.match(r"^python3?(\.\d+)?$", parts[0]):
+        fallback = [sys.executable, "-m", "pytest"] + parts[1:]
+        attempts.append(fallback)
+    return attempts
+
+
 def _run_pytest(
     worktree: Path,
     *,
@@ -86,35 +100,71 @@ def _run_pytest(
     junit_xml: Path,
     timeout_seconds: int = 120,
     test_cmd_base: str = "python -m pytest",
-) -> Tuple[Dict[str, str], str]:
-    """Run pytest in *worktree*, return (results_dict, raw_output).
+) -> Tuple[Dict[str, str], str, List[Dict[str, Any]]]:
+    """Run pytest in *worktree*, return (results_dict, raw_output, attempts).
 
     ``results_dict`` maps test_id → outcome string, empty if XML missing/malformed.
     ``raw_output`` is the combined stdout+stderr for log purposes.
     ``test_cmd_base`` is taken from the adapter spec when available.
     """
     env = {**os.environ, "PYTHONPATH": str(worktree)}
-    cmd = (
-        _resolve_test_cmd(test_cmd_base)
-        + [
-            "--tb=no",
-            "-q",
-            f"--junit-xml={junit_xml}",
-        ]
-        + (test_files if test_files else [])
-    )
+    attempts_log: List[Dict[str, Any]] = []
+    last_exc: Optional[Exception] = None
+
+    for attempt_cmd in _pytest_command_attempts(test_cmd_base):
+        cmd = (
+            attempt_cmd
+            + [
+                "--tb=no",
+                "-q",
+                f"--junit-xml={junit_xml}",
+            ]
+            + (test_files if test_files else [])
+        )
 
-    result = run_command(
-        cmd, cwd=str(worktree), env=env, timeout_seconds=timeout_seconds
-    )
-    raw = f"[stdout]\n{result.stdout}\n[stderr]\n{result.stderr}"
+        result = run_command(
+            cmd, cwd=str(worktree), env=env, timeout_seconds=timeout_seconds
+        )
+        raw = f"[stdout]\n{result.stdout}\n[stderr]\n{result.stderr}"
+
+        if not junit_xml.exists():
+            attempts_log.append(
+                {
+                    "command": cmd,
+                    "returncode": result.returncode,
+                    "status": "parse_error",
+                    "raw": raw,
+                }
+            )
+            continue
 
-    if not junit_xml.exists():
-        raise JUnitParseError(f"missing expected junit output: {junit_xml}")
+        try:
+            outcomes = parse_junit_xml(junit_xml)
+        except JUnitParseError as exc:
+            attempts_log.append(
+                {
+                    "command": cmd,
+                    "returncode": result.returncode,
+                    "status": "parse_error",
+                    "raw": raw,
+                }
+            )
+            last_exc = exc
+            continue
 
-    outcomes = parse_junit_xml(junit_xml)
+        attempts_log.append(
+            {
+                "command": cmd,
+                "returncode": result.returncode,
+                "status": "success",
+                "raw": raw,
+            }
+        )
+        return outcomes, raw, attempts_log
 
-    return outcomes, raw
+    if last_exc is not None:
+        raise last_exc
+    raise JUnitParseError(f"missing expected junit output: {junit_xml}")
 
 
 def _derive_test_lists(
@@ -196,7 +246,7 @@ def _eval_instance(
             if test_patch.strip():
                 apply_patch_text(wt_b.path, test_patch)
             xml_b = tmp / "junit_b.xml"
-            run_b, log_b = _run_pytest(
+            run_b, log_b, _ = _run_pytest(
                 wt_b.path,
                 test_files=test_inputs,
                 junit_xml=xml_b,
@@ -235,7 +285,7 @@ def _eval_instance(
             if pred_patch.strip():
                 apply_patch_text(wt_c.path, pred_patch)
             xml_c = tmp / "junit_c.xml"
-            run_c, log_c = _run_pytest(
+            run_c, log_c, _ = _run_pytest(
                 wt_c.path,
                 test_files=test_inputs,
                 junit_xml=xml_c,

LLM Judgeworse

Overall Δ: -1.20
Confidence: 0.89

Candidate fixes the basic pytest fallback path but omits the gold patch's attempt-reporting/error plumbing and adds no regression tests.

Task Fitworse Δ-1

It addresses the core fallback behavior in `_run_pytest`, but it does not match the full task coverage of the gold patch: no `PytestExecutionError`, no propagation of attempt metadata through `_eval_instance`/`run_eval`, and no accompanying regression test file.

Correctness Safetymuch worse Δ-2

The fallback selection is broader and less safe than gold: any non-`python` command gets rewritten to `sys.executable -m pytest`, which can mis-handle non-pytest launchers. It also does not clear an existing JUnit XML file between attempts, so a later retry can incorrectly parse stale output. On failure it raises the original parse/missing-XML error without preserved attempt context.

Maintainabilityworse Δ-1

The candidate duplicates command parsing logic instead of reusing `_resolve_test_cmd`, and its attempt logging is less structured than gold. Omitting the dedicated exception and downstream attempt fields makes future diagnosis harder.

Test Qualitymuch worse Δ-2

Gold adds targeted regression tests for both the attempt ladder and the missing-JUnit fallback path. The candidate adds no tests, so there is no direct protection for the behavior it changed.

Change Focusbetter Δ+1

The edit is narrowly scoped to the immediate fallback logic and avoids extra churn. Relative to gold, it is more minimal, even though that minimalism leaves out useful observability and error-reporting behavior.

s1liconcow__repogauge-rg-d05949a4fd70-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-7o4 - adapter vars - Production changes: repogauge/export/adapter.py. Test changes: tests/unit/test_adapter.py. - Bead oss_repogauge-7o4: Refactor export/adapter.py to use language template vars Context: - Generalize the harness adapter generator in `repogauge/export/adapter.py` so the generated `adapter_<repo>.py` references the right parser, file extension, install command, and test command for the repo's language. - The generated file remains a Python module (the harness consumes Python), but its template variables come from the language adapter rather than being hardcoded. Acceptance: - For an existing Python repo, the generated `adapter_<repo>.py` file is byte-identical to the pre-refactor output (covered by mlang-022). - `_swebench_spec(spec)` for `language="python"` produces the same dict shape as before. - A future non-Python language can be supported by changing only `harness_template_vars()` in its adapter, with no edits to `export/adapter.py`. - `tests/unit/test_adapter.py` continues to pass after mlang-020. --- - Reproduction: run tests impacted by this change (167 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 4.2m
Spend: $0.56
Tokens: 10.8K
Tool Calls: 35
Patch Size: 4,577 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index ad85a11..0f52c7d 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -14,6 +14,25 @@ import re
 from pathlib import Path
 from typing import Any, Dict
 
+from repogauge.lang import find_adapter
+
+
+_DEFAULT_TEMPLATE_VARS: Dict[str, Any] = {
+    "parser_import_module": "repogauge.parsers.junit",
+    "parser_import_name": "parse_repogauge_junit",
+    "parser_name": "junit",
+    "ext": "py",
+    "install_str_join": " && ",
+}
+
+
+def _template_vars_for(language: str, spec: Dict[str, Any]) -> Dict[str, Any]:
+    try:
+        adapter = find_adapter(language)
+        return {**_DEFAULT_TEMPLATE_VARS, **adapter.harness_template_vars(spec)}
+    except KeyError:
+        return dict(_DEFAULT_TEMPLATE_VARS)
+
 
 # ---------------------------------------------------------------------------
 # Spec serialisation
@@ -48,7 +67,7 @@ def build_adapter_spec(
     test_cmd_base = environment_plan.get("test_cmd_base", "python -m pytest")
     strategy_name = environment_plan.get("strategy_name", "")
 
-    return {
+    partial_spec: Dict[str, Any] = {
         "repo": repo_name,
         "version": str(environment_plan.get("version", "0.0.0")),
         "language": language,
@@ -59,10 +78,18 @@ def build_adapter_spec(
         "build": build,
         "test_cmd_base": test_cmd_base,
         "strategy_name": strategy_name,
-        "parser": "junit",
         "docker_specs": {"python_version": python_version},
         "module_name": _safe_module_name(repo_name),
     }
+    tvars = _template_vars_for(language, partial_spec)
+    return {
+        **partial_spec,
+        "parser": tvars["parser_name"],
+        "ext": tvars["ext"],
+        "install_str_join": tvars["install_str_join"],
+        "parser_import_module": tvars["parser_import_module"],
+        "parser_import_name": tvars["parser_import_name"],
+    }
 
 
 # ---------------------------------------------------------------------------
@@ -77,7 +104,7 @@ _ADAPTER_TEMPLATE = '''\
 
 from __future__ import annotations
 
-from repogauge.parsers.junit import parse_repogauge_junit
+from {parser_import_module} import {parser_import_name}
 
 REPO = {repo_repr}
 VERSION = {version_repr}
@@ -89,13 +116,13 @@ PRE_INSTALL = {pre_install_repr}
 INSTALL = {install_repr}
 BUILD = {build_repr}
 TEST_CMD_BASE = {test_cmd_base_repr}
-PARSER = "junit"
+PARSER = {parser_name_repr}
 STRATEGY_NAME = {strategy_name_repr}
 DOCKER_SPECS = {docker_specs_repr}
 
 MAP_REPO_TO_EXT = {map_repo_to_ext_repr}
 MAP_REPO_VERSION_TO_SPECS = {map_repo_version_specs_repr}
-MAP_REPO_TO_PARSER = {{REPO: parse_repogauge_junit}}
+MAP_REPO_TO_PARSER = {{REPO: {parser_import_name}}}
 
 
 def get_spec() -> dict:
@@ -135,8 +162,9 @@ def registration_context() -> dict:
 def _swebench_spec(spec: Dict[str, Any]) -> Dict[str, Any]:
     """Convert our internal spec to the key names swebench 4.x expects."""
     install_cmds = spec.get("install", [])
+    join_str = spec.get("install_str_join", " && ")
     if isinstance(install_cmds, list):
-        install_str = " && ".join(install_cmds) if install_cmds else "pip install -e ."
+        install_str = join_str.join(install_cmds) if install_cmds else "pip install -e ."
     else:
         install_str = install_cmds or "pip install -e ."
 
@@ -176,7 +204,10 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
         test_cmd_base_repr=repr(spec["test_cmd_base"]),
         strategy_name_repr=repr(spec["strategy_name"]),
         docker_specs_repr=repr(spec["docker_specs"]),
-        map_repo_to_ext_repr=repr({spec["repo"]: "py"}),
+        parser_import_module=spec["parser_import_module"],
+        parser_import_name=spec["parser_import_name"],
+        parser_name_repr=repr(spec["parser"]),
+        map_repo_to_ext_repr=repr({spec["repo"]: spec["ext"]}),
         map_repo_version_specs_repr=repr(
             {
                 spec["repo"]: {
diff --git a/repogauge/lang/python.py b/repogauge/lang/python.py
index c06c0dc..d4962fa 100644
--- a/repogauge/lang/python.py
+++ b/repogauge/lang/python.py
@@ -619,7 +619,8 @@ class PythonAdapter:
 
     def harness_template_vars(self, spec: dict[str, Any]) -> dict[str, Any]:
         return {
-            "parser_import": "repogauge.parsers.junit.parse_repogauge_junit",
+            "parser_import_module": "repogauge.parsers.junit",
+            "parser_import_name": "parse_repogauge_junit",
             "parser_name": "junit",
             "ext": "py",
             "install_str_join": " && ",

LLM Judgeworse

Overall Δ: -1.00
Confidence: 0.93

Candidate captures the main idea but misses the gold patch's compatibility guarantees and regression coverage, with a likely byte-level output mismatch for Python adapters.

Task Fitworse Δ-1

It generalizes template variables, but it does not match the gold patch's acceptance target for Python byte-identical output: `PARSER` is rendered via `repr(spec["parser"])`, which produces single quotes (`'junit'`) instead of the previous double-quoted form. It also narrows the expected adapter contract to split import fields rather than supporting the gold patch's more flexible import forms.

Correctness Safetyworse Δ-1

The candidate lacks the gold patch's normalization and fallback handling for `parser_import`, `parser_import_module`, and `parser_import_name`, so adapters using only a combined import path would not work. It also omits validation that `harness_template_vars()` returns a mapping, increasing failure risk and making non-Python support less robust.

Maintainabilityworse Δ-1

`_DEFAULT_TEMPLATE_VARS` and `_template_vars_for()` are straightforward, but the change introduces an API mismatch by updating `repogauge/lang/python.py` instead of keeping `export/adapter.py` tolerant of existing or alternate adapter shapes. The gold patch is more maintainable because it centralizes compatibility logic in one place.

Test Qualityworse Δ-1

The candidate adds no regression test coverage for language-provided template vars, custom parser imports, extension propagation, or install-command joining. The gold patch adds a focused unit test that directly proves the new extension point works.

Change Focusworse Δ-1

The extra change in `repogauge/lang/python.py` expands scope beyond the core refactor in `export/adapter.py`. Gold keeps compatibility handling localized there, whereas the candidate changes adapter-side API expectations without corresponding tests.

Harness: resolved
Attempt State: succeeded
Latency: 3.5m
Spend: $0.20
Tokens: 1.0M
Tool Calls: 30
Patch Size: 5,492 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index ad85a11..0555c1a 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -14,6 +14,8 @@ import re
 from pathlib import Path
 from typing import Any, Dict
 
+from repogauge.lang import find_adapter
+
 
 # ---------------------------------------------------------------------------
 # Spec serialisation
@@ -30,6 +32,34 @@ def _safe_module_name(repo: str) -> str:
     return sanitized
 
 
+def _normalize_harness_template_vars(
+    template_vars: Dict[str, Any]
+) -> Dict[str, str]:
+    """Normalize adapter template variables to the renderer's expected keys."""
+    parser_import_module = template_vars.get("parser_import_module")
+    parser_import_name = template_vars.get("parser_import_name")
+    parser_import = template_vars.get("parser_import")
+
+    if (not parser_import_module or not parser_import_name) and parser_import:
+        module_name, dot, function_name = str(parser_import).rpartition(".")
+        if dot and module_name and function_name:
+            parser_import_module = parser_import_module or module_name
+            parser_import_name = parser_import_name or function_name
+
+    if not parser_import_module or not parser_import_name:
+        raise KeyError(
+            "harness_template_vars() must provide parser_import_module and parser_import_name"
+        )
+
+    return {
+        "parser_import_module": str(parser_import_module),
+        "parser_import_name": str(parser_import_name),
+        "parser_name": str(template_vars.get("parser_name", "junit") or "junit"),
+        "ext": str(template_vars.get("ext", "py") or "py"),
+        "install_str_join": str(template_vars.get("install_str_join", " && ") or " && "),
+    }
+
+
 def build_adapter_spec(
     repo_name: str, environment_plan: Dict[str, Any]
 ) -> Dict[str, Any]:
@@ -48,7 +78,7 @@ def build_adapter_spec(
     test_cmd_base = environment_plan.get("test_cmd_base", "python -m pytest")
     strategy_name = environment_plan.get("strategy_name", "")
 
-    return {
+    base_spec = {
         "repo": repo_name,
         "version": str(environment_plan.get("version", "0.0.0")),
         "language": language,
@@ -59,11 +89,28 @@ def build_adapter_spec(
         "build": build,
         "test_cmd_base": test_cmd_base,
         "strategy_name": strategy_name,
-        "parser": "junit",
         "docker_specs": {"python_version": python_version},
         "module_name": _safe_module_name(repo_name),
     }
 
+    adapter = find_adapter(language)
+    template_vars = _normalize_harness_template_vars(adapter.harness_template_vars(base_spec))
+    parser_name = str(template_vars.get("parser_name", "junit") or "junit")
+    ext = str(template_vars.get("ext", "py") or "py")
+    install_str_join = str(template_vars.get("install_str_join", " && ") or " && ")
+    parser_import_module = str(template_vars["parser_import_module"])
+    parser_import_name = str(template_vars["parser_import_name"])
+
+    return {
+        **base_spec,
+        "parser": parser_name,
+        "parser_import_module": parser_import_module,
+        "parser_import_name": parser_import_name,
+        "parser_import": f"{parser_import_module}.{parser_import_name}",
+        "ext": ext,
+        "install_str_join": install_str_join,
+    }
+
 
 # ---------------------------------------------------------------------------
 # Code generation
@@ -77,7 +124,7 @@ _ADAPTER_TEMPLATE = '''\
 
 from __future__ import annotations
 
-from repogauge.parsers.junit import parse_repogauge_junit
+from {parser_import_module} import {parser_import_name}
 
 REPO = {repo_repr}
 VERSION = {version_repr}
@@ -89,13 +136,13 @@ PRE_INSTALL = {pre_install_repr}
 INSTALL = {install_repr}
 BUILD = {build_repr}
 TEST_CMD_BASE = {test_cmd_base_repr}
-PARSER = "junit"
+PARSER = {parser_repr}
 STRATEGY_NAME = {strategy_name_repr}
 DOCKER_SPECS = {docker_specs_repr}
 
 MAP_REPO_TO_EXT = {map_repo_to_ext_repr}
 MAP_REPO_VERSION_TO_SPECS = {map_repo_version_specs_repr}
-MAP_REPO_TO_PARSER = {{REPO: parse_repogauge_junit}}
+MAP_REPO_TO_PARSER = {{REPO: {parser_import_name}}}
 
 
 def get_spec() -> dict:
@@ -135,8 +182,9 @@ def registration_context() -> dict:
 def _swebench_spec(spec: Dict[str, Any]) -> Dict[str, Any]:
     """Convert our internal spec to the key names swebench 4.x expects."""
     install_cmds = spec.get("install", [])
+    install_join = str(spec.get("install_str_join", " && ") or " && ")
     if isinstance(install_cmds, list):
-        install_str = " && ".join(install_cmds) if install_cmds else "pip install -e ."
+        install_str = install_join.join(install_cmds) if install_cmds else "pip install -e ."
     else:
         install_str = install_cmds or "pip install -e ."
 
@@ -174,9 +222,12 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
         install_repr=repr(spec["install"]),
         build_repr=repr(spec["build"]),
         test_cmd_base_repr=repr(spec["test_cmd_base"]),
+        parser_repr=repr(spec["parser"]),
         strategy_name_repr=repr(spec["strategy_name"]),
         docker_specs_repr=repr(spec["docker_specs"]),
-        map_repo_to_ext_repr=repr({spec["repo"]: "py"}),
+        parser_import_module=spec["parser_import_module"],
+        parser_import_name=spec["parser_import_name"],
+        map_repo_to_ext_repr=repr({spec["repo"]: spec["ext"]}),
         map_repo_version_specs_repr=repr(
             {
                 spec["repo"]: {

LLM Judgemuch worse

Overall Δ: -1.30
Confidence: 0.88

Candidate implements most of the adapter templating refactor, but it misses the gold patch's backward-compatibility safeguards and omits the regression coverage added by the reference solution.

Task Fitworse Δ-1

It generalizes parser/ext/install handling in the main code path, but it does not match the gold patch's compatibility behavior. In particular, it calls `harness_template_vars()` with a spec that lacks the legacy `parser` field and requires parser import metadata more aggressively than the gold implementation.

Correctness Safetymuch worse Δ-2

The gold patch preserves Python behavior with fallbacks when parser import fields are absent and validates only that template vars are a mapping. The candidate instead raises if `parser_import_module`/`parser_import_name` cannot be derived, and it drops `parser` from the pre-template spec entirely. Those differences increase regression risk for existing adapters and make the refactor less backward-compatible than the gold fix.

Maintainabilityworse Δ-1

The added normalization helper is reasonably structured, but it hardcodes stricter requirements than necessary and duplicates fallback logic that the gold patch keeps closer to render/spec assembly. That makes the behavior less flexible and slightly harder to evolve safely.

Test Qualitymuch worse Δ-2

The candidate includes no test updates, while the gold patch adds a targeted regression test proving that language adapter template vars control parser import, extension, and install-command joining for a non-Python adapter.

Change Focussame Δ+0

The change stays within the intended production file and remains scoped to the adapter templating task, with no obvious unrelated churn.

Harness: resolved
Attempt State: succeeded
Latency: 2.9m
Spend: $0.21
Tokens: 738.2K
Tool Calls: 28
Patch Size: 4,955 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/adapter.py b/repogauge/export/adapter.py
index ad85a11..5a08882 100644
--- a/repogauge/export/adapter.py
+++ b/repogauge/export/adapter.py
@@ -48,7 +48,7 @@ def build_adapter_spec(
     test_cmd_base = environment_plan.get("test_cmd_base", "python -m pytest")
     strategy_name = environment_plan.get("strategy_name", "")
 
-    return {
+    spec: Dict[str, Any] = {
         "repo": repo_name,
         "version": str(environment_plan.get("version", "0.0.0")),
         "language": language,
@@ -64,6 +64,20 @@ def build_adapter_spec(
         "module_name": _safe_module_name(repo_name),
     }
 
+    try:
+        from repogauge.lang import find_adapter
+
+        adapter = find_adapter(language)
+        template_vars = adapter.harness_template_vars(spec)
+        spec["parser"] = template_vars.get("parser_name", spec["parser"])
+        spec["ext"] = template_vars.get("ext", "py")
+        spec["install_str_join"] = template_vars.get("install_str_join", " && ")
+    except (ImportError, KeyError):
+        spec["ext"] = "py"
+        spec["install_str_join"] = " && "
+
+    return spec
+
 
 # ---------------------------------------------------------------------------
 # Code generation
@@ -77,7 +91,7 @@ _ADAPTER_TEMPLATE = '''\
 
 from __future__ import annotations
 
-from repogauge.parsers.junit import parse_repogauge_junit
+{parser_import_line}
 
 REPO = {repo_repr}
 VERSION = {version_repr}
@@ -89,13 +103,13 @@ PRE_INSTALL = {pre_install_repr}
 INSTALL = {install_repr}
 BUILD = {build_repr}
 TEST_CMD_BASE = {test_cmd_base_repr}
-PARSER = "junit"
+PARSER = {parser_name_repr}
 STRATEGY_NAME = {strategy_name_repr}
 DOCKER_SPECS = {docker_specs_repr}
 
 MAP_REPO_TO_EXT = {map_repo_to_ext_repr}
 MAP_REPO_VERSION_TO_SPECS = {map_repo_version_specs_repr}
-MAP_REPO_TO_PARSER = {{REPO: parse_repogauge_junit}}
+MAP_REPO_TO_PARSER = {{REPO: {parser_ref}}}
 
 
 def get_spec() -> dict:
@@ -135,8 +149,9 @@ def registration_context() -> dict:
 def _swebench_spec(spec: Dict[str, Any]) -> Dict[str, Any]:
     """Convert our internal spec to the key names swebench 4.x expects."""
     install_cmds = spec.get("install", [])
+    install_str_join = spec.get("install_str_join", " && ")
     if isinstance(install_cmds, list):
-        install_str = " && ".join(install_cmds) if install_cmds else "pip install -e ."
+        install_str = install_str_join.join(install_cmds) if install_cmds else "pip install -e ."
     else:
         install_str = install_cmds or "pip install -e ."
 
@@ -162,6 +177,42 @@ def _swebench_spec(spec: Dict[str, Any]) -> Dict[str, Any]:
 
 
 def _render_adapter(spec: Dict[str, Any]) -> str:
+    language = spec.get("language", "python")
+    try:
+        from repogauge.lang import find_adapter
+
+        adapter = find_adapter(language)
+        template_vars = adapter.harness_template_vars(spec)
+    except (ImportError, KeyError):
+        template_vars = {
+            "parser_import": "repogauge.parsers.junit.parse_repogauge_junit",
+            "parser_name": "junit",
+            "ext": "py",
+            "install_str_join": " && ",
+        }
+
+    # Build parser import line and reference name for the template.
+    if "parser_import" in template_vars:
+        import_path = template_vars["parser_import"]
+        if "." in import_path:
+            module_path, import_name = import_path.rsplit(".", 1)
+            parser_import_line = f"from {module_path} import {import_name}"
+            parser_ref = import_name
+        else:
+            parser_import_line = f"import {import_path}"
+            parser_ref = import_path
+    elif "parser_import_module" in template_vars and "parser_import_name" in template_vars:
+        parser_import_line = (
+            f"from {template_vars['parser_import_module']} import {template_vars['parser_import_name']}"
+        )
+        parser_ref = template_vars["parser_import_name"]
+    else:
+        parser_import_line = "from repogauge.parsers.junit import parse_repogauge_junit"
+        parser_ref = "parse_repogauge_junit"
+
+    parser_name = template_vars.get("parser_name", "junit")
+    ext = template_vars.get("ext", "py")
+
     return _ADAPTER_TEMPLATE.format(
         repo=spec["repo"],
         repo_repr=repr(spec["repo"]),
@@ -174,9 +225,12 @@ def _render_adapter(spec: Dict[str, Any]) -> str:
         install_repr=repr(spec["install"]),
         build_repr=repr(spec["build"]),
         test_cmd_base_repr=repr(spec["test_cmd_base"]),
+        parser_name_repr=json.dumps(parser_name),
         strategy_name_repr=repr(spec["strategy_name"]),
         docker_specs_repr=repr(spec["docker_specs"]),
-        map_repo_to_ext_repr=repr({spec["repo"]: "py"}),
+        parser_import_line=parser_import_line,
+        parser_ref=parser_ref,
+        map_repo_to_ext_repr=repr({spec["repo"]: ext}),
         map_repo_version_specs_repr=repr(
             {
                 spec["repo"]: {

LLM Judgeworse

Overall Δ: -1.05
Confidence: 0.90

Candidate covers the main adapter templating path but is weaker than the gold patch on robustness and regression protection, mainly due to missing tests and looser handling of adapter template vars.

Task Fitworse Δ-1

It generalizes parser, extension, and install join behavior, so it addresses the core task, but it does not fully normalize/store parser import metadata in the spec the way the gold patch does. That makes future language support depend partly on extra logic in `_render_adapter` rather than solely on `harness_template_vars()` outputs.

Correctness Safetyworse Δ-1

The candidate swallows `ImportError` and `KeyError` and silently falls back to Python defaults, which can mask configuration mistakes for non-Python adapters. It also does not validate that `harness_template_vars()` returns a mapping, and it recomputes template vars in `_render_adapter` instead of relying on a normalized spec. Those choices increase regression risk versus the gold patch's stricter, more explicit handling.

Maintainabilityworse Δ-1

The gold patch centralizes adapter-derived values into the spec and keeps `_render_adapter` simpler. The candidate duplicates adapter lookup/template-var handling across `build_adapter_spec` and `_render_adapter`, mixes fallback behavior into both places, and leaves parser import normalization more ad hoc, which is harder to reason about long term.

Test Qualitymuch worse Δ-2

The provided candidate diff includes no test updates, while the gold patch adds a targeted regression test proving that language adapter template vars drive parser import, parser name, extension, and install string behavior for a non-Python adapter. This is a significant gap relative to the gold solution.

Change Focussame Δ+0

The candidate stays within the intended files and problem area, with no obvious unrelated churn. The main issue is completeness and rigor, not scope discipline.

s1liconcow__repogauge-rg-d1ea78738817-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-unb - problem statement synthesis - Production changes: repogauge/export/materialize.py, repogauge/mining/synthesize.py. Test changes: tests/unit/test_materialize.py, tests/unit/test_synthesize.py. - Bead oss_repogauge-unb: Synthesize issue-style problem statements with provenance Context: - Generate the `problem_statement` field for each dataset instance. - Follow the priority order: 1. linked issue title/body 2. PR title/body 3. commit message plus diff summary 4. LLM-generated issue-style restatement Acceptance: - Every exportable instance has a non-empty problem statement. - Provenance for the statement is stored. - The statement remains useful with no model access. --- - Reproduction: run tests impacted by this change (334 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 4.5m
Spend: $0.65
Tokens: 17.9K
Tool Calls: 20
Patch Size: 9,101 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..a52bf6f
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,34 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/export/materialize.py b/repogauge/export/materialize.py
index 50d9268..0a19e00 100644
--- a/repogauge/export/materialize.py
+++ b/repogauge/export/materialize.py
@@ -8,6 +8,7 @@ from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
 
 from repogauge.export.split_patch import PatchSplitError, split_prod_and_test
+from repogauge.mining.synthesize import synthesize_problem_statement
 from repogauge.utils.git import extract_commit_diff, get_repo_root, list_commit_parents
 
 
@@ -27,6 +28,7 @@ class MaterializedItem:
     status: str = "ready"
     reason: Optional[str] = None
     metadata: Dict[str, Any] | None = None
+    problem_statement: str = ""
 
     def to_dict(self) -> Dict[str, Any]:
         return {
@@ -37,6 +39,7 @@ class MaterializedItem:
             "status": self.status,
             "reason": self.reason,
             "patch": self.patch,
+            "problem_statement": self.problem_statement,
             "test_patch": self.test_patch,
             "prod_patch": self.prod_patch,
             "metadata": self.metadata or {},
@@ -250,6 +253,8 @@ def _materialize_candidate(
         )
 
     materialized_metadata = _extract_candidate_metadata(row, patch, base_commit)
+    problem_statement, ps_source, _ = synthesize_problem_statement(row, patch=patch)
+    materialized_metadata["problem_statement_source"] = ps_source
     item = MaterializedItem(
         candidate_id=candidate_id,
         repo=repo,
@@ -261,6 +266,7 @@ def _materialize_candidate(
         status="ready",
         reason=None,
         metadata=materialized_metadata,
+        problem_statement=problem_statement,
     )
     return item, None
 
diff --git a/repogauge/mining/synthesize.py b/repogauge/mining/synthesize.py
new file mode 100644
index 0000000..1f702d3
--- /dev/null
+++ b/repogauge/mining/synthesize.py
@@ -0,0 +1,86 @@
+"""Problem statement synthesis for dataset instances."""
+
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional, Tuple
+
+
+def synthesize_problem_statement(
+    record: Dict[str, Any],
+    patch: str = "",
+) -> Tuple[str, str, Optional[str]]:
+    """Return (statement, source, source_ref) for a candidate record.
+
+    Priority order:
+      1. linked issue title/body
+      2. PR title/body
+      3. commit message plus diff summary
+      4. LLM-generated advisory (only when commit subject is too weak)
+    """
+    metadata = record.get("metadata") or {}
+
+    def _get(key: str) -> Any:
+        val = record.get(key)
+        if val is not None:
+            return val
+        return metadata.get(key)
+
+    # 1. Linked issue
+    issue_title = _get("issue_title")
+    if issue_title:
+        issue_body = _get("issue_body") or ""
+        issue_refs: List[Any] = _get("issue_refs") or []
+        parts = [str(issue_title)]
+        if issue_body:
+            parts.append(str(issue_body))
+        source_ref = str(issue_refs[0]) if issue_refs else None
+        return "\n\n".join(parts), "linked_issue", source_ref
+
+    # 2. Pull request
+    pr_title = _get("pr_title")
+    if pr_title:
+        pr_body = _get("pr_body") or ""
+        parts = [str(pr_title)]
+        if pr_body:
+            parts.append(str(pr_body))
+        return "\n\n".join(parts), "pull_request", None
+
+    # 3. Commit message — weak single-word subjects fall through to LLM advisory
+    source_subject = _get("source_subject")
+    source_body = _get("source_body")
+
+    is_weak = bool(source_subject) and len(str(source_subject).strip().split()) <= 1
+
+    if is_weak:
+        llm_advisory = metadata.get("llm_advisory") or {}
+        llm_problem = llm_advisory.get("problem_statement")
+        if llm_problem:
+            llm_model = metadata.get("llm_model")
+            return str(llm_problem), "llm_advisory", str(llm_model) if llm_model else None
+
+    # 4. Generate commit-based statement
+    file_roles: Dict[str, Any] = metadata.get("file_roles") or {}
+    prod_files: List[str] = file_roles.get("prod") or []
+    test_files: List[str] = file_roles.get("test") or []
+    total_changed_lines = metadata.get("total_changed_lines") or len(patch.splitlines())
+
+    lines = ["Observed behavior"]
+    if source_subject and not is_weak:
+        lines.append(str(source_subject))
+    if source_body:
+        lines.append(str(source_body))
+
+    if prod_files or test_files:
+        prod_str = ", ".join(prod_files)
+        test_str = ", ".join(test_files)
+        if prod_str and test_str:
+            lines.append(f"- Production changes: {prod_str}. Test changes: {test_str}.")
+        elif prod_str:
+            lines.append(f"- Production changes: {prod_str}.")
+        else:
+            lines.append(f"- Test changes: {test_str}.")
+
+    lines.append(f"- Reproduction: run tests impacted by this change ({total_changed_lines} changed lines).")
+    lines.append("- Expected behavior: the update should make the corresponding regression test pass.")
+
+    return "\n".join(lines), "commit", None
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgemuch worse

Overall Δ: -1.50
Confidence: 0.92

Candidate implements basic problem-statement synthesis but misses provenance detail, handles fewer source shapes than the gold patch, and includes unrelated packaging churn without adding the regression tests.

Task Fitworse Δ-1

It adds `problem_statement` generation and records a source label, but it does not persist `problem_statement_source_ref`, so it falls short of the stated provenance requirement handled by the gold patch. Its source selection is also narrower because issue and PR fallback only trigger when titles exist.

Correctness Safetymuch worse Δ-2

The synthesis logic is materially less robust than gold: it ignores body-only issue/PR data, does not check alternate field names like `linked_issue_title` or `pull_request_title`, returns no commit provenance ref, and uses a much weaker heuristic for when LLM fallback should override commit text. That increases risk of empty or low-quality statements on real rows the gold patch covers.

Maintainabilityworse Δ-1

The candidate is shorter, but it achieves that by omitting normalization helpers and compatibility paths the gold patch uses to centralize coercion and source lookup. It also adds unrelated generated metadata files, which makes the change harder to reason about and maintain.

Test Qualitymuch worse Δ-2

The gold patch adds focused unit coverage for materialization and synthesis precedence. The candidate patch adds no tests, so the regression is not directly protected and important edge cases are unverified.

Change Focusmuch worse Δ-2

It includes unrelated generated files (`repogauge.egg-info/*`, `uv.lock`) that are outside the task. The gold patch stays tightly scoped to the production code and tests needed for the feature.

Harness: resolved
Attempt State: succeeded
Latency: 2.9m
Spend: $0.17
Tokens: 811.6K
Tool Calls: 32
Patch Size: 12,182 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/materialize.py b/repogauge/export/materialize.py
index 50d9268..8c49cb9 100644
--- a/repogauge/export/materialize.py
+++ b/repogauge/export/materialize.py
@@ -8,6 +8,7 @@ from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
 
 from repogauge.export.split_patch import PatchSplitError, split_prod_and_test
+from repogauge.mining.synthesize import synthesize_problem_statement
 from repogauge.utils.git import extract_commit_diff, get_repo_root, list_commit_parents
 
 
@@ -21,9 +22,10 @@ class MaterializedItem:
     repo: str
     commit: str
     base_commit: str
-    patch: str
-    test_patch: str
-    prod_patch: str
+    problem_statement: str = ""
+    patch: str = ""
+    test_patch: str = ""
+    prod_patch: str = ""
     status: str = "ready"
     reason: Optional[str] = None
     metadata: Dict[str, Any] | None = None
@@ -34,6 +36,7 @@ class MaterializedItem:
             "repo": self.repo,
             "commit": self.commit,
             "base_commit": self.base_commit,
+            "problem_statement": self.problem_statement,
             "status": self.status,
             "reason": self.reason,
             "patch": self.patch,
@@ -101,6 +104,41 @@ def _extract_candidate_metadata(row: Dict[str, Any], patch: str, base_commit: st
     return metadata
 
 
+def _synthesize_problem_statement(row: Dict[str, Any], repo_root: Path, patch: str) -> tuple[str, str, str]:
+    metadata = dict(row.get("metadata", {}))
+    enriched: Dict[str, Any] = dict(row)
+    enriched.setdefault("metadata", metadata)
+    if "source_subject" not in enriched:
+        enriched["source_subject"] = metadata.get("source_subject")
+    if "source_body" not in enriched:
+        enriched["source_body"] = metadata.get("source_body")
+    if "issue_title" not in enriched:
+        enriched["issue_title"] = metadata.get("issue_title")
+    if "issue_body" not in enriched:
+        enriched["issue_body"] = metadata.get("issue_body")
+    if "issue_refs" not in enriched:
+        enriched["issue_refs"] = metadata.get("issue_refs", [])
+    if "pr_title" not in enriched:
+        enriched["pr_title"] = metadata.get("pr_title")
+    if "pr_body" not in enriched:
+        enriched["pr_body"] = metadata.get("pr_body")
+
+    if not enriched.get("source_subject") and row.get("commit"):
+        try:
+            from repogauge.utils.git import get_commit_message
+
+            subject, body = get_commit_message(repo_root, str(row["commit"]))
+        except Exception:
+            subject, body = "", ""
+        enriched["source_subject"] = subject
+        enriched["source_body"] = body
+        enriched.setdefault("metadata", {})["source_subject"] = subject
+        enriched.setdefault("metadata", {})["source_body"] = body
+
+    statement, source, source_ref = synthesize_problem_statement(enriched, patch)
+    return statement, source, source_ref
+
+
 def _materialize_candidate(
     repo_root: Path,
     row: Dict[str, Any],
@@ -249,12 +287,20 @@ def _materialize_candidate(
             metadata={"reason": "test split is empty", "split_meta": split_meta},
         )
 
+    problem_statement, problem_statement_source, problem_statement_source_ref = _synthesize_problem_statement(row, repo_root, patch)
     materialized_metadata = _extract_candidate_metadata(row, patch, base_commit)
+    materialized_metadata.update(
+        {
+            "problem_statement_source": problem_statement_source,
+            "problem_statement_source_ref": problem_statement_source_ref,
+        }
+    )
     item = MaterializedItem(
         candidate_id=candidate_id,
         repo=repo,
         commit=commit,
         base_commit=base_commit,
+        problem_statement=problem_statement,
         patch=patch,
         test_patch=test_patch,
         prod_patch=prod_patch,
diff --git a/repogauge/mining/__init__.py b/repogauge/mining/__init__.py
index 231521f..a71ba14 100644
--- a/repogauge/mining/__init__.py
+++ b/repogauge/mining/__init__.py
@@ -3,10 +3,18 @@
 Deterministic repository inspection and related discovery utilities live here.
 """
 
-from .inspect import inspect_repository
 from .file_roles import FileRole, FileRoleClassification, classify_file, classify_files
 from .scan import scan_repository
 from .score import ScoredCommit, score_scan_commit
+from .synthesize import synthesize_problem_statement
+
+try:
+    from .inspect import inspect_repository
+except ModuleNotFoundError as exc:  # pragma: no cover - optional dependency fallback
+    _inspect_import_error = exc
+
+    def inspect_repository(*args, **kwargs):  # type: ignore[no-redef]
+        raise ImportError("inspect_repository is unavailable because an optional dependency could not be imported") from _inspect_import_error
 
 __all__ = [
     "inspect_repository",
@@ -17,4 +25,5 @@ __all__ = [
     "classify_files",
     "ScoredCommit",
     "score_scan_commit",
+    "synthesize_problem_statement",
 ]
diff --git a/repogauge/mining/synthesize.py b/repogauge/mining/synthesize.py
new file mode 100644
index 0000000..a108309
--- /dev/null
+++ b/repogauge/mining/synthesize.py
@@ -0,0 +1,181 @@
+"""Problem statement synthesis helpers."""
+
+from __future__ import annotations
+
+from collections.abc import Mapping
+from typing import Any
+
+
+def _first_string(*values: Any) -> str:
+    for value in values:
+        if isinstance(value, str):
+            candidate = value.strip()
+            if candidate:
+                return candidate
+    return ""
+
+
+def _nested_string(record: Mapping[str, Any], *keys: str) -> str:
+    value: Any = record
+    for key in keys:
+        if not isinstance(value, Mapping):
+            return ""
+        value = value.get(key)
+    return _first_string(value)
+
+
+def _nested_value(record: Mapping[str, Any], *keys: str) -> Any:
+    value: Any = record
+    for key in keys:
+        if not isinstance(value, Mapping):
+            return None
+        value = value.get(key)
+    return value
+
+
+def _issue_reference(record: Mapping[str, Any]) -> str:
+    refs = record.get("issue_refs")
+    if isinstance(refs, list):
+        for ref in refs:
+            if isinstance(ref, str) and ref.strip():
+                return ref.strip()
+            if isinstance(ref, (int, float)):
+                return str(ref)
+    return _first_string(
+        record.get("issue_ref"),
+        record.get("issue_id"),
+        _nested_value(record, "metadata", "issue_ref"),
+        _nested_value(record, "metadata", "issue_id"),
+    )
+
+
+def _file_names_from_patch(patch: str) -> list[str]:
+    names: list[str] = []
+    for line in patch.splitlines():
+        if not line.startswith("diff --git "):
+            continue
+        parts = line.split()
+        if len(parts) >= 4:
+            for candidate in parts[2:4]:
+                if candidate.startswith("a/") or candidate.startswith("b/"):
+                    candidate = candidate[2:]
+                if candidate not in names:
+                    names.append(candidate)
+    return names
+
+
+def _commit_style_statement(subject: str, body: str, patch: str) -> str:
+    file_names = _file_names_from_patch(patch)
+    details: list[str] = []
+    if subject:
+        details.append(subject.rstrip("."))
+    if body:
+        details.append(body.strip())
+    if file_names:
+        details.append(f"Affected files: {', '.join(file_names)}.")
+    details.append("Run tests impacted by this change.")
+    observed = " ".join(details)
+    return f"Observed behavior: {observed}"
+
+
+def _is_weak_commit_text(subject: str, body: str) -> bool:
+    text = f"{subject} {body}".strip().lower()
+    if not text:
+        return True
+    if len(text) < 12:
+        return True
+    weak_subjects = {
+        "fix",
+        "update",
+        "changes",
+        "change",
+        "misc",
+        "wip",
+        "temp",
+        "tweak",
+        "refactor",
+        "patch",
+    }
+    subject_norm = subject.strip().lower()
+    return subject_norm in weak_subjects or subject_norm in {"fix bug", "fixes"}
+
+
+def synthesize_problem_statement(record: Mapping[str, Any], patch: str) -> tuple[str, str, str]:
+    """Return ``(statement, source, source_ref)`` for a candidate record."""
+
+    issue_title = _first_string(
+        record.get("issue_title"),
+        _nested_value(record, "metadata", "issue_title"),
+    )
+    issue_body = _first_string(
+        record.get("issue_body"),
+        _nested_value(record, "metadata", "issue_body"),
+    )
+    if issue_title or issue_body:
+        parts = [part for part in [issue_title, issue_body] if part]
+        statement = ": ".join([parts[0], " ".join(parts[1:])]) if len(parts) > 1 else parts[0]
+        source_ref = _issue_reference(record)
+        return statement.strip(), "linked_issue", source_ref
+
+    pr_title = _first_string(
+        record.get("pr_title"),
+        _nested_value(record, "metadata", "pr_title"),
+    )
+    pr_body = _first_string(
+        record.get("pr_body"),
+        _nested_value(record, "metadata", "pr_body"),
+    )
+    if pr_title or pr_body:
+        parts = [part for part in [pr_title, pr_body] if part]
+        statement = ": ".join([parts[0], " ".join(parts[1:])]) if len(parts) > 1 else parts[0]
+        source_ref = _first_string(
+            record.get("pr_ref"),
+            record.get("pr_url"),
+            record.get("pr_number"),
+            _nested_value(record, "metadata", "pr_ref"),
+            _nested_value(record, "metadata", "pr_url"),
+            _nested_value(record, "metadata", "pr_number"),
+        )
+        return statement.strip(), "pull_request", source_ref
+
+    subject = _first_string(
+        record.get("source_subject"),
+        record.get("commit_subject"),
+        record.get("subject"),
+        _nested_value(record, "metadata", "source_subject"),
+        _nested_value(record, "metadata", "commit_subject"),
+    )
+    body = _first_string(
+        record.get("source_body"),
+        record.get("commit_body"),
+        _nested_value(record, "metadata", "source_body"),
+        _nested_value(record, "metadata", "commit_body"),
+    )
+
+    advisory = _nested_value(record, "metadata", "llm_advisory")
+    advisory_statement = ""
+    advisory_ref = ""
+    if isinstance(advisory, Mapping):
+        advisory_statement = _first_string(advisory.get("problem_statement"))
+    advisory_ref = _first_string(
+        _nested_value(record, "metadata", "llm_model", "model_name"),
+        _nested_value(record, "metadata", "llm_model"),
+        _nested_value(record, "metadata", "llm_advisory", "model_name"),
+    )
+
+    if advisory_statement and _is_weak_commit_text(subject, body):
+        return advisory_statement, "llm_advisory", advisory_ref
+
+    if not subject and not body:
+        subject = _first_string(
+            record.get("commit"),
+            _nested_value(record, "metadata", "source_commit"),
+        )
+
+    statement = _commit_style_statement(subject, body, patch)
+    source_ref = _first_string(
+        record.get("commit"),
+        record.get("source_commit"),
+        _nested_value(record, "metadata", "source_commit"),
+    )
+    return statement.strip(), "commit", source_ref
diff --git a/repogauge/utils/git.py b/repogauge/utils/git.py
index b2b774f..e62c4ed 100644
--- a/repogauge/utils/git.py
+++ b/repogauge/utils/git.py
@@ -81,6 +81,20 @@ def extract_commit_diff(
     return result.stdout
 
 
+def get_commit_message(path: str | Path, commit: str) -> tuple[str, str]:
+    """Return the subject and body for a commit."""
+    root = get_repo_root(path)
+    result = run_command(["git", "-C", str(root), "show", "-s", "--format=%s%n%b", commit])
+    if not result.success:
+        raise GitError(f"failed reading commit message for {commit}: {result.stderr.strip() or result.stdout}")
+    output = result.stdout.splitlines()
+    if not output:
+        return "", ""
+    subject = output[0].strip()
+    body = "\n".join(line.rstrip() for line in output[1:]).strip()
+    return subject, body
+
+
 def apply_patch_text(
     path: str | Path,
     patch: str,

LLM Judgemuch worse

Overall Δ: -1.25
Confidence: 0.91

Candidate implements problem statement synthesis and provenance, but it diverges from the gold behavior, adds unrelated changes, and lacks the regression tests present in the reference patch.

Task Fitworse Δ-1

It wires `problem_statement` into materialization and stores provenance, but the synthesized issue/PR text is much less aligned with the required issue-style output than the gold patch. The candidate also relies on an extra commit-message fetch path instead of keeping the logic purely deterministic from row data plus patch.

Correctness Safetyworse Δ-1

Core precedence is mostly implemented, but the synthesized fallback text is weaker and less structured than gold, and provenance returns empty strings rather than optional refs. The added runtime git lookup and broad exception suppression in `_synthesize_problem_statement` introduce extra failure modes and make behavior less predictable than the reference implementation.

Maintainabilityworse Δ-1

The gold patch keeps synthesis self-contained and straightforward. The candidate adds extra helper indirection in `materialize.py`, a new git utility, and an unrelated `repogauge.mining.__init__` import fallback, increasing surface area and coupling without clear task-driven benefit.

Test Qualitymuch worse Δ-2

The gold patch adds targeted unit coverage for materialization and synthesis precedence/fallback behavior. No candidate test changes are shown, so the regression is effectively unprotected relative to the reference.

Change Focusmuch worse Δ-2

The task was confined to problem-statement synthesis and its materialization/tests. The candidate also changes `repogauge/mining/__init__.py` and `repogauge/utils/git.py`, which is unrelated churn compared with the focused gold patch.

Harness: resolved
Attempt State: succeeded
Latency: 4.9m
Spend: $0.44
Tokens: 1.8M
Tool Calls: 47
Patch Size: 8,273 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/export/materialize.py b/repogauge/export/materialize.py
index 50d9268..b4d1742 100644
--- a/repogauge/export/materialize.py
+++ b/repogauge/export/materialize.py
@@ -8,7 +8,8 @@ from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
 
 from repogauge.export.split_patch import PatchSplitError, split_prod_and_test
-from repogauge.utils.git import extract_commit_diff, get_repo_root, list_commit_parents
+from repogauge.mining.synthesize import synthesize_problem_statement
+from repogauge.utils.git import extract_commit_diff, extract_commit_message, get_repo_root, list_commit_parents
 
 
 class MaterializationError(RuntimeError):
@@ -27,6 +28,7 @@ class MaterializedItem:
     status: str = "ready"
     reason: Optional[str] = None
     metadata: Dict[str, Any] | None = None
+    problem_statement: str = ""
 
     def to_dict(self) -> Dict[str, Any]:
         return {
@@ -39,6 +41,7 @@ class MaterializedItem:
             "patch": self.patch,
             "test_patch": self.test_patch,
             "prod_patch": self.prod_patch,
+            "problem_statement": self.problem_statement,
             "metadata": self.metadata or {},
         }
 
@@ -101,6 +104,24 @@ def _extract_candidate_metadata(row: Dict[str, Any], patch: str, base_commit: st
     return metadata
 
 
+def _build_synthesis_record(row: Dict[str, Any], repo_root: Path, commit: str) -> Dict[str, Any]:
+    record = dict(row)
+    metadata = dict(record.get("metadata", {}))
+    has_subject = bool(
+        metadata.get("source_subject") or metadata.get("commit_subject")
+        or record.get("source_subject") or record.get("commit_subject")
+    )
+    if not has_subject and commit:
+        try:
+            subject, body = extract_commit_message(repo_root, commit)
+            metadata["source_subject"] = subject
+            metadata["source_body"] = body
+        except Exception:
+            pass
+    record["metadata"] = metadata
+    return record
+
+
 def _materialize_candidate(
     repo_root: Path,
     row: Dict[str, Any],
@@ -250,6 +271,9 @@ def _materialize_candidate(
         )
 
     materialized_metadata = _extract_candidate_metadata(row, patch, base_commit)
+    synth_record = _build_synthesis_record(row, repo_root, commit)
+    problem_statement, ps_source, _ = synthesize_problem_statement(synth_record, patch)
+    materialized_metadata["problem_statement_source"] = ps_source
     item = MaterializedItem(
         candidate_id=candidate_id,
         repo=repo,
@@ -261,6 +285,7 @@ def _materialize_candidate(
         status="ready",
         reason=None,
         metadata=materialized_metadata,
+        problem_statement=problem_statement,
     )
     return item, None
 
diff --git a/repogauge/mining/synthesize.py b/repogauge/mining/synthesize.py
new file mode 100644
index 0000000..121af67
--- /dev/null
+++ b/repogauge/mining/synthesize.py
@@ -0,0 +1,143 @@
+"""Deterministic problem-statement synthesis for dataset instances."""
+
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional, Tuple
+
+
+def _get_value(record: Dict[str, Any], key: str) -> Any:
+    if key in record:
+        return record[key]
+    metadata = record.get("metadata")
+    if isinstance(metadata, dict) and key in metadata:
+        return metadata[key]
+    return None
+
+
+def _get_subject(record: Dict[str, Any]) -> Optional[str]:
+    for key in ("source_subject", "commit_subject"):
+        val = _get_value(record, key)
+        if val is not None and str(val).strip():
+            return str(val).strip()
+    return None
+
+
+def _get_body(record: Dict[str, Any]) -> Optional[str]:
+    for key in ("source_body", "commit_body"):
+        val = _get_value(record, key)
+        if val is not None and str(val).strip():
+            return str(val).strip()
+    return None
+
+
+def _is_weak_commit(text: str) -> bool:
+    cleaned = text.strip().lower()
+    if len(cleaned) <= 8:
+        return True
+    words = cleaned.split()
+    if len(words) <= 1:
+        return True
+    weak_starters = {
+        "fix",
+        "update",
+        "refactor",
+        "bump",
+        "merge",
+        "chore",
+        "docs",
+        "style",
+        "test",
+        "wip",
+        "patch",
+        "init",
+        "add",
+        "remove",
+        "delete",
+        "clean",
+        "revert",
+        "temp",
+        "todo",
+        "hack",
+    }
+    if words[0] in weak_starters and len(words) <= 2:
+        return True
+    return False
+
+
+def synthesize_problem_statement(
+    record: Dict[str, Any],
+    patch: str,
+) -> Tuple[str, str, Optional[str]]:
+    """
+    Generate an issue-style problem statement with provenance.
+
+    Priority order:
+    1. linked issue title/body
+    2. PR title/body
+    3. commit message plus diff summary
+    4. LLM-generated issue-style restatement
+    """
+    # Priority 1: linked issue
+    issue_title = _get_value(record, "issue_title")
+    issue_body = _get_value(record, "issue_body")
+    issue_refs = _get_value(record, "issue_refs")
+
+    if issue_title:
+        parts: List[str] = [str(issue_title).strip()]
+        if issue_body:
+            parts.append(str(issue_body).strip())
+        statement = "\n".join(parts)
+        source_ref = issue_refs[0] if isinstance(issue_refs, list) and issue_refs else None
+        return statement, "linked_issue", source_ref
+
+    # Priority 2: pull request
+    pr_title = _get_value(record, "pr_title")
+    pr_body = _get_value(record, "pr_body")
+    if pr_title:
+        parts = [str(pr_title).strip()]
+        if pr_body:
+            parts.append(str(pr_body).strip())
+        statement = "\n".join(parts)
+        return statement, "pull_request", None
+
+    # Priority 3 & 4: commit or LLM fallback
+    source_subject = _get_subject(record)
+    source_body = _get_body(record)
+    llm_advisory = _get_value(record, "llm_advisory")
+    llm_model = _get_value(record, "llm_model")
+
+    commit_text = source_subject or ""
+    is_weak = _is_weak_commit(commit_text)
+
+    if is_weak and isinstance(llm_advisory, dict):
+        ps = llm_advisory.get("problem_statement")
+        if ps:
+            return str(ps).strip(), "llm_advisory", str(llm_model) if llm_model is not None else None
+
+    # Commit fallback synthesis
+    metadata = record.get("metadata", {}) or {}
+    total_changed_lines = metadata.get("total_changed_lines") if isinstance(metadata, dict) else None
+
+    parts = []
+    if commit_text:
+        parts.append("Observed behavior")
+        parts.append(f"- {commit_text}")
+        if source_body:
+            for line in source_body.splitlines():
+                line = line.strip()
+                if line:
+                    parts.append(f"- {line}")
+
+    if patch and patch.strip():
+        if parts:
+            parts.append("")
+        parts.append("Reproduction")
+        parts.append("- run tests impacted by this change")
+        if total_changed_lines and isinstance(total_changed_lines, int):
+            parts.append(f"  ({total_changed_lines} changed lines)")
+
+    statement = "\n".join(parts)
+    if not statement:
+        statement = "No problem description available."
+
+    return statement, "commit", None
diff --git a/repogauge/utils/git.py b/repogauge/utils/git.py
index b2b774f..8c6d048 100644
--- a/repogauge/utils/git.py
+++ b/repogauge/utils/git.py
@@ -133,6 +133,18 @@ def create_worktree(
     return WorktreeHandle(repo=repo, path=temp_path)
 
 
+def extract_commit_message(path: str | Path, commit: str) -> tuple[str, str]:
+    """Return (subject, body) for a commit."""
+    root = get_repo_root(path)
+    result = run_command(["git", "-C", str(root), "log", "-1", "--format=%s%x00%b", commit])
+    if not result.success:
+        raise GitError(f"failed extracting message for {commit}: {result.stderr.strip() or result.stdout}")
+    parts = result.stdout.split("\x00", 1)
+    subject = parts[0].strip()
+    body = parts[1].strip() if len(parts) > 1 else ""
+    return subject, body
+
+
 def remove_worktree(path: str | Path, worktree_path: str | Path) -> None:
     """Remove a worktree and best-effort clean directory."""
     root = get_repo_root(path)

LLM Judgemuch worse

Overall Δ: -1.40
Confidence: 0.91

Candidate lands the main field wiring but misses part of the required provenance contract and implements a weaker synthesis policy than the gold patch.

Task Fitworse Δ-1

It does populate `problem_statement` and stores a source label, but it does not store the source reference provenance required by the task, and its priority handling is less faithful because issue/PR body-only cases are skipped and LLM fallback can override commit-derived text more aggressively than the spec/gold.

Correctness Safetymuch worse Δ-2

The candidate drops `problem_statement_source_ref`, so provenance is incomplete. It only uses linked issues when `issue_title` exists and PRs when `pr_title` exists, missing valid body-only inputs that the gold handles. Its weak-commit heuristic considers only the subject, so it may incorrectly prefer LLM output even when commit body text is sufficient, weakening deterministic no-model behavior.

Maintainabilityworse Δ-1

The implementation is simpler, but it is also less cohesive and less explicit about fallback behavior. Adding `extract_commit_message` in `utils/git.py` expands scope and coupling to git plumbing, while the synthesis code omits the gold patch’s clearer normalization helpers and richer structured output generation.

Test Qualitymuch worse Δ-2

The gold patch adds targeted unit coverage for materialization and synthesis precedence/fallback behavior. No candidate tests are shown, so the missing provenance field, body-only source cases, and fallback-order regressions are left unguarded.

Change Focusworse Δ-1

The candidate introduces an additional git utility and runtime metadata backfill path that the gold patch does not need. That extra churn is adjacent to the task but broader than necessary, while still failing to capture the full required provenance.

s1liconcow__repogauge-rg-fa095be4cc9e-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Refactor environment signature delegation - Production changes: repogauge/lang/python.py, repogauge/mining/signature.py. Test changes: tests/unit/test_signature.py. - Reproduction: run tests impacted by this change (275 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 3.8m
Spend: $0.58
Tokens: 8.8K
Tool Calls: 32
Patch Size: 919 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/lang/python.py b/repogauge/lang/python.py
index 23c7cac..917e1d0 100644
--- a/repogauge/lang/python.py
+++ b/repogauge/lang/python.py
@@ -11,6 +11,7 @@ from pathlib import Path
 from typing import Any
 
 from repogauge.mining.signature import REPO_VERSION_UNKNOWN
+from repogauge.mining.signature import _read_requirements_signature
 from repogauge.mining.signature import _to_pkg_label
 from repogauge.mining.signature import _to_python_label
 from repogauge.mining.signature import _to_test_label
@@ -650,7 +651,7 @@ class PythonAdapter:
     def dependency_signature_inputs(
         self, repo_root: Path, profile: dict[str, Any]
     ) -> list[str]:
-        return _detect_package_and_install_hints(repo_root)[1]
+        return _read_requirements_signature(repo_root, profile)
 
     def env_overrides(self, worktree: Path) -> dict[str, str]:
         return {"PYTHONPATH": str(worktree)}

LLM Judgeworse

Overall Δ: -1.05
Confidence: 0.95

Candidate fixes the Python dependency-signature regression narrowly, but it does not implement the broader adapter-based delegation refactor or add the regression coverage present in the gold patch.

Task Fitworse Δ-1

It updates `PythonAdapter.dependency_signature_inputs()` to use requirement-content hashing, which addresses the observed failing behavior for Python, but it does not complete the intended environment-signature delegation refactor in `repogauge/mining/signature.py` that the task describes.

Correctness Safetyworse Δ-1

For Python repos this is likely correct, but it leaves signature label generation and dependency-input delegation centralized in `signature.py` instead of moving them behind adapter APIs. That means non-Python or future language-specific profiles still do not benefit from the safer adapter-driven behavior in the gold patch.

Maintainabilityworse Δ-1

The candidate keeps `repogauge/lang/python.py` coupled to private helpers from `repogauge.mining.signature` and only patches one call site. The gold patch improves module boundaries by moving Python-specific labeling and dependency parsing into the adapter and making `signature.py` consume adapter interfaces.

Test Qualitymuch worse Δ-2

No tests were added. The gold patch adds a targeted regression test that verifies adapter labels, adapter dependency inputs, and the final environment signature built from a realistic repo-profile fixture.

Change Focussame Δ+0

The candidate stays tightly scoped to the regression and avoids unrelated churn. It is smaller than the gold patch, but not distractingly broad.

Harness: resolved
Attempt State: succeeded
Latency: 3.2m
Spend: $0.18
Tokens: 679.1K
Tool Calls: 23
Patch Size: 6,084 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/lang/python.py b/repogauge/lang/python.py
index 23c7cac..9f2ede0 100644
--- a/repogauge/lang/python.py
+++ b/repogauge/lang/python.py
@@ -11,6 +11,7 @@ from pathlib import Path
 from typing import Any
 
 from repogauge.mining.signature import REPO_VERSION_UNKNOWN
+from repogauge.mining.signature import _normalize_dependency_lines
 from repogauge.mining.signature import _to_pkg_label
 from repogauge.mining.signature import _to_python_label
 from repogauge.mining.signature import _to_test_label
@@ -640,7 +641,13 @@ class PythonAdapter:
             if version:
                 versions = [str(version)]
         commands = _coerce_list(test_runner_hints.get("commands"))
+        if not commands:
+            commands = _coerce_list(profile.get("test_cmds"))
         managers = _coerce_list(python_hints.get("package_managers"))
+        if not managers:
+            package_manager = profile.get("package_manager")
+            if package_manager:
+                managers = [str(package_manager)]
         return {
             "runtime_label": _to_python_label(versions),
             "test_label": _to_test_label(commands),
@@ -650,7 +657,41 @@ class PythonAdapter:
     def dependency_signature_inputs(
         self, repo_root: Path, profile: dict[str, Any]
     ) -> list[str]:
-        return _detect_package_and_install_hints(repo_root)[1]
+        if not repo_root.exists():
+            package_style = profile.get("package_style")
+            if isinstance(package_style, str) and package_style.strip():
+                return [package_style.strip()]
+            return []
+
+        inputs: list[str] = []
+        for candidate in sorted(
+            (
+                repo_root / "requirements.txt",
+                repo_root / "requirements-dev.txt",
+                repo_root / "dev-requirements.txt",
+                repo_root / "pyproject.toml",
+                repo_root / "setup.cfg",
+                repo_root / "setup.py",
+            )
+        ):
+            if not candidate.exists():
+                continue
+            try:
+                normalized_lines = _normalize_dependency_lines(
+                    candidate.read_text(encoding="utf-8")
+                )
+            except OSError:
+                continue
+            if normalized_lines:
+                inputs.append("\n".join(_sorted_unique(normalized_lines)))
+
+        if inputs:
+            return _sorted_unique(inputs)
+
+        package_style = profile.get("package_style")
+        if isinstance(package_style, str) and package_style.strip():
+            return [package_style.strip()]
+        return []
 
     def env_overrides(self, worktree: Path) -> dict[str, str]:
         return {"PYTHONPATH": str(worktree)}
diff --git a/repogauge/mining/signature.py b/repogauge/mining/signature.py
index 1efe27e..3c363d0 100644
--- a/repogauge/mining/signature.py
+++ b/repogauge/mining/signature.py
@@ -107,7 +107,16 @@ def _read_requirements_signature(repo_root: Path, profile: dict[str, Any]) -> li
 
 
 def build_environment_signature(profile: dict[str, Any]) -> dict[str, Any]:
-    repo_root = Path(profile.get("repo_root", "")).resolve()
+    from repogauge.lang import find_adapter
+
+    repo_root = Path(profile.get("repo_root") or profile.get("source_path") or "").resolve()
+    language = str(profile.get("language", "python")).strip().lower() or "python"
+
+    try:
+        adapter = find_adapter(language)
+    except KeyError:
+        adapter = find_adapter("python")
+
     python_hints = profile.get("python_hints", {}) or {}
     test_runner_hints = profile.get("test_runner_hints", {}) or {}
 
@@ -121,18 +130,37 @@ def build_environment_signature(profile: dict[str, Any]) -> dict[str, Any]:
     repo_name = str(profile.get("repo_name", "")).strip()
     repo_version = str(profile.get("repo_version", "")).strip() or REPO_VERSION_UNKNOWN
 
-    dependency_payload = {
-        "package_managers": package_managers,
-        "install_cmds": install_cmds,
-        "test_commands": test_commands,
-        "package_style": package_style,
-        "requirements": _read_requirements_signature(repo_root, profile),
-    }
-    fingerprint = _dependency_hash(dependency_payload)
+    try:
+        labels = adapter.signature_labels(profile)
+    except Exception:
+        labels = {}
+    if not isinstance(labels, dict):
+        labels = {}
+    runtime_label = str(labels.get("runtime_label", "")).strip() or _to_python_label(
+        python_versions
+    )
+    test_label = str(labels.get("test_label", "")).strip() or _to_test_label(
+        test_commands
+    )
+    package_label = str(labels.get("package_label", "")).strip() or _to_pkg_label(
+        package_managers
+    )
 
-    python_label = _to_python_label(python_versions)
-    test_label = _to_test_label(test_commands)
-    package_label = _to_pkg_label(package_managers)
+    try:
+        dependency_inputs = adapter.dependency_signature_inputs(repo_root, profile)
+    except Exception:
+        dependency_inputs = []
+    if not isinstance(dependency_inputs, list):
+        dependency_inputs = []
+    if not dependency_inputs and package_style != "unknown":
+        dependency_inputs = [package_style]
+
+    dependency_payload = {"inputs": _as_sorted_unique(dependency_inputs)}
+    fingerprint = _dependency_hash(dependency_payload)
+    version = (
+        f"{repo_version}__{runtime_label}__{test_label}__{package_label}__reqhash_"
+        f"{fingerprint}"
+    )
 
     return {
         "repo_name": repo_name,
@@ -143,8 +171,8 @@ def build_environment_signature(profile: dict[str, Any]) -> dict[str, Any]:
         "install_cmds": install_cmds,
         "test_commands": test_commands,
         "dependency_signature": fingerprint,
-        "signature": f"{repo_version}__{python_label}__{test_label}__{package_label}__reqhash_{fingerprint}",
-        "version": f"{repo_version}__{python_label}__{test_label}__{package_label}__reqhash_{fingerprint}",
+        "signature": version,
+        "version": version,
     }

LLM Judgemuch worse

Overall Δ: -1.40
Confidence: 0.88

Candidate partially implements adapter delegation but diverges from the golden fix by weakening dependency fingerprinting and omitting the regression test coverage.

Task Fitworse Δ-1

It moves signature label and dependency input gathering toward the adapter path, which addresses the refactor intent, but it does not match the gold patch's full environment-signature behavior and adds extra fallback behavior outside the stated task.

Correctness Safetymuch worse Δ-2

The candidate changes the dependency hash payload from the gold patch's structured fields (`package_managers`, `install_cmds`, `test_commands`, `package_style`, `requirements`) to only `inputs`, so environment signatures can stop changing when install/test/package-manager hints change. It also silently swallows adapter failures and falls back to python defaults, which can mask real regressions.

Maintainabilityworse Δ-1

Compared with the gold patch, the candidate leaves responsibility split awkwardly across modules: `_normalize_dependency_lines` still lives in `signature.py`, dependency-reading logic is duplicated in `python.py`, and exception-heavy fallback paths make the flow less clear. The gold patch is cleaner about moving language-specific behavior behind the adapter.

Test Qualitymuch worse Δ-2

The gold patch adds a focused regression test that exercises adapter-provided labels and dependency inputs through a realistic repo-profile fixture. The candidate patch includes no corresponding test coverage, so the resolved harness result is not backed by the same regression protection.

Change Focusworse Δ-1

The candidate includes unrelated broadenings such as `source_path` fallback, defaulting unknown languages back to python, and extra profile fallbacks for `test_cmds` and `package_manager`. Those may be useful, but they go beyond the gold fix's tighter refactor scope.

Harness: resolved
Attempt State: succeeded
Latency: 3.8m
Spend: $0.27
Tokens: 864.1K
Tool Calls: 26
Patch Size: 9,818 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/lang/python.py b/repogauge/lang/python.py
index 23c7cac..86abd1c 100644
--- a/repogauge/lang/python.py
+++ b/repogauge/lang/python.py
@@ -11,9 +11,6 @@ from pathlib import Path
 from typing import Any
 
 from repogauge.mining.signature import REPO_VERSION_UNKNOWN
-from repogauge.mining.signature import _to_pkg_label
-from repogauge.mining.signature import _to_python_label
-from repogauge.mining.signature import _to_test_label
 from repogauge.validation.env_detect import EnvPlan
 from repogauge.parsers.junit import parse_repogauge_junit
 
@@ -44,6 +41,91 @@ def _coerce_list(value: Any) -> list[str]:
     return []
 
 
+def _to_test_label(commands: list[str]) -> str:
+    if not commands:
+        return "testunknown"
+    return "+".join(commands)
+
+
+def _to_pkg_label(managers: list[str]) -> str:
+    if not managers:
+        return "pkgunknown"
+    return "+".join(managers)
+
+
+def _to_python_label(versions: list[str]) -> str:
+    if not versions:
+        return "pyunknown"
+    return "_".join(f"py{v.replace('.', '')}" for v in versions)
+
+
+def _normalize_dependency_lines(raw: str) -> list[str]:
+    lines: list[str] = []
+    for raw_line in raw.splitlines():
+        value = raw_line.strip()
+        if not value or value.startswith("#"):
+            continue
+        value = value.split("#", 1)[0].strip()
+        if value:
+            lines.append(value)
+    return lines
+
+
+def _read_requirements_signature(repo_root: Path, profile: dict[str, Any]) -> list[str]:
+    if not repo_root.exists():
+        if isinstance(profile.get("package_style"), str):
+            return _sorted_unique([str(profile.get("package_style"))])
+        return []
+
+    requirements: list[str] = []
+    for candidate in sorted(
+        (
+            repo_root / "requirements.txt",
+            repo_root / "requirements-dev.txt",
+            repo_root / "dev-requirements.txt",
+        )
+    ):
+        if not candidate.exists():
+            continue
+        try:
+            normalized_lines = _normalize_dependency_lines(
+                candidate.read_text(encoding="utf-8")
+            )
+            requirements.append("\n".join(_sorted_unique(normalized_lines)))
+        except OSError:
+            requirements.append("")
+    pyproject = repo_root / "pyproject.toml"
+    if pyproject.exists():
+        try:
+            normalized_lines = _normalize_dependency_lines(
+                pyproject.read_text(encoding="utf-8")
+            )
+            requirements.append("\n".join(_sorted_unique(normalized_lines)))
+        except OSError:
+            requirements.append("")
+    setup_cfg = repo_root / "setup.cfg"
+    if setup_cfg.exists():
+        try:
+            normalized_lines = _normalize_dependency_lines(
+                setup_cfg.read_text(encoding="utf-8")
+            )
+            requirements.append("\n".join(_sorted_unique(normalized_lines)))
+        except OSError:
+            requirements.append("")
+    setup_py = repo_root / "setup.py"
+    if setup_py.exists():
+        try:
+            normalized_lines = _normalize_dependency_lines(
+                setup_py.read_text(encoding="utf-8")
+            )
+            requirements.append("\n".join(_sorted_unique(normalized_lines)))
+        except OSError:
+            requirements.append("")
+    if not requirements and isinstance(profile.get("package_style"), str):
+        requirements.append(profile["package_style"])
+    return _sorted_unique(requirements)
+
+
 def _version_tuple(v: str) -> tuple[int, ...]:
     try:
         return tuple(int(x) for x in v.split("."))
@@ -650,7 +732,7 @@ class PythonAdapter:
     def dependency_signature_inputs(
         self, repo_root: Path, profile: dict[str, Any]
     ) -> list[str]:
-        return _detect_package_and_install_hints(repo_root)[1]
+        return _read_requirements_signature(repo_root, profile)
 
     def env_overrides(self, worktree: Path) -> dict[str, str]:
         return {"PYTHONPATH": str(worktree)}
diff --git a/repogauge/mining/signature.py b/repogauge/mining/signature.py
index 1efe27e..3037b46 100644
--- a/repogauge/mining/signature.py
+++ b/repogauge/mining/signature.py
@@ -15,99 +15,22 @@ def _as_sorted_unique(values: Iterable[str]) -> list[str]:
     return sorted({str(v).strip() for v in values if str(v).strip()})
 
 
-def _to_test_label(commands: list[str]) -> str:
-    if not commands:
-        return "testunknown"
-    return "+".join(commands)
-
-
-def _to_pkg_label(managers: list[str]) -> str:
-    if not managers:
-        return "pkgunknown"
-    return "+".join(managers)
-
-
-def _to_python_label(versions: list[str]) -> str:
-    if not versions:
-        return "pyunknown"
-    return "_".join(f"py{v.replace('.', '')}" for v in versions)
-
-
 def _dependency_hash(parts: dict[str, Any]) -> str:
     payload = json.dumps(parts, sort_keys=True, separators=(",", ":"))
     digest = hashlib.sha256(payload.encode("utf-8")).hexdigest()
     return digest[:16]
 
 
-def _normalize_dependency_lines(raw: str) -> list[str]:
-    lines: list[str] = []
-    for raw_line in raw.splitlines():
-        value = raw_line.strip()
-        if not value or value.startswith("#"):
-            continue
-        value = value.split("#", 1)[0].strip()
-        if value:
-            lines.append(value)
-    return lines
-
-
-def _read_requirements_signature(repo_root: Path, profile: dict[str, Any]) -> list[str]:
-    if not repo_root.exists():
-        if isinstance(profile.get("package_style"), str):
-            return _as_sorted_unique([str(profile.get("package_style"))])
-        return []
-
-    requirements: list[str] = []
-    for candidate in sorted(
-        (
-            repo_root / "requirements.txt",
-            repo_root / "requirements-dev.txt",
-            repo_root / "dev-requirements.txt",
-        )
-    ):
-        if not candidate.exists():
-            continue
-        try:
-            normalized_lines = _normalize_dependency_lines(
-                candidate.read_text(encoding="utf-8")
-            )
-            requirements.append("\n".join(_as_sorted_unique(normalized_lines)))
-        except OSError:
-            requirements.append("")
-    pyproject = repo_root / "pyproject.toml"
-    if pyproject.exists():
-        try:
-            normalized_lines = _normalize_dependency_lines(
-                pyproject.read_text(encoding="utf-8")
-            )
-            requirements.append("\n".join(_as_sorted_unique(normalized_lines)))
-        except OSError:
-            requirements.append("")
-    setup_cfg = repo_root / "setup.cfg"
-    if setup_cfg.exists():
-        try:
-            normalized_lines = _normalize_dependency_lines(
-                setup_cfg.read_text(encoding="utf-8")
-            )
-            requirements.append("\n".join(_as_sorted_unique(normalized_lines)))
-        except OSError:
-            requirements.append("")
-    setup_py = repo_root / "setup.py"
-    if setup_py.exists():
-        try:
-            normalized_lines = _normalize_dependency_lines(
-                setup_py.read_text(encoding="utf-8")
-            )
-            requirements.append("\n".join(_as_sorted_unique(normalized_lines)))
-        except OSError:
-            requirements.append("")
-    if not requirements and isinstance(profile.get("package_style"), str):
-        requirements.append(profile["package_style"])
-    return _as_sorted_unique(requirements)
-
-
 def build_environment_signature(profile: dict[str, Any]) -> dict[str, Any]:
     repo_root = Path(profile.get("repo_root", "")).resolve()
+    repo_name = str(profile.get("repo_name", "")).strip()
+    repo_version = str(profile.get("repo_version", "")).strip() or REPO_VERSION_UNKNOWN
+
+    from repogauge.lang import find_adapter
+
+    adapter = find_adapter(profile.get("language", "python"))
+    labels = adapter.signature_labels(profile)
+
     python_hints = profile.get("python_hints", {}) or {}
     test_runner_hints = profile.get("test_runner_hints", {}) or {}
 
@@ -118,22 +41,16 @@ def build_environment_signature(profile: dict[str, Any]) -> dict[str, Any]:
     package_style = (
         str(python_hints.get("package_style", "unknown")).strip() or "unknown"
     )
-    repo_name = str(profile.get("repo_name", "")).strip()
-    repo_version = str(profile.get("repo_version", "")).strip() or REPO_VERSION_UNKNOWN
 
     dependency_payload = {
         "package_managers": package_managers,
         "install_cmds": install_cmds,
         "test_commands": test_commands,
         "package_style": package_style,
-        "requirements": _read_requirements_signature(repo_root, profile),
+        "requirements": adapter.dependency_signature_inputs(repo_root, profile),
     }
     fingerprint = _dependency_hash(dependency_payload)
 
-    python_label = _to_python_label(python_versions)
-    test_label = _to_test_label(test_commands)
-    package_label = _to_pkg_label(package_managers)
-
     return {
         "repo_name": repo_name,
         "repo_version": repo_version,
@@ -143,8 +60,8 @@ def build_environment_signature(profile: dict[str, Any]) -> dict[str, Any]:
         "install_cmds": install_cmds,
         "test_commands": test_commands,
         "dependency_signature": fingerprint,
-        "signature": f"{repo_version}__{python_label}__{test_label}__{package_label}__reqhash_{fingerprint}",
-        "version": f"{repo_version}__{python_label}__{test_label}__{package_label}__reqhash_{fingerprint}",
+        "signature": f"{repo_version}__{labels['runtime_label']}__{labels['test_label']}__{labels['package_label']}__reqhash_{fingerprint}",
+        "version": f"{repo_version}__{labels['runtime_label']}__{labels['test_label']}__{labels['package_label']}__reqhash_{fingerprint}",
     }

LLM Judgeworse

Overall Δ: -0.90
Confidence: 0.93

Candidate captures the core delegation refactor in production code but misses the gold patch's broader hint-source handling and regression test coverage.

Task Fitworse Δ-1

It updates dependency signature reading and delegates label generation through the adapter, which addresses the main bug, but it does not match the gold patch's full environment-signature delegation behavior. In particular, it keeps deriving versions, package managers, install commands, test commands, and package style only from legacy python-specific fields instead of the adapter-compatible language-hint path used by the gold fix.

Correctness Safetyworse Δ-1

The candidate is likely sufficient for the reported regression, but it is less robust than the gold patch. It omits fallback to `language_hints`, does not merge install/test hints from adapter-facing fields, and indexes `labels[...]` directly instead of using safe defaults, so it has more risk if profiles are incomplete or if non-Python adapters rely on the new delegation boundary.

Maintainabilityworse Δ-1

Moving helper functions into `repogauge/lang/python.py` is consistent with the refactor, but the overall implementation is less cohesive than the gold patch. The inline `find_adapter` import inside `build_environment_signature` and continued dependence on python-specific hint extraction leave the module in a partially refactored state rather than cleanly centralizing adapter-owned behavior.

Test Qualityworse Δ-1

The candidate provides no test updates, while the gold patch adds a focused regression test that exercises adapter label delegation and dependency signature inputs from a realistic `RepoProfile`-style fixture. That missing coverage materially reduces confidence in the refactor.

Change Focussame Δ+0

The production diff stays narrowly scoped to the environment-signature refactor and does not introduce unrelated churn. The absence of tests is a coverage gap rather than scope creep, so focus is about the same as the gold patch.

s1liconcow__repogauge-rg-71b332d9f694-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Landing changes for bead oss_repogauge-626 - orchestrate materialization flow - Production changes: repogauge/cli.py, repogauge/export/__init__.py, repogauge/export/materialize.py. Test changes: tests/unit/test_export.py. - Bead oss_repogauge-626: Orchestrate candidate materialization from reviewed commits Context: - Convert accepted reviewed candidates into materialized instance work items. - Each work item should gather all information needed for patch splitting, synthesis, env detection, and validation. Acceptance: - Accepted candidates become materialization work items with explicit base commit and provenance. - Invalid candidates fail with actionable reasons. - Materialization is deterministic for a given input artifact. --- - Reproduction: run tests impacted by this change (300 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 4.6m
Spend: $0.61
Tokens: 12.0K
Tool Calls: 37
Patch Size: 4,173 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/cli.py b/repogauge/cli.py
index 2579a86..3ea54d7 100644
--- a/repogauge/cli.py
+++ b/repogauge/cli.py
@@ -283,6 +283,74 @@ def _run_command(namespace: argparse.Namespace) -> int:
             "include_merges": not namespace.exclude_merges,
         }
 
+    if command == "export":
+        if not namespace.path:
+            manifest.mark_step("inspect", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.mark_step("execute", ManifestStepStatus.SKIPPED)
+            manifest.finish(status="failed", metadata={"reason": "missing_export_input"})
+            manifest.mark_step("finish", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.write(manifest_path)
+            log_event(
+                {
+                    "event": "command.finish",
+                    "command": command,
+                    "status": manifest.status,
+                    "timestamp": manifest.ended_at,
+                    "error": "missing export input path",
+                },
+                events_path,
+            )
+            return 1
+
+        from repogauge.export import run_materialization
+
+        manifest.mark_step("inspect", ManifestStepStatus.RUNNING, started_at=command_timestamp)
+        try:
+            export_summary = run_materialization(
+                reviewed_path=namespace.path,
+                out_root=out_root,
+            )
+            manifest.mark_step("inspect", ManifestStepStatus.SUCCEEDED)
+            manifest.mark_step(
+                "execute",
+                ManifestStepStatus.SUCCEEDED,
+                ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z",
+            )
+            manifest.artifact_paths["materialized"] = export_summary["materialized_path"]
+            manifest.artifact_paths["rejected"] = export_summary["rejected_path"]
+            manifest.metadata["export"] = export_summary
+        except Exception as exc:
+            manifest.mark_step("inspect", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.mark_step("execute", ManifestStepStatus.SKIPPED)
+            manifest.finish(status="failed", metadata={"reason": "export_failed", "error": str(exc)})
+            manifest.mark_step("finish", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.write(manifest_path)
+            log_event(
+                {
+                    "event": "command.finish",
+                    "command": command,
+                    "status": manifest.status,
+                    "timestamp": manifest.ended_at,
+                    "error": str(exc),
+                },
+                events_path,
+            )
+            return 1
+
+        manifest.mark_step("finish", ManifestStepStatus.SUCCEEDED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+        manifest.finish(status="succeeded", metadata={"reason": "export_complete", "path": namespace.path})
+        manifest.write(manifest_path)
+        log_event(
+            {
+                "event": "command.finish",
+                "command": command,
+                "status": manifest.status,
+                "timestamp": manifest.ended_at,
+            },
+            events_path,
+        )
+        return 0
+
     # Scaffold implementations are intentionally explicit no-ops for unimplemented commands.
     manifest.mark_step("execute", ManifestStepStatus.SUCCEEDED, started_at=command_timestamp)
     manifest.mark_step("finish", ManifestStepStatus.SUCCEEDED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
diff --git a/repogauge/export/__init__.py b/repogauge/export/__init__.py
index 63f7161..a588709 100644
--- a/repogauge/export/__init__.py
+++ b/repogauge/export/__init__.py
@@ -1 +1,5 @@
 """Export package."""
+
+from repogauge.export.materialize import run_materialization
+
+__all__ = ["run_materialization"]

LLM Judgeworse

Overall Δ: -0.90
Confidence: 0.93

Candidate adds an `export` command path, but it is materially incomplete relative to the gold fix: it omits repo-root/input resolution support, does not include the production materialization module change, and adds no regression tests.

Task Fitworse Δ-1

It only partially addresses the task. The gold patch wires CLI orchestration to deterministic materialization inputs by resolving the reviewed artifact path and repository root; the candidate only forwards `namespace.path` and `out_root`, so it does not fully implement the required orchestration behavior.

Correctness Safetyworse Δ-1

The candidate misses the `_resolve_repo_root` handling and the explicit `reviewed.jsonl` directory/file resolution present in gold, so valid invocation shapes can fail or behave differently. It also records artifact keys differently (`rejected` instead of `materialization_rejections`) and relies on `run_materialization` defaults that may not provide the required provenance/base-commit context.

Maintainabilityworse Δ-1

The implementation is serviceable but less cohesive than gold. Importing `run_materialization` inside the command branch instead of at module scope is less clear, and omitting the shared repo-root helper leaves the CLI behavior less explicit and less reusable.

Test Qualityworse Δ-1

Gold adds targeted unit coverage for materialization splitting, rejection handling, and CLI artifact writing. The candidate adds no tests, so it provides no regression protection for the new behavior.

Change Focussame Δ+0

The scope stays narrow and relevant to the export flow, similar to gold. It is incomplete, but it does not introduce obvious unrelated churn.

Harness: resolved
Attempt State: succeeded
Latency: 1.3m
Spend: $0.07
Tokens: 365.7K
Tool Calls: 20
Patch Size: 5,442 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/cli.py b/repogauge/cli.py
index 2579a86..94129a6 100644
--- a/repogauge/cli.py
+++ b/repogauge/cli.py
@@ -16,6 +16,7 @@ from datetime import datetime, timezone
 from pathlib import Path
 
 from repogauge.review import run_review
+from repogauge.export import MaterializationError, run_materialization
 
 OUT_DIR_HELP = "Path where artifacts are written (created when needed)."
 CONFIG_HELP = "Configuration file path. Values are merged over project defaults."
@@ -283,6 +284,73 @@ def _run_command(namespace: argparse.Namespace) -> int:
             "include_merges": not namespace.exclude_merges,
         }
 
+    if command == "export":
+        if not namespace.path:
+            manifest.mark_step("inspect", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.mark_step("execute", ManifestStepStatus.SKIPPED)
+            manifest.finish(status="failed", metadata={"reason": "missing_reviewed_input"})
+            manifest.mark_step("finish", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.write(manifest_path)
+            log_event(
+                {
+                    "event": "command.finish",
+                    "command": command,
+                    "status": manifest.status,
+                    "timestamp": manifest.ended_at,
+                    "error": "missing reviewed input path",
+                },
+                events_path,
+            )
+            return 1
+
+        source = Path(namespace.path).resolve()
+        reviewed_path = source / "reviewed.jsonl" if source.is_dir() else source
+        manifest.mark_step("inspect", ManifestStepStatus.RUNNING, started_at=command_timestamp)
+        try:
+            materialization_summary = run_materialization(
+                reviewed_path=reviewed_path,
+                out_root=out_root,
+            )
+            manifest.mark_step("inspect", ManifestStepStatus.SUCCEEDED)
+            manifest.mark_step(
+                "execute",
+                ManifestStepStatus.SUCCEEDED,
+                ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z",
+            )
+            manifest.artifact_paths["materialized"] = materialization_summary["materialized_path"]
+            manifest.artifact_paths["rejected"] = materialization_summary["rejected_path"]
+            manifest.metadata["export"] = materialization_summary
+        except (MaterializationError, ValueError) as exc:
+            manifest.mark_step("inspect", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.mark_step("execute", ManifestStepStatus.SKIPPED)
+            manifest.finish(status="failed", metadata={"reason": "materialization_failed", "error": str(exc)})
+            manifest.mark_step("finish", ManifestStepStatus.FAILED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+            manifest.write(manifest_path)
+            log_event(
+                {
+                    "event": "command.finish",
+                    "command": command,
+                    "status": manifest.status,
+                    "timestamp": manifest.ended_at,
+                    "error": str(exc),
+                },
+                events_path,
+            )
+            return 1
+        manifest.mark_step("finish", ManifestStepStatus.SUCCEEDED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
+        manifest.finish(status="succeeded", metadata={"reason": "materialization_complete", "path": str(reviewed_path)})
+        manifest.write(manifest_path)
+        log_event(
+            {
+                "event": "command.finish",
+                "command": command,
+                "status": manifest.status,
+                "timestamp": manifest.ended_at,
+            },
+            events_path,
+        )
+        return 0
+
     # Scaffold implementations are intentionally explicit no-ops for unimplemented commands.
     manifest.mark_step("execute", ManifestStepStatus.SUCCEEDED, started_at=command_timestamp)
     manifest.mark_step("finish", ManifestStepStatus.SUCCEEDED, ended_at=datetime.now(timezone.utc).replace(tzinfo=None).isoformat() + "Z")
diff --git a/repogauge/export/__init__.py b/repogauge/export/__init__.py
index 63f7161..406e510 100644
--- a/repogauge/export/__init__.py
+++ b/repogauge/export/__init__.py
@@ -1 +1,5 @@
 """Export package."""
+
+from .materialize import MaterializationError, MaterializedItem, run_materialization
+
+__all__ = ["MaterializationError", "MaterializedItem", "run_materialization"]
diff --git a/repogauge/export/materialize.py b/repogauge/export/materialize.py
index 83aebe6..c02ebcf 100644
--- a/repogauge/export/materialize.py
+++ b/repogauge/export/materialize.py
@@ -127,6 +127,9 @@ def _parent_count(row: Dict[str, Any]) -> int:
 
 def _resolve_base_commit(repo_root: Path, commit: str, row: Dict[str, Any]) -> str:
     metadata = row.get("metadata", {})
+    explicit_parent = row.get("parent_commit")
+    if isinstance(explicit_parent, str) and explicit_parent.strip():
+        return explicit_parent.strip()
     candidate_parent = metadata.get("parent_commit")
     if isinstance(candidate_parent, str) and candidate_parent:
         return candidate_parent

LLM Judgemuch worse

Overall Δ: -1.30
Confidence: 0.93

Candidate implements part of the export orchestration, but misses key gold behaviors around repo-root resolution, input validation, and regression coverage.

Task Fitworse Δ-1

It adds an `export` command path and honors explicit `parent_commit`, but it does not match the gold patch's full orchestration behavior. In particular, the CLI path does not resolve the repository root from the reviewed artifact location and does not handle missing reviewed artifacts the way the gold fix does.

Correctness Safetymuch worse Δ-2

The gold CLI passes an explicit `repo_root` into `run_materialization` after resolving it from the input path; the candidate omits that, which risks failure or nondeterministic behavior depending on cwd and artifact location. It also skips the explicit reviewed-file existence check and only catches `MaterializationError`/`ValueError`, so common filesystem failures can escape the intended manifest/logging flow.

Maintainabilityworse Δ-1

The candidate is smaller, but it exposes extra export-package symbols (`MaterializedItem`, `MaterializationError`) that are not needed for the task and diverges from the cleaner public surface in the gold patch. The missing repo-root helper also leaves important CLI behavior implicit rather than centralized and explicit.

Test Qualitymuch worse Δ-2

The gold patch adds focused regression tests for direct materialization behavior, rejection handling, and CLI artifact generation. The candidate diff includes no corresponding tests, so it does not demonstrate the regression is actually covered.

Change Focussame Δ+0

The change stays within the relevant CLI/export files and does not introduce broad unrelated churn. The extra exported symbols are unnecessary, but overall the scope remains reasonably focused on the materialization flow.

Harness: not_resolved
Attempt State: succeeded
Latency: 25.6s
Spend: -
Tokens: 0
Tool Calls: 0
Patch Size: 3,956 chars
Failure Reason: model_not_found

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..2ec2a31
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,31 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgemuch worse

Overall Δ: -2.00
Confidence: 0.99

Candidate misses the materialization/export implementation entirely and adds unrelated packaging artifacts, so it is substantially worse than the gold patch.

Task Fitmuch worse Δ-2

The task was to orchestrate export/materialization flow and make the regression test pass. The candidate does not touch `repogauge/cli.py`, `repogauge/export/__init__.py`, `repogauge/export/materialize.py`, or add the required tests, so it does not address the stated behavior.

Correctness Safetymuch worse Δ-2

The harness outcome is `not_resolved`, which matches the diff: none of the required command wiring, repo-root resolution, artifact generation, rejection handling, or deterministic materialization behavior from the gold patch is implemented. Adding generated packaging files also increases noise without improving runtime behavior.

Maintainabilitymuch worse Δ-2

The gold patch adds cohesive production code and exports a clear public entrypoint. The candidate instead adds `repogauge.egg-info/*` and `uv.lock`, which are generated or packaging-oriented artifacts unrelated to the feature and likely undesirable in a review-focused fix.

Test Qualitymuch worse Δ-2

The gold patch adds targeted unit coverage for materialization behavior, missing-commit rejection, and CLI export artifact writing. The candidate adds no tests at all, leaving the regression unaddressed and unverified.

Change Focusmuch worse Δ-2

The candidate is almost entirely unrelated churn: packaging metadata and lockfile additions rather than the requested export/materialization changes. This is less focused than the gold patch and actively distracts from the task.

s1liconcow__repogauge-rg-c980ea1754af-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Land oss_repogauge-p46: add file role classifier - Production changes: repogauge/mining/__init__.py, repogauge/mining/file_roles.py. Test changes: tests/unit/test_file_roles.py. - Bead oss_repogauge-p46: Define file-role taxonomy and path-based classifier Context: - Implement the canonical file-role classifier used by: - candidate scanning - patch splitting - validation test targeting - Roles: - `prod` - `test` - `test_support` - `config_build` - `docs` - `generated_vendor` - `unknown` Acceptance: - File-role classification is centralized and tested. - Patch splitting and scan scoring can import the same logic. - Ambiguous files remain visible for human or LLM review instead of being silently misclassified. --- - Reproduction: run tests impacted by this change (159 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 51.6s
Spend: $0.10
Tokens: 2.2K
Tool Calls: 6
Patch Size: 7,465 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..b33f9d7
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,26 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/specs.py
+repogauge/mining/__init__.py
+repogauge/mining/inspect.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/mining/__init__.py b/repogauge/mining/__init__.py
index 4d8b95f..e06c185 100644
--- a/repogauge/mining/__init__.py
+++ b/repogauge/mining/__init__.py
@@ -4,5 +4,6 @@ Deterministic repository inspection and related discovery utilities live here.
 """
 
 from .inspect import inspect_repository
+from .file_roles import classify_file
 
-__all__ = ["inspect_repository"]
+__all__ = ["inspect_repository", "classify_file"]
diff --git a/repogauge/mining/file_roles.py b/repogauge/mining/file_roles.py
new file mode 100644
index 0000000..a3abedc
--- /dev/null
+++ b/repogauge/mining/file_roles.py
@@ -0,0 +1,82 @@
+"""Path-based file role classifier for RepoGauge."""
+
+from dataclasses import dataclass
+from pathlib import PurePosixPath
+
+_VENDOR_GENERATED_DIRS = {
+    ".venv", "venv", "env", "node_modules", ".tox", "__pycache__",
+    "dist", "build", ".eggs", "vendor", ".cache",
+}
+
+_CONFIG_BUILD_NAMES = {
+    "setup.py", "setup.cfg", "pyproject.toml", "Makefile", "Dockerfile",
+    "tox.ini", ".flake8", ".pylintrc", ".pre-commit-config.yaml",
+    "requirements.txt", "requirements-dev.txt", "Pipfile", "Pipfile.lock",
+    "package.json", "package-lock.json", "yarn.lock", "uv.lock",
+    "MANIFEST.in", "CODEOWNERS",
+}
+
+_CONFIG_BUILD_DIRS = {
+    ".github", ".circleci", ".travis", ".gitlab", ".azure-pipelines",
+}
+
+_CONFIG_BUILD_EXTENSIONS = {".cfg", ".ini", ".toml", ".yml", ".yaml"}
+
+_DOCS_DIRS = {"docs", "doc", "documentation"}
+
+_DOCS_EXTENSIONS = {".md", ".rst", ".txt", ".adoc", ".asciidoc"}
+
+_RUNTIME_EXTENSIONS = {
+    ".py", ".js", ".ts", ".jsx", ".tsx", ".go", ".rb", ".java",
+    ".kt", ".scala", ".rs", ".c", ".cpp", ".cc", ".h", ".hpp",
+    ".cs", ".swift", ".php", ".sh", ".bash", ".zsh",
+}
+
+
+@dataclass
+class FileRole:
+    role: str
+    reason: str
+
+
+def classify_file(path: str) -> FileRole:
+    """Classify a file path into a role category."""
+    p = PurePosixPath(path)
+    parts = p.parts
+    name = p.name
+    suffix = p.suffix.lower()
+
+    # Check vendor/generated directories first (highest priority for exclusion)
+    for part in parts[:-1]:
+        if part in _VENDOR_GENERATED_DIRS or part.startswith(".venv"):
+            return FileRole("generated_vendor", "vendor or generated build cache directory")
+
+    # Check test files (files named test_*.py or *_test.py)
+    if name.startswith("test_") and suffix == ".py":
+        return FileRole("test", "test filename convention")
+    if name.endswith("_test.py"):
+        return FileRole("test", "test filename convention")
+
+    # Check test support (non-test files under a tests/ directory)
+    if parts and parts[0] in {"tests", "test", "spec"}:
+        return FileRole("test_support", "test-support path under tests")
+
+    # Check docs
+    if parts and parts[0] in _DOCS_DIRS:
+        return FileRole("docs", "documentation file or directory")
+    if suffix in _DOCS_EXTENSIONS:
+        return FileRole("docs", "documentation file or directory")
+
+    # Check config/build files
+    if parts and parts[0] in _CONFIG_BUILD_DIRS:
+        return FileRole("config_build", "CI, package, or tooling configuration file")
+    if name in _CONFIG_BUILD_NAMES:
+        return FileRole("config_build", "CI, package, or tooling configuration file")
+    if suffix in _CONFIG_BUILD_EXTENSIONS:
+        return FileRole("config_build", "CI, package, or tooling configuration file")
+
+    # Check prod (runtime source files)
+    if suffix in _RUNTIME_EXTENSIONS:
+        return FileRole("prod", "runtime source extension")
+
+    return FileRole("unknown", "No explicit role rule matched")
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgemuch worse

Overall Δ: -1.80
Confidence: 0.94

Candidate adds a path classifier, but it diverges from the required API, omits the shared bulk-classification surface, introduces clear overclassification risks, and includes substantial unrelated generated-file churn with no tests.

Task Fitmuch worse Δ-2

It only exposes `classify_file` and replaces the intended `FileRole`/`FileRoleClassification` API with a different dataclass shape, omits `classify_files`, and does not fully match the canonical shared classifier surface the task called for.

Correctness Safetymuch worse Δ-2

The rules are materially less safe than gold: all files under `tests/` that are not named like tests become `test_support`, many generic extensions like `.yml`, `.toml`, `.ini`, `.txt` are classified globally as config/docs, and the return object no longer preserves the original path. That increases silent misclassification risk instead of keeping ambiguous files visible as `unknown`.

Maintainabilityworse Δ-1

The module is readable, but it weakens the contract by using untyped free-form role strings, drops the reusable classification record with path metadata, and broadens heuristic sets in ways that make future behavior harder to reason about.

Test Qualitymuch worse Δ-2

The gold patch adds focused regression tests for each role and reason path. The candidate adds no tests at all, despite changing behavior and API.

Change Focusmuch worse Δ-2

It includes unrelated generated artifacts (`repogauge.egg-info/*`, `uv.lock`) that are outside the task, while still failing to land the full intended production/test scope.

Harness: resolved
Attempt State: succeeded
Latency: 2.5m
Spend: $0.08
Tokens: 312.4K
Tool Calls: 17
Patch Size: 5,420 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/mining/__init__.py b/repogauge/mining/__init__.py
index 4d8b95f..5e91c5e 100644
--- a/repogauge/mining/__init__.py
+++ b/repogauge/mining/__init__.py
@@ -3,6 +3,7 @@
 Deterministic repository inspection and related discovery utilities live here.
 """
 
+from .file_roles import FILE_ROLES, FileRoleClassification, classify_file
 from .inspect import inspect_repository
 
-__all__ = ["inspect_repository"]
+__all__ = ["FILE_ROLES", "FileRoleClassification", "classify_file", "inspect_repository"]
diff --git a/repogauge/mining/file_roles.py b/repogauge/mining/file_roles.py
new file mode 100644
index 0000000..6456cc3
--- /dev/null
+++ b/repogauge/mining/file_roles.py
@@ -0,0 +1,175 @@
+"""Canonical file-role classification for RepoGauge."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path, PurePosixPath
+
+FILE_ROLES = (
+    "prod",
+    "test",
+    "test_support",
+    "config_build",
+    "docs",
+    "generated_vendor",
+    "unknown",
+)
+
+_TEST_DIR_NAMES = {"tests", "test"}
+_DOC_DIR_NAMES = {"docs", "doc"}
+_VENDOR_DIR_NAMES = {
+    ".venv",
+    "venv",
+    ".tox",
+    ".nox",
+    "node_modules",
+    "site-packages",
+    "dist-packages",
+    "__pycache__",
+    "vendor",
+    "third_party",
+    "third-party",
+    "build",
+    "dist",
+    "target",
+}
+_CONFIG_FILENAMES = {
+    "makefile",
+    "dockerfile",
+    "pyproject.toml",
+    "setup.py",
+    "setup.cfg",
+    "tox.ini",
+    "noxfile.py",
+    "pytest.ini",
+    "requirements.txt",
+}
+_CONFIG_PREFIXES = ("requirements",)
+_DOC_FILENAMES = {
+    "readme",
+    "readme.md",
+    "readme.rst",
+    "changelog",
+    "changelog.md",
+    "changelog.rst",
+    "license",
+    "license.md",
+    "license.rst",
+    "contributing",
+    "contributing.md",
+    "contributing.rst",
+}
+
+
+@dataclass(frozen=True)
+class FileRoleClassification:
+    role: str
+    reason: str
+
+
+def _normalise_path(path: str | Path) -> PurePosixPath:
+    text = str(path).replace("\\", "/")
+    return PurePosixPath(text)
+
+
+def _basename(path: PurePosixPath) -> str:
+    return path.name.lower()
+
+
+def _parts_lower(path: PurePosixPath) -> list[str]:
+    return [part.lower() for part in path.parts if part not in ("", ".")]
+
+
+def _has_vendor_marker(parts: list[str]) -> bool:
+    return any(part in _VENDOR_DIR_NAMES for part in parts)
+
+
+def _has_doc_marker(parts: list[str], basename: str) -> bool:
+    return any(part in _DOC_DIR_NAMES for part in parts) or basename in _DOC_FILENAMES
+
+
+def _has_config_marker(parts: list[str], basename: str) -> bool:
+    if basename in _CONFIG_FILENAMES:
+        return True
+    if any(part == ".github" for part in parts) and "workflows" in parts:
+        return True
+    if basename.startswith(_CONFIG_PREFIXES):
+        return True
+    if basename.endswith((".yml", ".yaml", ".toml", ".ini", ".cfg", ".json", ".txt")):
+        return basename in {
+            "pyproject.toml",
+            "setup.cfg",
+            "tox.ini",
+            "noxfile.py",
+            "pytest.ini",
+            "requirements.txt",
+        } or basename.startswith("requirements")
+    return False
+
+
+def _has_test_marker(parts: list[str], basename: str) -> bool:
+    if "tests" in parts or "test" in parts:
+        if basename.startswith("test_") or basename.endswith("_test.py") or basename.endswith(".test.py"):
+            return True
+    return basename.startswith("test_") or basename.endswith("_test.py") or basename.endswith(".test.py")
+
+
+def _has_test_support_marker(parts: list[str], basename: str) -> bool:
+    if not any(part in _TEST_DIR_NAMES for part in parts):
+        return False
+    if _has_test_marker(parts, basename):
+        return False
+    return True
+
+
+def classify_file(path: str | Path) -> FileRoleClassification:
+    """Classify a file path into the RepoGauge file-role taxonomy."""
+
+    normalized = _normalise_path(path)
+    parts = _parts_lower(normalized)
+    basename = _basename(normalized)
+
+    if _has_vendor_marker(parts):
+        return FileRoleClassification(
+            role="generated_vendor",
+            reason="Matched vendor or generated build cache directory",
+        )
+
+    if _has_test_marker(parts, basename):
+        return FileRoleClassification(
+            role="test",
+            reason="Matched test filename convention",
+        )
+
+    if _has_test_support_marker(parts, basename):
+        return FileRoleClassification(
+            role="test_support",
+            reason="Matched test-support path under tests",
+        )
+
+    if _has_config_marker(parts, basename):
+        return FileRoleClassification(
+            role="config_build",
+            reason="Matched CI, package, or tooling configuration file",
+        )
+
+    if _has_doc_marker(parts, basename):
+        return FileRoleClassification(
+            role="docs",
+            reason="Matched documentation file or directory",
+        )
+
+    suffix = normalized.suffix.lower()
+    if suffix == ".py":
+        return FileRoleClassification(
+            role="prod",
+            reason="Matched runtime source extension",
+        )
+
+    return FileRoleClassification(
+        role="unknown",
+        reason="No explicit role rule matched",
+    )
+
+
+__all__ = ["FILE_ROLES", "FileRoleClassification", "classify_file"]

LLM Judgeworse

Overall Δ: -0.90
Confidence: 0.90

Candidate lands a workable classifier, but it is materially less complete than the gold patch and changes some taxonomy behavior in ways that increase downstream risk.

Task Fitworse Δ-1

The candidate adds a central classifier, but it omits key gold-scope API pieces: no `FileRole` type alias, no `classify_files`, and no `path` field on the classification object. Those omissions make it less suitable for the stated shared use cases across candidate scanning, patch splitting, and validation targeting.

Correctness Safetyworse Δ-1

Several gold behaviors are missing or weakened. The candidate only treats `.py` as production code, so `.pyi`, `.ts`, `.tsx`, `.js`, and `.jsx` regress to `unknown`. It also lacks the gold `src` fallback and empty-path handling. Test classification is broader but less precise: many files under `tests/` that gold marks as `test` become `test_support`, which can affect targeting logic.

Maintainabilityworse Δ-1

The helper decomposition is readable, but the public model is less coherent than gold because `FileRoleClassification` drops `path` and `role` is an unconstrained `str` instead of a typed role union. The broader marker sets also encode more implicit policy while still covering fewer canonical cases than the gold implementation.

Test Qualityworse Δ-1

No candidate test patch is present, while the gold patch adds focused regression coverage for each major role bucket and the unknown fallback. That is a substantial gap relative to the reference.

Change Focussame Δ+0

The candidate stays narrowly focused on the file-role classifier and package exports. It does not introduce obvious unrelated churn, even though its exported surface differs from gold.

Harness: not_resolved
Attempt State: succeeded
Latency: 21.7s
Spend: -
Tokens: 0
Tool Calls: 0
Patch Size: 3,817 chars
Failure Reason: model_not_found

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..19ec268
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+Current release state is scaffolded and in active development.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..b33f9d7
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,26 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/specs.py
+repogauge/mining/__init__.py
+repogauge/mining/inspect.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 0000000..4e6f117
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,8 @@
+version = 1
+revision = 3
+requires-python = ">=3.10"
+
+[[package]]
+name = "repogauge"
+version = "0.0.0"
+source = { editable = "." }

LLM Judgemuch worse

Overall Δ: -2.00
Confidence: 0.99

Candidate does not implement the file-role classifier or tests and instead adds unrelated packaging artifacts, so it is substantially worse than the gold patch.

Task Fitmuch worse Δ-2

The stated task was to add a centralized file-role classifier, export it from `repogauge.mining`, and cover it with tests. The candidate patch does none of that and does not touch the required production or test files.

Correctness Safetymuch worse Δ-2

Because the classifier is not implemented, the regression remains unresolved, matching the harness outcome. The added `egg-info` and lockfile artifacts do not contribute to correctness and introduce unnecessary generated state into the repo.

Maintainabilitymuch worse Δ-2

The gold patch adds a focused reusable module with explicit types and reasons. The candidate adds unrelated generated metadata files, which increase noise and maintenance burden without providing the requested abstraction.

Test Qualitymuch worse Δ-2

The gold patch adds direct unit coverage for each file-role category and unknown-path behavior. The candidate adds no tests at all, leaving the new behavior entirely unverified.

Change Focusmuch worse Δ-2

The change is unfocused: it introduces `repogauge.egg-info/*` and `uv.lock`, which are unrelated to the requested classifier work, while omitting the actual production and test changes needed for the task.

s1liconcow__repogauge-rg-ef70ca456338-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Fix JUnit classname-to-path splitting for class-based tests Classnames like 'tests.unit.test_foo.TestBar' were being converted to 'tests/unit/test_foo/TestBar.py' instead of the correct pytest node ID 'tests/unit/test_foo.py::TestBar::test_method'. Split at the first PascalCase component to separate the module path from the class chain. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> - Production changes: repogauge/validation/junit_parser.py. Test changes: tests/unit/test_junit_parser.py. - Reproduction: run tests impacted by this change (59 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: resolved
Attempt State: succeeded
Latency: 1.1m
Spend: $0.13
Tokens: 2.3K
Tool Calls: 8
Patch Size: 7,830 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..5a86ef7
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,112 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--max-commits N` | `100` | Commits to scan |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..820a3c1
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,38 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/junit_parser.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/validation/junit_parser.py b/repogauge/validation/junit_parser.py
index bf9fe3d..5e085ae 100644
--- a/repogauge/validation/junit_parser.py
+++ b/repogauge/validation/junit_parser.py
@@ -30,12 +30,22 @@ class JUnitParseError(ValueError):
     """Raised when the JUnit XML cannot be parsed."""
 
 
-def _classname_to_path(classname: str) -> str:
-    """Convert ``tests.unit.test_foo`` → ``tests/unit/test_foo.py``."""
+def _classname_to_path_and_classes(classname: str) -> tuple[str, list[str]]:
+    """Split ``tests.unit.test_foo.TestSuite`` into (``tests/unit/test_foo.py``, [``TestSuite``]).
+
+    The split point is the first dot-separated component that starts with an uppercase
+    letter (PascalCase), which indicates a class name rather than a module path component.
+    """
     parts = classname.split(".")
-    # Heuristic: if any component starts with "test_" or equals "tests",
-    # assume it is a file/directory component and convert dots to slashes.
-    return "/".join(parts) + ".py"
+    split_idx = len(parts)
+    for i, part in enumerate(parts):
+        if part and part[0].isupper():
+            split_idx = i
+            break
+    module_parts = parts[:split_idx]
+    class_parts = parts[split_idx:]
+    path = "/".join(module_parts) + ".py" if module_parts else ""
+    return path, class_parts
 
 
 def _canonical_id(classname: str, name: str) -> str:
@@ -43,10 +53,11 @@ def _canonical_id(classname: str, name: str) -> str:
 
     pytest encodes parametrized cases as ``name[param]``; we preserve that as-is.
     """
-    path_part = _classname_to_path(classname) if classname else ""
-    if path_part:
-        return f"{path_part}::{name}"
-    return name
+    if not classname:
+        return name
+    path_part, class_parts = _classname_to_path_and_classes(classname)
+    segments = ([path_part] if path_part else []) + class_parts + [name]
+    return "::".join(segments)
 
 
 def _outcome_of(testcase: ET.Element) -> str:

LLM Judgeworse

Overall Δ: -1.00
Confidence: 0.95

Candidate fixes the core classname splitting behavior in production, but it omits the regression test added by the gold patch and introduces substantial unrelated packaging churn.

Task Fitworse Δ-1

The production change addresses the reported JUnit classname issue and matches the intended split-at-first-PascalCase approach, so the main bug is fixed. However, unlike the gold patch, it does not add the targeted regression test that demonstrates the expected three-part pytest node ID behavior.

Correctness Safetysame Δ+0

The candidate implementation is semantically equivalent to the gold fix for the core behavior: it separates module path from class components at the first uppercase segment and constructs canonical IDs correctly for module-only and class-based cases. I do not see an obvious regression relative to gold in the production logic shown.

Maintainabilityworse Δ-1

The implementation is reasonably clear, but it is slightly less polished than gold: the helper naming and docstring are narrower, and returning a list of class parts instead of a joined class chain makes the abstraction a bit less cohesive. The added egg-info files also add repository noise that hurts long-term maintainability.

Test Qualitymuch worse Δ-2

The gold patch adds a focused regression test covering both class-based and standalone test IDs. The candidate adds no corresponding test coverage, so the fix is left unguarded against future regressions.

Change Focusmuch worse Δ-2

The candidate includes multiple unrelated generated `repogauge.egg-info/*` files that have nothing to do with the JUnit parser bugfix. The gold patch stays tightly scoped to the production file and its regression test.

Harness: resolved
Attempt State: succeeded
Latency: 1.0m
Spend: $0.04
Tokens: 134.8K
Tool Calls: 6
Patch Size: 8,112 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..5a86ef7
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,112 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--max-commits N` | `100` | Commits to scan |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..820a3c1
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,38 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/junit_parser.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge
diff --git a/repogauge/validation/junit_parser.py b/repogauge/validation/junit_parser.py
index bf9fe3d..150bef4 100644
--- a/repogauge/validation/junit_parser.py
+++ b/repogauge/validation/junit_parser.py
@@ -30,12 +30,34 @@ class JUnitParseError(ValueError):
     """Raised when the JUnit XML cannot be parsed."""
 
 
+def _split_classname(classname: str) -> tuple[str, str]:
+    """Split a pytest classname into module path and optional class chain.
+
+    Pytest reports class-based tests as ``module.path.ClassName`` in the JUnit
+    ``classname`` field.  The first PascalCase component marks the start of the
+    class chain; everything before it is the filesystem path.
+    """
+    parts = classname.split(".")
+    split_index = None
+    for index, part in enumerate(parts):
+        if part and part[0].isupper():
+            split_index = index
+            break
+
+    if split_index is None:
+        return classname, ""
+
+    module_part = ".".join(parts[:split_index])
+    class_part = ".".join(parts[split_index:])
+    return module_part, class_part
+
+
 def _classname_to_path(classname: str) -> str:
     """Convert ``tests.unit.test_foo`` → ``tests/unit/test_foo.py``."""
-    parts = classname.split(".")
-    # Heuristic: if any component starts with "test_" or equals "tests",
-    # assume it is a file/directory component and convert dots to slashes.
-    return "/".join(parts) + ".py"
+    module_part, _ = _split_classname(classname)
+    if not module_part:
+        return ""
+    return "/".join(module_part.split(".")) + ".py"
 
 
 def _canonical_id(classname: str, name: str) -> str:
@@ -43,9 +65,17 @@ def _canonical_id(classname: str, name: str) -> str:
 
     pytest encodes parametrized cases as ``name[param]``; we preserve that as-is.
     """
-    path_part = _classname_to_path(classname) if classname else ""
+    if not classname:
+        return name
+
+    module_part, class_part = _split_classname(classname)
+    path_part = _classname_to_path(classname)
+    if path_part and class_part:
+        return f"{path_part}::{class_part.replace('.', '::')}::{name}"
     if path_part:
         return f"{path_part}::{name}"
+    if class_part:
+        return f"{class_part.replace('.', '::')}::{name}"
     return name

LLM Judgeworse

Overall Δ: -0.75
Confidence: 0.94

Candidate fixes the classname splitting behavior, but it omits the targeted regression test and adds unrelated packaging artifact churn.

Task Fitworse Δ-1

The production change addresses the reported JUnit classname bug, but unlike the gold patch it does not add the explicit regression coverage for class-based pytest node IDs. That makes the overall task completion slightly weaker than the reference.

Correctness Safetysame Δ+0

The candidate’s split-at-first-uppercase approach produces the intended node IDs for class-based tests and preserves standalone function cases. The implementation is a bit more indirect than gold, but there is no clear semantic regression relative to the reference fix.

Maintainabilityworse Δ-1

The code is serviceable, but the gold patch is cleaner: it returns path and class chain directly, has clearer contract documentation, and avoids the extra module-to-path round trip plus string replacement step. The candidate is also noisier overall because of unrelated generated metadata files.

Test Qualityworse Δ-1

The gold patch adds a focused regression test covering both class-based and standalone cases. The candidate patch includes no corresponding test change, so its testing posture is materially weaker.

Change Focusworse Δ-1

The candidate includes unrelated `repogauge.egg-info/*` generated files, which are outside the stated task and add avoidable churn. The gold patch stays tightly scoped to the parser and its test.

Harness: not_resolved
Attempt State: succeeded
Latency: 23.1s
Spend: -
Tokens: 0
Tool Calls: 0
Patch Size: 5,836 chars
Failure Reason: model_not_found

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge.egg-info/PKG-INFO b/repogauge.egg-info/PKG-INFO
new file mode 100644
index 0000000..5a86ef7
--- /dev/null
+++ b/repogauge.egg-info/PKG-INFO
@@ -0,0 +1,112 @@
+Metadata-Version: 2.4
+Name: repogauge
+Version: 0.0.0
+Summary: Local SWE-bench-style dataset and harness helper for repositories
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+
+# repogauge
+
+## v1 Scope and non-goals
+
+RepoGauge v1 is a **Python-only, local-first CLI** for creating local SWE-bench-style
+evaluation tasks from a repository and evaluating patches with the official SWE-bench harness.
+
+What v1 is:
+
+- Mine and shortlist candidate bugfix commits with deterministic heuristics.
+- Export SWE-bench-compatible `dataset.jsonl` artifacts.
+- Generate a repository-specific harness adapter so official evaluation can run
+  against previously unsupported repos.
+- Validate gold patches and `FAIL_TO_PASS` / `PASS_TO_PASS` outcomes deterministically.
+
+What v1 is not:
+
+- Multi-language generality.
+- Multi-commit PR reconstruction.
+- Synthetic test generation.
+- Hosted service or database-backed workflow orchestration.
+- Remote model calls by default.
+
+See the architecture docs:
+
+- [docs/ADRs/0001-mvp-architecture.md](docs/ADRs/0001-mvp-architecture.md)
+- [DESIGN.md](DESIGN.md)
+
+## CLI surface (scaffold)
+
+- `repogauge mine PATH --out DIR`
+- `repogauge review CANDIDATES --out DIR`
+- `repogauge export REVIEWED --dataset DIR`
+- `repogauge eval DATASET --gold`
+- `repogauge run MATRIX`
+- `repogauge analyze RUN`
+- `repogauge train-router RUN`
+
+Global behavior:
+
+- `--config`: merges config files over built-in defaults.
+- `--out`: sets output directory root.
+- `--resume`: continues from existing outputs where possible.
+- `--dry-run`: validates parameters without writing artifacts.
+- `--llm-mode`: `off`, `local_only`, or `allow_remote`.
+
+### In scope for MVP
+
+- CLI-only workflows such as:
+  - `repogauge mine /path/to/repo --out ./out`
+  - `repogauge review ./out/candidates.jsonl`
+  - `repogauge export ./out/reviewed.jsonl --dataset ./out/dataset`
+  - `repogauge eval ./out/dataset/dataset.jsonl --gold`
+- Mining candidate bugfix commits from the default branch or an explicit commit range.
+- Materializing SWE-bench-style instances with:
+  - `instance_id`
+  - `repo`
+  - `base_commit`
+  - `problem_statement`
+  - `version`
+  - `patch`
+  - `test_patch`
+  - `FAIL_TO_PASS`
+  - `PASS_TO_PASS`
+  - optional metadata
+- Generating a repo-specific adapter that patches the official harness runtime maps.
+- Validating gold patches locally and through the official harness.
+- Running experiment matrices across multiple solver adapters and evaluating them through the judge path.
+- Producing per-run cost/quality reports and router-training data.
+
+Current release state is scaffolded and in active development.
+
+## Running repogauge against itself
+
+```bash
+scripts/gauge_self.sh
+```
+
+Options:
+
+| Flag | Default | Description |
+|---|---|---|
+| `--out DIR` | `./out` | Root directory for all artifacts |
+| `--max-commits N` | `100` | Commits to scan |
+| `--decisions FILE` | _(none)_ | JSONL file of manual accept/reject decisions |
+
+Outputs written under `--out`:
+
+```
+mine/repo_profile.json              repo identity and environment hints
+mine/candidates.jsonl               all scanned commits with heuristic scores
+review/reviewed.jsonl               accept/reject decisions
+review/review.html                  human-readable review report
+export/dataset/dataset.jsonl        SWE-bench-compatible instances
+export/dataset/predictions.gold.jsonl
+```
+
+### E2E integration test
+
+```bash
+uv run python -m pytest tests/e2e/test_self_gauge.py -v
+```
+
+This runs the full mine → review → export pipeline against this repository and
+validates every artifact at each stage.
diff --git a/repogauge.egg-info/SOURCES.txt b/repogauge.egg-info/SOURCES.txt
new file mode 100644
index 0000000..820a3c1
--- /dev/null
+++ b/repogauge.egg-info/SOURCES.txt
@@ -0,0 +1,38 @@
+README.md
+pyproject.toml
+repogauge/__init__.py
+repogauge/artifacts.py
+repogauge/cli.py
+repogauge/config.py
+repogauge/exec.py
+repogauge/llm.py
+repogauge/logging_utils.py
+repogauge/manifest.py
+repogauge/review.py
+repogauge.egg-info/PKG-INFO
+repogauge.egg-info/SOURCES.txt
+repogauge.egg-info/dependency_links.txt
+repogauge.egg-info/entry_points.txt
+repogauge.egg-info/top_level.txt
+repogauge/export/__init__.py
+repogauge/export/adapter.py
+repogauge/export/dataset.py
+repogauge/export/materialize.py
+repogauge/export/specs.py
+repogauge/export/split_patch.py
+repogauge/mining/__init__.py
+repogauge/mining/file_roles.py
+repogauge/mining/inspect.py
+repogauge/mining/scan.py
+repogauge/mining/score.py
+repogauge/mining/signature.py
+repogauge/mining/synthesize.py
+repogauge/runner/__init__.py
+repogauge/runner/judge.py
+repogauge/runner/telemetry.py
+repogauge/utils/__init__.py
+repogauge/utils/git.py
+repogauge/validation/__init__.py
+repogauge/validation/env_detect.py
+repogauge/validation/junit_parser.py
+repogauge/validation/validate.py
\ No newline at end of file
diff --git a/repogauge.egg-info/dependency_links.txt b/repogauge.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/repogauge.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/repogauge.egg-info/entry_points.txt b/repogauge.egg-info/entry_points.txt
new file mode 100644
index 0000000..6d62c54
--- /dev/null
+++ b/repogauge.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+repogauge = repogauge.cli:main
diff --git a/repogauge.egg-info/top_level.txt b/repogauge.egg-info/top_level.txt
new file mode 100644
index 0000000..0fd2992
--- /dev/null
+++ b/repogauge.egg-info/top_level.txt
@@ -0,0 +1 @@
+repogauge

LLM Judgemuch worse

Overall Δ: -2.00
Confidence: 0.99

Candidate does not implement the JUnit classname parsing fix and instead adds unrelated packaging metadata files, so it is substantially worse than the gold patch.

Task Fitmuch worse Δ-2

The stated task was to fix classname-to-path splitting in `repogauge/validation/junit_parser.py` and add the corresponding regression test in `tests/unit/test_junit_parser.py`. The candidate patch changes neither target file and does not address the bug at all.

Correctness Safetymuch worse Δ-2

The harness outcome is `not_resolved`, which matches the diff: the faulty classname handling remains unchanged, so the regression is still present. Adding unrelated egg-info artifacts also introduces unnecessary repository noise without functional benefit.

Maintainabilitymuch worse Δ-2

The gold patch is a small, cohesive fix with clear helper extraction and docstrings. The candidate adds generated packaging metadata unrelated to the bug, increasing clutter and maintenance burden while leaving the actual logic untouched.

Test Qualitymuch worse Δ-2

The gold patch adds a focused regression test for class-based JUnit classnames. The candidate adds no test coverage for the reported failure, so there is no validation of the expected behavior.

Change Focusmuch worse Δ-2

The candidate patch is entirely off-target: it adds multiple `repogauge.egg-info/*` files, which are unrelated to the requested production and test changes. This is the opposite of disciplined, task-focused scope.

s1liconcow__repogauge-rg-d68b8bf7245e-reviewed

s1liconcow/repogauge

Problem Statement

Observed behavior - s1liconcow/repogauge - Details: Implement deterministic environment signature and version key - Production changes: repogauge/mining/inspect.py, repogauge/mining/signature.py. Test changes: tests/unit/test_inspect.py, tests/unit/test_signature.py. - Reproduction: run tests impacted by this change (324 changed lines). - Expected behavior: the update should make the corresponding regression test pass.

Harness: unknown
Attempt State: timed_out
Latency: 5.4m
Spend: $0.03
Tokens: 52
Tool Calls: 25
Patch Size: 0 chars
Failure Reason: timeout
Exit Detail: command timeout: timed out

Diff

This attempt produced no patch. The harness saw an empty diff, so there is nothing to compare.

LLM Judge

No judge analysis was produced for this attempt.

Harness: resolved
Attempt State: succeeded
Latency: 2.5m
Spend: $0.13
Tokens: 579.9K
Tool Calls: 27
Patch Size: 8,790 chars

Diff

Loading Diffs viewer…

Diffs failed to load, showing the raw patch instead.

diff --git a/repogauge/mining/__init__.py b/repogauge/mining/__init__.py
index 231521f..8a2d1ef 100644
--- a/repogauge/mining/__init__.py
+++ b/repogauge/mining/__init__.py
@@ -7,9 +7,12 @@ from .inspect import inspect_repository
 from .file_roles import FileRole, FileRoleClassification, classify_file, classify_files
 from .scan import scan_repository
 from .score import ScoredCommit, score_scan_commit
+from .signature import REPO_VERSION_UNKNOWN, build_environment_signature
 
 __all__ = [
     "inspect_repository",
+    "build_environment_signature",
+    "REPO_VERSION_UNKNOWN",
     "scan_repository",
     "FileRole",
     "FileRoleClassification",
diff --git a/repogauge/mining/inspect.py b/repogauge/mining/inspect.py
index 4b8d0e9..258e1d1 100644
--- a/repogauge/mining/inspect.py
+++ b/repogauge/mining/inspect.py
@@ -6,6 +6,7 @@ import re
 
 from repogauge.exec import run_command
 from repogauge.utils.git import get_default_branch, get_repo_root
+from .signature import build_environment_signature
 
 
 def _as_sorted_unique(values: list[str]) -> list[str]:
@@ -262,7 +263,7 @@ def inspect_repository(path: str | Path) -> Dict[str, Any]:
         )
     )
 
-    return {
+    profile = {
         "repo_name": repo_name,
         "repo_root": str(repo_root_resolved),
         "default_branch": default_branch,
@@ -285,3 +286,9 @@ def inspect_repository(path: str | Path) -> Dict[str, Any]:
         "test_paths": _detect_test_paths(repo_root_resolved),
         "profile_warnings": warnings,
     }
+
+    environment_signature = build_environment_signature(profile)
+    profile["repo_version"] = environment_signature["repo_version"]
+    profile["version"] = environment_signature["version"]
+    profile["environment_signature"] = environment_signature
+    return profile
diff --git a/repogauge/mining/signature.py b/repogauge/mining/signature.py
new file mode 100644
index 0000000..2236fd0
--- /dev/null
+++ b/repogauge/mining/signature.py
@@ -0,0 +1,181 @@
+from __future__ import annotations
+
+import hashlib
+import json
+import re
+from pathlib import Path
+from typing import Any
+
+try:
+    import tomllib
+except ModuleNotFoundError:  # pragma: no cover - Python < 3.11 fallback
+    tomllib = None  # type: ignore[assignment]
+
+
+REPO_VERSION_UNKNOWN = "repover_unknown"
+
+
+def _as_sorted_unique(values: list[str]) -> list[str]:
+    return sorted(set(values))
+
+
+def _safe_read_text(path: Path) -> str:
+    try:
+        return path.read_text(encoding="utf-8")
+    except OSError:
+        return ""
+
+
+def _normalize_python_versions(values: Any) -> list[str]:
+    normalized: list[str] = []
+    for value in values or []:
+        match = re.search(r"\b3\.(\d+)\b", str(value))
+        if match:
+            normalized.append(f"3.{match.group(1)}")
+    return _as_sorted_unique(normalized)
+
+
+def _normalize_tokens(values: Any) -> list[str]:
+    return _as_sorted_unique([str(value).strip() for value in values or [] if str(value).strip()])
+
+
+def _normalize_requirement_line(line: str) -> str:
+    stripped = line.strip()
+    if not stripped or stripped.startswith("#"):
+        return ""
+    if " #" in stripped:
+        stripped = stripped.split(" #", 1)[0].rstrip()
+    stripped = re.sub(r"\s+", " ", stripped)
+    return stripped
+
+
+def _normalize_requirements_text(raw: str) -> list[str]:
+    normalized = [_normalize_requirement_line(line) for line in raw.splitlines()]
+    return _as_sorted_unique([line for line in normalized if line])
+
+
+def _extract_pyproject_dependency_lines(pyproject_path: Path) -> list[str]:
+    if tomllib is None:
+        return _normalize_requirements_text(_safe_read_text(pyproject_path))
+
+    try:
+        data = tomllib.loads(_safe_read_text(pyproject_path))
+    except Exception:
+        return _normalize_requirements_text(_safe_read_text(pyproject_path))
+
+    lines: list[str] = []
+
+    project = data.get("project", {})
+    for dep in project.get("dependencies", []) or []:
+        lines.append(str(dep).strip())
+    for extra_name, deps in (project.get("optional-dependencies", {}) or {}).items():
+        for dep in deps or []:
+            lines.append(f"{extra_name}:{str(dep).strip()}")
+
+    poetry = data.get("tool", {}).get("poetry", {})
+    for dep_name, dep_value in (poetry.get("dependencies", {}) or {}).items():
+        if dep_name.lower() == "python":
+            continue
+        if isinstance(dep_value, str):
+            lines.append(f"{dep_name}{dep_value.strip()}")
+        elif isinstance(dep_value, dict):
+            lines.append(f"{dep_name}:{json.dumps(dep_value, sort_keys=True)}")
+        else:
+            lines.append(f"{dep_name}:{dep_value}")
+    for dep_name, dep_value in (poetry.get("dev-dependencies", {}) or {}).items():
+        if isinstance(dep_value, str):
+            lines.append(f"dev:{dep_name}{dep_value.strip()}")
+        elif isinstance(dep_value, dict):
+            lines.append(f"dev:{dep_name}:{json.dumps(dep_value, sort_keys=True)}")
+        else:
+            lines.append(f"dev:{dep_name}:{dep_value}")
+
+    return _as_sorted_unique([line for line in lines if line])
+
+
+def _detect_repo_version(profile: dict[str, Any]) -> str:
+    repo_version = str(profile.get("repo_version") or "").strip()
+    if repo_version:
+        return repo_version
+
+    repo_root = Path(profile["repo_root"])
+    pyproject_path = repo_root / "pyproject.toml"
+    if pyproject_path.exists():
+        if tomllib is not None:
+            try:
+                data = tomllib.loads(_safe_read_text(pyproject_path))
+            except Exception:
+                data = {}
+            project = data.get("project", {})
+            version = str(project.get("version") or "").strip()
+            if version:
+                return version
+            poetry = data.get("tool", {}).get("poetry", {})
+            version = str(poetry.get("version") or "").strip()
+            if version:
+                return version
+        text = _safe_read_text(pyproject_path)
+        for pattern in (
+            r'(?m)^\s*version\s*=\s*"([^"]+)"\s*$',
+            r"(?m)^\s*version\s*=\s*'([^']+)'\s*$",
+        ):
+            match = re.search(pattern, text)
+            if match:
+                return match.group(1).strip()
+
+    return REPO_VERSION_UNKNOWN
+
+
+def _dependency_inputs(profile: dict[str, Any]) -> list[str]:
+    repo_root = Path(profile["repo_root"])
+    inputs: list[str] = []
+
+    requirements_files = sorted(repo_root.glob("requirements*.txt"))
+    for req_file in requirements_files:
+        inputs.extend(_normalize_requirements_text(_safe_read_text(req_file)))
+
+    pyproject_path = repo_root / "pyproject.toml"
+    if pyproject_path.exists():
+        inputs.extend(_extract_pyproject_dependency_lines(pyproject_path))
+
+    pipfile_path = repo_root / "Pipfile"
+    if pipfile_path.exists():
+        inputs.extend(_normalize_requirements_text(_safe_read_text(pipfile_path)))
+
+    uv_lock_path = repo_root / "uv.lock"
+    if uv_lock_path.exists():
+        inputs.extend(_normalize_requirements_text(_safe_read_text(uv_lock_path)))
+
+    return _as_sorted_unique(inputs)
+
+
+def _build_dependency_signature(inputs: list[str]) -> str:
+    payload = json.dumps(inputs, sort_keys=True, separators=(",", ":"))
+    return hashlib.sha256(payload.encode("utf-8")).hexdigest()
+
+
+def build_environment_signature(profile: dict[str, Any]) -> dict[str, Any]:
+    repo_version = _detect_repo_version(profile)
+    python_hints = profile.get("python_hints", {}) or {}
+    test_runner_hints = profile.get("test_runner_hints", {}) or {}
+    python_versions = _normalize_python_versions(python_hints.get("versions", []))
+    test_commands = _normalize_tokens(test_runner_hints.get("commands", []))
+    package_managers = _normalize_tokens(python_hints.get("package_managers", []))
+    dependency_inputs = _dependency_inputs(profile)
+    dependency_signature = _build_dependency_signature(dependency_inputs)
+
+    py_segment = "_".join(f"py{v.replace('.', '')}" for v in python_versions) if python_versions else "pyunknown"
+    test_segment = "+".join(test_commands) if test_commands else "unknown"
+    package_segment = "+".join(package_managers) if package_managers else "unknown"
+    version = f"{repo_version}__{py_segment}__{test_segment}__{package_segment}__reqhash_{dependency_signature}"
+
+    return {
+        "repo_version": repo_version,
+        "python_versions": python_versions,
+        "test_runner_commands": test_commands,
+        "package_managers": package_managers,
+        "dependency_inputs": dependency_inputs,
+        "dependency_signature": dependency_signature,
+        "version": version,
+        "signature": version,
+    }

LLM Judgeworse

Overall Δ: -1.15
Confidence: 0.89

Candidate mostly implements environment signature generation, but it diverges from the gold fix in important ways: repo version detection is incomplete, the signature schema/fallback labels differ, and it omits the regression tests that anchor the intended behavior.

Task Fitworse Δ-1

It adds environment signature support and populates `repo_version`/`version`, but it does not match the full task coverage of the gold patch. In particular, repo version detection only handles `pyproject.toml` and misses the gold fix’s `setup.cfg` and `setup.py` paths, which were part of the required deterministic version key behavior.

Correctness Safetyworse Δ-1

The candidate introduces behavioral gaps and format drift versus gold. `_detect_repo_version()` ignores `setup.cfg` and `setup.py`, so some repositories will incorrectly stay at `repover_unknown`. It also changes fallback label semantics from `testunknown`/`pkgunknown` to plain `unknown`, which alters signature stability and compatibility relative to the intended format. The adapter-coupled implementation is also riskier than the gold patch’s self-contained helper for this targeted change.

Maintainabilityworse Δ-1

The gold patch keeps version extraction in `inspect.py` and signature construction in a straightforward helper. The candidate makes `build_environment_signature()` more coupled to adapter lookup and profile shape, adds extra fields not needed for the task, and broadens the logic beyond the stated fix. That makes the implementation harder to reason about for a regression-focused change.

Test Qualitymuch worse Δ-2

The gold patch adds focused unit coverage in both `tests/unit/test_inspect.py` and `tests/unit/test_signature.py` for repo-version extraction, deterministic signatures, fallback behavior, and normalization. The candidate patch includes no test changes at all, so it does not demonstrate or lock in the regression fix.

Change Focusworse Δ-1

The candidate stays in the general area of the bug, but it adds unrelated API surface in `repogauge/mining/__init__.py` and broadens the implementation toward adapter-driven generalization instead of the narrower repository-version/signature fix in the gold patch.