Benchmarking
daydream bench scores deep-review findings against
Martian's Code Review Benchmark
on an offline replay subset of 26 PRs.
The pinned PR registry lives in
daydream/benchmark/prs.py:
| Source repo | PR count |
|---|---|
| getsentry/sentry | 6 |
| grafana/grafana | 10 |
| calcom/cal.com | 10 |
| Total | 26 |
For each PR, the benchmark harness runs daydream --non-interactive --base <base_sha> --trajectory <path> <checkout> against a blobless clone, reads the
merged findings from .daydream/deep/merged-items.json, and injects them as
synthetic review comments into the benchmark corpus. Scoring then runs Martian's step2 (extract),
step2.5 (dedup), and step3 (judge) modules, producing micro-averaged precision
and recall against the golden labels.
The judge requires MARTIAN_API_KEY. The default judge model is
anthropic/claude-opus-4.5. Per-model results land under
results/<model-with-slashes-replaced> in the benchmark repo.
These five repos are excluded from the training corpus
(exclusion.txt)
so the benchmark remains a clean held-out evaluation set: Sentry, Grafana,
Cal.com, Discourse, and Keycloak.