Benchmarking

daydream bench scores deep-review findings against Martian's Code Review Benchmark on an offline replay subset of 26 PRs.

The pinned PR registry lives in daydream/benchmark/prs.py:

Source repo PR count
getsentry/sentry 6
grafana/grafana 10
calcom/cal.com 10
Total 26

For each PR, the benchmark harness runs daydream --non-interactive --base <base_sha> --trajectory <path> <checkout> against a blobless clone, reads the merged findings from .daydream/deep/merged-items.json, and injects them as synthetic review comments into the benchmark corpus. Scoring then runs Martian's step2 (extract), step2.5 (dedup), and step3 (judge) modules, producing micro-averaged precision and recall against the golden labels.

The judge requires MARTIAN_API_KEY. The default judge model is anthropic/claude-opus-4.5. Per-model results land under results/<model-with-slashes-replaced> in the benchmark repo.

These five repos are excluded from the training corpus (exclusion.txt) so the benchmark remains a clean held-out evaluation set: Sentry, Grafana, Cal.com, Discourse, and Keycloak.

Back to Daydream

Daydream overview