Corpus and training pipeline

Pipeline overview

A bitemporal pipeline in daydream/training/ converts archived trajectories into fine-tuning datasets. The pipeline has three stages, each exposed as a daydream corpus sub-verb.

Harvest

harvest.py walks the archive index and assembles capture-time signals for each run: verifier verdicts, finding records, grounding rate, and review length. It then derives an outcome label, scores the intrinsic reward, and writes one bitemporal observation per run to the label_observations SQLite table.

Each observation carries: the outcome label, the PR state (if applicable), the reward breakdown JSON, the composite reward scalar, the evidence SHA, the rubric JSON, the reward version, the reviewer logins (for PR rows), and a posterior flag.

Harvest is idempotent. The write layer deduplicates on (evidence_sha, reward_version), so re-running harvest with unchanged evidence is a no-op counted in skipped. A REWARD_VERSION bump changes the dedup key and appends a new generation. Older as_of pins still resolve their original scores.

GitHub API calls during harvest are rate-limited: per-request spacing defaults to 0.8 seconds (--gh-spacing-sec), and the harvest aborts cleanly when the rate limit is exhausted. A BackfillCache memoizes GitHub API responses across harvest runs so re-harvesting after a reward-version bump does not re-fetch.

Reward scoring

reward.py is a pure function over capture-time signals. No I/O, no side effects. The default weights are correctness-dominant:

Axis Weight Role
Correctness 0.6 Mean of per-finding verifier verdicts
Grounding 0.4 Fraction of findings with code evidence
Length penalty 0.2 Subtracted from the credit mean
False-positive penalty 0.3 Sibling axis, not folded into composite

Verdict-to-score mapping:

Verdict Score
consistent 1.0
uncertain 0.5
contradicts 0.0

The format_valid gate is a dominating floor. When False, the composite floors to 0.0 regardless of every other axis. When all credit axes are missing but format_valid is True, the composite is None (uncomputable, not zero).

Missing signals are never imputed as zero. An absent or unparseable signal makes that axis None with a presence flag. The credit mean renormalizes over the axes actually measured, so an uninstrumented run cannot masquerade as a failed one.

The false-positive penalty (weight 0.3) is kept as a sibling field on the PosteriorBreakdown, not subtracted into the composite. It measures calibrated surprise: the absolute difference between the maintainer outcome penalty and a prior expectation. Only runs with a mapped PR outcome (accepted, contested, rejected) produce a PosteriorBreakdown. Runs without a PR outcome produce a plain RewardBreakdown.

Only the default weights earn the canonical REWARD_VERSION stamp ("2026.05.28-2"). Custom weights get a content-hash suffix (REWARD_VERSION+custom-<8hex>), keyed by object identity. This prevents analysis-time weight sweeps from contaminating the canonical corpus.

Build corpus

corpus.py projects as_of-pinned annotations into JSONL training records. It writes one JSON object per run, filtered by label, reward threshold, skill, repo, status, and license.

The default admission filter is accepted-only: labels = ("accepted",). The --min-reward flag provides an alternative admission path that admits runs whose composite_reward meets a threshold, even without the accepted label. The --include-all-labels flag disables label filtering entirely.

Each corpus build writes a content-addressed lineage.json manifest beside the JSONL. The manifest records the SHA-256 of the sorted, newline-joined session_id set, the labeler version, the reward version, the as_of pin, and the creation timestamp. This gives byte-for-byte reproducibility: the same filter set and as_of pin always produces the same hash.

A temporal-leakage guard drops any annotation whose valid_at (the PR merge timestamp) is lexically greater than the as_of pin. This prevents future information from leaking into training data.

An exclusion list (schema/exclusion.txt) is always enforced. Benchmark source repos (Sentry, Grafana, Cal.com, Discourse, Keycloak) are excluded from the training corpus so they remain a clean held-out evaluation set. A copyleft list (schema/copyleft.txt) is opt-in via --allow-copyleft.

Training roadmap

The data pipeline is implemented and operational. The training stages are roadmap items. The planned recipe targets an open-weight code-review model (Qwen2.5-Coder-7B-Instruct, trained with QLoRA) via rejection-filtered SFT, span-segmented SFT on ATIF reasoning and action spans, and KTO (Kahneman-Tversky Optimization) preference training on PR-comment accept and reject labels.

Back to Daydream

Daydream overview