Corpus and training pipeline
Pipeline overview
A bitemporal pipeline in
daydream/training/
converts archived trajectories into fine-tuning datasets. The pipeline has
three stages, each exposed as a daydream corpus sub-verb.
Harvest
harvest.py
walks the archive index and assembles capture-time signals for each run:
verifier verdicts, finding records, grounding rate, and review length. It then
derives an outcome label, scores the intrinsic reward, and writes one bitemporal
observation per run to the label_observations SQLite table.
Each observation carries: the outcome label, the PR state (if applicable), the reward breakdown JSON, the composite reward scalar, the evidence SHA, the rubric JSON, the reward version, the reviewer logins (for PR rows), and a posterior flag.
Harvest is idempotent. The write layer deduplicates on
(evidence_sha, reward_version), so re-running harvest with unchanged evidence
is a no-op counted in skipped. A REWARD_VERSION bump changes the dedup key
and appends a new generation. Older as_of pins still resolve their original
scores.
GitHub API calls during harvest are rate-limited: per-request spacing defaults
to 0.8 seconds (--gh-spacing-sec), and the harvest aborts cleanly when the
rate limit is exhausted. A
BackfillCache
memoizes GitHub API responses across harvest runs so re-harvesting after a
reward-version bump does not re-fetch.
Reward scoring
reward.py
is a pure function over capture-time signals. No I/O, no side effects. The
default weights are correctness-dominant:
| Axis | Weight | Role |
|---|---|---|
| Correctness | 0.6 | Mean of per-finding verifier verdicts |
| Grounding | 0.4 | Fraction of findings with code evidence |
| Length penalty | 0.2 | Subtracted from the credit mean |
| False-positive penalty | 0.3 | Sibling axis, not folded into composite |
Verdict-to-score mapping:
| Verdict | Score |
|---|---|
consistent |
1.0 |
uncertain |
0.5 |
contradicts |
0.0 |
The format_valid gate is a dominating floor. When False, the composite
floors to 0.0 regardless of every other axis. When all credit axes are missing
but format_valid is True, the composite is None (uncomputable, not zero).
Missing signals are never imputed as zero. An absent or unparseable signal makes
that axis None with a presence flag. The credit mean renormalizes over the
axes actually measured, so an uninstrumented run cannot masquerade as a failed
one.
The false-positive penalty (weight 0.3) is kept as a sibling field on the
PosteriorBreakdown, not subtracted into the composite. It measures calibrated
surprise: the absolute difference between the maintainer outcome penalty and a
prior expectation. Only runs with a mapped PR outcome (accepted,
contested, rejected) produce a PosteriorBreakdown. Runs without a PR
outcome produce a plain RewardBreakdown.
Only the default weights earn the canonical
REWARD_VERSION
stamp ("2026.05.28-2"). Custom weights get a content-hash suffix
(REWARD_VERSION+custom-<8hex>), keyed by object identity. This prevents
analysis-time weight sweeps from contaminating the canonical corpus.
Build corpus
corpus.py
projects as_of-pinned annotations into JSONL training records. It writes one
JSON object per run, filtered by label, reward threshold, skill, repo, status,
and license.
The default admission filter is accepted-only: labels = ("accepted",). The
--min-reward flag provides an alternative admission path that admits runs
whose composite_reward meets a threshold, even without the accepted label.
The --include-all-labels flag disables label filtering entirely.
Each corpus build writes a content-addressed lineage.json manifest beside the
JSONL. The manifest records the SHA-256 of the sorted, newline-joined
session_id set, the labeler version, the reward version, the as_of pin, and
the creation timestamp. This gives byte-for-byte reproducibility: the same
filter set and as_of pin always produces the same hash.
A temporal-leakage guard drops any annotation whose valid_at (the PR merge
timestamp) is lexically greater than the as_of pin. This prevents future
information from leaking into training data.
An exclusion list
(schema/exclusion.txt)
is always enforced. Benchmark source repos (Sentry, Grafana, Cal.com,
Discourse, Keycloak) are excluded from the training corpus so they remain a
clean held-out evaluation set. A copyleft list
(schema/copyleft.txt)
is opt-in via --allow-copyleft.
Training roadmap
The data pipeline is implemented and operational. The training stages are roadmap items. The planned recipe targets an open-weight code-review model (Qwen2.5-Coder-7B-Instruct, trained with QLoRA) via rejection-filtered SFT, span-segmented SFT on ATIF reasoning and action spans, and KTO (Kahneman-Tversky Optimization) preference training on PR-comment accept and reject labels.