osprey

Research preview · source not yet public

What it is

Osprey is an agentic coding orchestrator in Rust, built around a directed-search self-improvement loop. The agent runs as a persona-driven, multi-turn loop with streaming LLM drivers and an interactive terminal UI (TUI). Its system prompt and tool descriptions are an editable surface that a harness searches against a frozen, stratified slice of real terminal tasks under a token-penalized reward, accepting only variants whose gains clear a measured noise margin.

The unit of evaluation is a full multi-turn, tool-using agent episode on a real filesystem, averaging 48 agent steps and 554k tokens per task, so every candidate evaluation is expensive and noisy. The harness's center of gravity is statistical decision machinery for that setting.

What it's for

Osprey exists to answer a research question: can a coding agent measurably improve its own behavior by searching over the text that defines it? You can also run it as a plain coding agent through its TUI, without engaging the search loop at all.

Self-improvement loop

A directed-search harness runs the loop. Each iteration mutates the agent's editable surface, re-evaluates it against the frozen 20-task slice, and accepts or rejects the change via git tags. Published measurements so far are the baseline and the noise floor — no candidate has yet cleared the gate.

Editable surface

Per-iteration mutations are restricted to two override files, materialized by the harness and pointed at via environment variables:

Layer 1 of the composed system prompt — the base persona instructions.
Per-tool descriptions — the text surfaced to the model for each of the eight core tools.

The agent loop, persona resolver, drivers, and tool dispatcher are deliberately out of scope — they are the substrate, not the editable parameters. Expanding the surface to include scaffold knobs, persona contents, and per-tool argument schemas is on the roadmap.

Reward

score = mean(reward) − λ · mean(tokens)

reward is per-task pass/fail in [0, 1], scored by each benchmark task's own test harness.
tokens is total tokens consumed by the agent during the task.
λ is derived, not hand-tuned: computed at runtime from the recorded baseline so that a 10% token bloat costs roughly 5% of the reward budget on the slice. It is tied to the slice version — changing the slice forces a fresh baseline run and a re-derived λ.

The token penalty is what keeps the search honest: without it, the optimizer is free to buy marginal pass-rate with unbounded verbosity.

Slice

The slice freezes 20 benchmark tasks, stratified from a baseline run into a mix — easy passes, medium passes, a hard pass, small fails, and big fails. The mix matters: a pure pass-only slice has no headroom to improve against, and a pure fail-only slice has no floor to protect.

Accept gate

Five full re-runs of the unchanged agent established the noise floor: per-run score standard deviation 0.089 (a 12.8% coefficient of variation), with individual tasks flipping pass/fail up to 40% of the time.

A single comparison against that backdrop would accept noise as progress, so each iteration runs a paired trial — five evaluations of the incumbent champion and five of the candidate — and accepts only when the paired delta exceeds 1.5× the pooled standard deviation, a margin calibrated to roughly a 0.9% false-accept rate against the measured floor. A positive-but-sub-noise delta is rejected.

Search

The harness runs a directed search on a single linear branch:

Propose — a pluggable backend writes candidate override files into a staging directory. The default backend invokes a headless coding agent; any backend matching the protocol can be substituted (a scripted sweep, a separate agent, or a human in the loop).
Apply — the harness validates the edit allowlist and scrubs for API-key-shaped literals before writing.
Evaluate — run the paired champion/candidate trials through the benchmark adapter.
Score — compute the token-penalized reward.
Accept — pin an optimize/accepted/<N> git tag to HEAD.
Reject — reset hard back to the previous accept tag.

Guardrails bound the loop: it refuses to operate on main or any non-optimization branch, a file lock enforces one iteration at a time, a baseline fingerprint check halts the run if the agent's config, candidate inputs, or slice version diverge from the recorded baseline, and host provider keys are stripped from the proposal subprocess's environment (the proposer CLI keeps only its own key). One documented gap remains: the max-turns override sits outside the fingerprint's scope.

Baseline

The baseline run (model: deepseek/deepseek-v4-flash): 14/20 tasks passing (70%), a mean of 553,942 tokens and 47.9 agent steps per task. A reward-hacking audit of the baseline — explicit checks for test-file modification, reward-file writes, and solution-directory access — came back clean across all five audited trials.

Relation to prior work

Self-improvement at the paper level is a crowded field; Osprey's position in it is a deliberate set of trades. DSPy's GEPA and MIPROv2 optimize prompts for LM pipelines where rollouts are cheap; Osprey pays for end-to-end agent episodes and spends its rigor on the accept statistics. The Darwin Gödel Machine and SICA let the agent rewrite its own code, maximizing search-space breadth at the cost of auditability; Osprey confines mutation to two text files behind a hard allowlist, so any change the gate accepts lands as a readable diff pinned by a git tag. SICA's cost-penalized utility is the closest prior art to the token-penalized reward. What Osprey adds over all of these is the production-harness setting: a measured noise floor and a calibrated paired accept gate for an objective where a single evaluation is a wall-clock-capped half-hour of real terminal work.

Substrate

The execution infrastructure the loop sits on top of:

Persona-driven multi-turn agent loop. Bundled default, architect, and verifier personas; resolution searches project, user, and bundled tiers. A persona declares a system prompt and optionally a terminal tool with a JSON-schema'd argument shape.
Eight core tools — read_file, write_file, edit_file, bash, find_files, grep, list_directory, tree — exposed through a gRPC sandbox sidecar. A streaming partial-JSON parser tolerates mid-stream tool-call extraction.
Subagent orchestration. Spawn, send-input, wait, and cancel control tools with an eight-child concurrency cap, allowlist-gated nested spawning, per-spawn model override through role resolution, and children persisted as forkable sessions viewable live in the TUI.
Bring-your-own-model. Direct Anthropic, OpenAI, and OpenRouter drivers with Server-Sent Events (SSE) streaming and per-role driver/model selection.
ATIF (Agent Trajectory Interchange Format) v1.6 trajectory logging. Standardized execution traces — agent schemas, per-turn timing, tool-call records. This is the input format the reinforcement learning (RL) roadmap consumes.
Local-first session persistence. SQLite by default; Postgres opt-in for shared installs. Sessions survive crashes and can fork from any past turn.
Sandbox. macOS isolation via sandbox-exec with a deny-by-default network policy (loopback only) and a per-repo allowlist for outbound hosts. Other platforms run unsandboxed.

Roadmap

Two phases scope the next round of self-improvement work. Framework choice is deferred until each phase starts so the field can be re-evaluated.

Composable evaluation & multi-environment optimization. Replace the single scalar reward with a weighted combination of dimensions — task success, token efficiency, and RLAIF (Reinforcement Learning from AI Feedback) judge scores for soft quality. Replace the single benchmark with an environment trait so any eval harness can serve as one of many objectives. Surface per-turn ATIF trajectory data so the harness can identify which turn in a multi-step run failed, not just whether the task passed.
RL training pipeline & model self-improvement. Convert ATIF trajectories into scored trajectory groups, export them as datasets for GRPO (Group Relative Policy Optimization), DPO (Direct Preference Optimization), or supervised fine-tuning (SFT), and wire the environment into a trainer that updates model weights. A model hot-swap to a local inference server closes the loop: execute → score → train → deploy → repeat. A distillation pipeline (large-model trajectories into a smaller agent model) and reward-hacking detection are explicit scope items. The committed piece is that strategic intent — a model fine-tuned for Osprey's own harness; the specific integration shape is deferred.

Status

Osprey is under active development and its source is not yet public. The optimization loop is operational: the baseline and noise floor are measured and the automated outer loop is wired. Improvement numbers will be published as candidates clear the accept gate. This page tracks the methodology and architecture as the work progresses.