The Agent Factory · a worked example

Watch an agent get built, graded, and shipped — to a 10/10 institutional bar.

The Agent Factory lets you build your own AI agents using the fund's two meta-agents — a builder (build.agent) that drafts a complete, runnable spec, and an independent grader (agent.evaluate) that scores it 0–100 against the same rubric the fund holds its own 60+ agents to. Build → grade → refine, until it ships. BYOK: the finished spec runs on your AI — the fund never runs it or stores it.

Builder & grader are separate agents 72 · Iterate 94 · Ship (≥90) 4 pillars · scored 0/3/7/10 · no half-points

One request → a draft → a 72 → three fixes → a 94 → yours to run

Nothing below is staged for show. This is the literal loop: the builder returns a draft it explicitly labels unscored; the grader pre-commits its scoring bands, reads the whole spec, and returns a number with the exact edits that raise it. A draft is not "done" until the grader returns SHIP (≥90).

1
The request

What the user asks for

User → Agent Factory
"Build me an agent that reviews SaaS vendor contracts for renewal traps and data-exit risk."

One line in. The builder triages it: the job (flag renewal/exit risk in a contract), the inputs (a contract the user pastes — untrusted text), the actions (read & report only), and the stakes (advice that could mislabel a clause as a breach). It states its assumptions and proceeds — it doesn't stall.

2
build.agent

Draft v1 — a complete, runnable spec

vendor-contract-reviewer.md DRAFT — unscored
---
name: vendor-contract-reviewer
description: Use ON REQUEST to review a SaaS vendor contract for renewal
  traps and data-exit risk. Fires on "review this contract / MSA / DPA",
  "auto-renewal", "termination for convenience", "data return / deletion".
  Reads the document; returns a risk register. Does not sign or negotiate.
tools: Read, Grep                       # read-only
disallowedTools: Write, Edit, NotebookEdit  # reviewer = write-locked
model: claude-sonnet-4-6              # T2: structured extraction vs a
                                       # known checklist, not open reasoning
effort: high
---

You are the Vendor Contract Reviewer — you read a SaaS agreement and
surface the clauses that quietly cost money or trap data on the way out.

## Your Role
Flag renewal traps and data-exit risk in the contract you are given.
You advise; you never assert a legal conclusion as settled fact.

## How You Work  (ground before reasoning)
1. TRIAGE — identify the agreement type (MSA / order form / DPA) and term.
2. RETRIEVE — locate the governing clauses: auto-renewal & notice window,
   price-uplift caps, termination rights, data return/deletion, SLA credits.
3. REASON — score each against the checklist; quote the clause text.

## Methodology  (cited counterpart)
Applies a structured commercial-contract review checklist + a
risk register (a standard procurement/legal-ops discipline).
Walks the renewal mechanics and the data-exit path clause by clause.

## Output Contract
A risk register: one row per issue — clause · what it says · risk · severity.
Lead with the three highest-severity findings.

## Success Criteria  (self-eval)
- Every flagged issue quotes the clause it came from.        PASS/FAIL
- Renewal window & data-exit path are both addressed.        PASS/FAIL
- No issue asserted without the contract text behind it.     PASS/FAIL

Real shape, real discipline: a trigger-style description, a minimal read-only toolset, a justified model tier, a grounding step, a named method, an output contract, and its own pass/fail self-eval. The builder hands it straight to the grader and says, in effect: score this before anyone calls it finished.

3
agent.evaluate

Score v1 — graded against the 10 criteria

72/100 Verdict ⚑ ITERATE
What the grader checkedScoreEvidence-anchored note
Grounding before reasoning10Explicit triage → retrieve governing clauses → reason; quotes the source text.
Tool & model fit10Minimal read-only tools; reviewer write-locked; tier justified in one line.
Cited real-world method7 flagMethod is named but shallow — no specific framework, and the risk register has no defined columns.
Clean handoff to whoever's next7 flagOutput is clean, but there is no 3-element handoff contract for who consumes the register next.
Safety: autonomy tier & human gate3 capNo explicit autonomy tier and no human-review gate before a clause is labelled a breach. A load-bearing safety gap caps the verdict.
Self-eval & learning loop7Has pass/fail Success Criteria, but no golden set and no failure-classification mechanism yet.
…and the rest of the institutional bar — right-altitude prompt, evidence discipline, answer-first output — all scored, each anchored to quoted text or a named gap.
Total 72 / 100 · Band: Adequate A single load-bearing safety gap caps the verdict to ITERATE regardless of the total — the grader does not let a high average hide a critical miss.

The three fixes that move it most

  1. Safety → Add an explicit advisory autonomy tier and a human-review gate before any clause is labelled a "breach." The agent flags and proposes; a person confirms the legal characterisation. This clears the cap.
  2. Method → Cite the specific review framework and define the risk-register columns — Clause ref · Verbatim text · Risk type · Likelihood · Impact · Severity · Recommended action — so the method is applied, not name-dropped.
  3. Handoff → Add the 3-element handoff (goal · last decision · active constraints) so the register passes cleanly to whoever negotiates or signs, without dragging the whole transcript along.
4
The fix

Draft v2 — what the builder changed

The builder re-enters at the fix list — it applies each edit, keeps the validated decisions (tier, method) verbatim, and re-issues. It does not restart from zero.

  • Autonomy & Gates section added. Tier: advisory (proposes; the human adopts). A clause may be surfaced as a possible breach, but the "breach" label requires a stated human-review gate — explicitly written in.
  • Method deepened & anchored. Now cites a named commercial-contract-review checklist + procurement risk-register discipline, with the full column set defined (Clause ref · Verbatim · Risk type · Likelihood · Impact · Severity · Action).
  • 3-element handoff added. On completion it passes only: the goal, the last decision (top risks + recommendation), and active constraints (jurisdiction, governing law) — to the negotiator/signer.
  • Golden-set stub added. Three cases: a should-fire (an evergreen auto-renew with a 90-day notice trap), a should-NOT-fire (a month-to-month with clean exit), and an edge (a DPA silent on data deletion).
5
agent.evaluate — re-score

94/100 — Ship

94/100 Verdict · re-evaluation mode ✓ SHIP

In re-evaluation mode the grader doesn't restart — it checklists each prior fix DONE / PARTIAL / NOT-DONE, then re-scores and shows the movement. All three flagged criteria closed:

Safety · gates/tier
3 10 ✓ DONE
Method · cited framework
7 10 ✓ DONE
Handoff · boundary
7 10 ✓ DONE
Self-eval · golden set
7 10 ✓
Movement: 72 → 94  ·  four criteria lifted 3→10 / 7→10 ≥ 90 · Institutional band
This is the same institutional bar the fund applies to its own 60+ agents — Anthropic-grade rigor, methodology depth, orchestration safety and a real self-improvement loop. Your agent is held to the standard the fund holds itself to. Nothing is graded on a curve.
6
The result

You get the finished spec — to run on your own AI

A shipped, institutional-grade agent spec

  • Yours to keep. Save the file, drop it into your own agent runtime, and run it on your AI — BYOK.
  • The fund never runs it, never stores it, never sees your contracts. The Factory produces a specification; execution happens entirely on your side.
  • It arrives improvable. It ships with its own success criteria and a golden-set stub, so you can regression-test and re-grade it as you adapt it.
  • Re-enter anytime. Tweak the spec, run agent.evaluate again, and watch the score move. The loop is yours to keep using.
PILLAR 1
Agent-design rigor
right-altitude prompt · grounded retrieval · tool & model fit
PILLAR 2
Methodology depth
cited real-world method · evidence discipline · answer-first
PILLAR 3
Orchestration
fits the system · gates & safety
PILLAR 4
Self-improvement
a real self-eval & learning loop