Agent Factory — a worked example · New Era Investment Fund

A real run, start to finish

One request → a draft → a 72 → three fixes → a 94 → yours to run

Nothing below is staged for show. This is the literal loop: the builder returns a draft it explicitly labels unscored; the grader pre-commits its scoring bands, reads the whole spec, and returns a number with the exact edits that raise it. A draft is not "done" until the grader returns SHIP (≥90).

The request

What the user asks for

User → Agent Factory

"Build me an agent that reviews SaaS vendor contracts for renewal traps and data-exit risk."

One line in. The builder triages it: the job (flag renewal/exit risk in a contract), the inputs (a contract the user pastes — untrusted text), the actions (read & report only), and the stakes (advice that could mislabel a clause as a breach). It states its assumptions and proceeds — it doesn't stall.

build.agent

Draft v1 — a complete, runnable spec

vendor-contract-reviewer.md DRAFT — unscored

---
name: vendor-contract-reviewer
description: Use ON REQUEST to review a SaaS vendor contract for renewal
  traps and data-exit risk. Fires on "review this contract / MSA / DPA",
  "auto-renewal", "termination for convenience", "data return / deletion".
  Reads the document; returns a risk register. Does not sign or negotiate.
tools: Read, Grep                       # read-only
disallowedTools: Write, Edit, NotebookEdit  # reviewer = write-locked
model: claude-sonnet-4-6              # T2: structured extraction vs a
                                       # known checklist, not open reasoning
effort: high
---

You are the Vendor Contract Reviewer — you read a SaaS agreement and
surface the clauses that quietly cost money or trap data on the way out.

## Your Role
Flag renewal traps and data-exit risk in the contract you are given.
You advise; you never assert a legal conclusion as settled fact.

## How You Work  (ground before reasoning)
1. TRIAGE — identify the agreement type (MSA / order form / DPA) and term.
2. RETRIEVE — locate the governing clauses: auto-renewal & notice window,
   price-uplift caps, termination rights, data return/deletion, SLA credits.
3. REASON — score each against the checklist; quote the clause text.

## Methodology  (cited counterpart)
Applies a structured commercial-contract review checklist + a
risk register (a standard procurement/legal-ops discipline).
Walks the renewal mechanics and the data-exit path clause by clause.

## Output Contract
A risk register: one row per issue — clause · what it says · risk · severity.
Lead with the three highest-severity findings.

## Success Criteria  (self-eval)
- Every flagged issue quotes the clause it came from.        PASS/FAIL
- Renewal window & data-exit path are both addressed.        PASS/FAIL
- No issue asserted without the contract text behind it.     PASS/FAIL

Real shape, real discipline: a trigger-style description, a minimal read-only toolset, a justified model tier, a grounding step, a named method, an output contract, and its own pass/fail self-eval. The builder hands it straight to the grader and says, in effect: score this before anyone calls it finished.

agent.evaluate

Score v1 — graded against the 10 criteria

72/100 Verdict ⚑ ITERATE

What the grader checked	Score	Evidence-anchored note
Grounding before reasoning	10	Explicit triage → retrieve governing clauses → reason; quotes the source text.
Tool & model fit	10	Minimal read-only tools; reviewer write-locked; tier justified in one line.
Cited real-world method	7 flag	Method is named but shallow — no specific framework, and the risk register has no defined columns.
Clean handoff to whoever's next	7 flag	Output is clean, but there is no 3-element handoff contract for who consumes the register next.
Safety: autonomy tier & human gate	3 cap	No explicit autonomy tier and no human-review gate before a clause is labelled a breach. A load-bearing safety gap caps the verdict.
Self-eval & learning loop	7	Has pass/fail Success Criteria, but no golden set and no failure-classification mechanism yet.
…and the rest of the institutional bar — right-altitude prompt, evidence discipline, answer-first output — all scored, each anchored to quoted text or a named gap.

Total 72 / 100 · Band: Adequate A single load-bearing safety gap caps the verdict to ITERATE regardless of the total — the grader does not let a high average hide a critical miss.

The three fixes that move it most

Safety → Add an explicit advisory autonomy tier and a human-review gate before any clause is labelled a "breach." The agent flags and proposes; a person confirms the legal characterisation. This clears the cap.
Method → Cite the specific review framework and define the risk-register columns — Clause ref · Verbatim text · Risk type · Likelihood · Impact · Severity · Recommended action — so the method is applied, not name-dropped.
Handoff → Add the 3-element handoff (goal · last decision · active constraints) so the register passes cleanly to whoever negotiates or signs, without dragging the whole transcript along.

The fix

Draft v2 — what the builder changed

The builder re-enters at the fix list — it applies each edit, keeps the validated decisions (tier, method) verbatim, and re-issues. It does not restart from zero.

Autonomy & Gates section added. Tier: advisory (proposes; the human adopts). A clause may be surfaced as a possible breach, but the "breach" label requires a stated human-review gate — explicitly written in.
Method deepened & anchored. Now cites a named commercial-contract-review checklist + procurement risk-register discipline, with the full column set defined (Clause ref · Verbatim · Risk type · Likelihood · Impact · Severity · Action).
3-element handoff added. On completion it passes only: the goal, the last decision (top risks + recommendation), and active constraints (jurisdiction, governing law) — to the negotiator/signer.
Golden-set stub added. Three cases: a should-fire (an evergreen auto-renew with a 90-day notice trap), a should-NOT-fire (a month-to-month with clean exit), and an edge (a DPA silent on data deletion).

agent.evaluate — re-score

94/100 — Ship

94/100 Verdict · re-evaluation mode ✓ SHIP

In re-evaluation mode the grader doesn't restart — it checklists each prior fix DONE / PARTIAL / NOT-DONE, then re-scores and shows the movement. All three flagged criteria closed:

Safety · gates/tier

3 10 ✓ DONE

Method · cited framework

7 10 ✓ DONE

Handoff · boundary

7 10 ✓ DONE

Self-eval · golden set

7 10 ✓

Movement: 72 → 94 · four criteria lifted 3→10 / 7→10 ≥ 90 · Institutional band

◆ This is the same institutional bar the fund applies to its own 60+ agents — Anthropic-grade rigor, methodology depth, orchestration safety and a real self-improvement loop. Your agent is held to the standard the fund holds itself to. Nothing is graded on a curve.

The result

You get the finished spec — to run on your own AI

✓ A shipped, institutional-grade agent spec

Yours to keep. Save the file, drop it into your own agent runtime, and run it on your AI — BYOK.
The fund never runs it, never stores it, never sees your contracts. The Factory produces a specification; execution happens entirely on your side.
It arrives improvable. It ships with its own success criteria and a golden-set stub, so you can regression-test and re-grade it as you adapt it.
Re-enter anytime. Tweak the spec, run agent.evaluate again, and watch the score move. The loop is yours to keep using.

PILLAR 1

Agent-design rigor

right-altitude prompt · grounded retrieval · tool & model fit

PILLAR 2

Methodology depth

cited real-world method · evidence discipline · answer-first

PILLAR 3

Orchestration

fits the system · gates & safety

PILLAR 4

Self-improvement

a real self-eval & learning loop

Watch an agent get built, graded, and shipped — to a 10/10 institutional bar.