Loop Herding — How We Build Autonomous Systems¶

REFERENCE | DERIVED | Updated 2026-04-08 | Owner: Leadership

Sources: "The Dapper Mini AGI — Loop Herding, World Models, and the Future of Work" (Apr 2026); "The Dapper Intelligence Model" (Apr 7, 2026); "Loop Herding — The Discipline of Building and Graduating Autonomous Systems" (Apr 2026); "The Loop Playbook — How Every Team Member Builds Their Own Autonomous Loop" (Apr 2026); "Building the Intelligence: How the Collectibles Team Becomes a Mini-AGI" (Apr 5, 2026)

What Loop Herding Is¶

Loop herding is the discipline that defines how every person at Dapper builds, graduates, and manages autonomous systems.

"Your job has changed. Not in the way you think. Your job is not to 'use AI tools.' Using AI tools is like using a calculator — it makes you faster at what you were already doing, but it doesn't change what you do. Your job is to build loops. Autonomous systems that do parts of your job continuously, reliably, and improvingly — whether you're watching or not. And then your job is to make those loops better. And then your job is to build the next loop. This is loop herding."

A loop herder does three things:

Builds loops. Identifies a repeating task, specifies what "good" looks like, builds an autonomous system to do it, and sets up verification.
Graduates loops. Monitors performance, adjusts autonomy based on demonstrated competence, and systematically reduces human oversight as the loop earns trust.
Herds a portfolio. Manages 5-15 loops simultaneously, each at a different maturity level, allocating attention where it's most needed and building new loops to cover uncovered territory.

"The master KPI is not 'how much work did you do.' It's 'how many of your loops are at Level 5, and what's next.'"

The Specifier-Verifier Shift¶

The fundamental mental model change underpinning loop herding:

"The future of knowledge work is not 'AI helps you do your job.' It's 'you build an always-on system that does your job, and your job becomes making that system better.'"

This builds on Andrej Karpathy's framework:

"Software 1.0 automated what you could specify (rote algorithms). Software 2.0 automates what you can verify (tasks with clear scoring). The shift for every knowledge worker is the same — stop being the person who does the work. Become the person who specifies what the work should look like, and verifies that the autonomous system is doing it right."

Three properties make a task loop-able:

Resettable — you can start over (git revert, feature flag off, experiment killed)
Efficient — you can repeat quickly (automated tests, cron jobs, not 6-month cycles)
Rewardable — you can score the output (growth-check /150, benchmark /25, track GREEN/YELLOW/RED)

"If your task has all three, build a loop. If it's missing one, add it. If it can't have all three — business logic that requires human judgment, novel strategic bets, relationship-dependent negotiations — that's where your irreplaceable human contribution lives."

Addy Osmani describes the role evolution:

Implementer — you do the work
Conductor — you direct one AI at a time
Orchestrator — you manage multiple loops in parallel

"One orchestrator manages more total work than would be possible as an implementer. Not by working harder — by working at a higher level of abstraction."

The Five Levels of Autonomy¶

Based on the Knight First Amendment Institute's autonomy framework, Anthropic's empirical trust data, and the academic human-in-the-loop literature. Each level maps to established management science.

Level 1: Operator¶

You do the work. The agent assists.

Agent's role: Autocomplete, suggest, draft fragments
Review: Every output, before accepting
Cadence: Real-time, continuous
Time here: 1-2 weeks per task type
Graduation signal: You can describe what "good" looks like. You know the failure modes.

Grove parallel: Low TRM — "very precise and detailed instructions, wherein the supervisor tells the subordinate what needs to be done, when, and how: a highly structured approach."

Level 2: Collaborator¶

You and the agent share the work.

Your role: Set direction, handle judgment calls and ambiguity
Agent's role: Generate first drafts, execute known patterns
Review: Every artifact before it ships — but whole artifacts, not individual actions
Cadence: Per-output
Time here: 2-4 weeks per task type
Graduation signal: Agent outputs need only minor edits >80% of the time

Level 3: Consultant¶

The agent leads. You advise.

Your role: Provide expertise when asked. Handle exceptions.
Agent's role: Plan, execute, self-check. Escalate when uncertain.
Review: Daily digest plus exception-based alerts
Cadence: Daily review + real-time escalation
Time here: 1-3 months per task type
Graduation signal: You intervene on <20% of outputs. Interventions are "adjust direction" not "fix errors."

Level 4: Approver¶

The agent operates independently. You approve key decisions.

Your role: Review weekly summaries. Approve high-stakes actions. Set strategy.
Agent's role: Run the full cycle autonomously. Escalate only above its authority.
Review: Weekly summary + exception escalation
Cadence: Weekly review, quarterly strategy
Time here: 3-6 months per task type
Graduation signal: Agent decisions match yours >90% of the time

Level 5: Observer¶

The agent is autonomous. You monitor for drift.

Your role: Monthly audit. Strategic corrections. Build the next loop.
Agent's role: Fully autonomous. Self-improving.
Review: Monthly audit + continuous automated monitoring
Cadence: Monthly deep review

"This is the destination but not the only valuable state. A well-managed Level 3 loop is more valuable than an unmeasured Level 5."

The Two Review Channels¶

Every loop, at every level, has exactly two review channels running simultaneously.

Channel 1: The Morning Digest¶

"A daily summary of what each loop did. Traffic-light format: GREEN (operating normally), YELLOW (needs attention), RED (requires intervention)."

How it changes by level:

Level 1-2: The digest is a learning tool — here's what the agent did, was it right?
Level 3: The digest is your primary work surface — read it, intervene on YELLOWs, investigate REDs.
Level 4-5: The digest is a confidence check — skim for RED flags, move on if GREEN.

Channel 2: The Escalation¶

"Real-time alerts when the loop encounters something it can't or shouldn't handle alone. Format: what happened, what the loop recommends, what it needs from you, and a deadline. Not a status update — a decision request."

What escalates varies by level:

Level 1-2: Everything. The human is in the loop.
Level 3: RED severity, low confidence, or decisions above the loop's authority.
Level 4: Irreversible actions, budget implications, external communications, novel situations.
Level 5: System failures, security concerns, strategic inflection points.

A critical finding from Anthropic:

"Experienced Claude Code users don't review less — they review differently. New users approve every action (5% interrupt rate). Experienced users grant autonomy and interrupt more often (9% interrupt rate) — but only when it matters. Same total oversight effort, radically different distribution. That's what good loop herding looks like."

How You Graduate a Loop¶

Graduation is earned through evidence, not time. The evidence comes from the loop itself.

Four Metrics¶

Intervention rate — What percentage of outputs did you change? Trending down = graduating.
Error catch rate — Does the loop's self-checks catch errors before you do? Trending up = graduating.
Outcome quality — Are autonomous outputs producing good results? Measured by the loop's own tracking.
Novel situation handling — When something new happens, does the loop escalate appropriately or fail silently?

The Graduation Conversation¶

Every 2 weeks for Levels 1-3, monthly for Levels 4-5. Three questions:

"Question 1: Am I still catching errors the loop misses? Yes — stay, improve the loop's checks."

"Question 2: Am I adding value in my reviews, or rubber-stamping? Rubber-stamping — graduate."

"Question 3: Did anything go wrong that the loop didn't catch? Yes — improve the loop, reset the clock."

Two cautionary principles hold in tension:

"Loops with explicit fading protocols produce twice the value of loops where human review stays constant. Never graduating is not cautious. It's wasteful."

"Klarna graduated to full autonomy for customer service without adequate quality monitoring. Customer satisfaction collapsed. They rehired humans. Graduation without measurement is negligence."

The balance:

"Always be graduating, but never graduate without evidence."

Managing the Portfolio¶

A loop herder manages a portfolio, not a single loop.

"A loop herder doesn't manage one loop. They manage a portfolio."

A portfolio view at any given time:

Loop 1 (health monitoring) — Level 5, monthly audit
Loop 2 (spec generation) — Level 4, weekly approval
Loop 3 (campaign design) — Level 3, daily digest
Loop 4 (strategic analysis) — Level 2, reviewing every output
Loop 5 (new domain) — Level 1, hands-on

"Your attention follows the levels. Level 1 loops get the most hands-on time. Level 5 loops get a monthly glance. The portfolio-level question is: 'Am I spending my time on the right loops?'"

The meta-skill:

"A Level 3 loop that could be Level 4 with two weeks of work is often more valuable than starting a new Level 1 loop — because graduation compounds. Every loop that reaches Level 4-5 frees up time to build the next one."

The leading indicators:

How many loops exist? (Are you building?)
What's the average level? (Are you graduating?)
What's the intervention rate trend? (Are loops getting better?)
How much of your time goes to Level 1 versus Level 4-5? (Are you climbing the ladder?)

The Intelligence Architecture¶

All loops compose into a company-wide intelligence. From the executive summary:

"The bigger picture is that all of our loops compose together into something much larger — a company-wide intelligence that senses what's happening, understands what it means, decides what to do, builds the response, and runs it in production. Dorsey and Botha call this 'the company as an intelligence.' We call it the same thing. Where we differ is in how we build it: not top-down, not waiting for a finished system, but through composable loops that each person owns and graduates toward autonomy."

The Four Layers¶

Drawing from Dorsey and Botha's "From Hierarchy to Intelligence" (March 31, 2026), adapted for Dapper:

Layer 1: Capabilities. Atomic things the business can do. Pack drops. Marketplace operations. Challenge systems. Content production. Analytics. Mint an asset, create a listing, process a purchase, send a notification, create a challenge, distribute a reward.

"Capabilities are building blocks, not job descriptions. One person may own multiple capabilities. One capability may require coordination across people. The unit of organizational design is the capability, not the role."

Layer 2: World Model. Two sides:

Customer world model — what users do, what the market does, what competitors do. Built from behavioral data and external signals.
Company world model — everything the organization writes down, discusses, and decides. Slack threads, Google Docs, specs, meeting notes, project state.

"Hierarchy exists to route information — managers carry context up, translate decisions down. The world model replaces that routing. When every person and every loop has access to the same picture of our customers and our operations, information moves without someone having to carry it."

Layer 3: Intelligence Layer. Composes capabilities into solutions based on world model signals.

"Playoffs are starting. A collector hasn't been active in two weeks. They collect Celtics moments. Celtics are favored. The intelligence layer composes: create a challenge tied to the Celtics' run, with a reward calibrated to this collector's spending history, delivered before they think to look. Nobody designed that product. It was composed from primitives at a specific moment for a specific person."

Layer 4: Interfaces. The products themselves — they are delivery surfaces, not where value is created.

The Roadmap Inversion¶

"The traditional road map, where product managers hypothesize about what to build next, is any company's ultimate limiting factor. The alternative: when the intelligence layer tries to compose a solution and fails because a capability does not exist, that failure signal IS the roadmap."

The system tells you what to build by failing to do it.

Autonomy of the Intelligence Layer Itself¶

"What lets us put this system into production immediately is that the intelligence layer itself can have autonomy levels. At low autonomy, it surfaces insights from the world model — but we, as team members, create the solution, because the capabilities to execute don't exist yet or aren't trusted yet. The layer is intelligent; it just can't act on its intelligence yet."

The Three Learning Loops¶

Inside each autonomous system, three nested learning cycles operate at different timescales:

Loop 1: Generation (minutes). The system generates many variants against a scoring matrix and presents the highest-scoring options.

"A campaign gets generated 20 times, scored on brand alignment + targeting precision + expected conversion, and the top 3 are presented."

Loop 2: Preference (days/weeks). When the human picks variant 3 (scored 7.2) over variant 1 (scored 8.5), that's signal. The scoring matrix underweighted something the human values.

"Over time, the system's top pick converges with the human's top pick. The system learns what you actually value."

Loop 3: Impact (weeks/months). Market outcomes calibrate the scoring matrix.

"Each outer loop calibrates the one inside it: Impact calibrates preference — the human learns which of their instincts produce results. Preference calibrates generation — the scoring matrix converges with informed judgment. Generation produces better starting options each cycle."

The Hierarchy of Human Leverage¶

Where should humans spend their time? The answer is a hierarchy:

Layer	What It Is	Durability
Layer 1: THE WORK	The output (code, spec, campaign, analysis)	Depreciates — today's code is tomorrow's legacy
Layer 2: THE SYSTEM	The loop that produces the work	Depreciates — today's harness simplifies as models improve
Layer 3: THE EVAL	The scoring function that judges the work	Compounds slowly — good criteria get refined but core holds
Layer 4: THE META-EVAL	The system that improves the eval from outcomes	Compounds fastest — durable across model generations

"Most people live at Layer 1. They do the work. Useful, but it scales linearly with hours. The first unlock is Layer 2: build the system. This is where most 'AI transformation' stalls — people build a loop and operate it forever, never graduating it, never climbing higher. The real leverage is at Layer 3: defining what good looks like."

"Loop herding is the practice of climbing this hierarchy."

How Agents Differ from People¶

Three critical differences:

"1. Agents are fully literal. A person with medium TRM will figure out your intent even if your instructions are imprecise. An agent will execute your instructions exactly as written, including the parts that are wrong. This means specification quality matters more for agents than for people."

"2. Agents are tireless and parallelizable. A person can manage one task at a time. An agent can run continuously across all tasks simultaneously. This means the bottleneck shifts from execution to judgment."

"3. Agents don't have ego, politics, or career anxiety. You can demote a loop from Level 4 to Level 2 without a difficult conversation. You can run multiple approaches in parallel without anyone feeling threatened. You can kill a loop that isn't working without a performance improvement plan. This makes agents easier to manage in some ways — but it also means you don't get the feedback signals that come from human resistance. A person who pushes back on a bad plan is providing valuable signal. An agent will execute the bad plan silently. The herder must build explicit feedback mechanisms to replace the organic feedback that comes from human judgment and resistance."

The SRE Parallel¶

Site Reliability Engineering has already solved many of these problems for automated systems:

Error budgets — a loop gets a defined tolerance for errors. A 95% accuracy SLO means 5% of outputs can be wrong before human intervention escalates.
SLOs — define what "good enough" means for each loop. Not perfect — good enough.
Runbooks — documented procedures for when things go wrong.
Graduated response — YELLOW at threshold, RED at critical, automatic demotion at unacceptable.
Blameless postmortems — when a loop fails, the question is "what does the loop need to improve?" not "whose fault is it?"

Building Your First Loop¶

A two-week sprint.

Week 1: Specify. Pick one repeating task you do at least weekly. Write the spec: what does a good output look like? Write the verification: how do you check? Build the simplest version — a Claude Code skill, a prompt template, a cron job.

Week 2: Verify. Run the loop for five consecutive instances. Score each output against your spec. Log what you changed, what was wrong, what was right.

After Week 2:

"If >60% of outputs were good without changes — move to Level 2. If <60% — improve the spec or the loop, run another week. Share what you learned in the team channel."

Ongoing cadence:

Every 2 weeks — graduation review using the loop's own metrics
Every month — loop improvement. What new capability can the loop handle?
Every quarter — new loop. What adjacent task should you automate next?

Role-Specific Examples¶

Product Manager¶

Level 1: You write the spec. Claude reviews it against evaluation lenses and gives feedback.
Level 2: Claude drafts the spec from your brief. You review against the evaluation lenses yourself.
Level 3: Data agent detects an opportunity, product agent generates the bet, evaluation agent scores it. You read the daily digest, adjust and approve.
Level 4: The loop runs weekly. Data signal -> proposal -> evaluation -> your approval -> engineering.
Level 5: The loop runs continuously. You review the monthly audit. Your time shifts to improving the evaluation frameworks.

Engineer¶

Level 1: Agent detects the bug. You read the ticket. You fix it manually with Claude Code.
Level 2: The SWE pipeline generates a PR. You review every line.
Level 3: The pipeline generates PRs for small/medium bugs. You review the daily digest. You deep-review only high-risk PRs.
Level 4: The pipeline handles bugs below a risk threshold end-to-end. You review the weekly quality report.
Level 5: The pipeline is self-improving. It proposes its own prompt improvements based on failure analysis.

Campaign Lead¶

Level 1: You design the campaign. Growth Engine scores it. You iterate.
Level 2: Growth Engine generates the campaign plan from your brief. You review and adjust.
Level 3: Data agent detects a seasonal opportunity. Product agent generates ideas. Growth Engine designs the top campaign. You review, approve or redirect.
Level 4: The loop runs weekly. Data -> design -> review -> launch -> measurement. You approve the weekly campaign stack-rank.
Level 5: The campaign loop runs autonomously for recurring types. You focus on novel campaign types and strategic pivots.

The Compound Effect¶

"30 people. 3 loops each. 50 weeks. 4,500 loop-weeks of compounding improvement per year."

"Each loop that reaches Level 4-5 frees up time for the herder to build the next loop. Each new loop covers more of the herder's domain. Each domain that's covered means better data, faster decisions, fewer surprises."

"The compound effect is the moat. Not the AI models — those are available to everyone. Not the tools — those are commoditized. The moat is the institutional muscle of building, reviewing, and graduating autonomous systems faster than anyone else. The organization that has 100 Level-4 loops running across product, engineering, data, marketing, and operations will outperform the organization that has 10 people using AI as a calculator."

"That's the gap between using AI tools and herding loops. One scales linearly with headcount. The other compounds."

What This Means for Every Team Member¶

Every person on the team operates at three levels simultaneously:

"Capability Builder. Build and maintain atomic capabilities. Your playbooks are executable specifications — programs the intelligence layer invokes."

"World Model Maintainer. Keep your part of the shared intelligence current. This replaces reporting up. Instead of telling your manager your status in a 1:1, you feed the world model. The world model tells everyone simultaneously, continuously, without information loss."

"Judgment Gate. Staff the decision points where the system needs human approval. The system handles everything except the decisions requiring your domain-specific judgment. Your value concentrates in the moments where human judgment is irreplaceable — and those moments become higher-leverage because you arrive at them with full context from the world model rather than partial context from a meeting."

Intellectual Foundations¶

The five levels of loop autonomy synthesize multiple academic frameworks:

Level	Grove (1983)	Dreyfus (1980)	Sheridan (1978)	Vygotsky	Collins (1989)
1 Operator	Low TRM	Novice	Levels 2-4	Full Scaffolding	Modeling
2 Collaborator	Low-Medium TRM	Advanced Beginner	Level 5	Partial Scaffolding	Coaching
3 Consultant	Medium TRM	Competent	Levels 6-7	Fading	Scaffolding
4 Approver	High TRM	Proficient	Levels 7-8	Minimal Scaffolding	Fading
5 Observer	High TRM + Monitoring	Expert	Levels 8-9	Independent	Independent Practice

Key insight from Grove:

"The presence or absence of monitoring is the difference between a supervisor's delegating a task and abdicating it. Monitoring persists at all TRM levels. What changes is frequency and granularity — not whether it happens."

Key insight from Vygotsky's research:

"Scaffolding interventions with explicit fading protocols produced effect sizes of d=0.71, compared to d=0.32 for programs where scaffolds remained constant. Fading roughly doubles the effectiveness."

Key insight on trust from Lee and See (2004):

"Three bases of trust in automation: Performance trust — does it work? Process trust — do I understand how it works? Purpose trust — is it aligned with my goals? All three must be present for trust to be calibrated."

And the four failure modes from Parasuraman and Riley (1997):

"Use — appropriate activation of automation. Misuse — over-reliance. Disuse — under-reliance. Abuse — deploying without guardrails or graduation criteria."

"The most common failure at Dapper today is disuse — people who have proven tools but don't use them because 'it's faster to just do it myself.' That's a failure of delegation, and it has the same cost as it does with human reports: you become the bottleneck."