The Cutting Edge of AI Agents — What the Team Needs to Know¶
Sources: "The Dapper Mini AGI — Loop Herding, World Models, and the Future of Work" (Apr 2026); "The Dapper Intelligence Model" (Apr 7, 2026); "Loop Herding — The Discipline of Building and Graduating Autonomous Systems" (Apr 2026); "The Loop Playbook" (Apr 2026); "Building the Intelligence: How the Collectibles Team Becomes a Mini-AGI" (Apr 5, 2026)
The December 2025 Threshold¶
In December 2025, AI coding agents crossed a capability step function.
"Coding agents basically didn't work before December... not gradually and over time, but specifically this last December." — Andrej Karpathy, February 2026
SWE-bench Verified scores rose from approximately 2% when first introduced to 80.9% by early 2026 — a 40x improvement in roughly two years. Two independent practitioners (Karpathy and Jack Dorsey) saw the same step function. Dorsey cited the same December as Block's restructuring trigger.
"Any organizational assumption made before Q1 2026 about what AI can and cannot do is stale."
This is not a trend line. It is a step function. The entire organizational model described in the loop herding documents is predicated on this capability threshold being real and durable.
What AI Agents Can Do Now¶
Anthropic's internal data provides the most honest assessment:
"Engineers now use Claude in 59% of their work. Merged pull requests per engineer per day increased 67%. And 27% of AI-assisted work represents entirely new tasks that would not have happened otherwise." — Anthropic internal metrics, 2026
Dario Amodei has described "the 10% that remains" — the fraction of any workflow where human judgment, creativity, and relationship management add irreplaceable value. The architecture we are building is designed to free people from the 90% that does not require their judgment so they can invest fully in the 10% that does.
Shopify recognized this earliest:
"CEO Tobi Lutke mandated that teams 'demonstrate why they cannot get what they want done using AI' before requesting headcount. Shopify held headcount flat at approximately 8,100 while revenue grew from $5.6B to $11.56B. Revenue per employee went from $483K to over $1.3M." — Shopify public filings, 2022-2025
Specification Files as Source Code¶
Karpathy's Software 3.0 thesis:
"Prompts are the new source code. English is the new programming language. And large language models are the new CPUs."
As of April 2026, 15+ specification file formats are in active use. The ones that matter to us:
| Format | Creator | Function |
|---|---|---|
| CLAUDE.md | Anthropic | Persistent project-level instructions, hierarchical |
| AGENTS.md | OpenAI / Linux Foundation | Cross-platform universal standard |
| SKILL.md | Anthropic | On-demand workflows loaded when matching task detected |
| program.md | Kevin Gu (AutoAgent) | Human writes one file, meta-agent autonomously improves the agent |
"A well-structured playbook is a program. The SKILL.md specification ecosystem the company already uses proves this works at scale — AI agents execute structured English specifications reliably today."
The six-layer specification stack:
- Project Identity (always loaded) — AGENTS.md, CLAUDE.md
- Scoped Rules (conditionally loaded) — file-type or domain-specific
- Skills (on-demand) — SKILL.md, loaded when matching task detected
- Design System (UI work) — DESIGN.md
- External Knowledge (dynamic) — RAG, MCP servers, APIs
- Runtime Enforcement (hard constraints) — policy engines, sandboxing, kill switches
Our plugin architecture maps directly to this stack.
Context Engineering vs. Prompt Engineering¶
Tobi Lutke and Karpathy both endorsed this reframing (June 19, 2025):
"Prompt engineering = cleverly phrasing a question. Context engineering = constructing an entire information environment so the AI can solve the problem reliably."
Simon Willison decomposed it: system instructions, retrieved knowledge, tool results, conversation history. Prompt engineering covers only the first component. Context engineering covers all four.
Critical constraint: The "Lost in the Middle" phenomenon — performance is highest when relevant information occurs at the beginning or end of context. It significantly degrades for information buried in the middle.
A practical warning from ETH Zurich: LLM-generated specification files reduced task success by ~3% and increased costs by 20%. Agents followed instructions too literally. The corrective:
"Include only non-discoverable information in specification files. If the agent can figure it out from the codebase, don't tell it."
The Harness Matters More Than the Model (For Now)¶
At current capability levels, infrastructure around the model matters more than the model itself:
- LangChain experiment: Same model scored 52.8% on coding benchmarks (outside Top 30). With harness-only changes: 66.5% (Top 5). Nothing about the model changed.
- Vercel: Removed 80% of their agent's tools. Got 3.5x speed improvement, 20% higher success rates, 37% fewer tokens, 42% fewer steps.
"We were building tools to summarize what was already legible." — Vercel engineering team
But this advantage is shrinking. The "Agent Complexity Law" (Dai et al., 2025):
"The performance gap between agents of varying complexity will shrink as the core model improves."
The implication for us:
"Build infrastructure that compounds knowledge, not infrastructure that compensates for model weakness. A query template that teaches the model SQL will become unnecessary as models improve. But a knowledge store of 22 verified findings about your specific business — with provenance chains, contradiction tracking, staleness detection — compounds regardless of model capability."
The Three Engines of Compounding Intelligence¶
Three distinct mechanisms create the intelligence flywheel. Each is incomplete alone.
Engine 1: Optimization (the AutoResearch pattern)¶
Try variations, keep what scores higher. Requires a scalar metric.
"Karpathy's AutoResearch: 630 lines of Python, one markdown prompt, 700 experiments in 2 days. The system generates a hypothesis, runs the experiment, scores the result, and iterates — no human in the loop during the run."
"Kevin Gu's AutoAgent: human writes program.md, meta-agent writes and optimizes the agent code by hill-climbing on benchmark scores. 96.5% on SpreadsheetBench — 'every other entry was human-engineered. Ours wasn't.'"
The pattern: human defines the goal and the scoring function, machine explores the solution space.
Engine 2: Accumulation (the knowledge base)¶
Store every verified finding. Cross-reference. Detect staleness and contradictions.
"Karpathy (April 3, 2026): 'A large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge.' He built an evolving markdown library maintained by AI — self-healing, auditable, human-readable. The knowledge base is the artifact, not the code."
Engine 3: Adaptation (the self-improving harness)¶
The system examines its own performance and changes how it works.
"A knowledge base that accumulates but never optimizes is a library — useful but static. An optimization loop that iterates but never accumulates is memoryless — rediscovers the same dead ends. A self-improving harness without a knowledge base or formal eval is just vibes-based prompt tweaking."
Why all three are needed:
"Knowledge feeds the eval, the eval drives optimization, optimization generates learnings, learnings feed the knowledge base, adaptation improves the process. Each engine's output is another engine's input."
Knowledge Base Integrity¶
The knowledge base is AI-maintained. Without integrity mechanisms, it poisons itself — AI-generated findings compound errors. Six mechanisms prevent this:
| Mechanism | How |
|---|---|
| Provenance | Every fact traces to source query, date, sample size. Re-runnable. |
| Verification status | unverified (AI-generated, flagged) / verified (human-confirmed) / auto-verified (re-derived, matches prior) |
| Contradiction detection | Conflicting findings coexist until resolved. No silent overwrites. |
| Staleness tracking | Every number has an expiry. Stale = flagged, not served as current. |
| Blacklist | Known-wrong numbers actively blocked in output. |
| Atomic facts | Findings are 2-3 sentences with numbers and source. Not narratives. |
Governance: What Aviation and Self-Driving Teach Us¶
"The single most dangerous transition in any automated system is the handoff from automation to human at the moment of crisis."
Three catastrophic examples:
- Air France 447: 228 dead. Pilots who had been monitoring automation for hours couldn't hand-fly when automation disconnected. The more reliable the automation, the less prepared the human when it fails.
- Boeing MCAS: 346 dead across two crashes. Single sensor, no authority limit on repeated activation, concealed from pilots.
- Uber fatality: Safety driver watching television. System alternated between classifying pedestrian as vehicle, bicycle, and unknown for 5 seconds.
The lesson: graduated autonomy with active monitoring, multiple independent safety layers, and regular manual override drills.
Applied to AI agent loops as crawl/walk/run:
| Stage | What Happens | Purpose |
|---|---|---|
| Crawl (Shadow Mode) | Agent analyzes but does not act. Proposals tracked but not executed. | Trust calibration. Show failures early. |
| Walk (Canary) | Agent handles growing % of real decisions under oversight. 1% -> 5% -> 10% -> 25% -> 50%. | Each expansion justified by data, not feeling. |
| Run (Full Autonomy) | Agent executes under policy guardrails and audit trail. | ONLY for low-risk, repetitive, reversible, well-understood tasks. |
A counterintuitive finding on trust:
"Participants exposed to automation failures earlier on were less susceptible to both automation complacency and automation bias. When deploying loops, deliberately show failures EARLY. Do not start with the agent's best performance."
This crawl/walk/run model maps to the five autonomy levels: Crawl = Levels 1-2, Walk = Level 3, Run = Levels 4-5.
The Production Pipeline: Four Zones¶
When a loop produces output that reaches production, it flows through a universal pipeline with sensitivity configured by zone:
GENERATE -> SCORE -> GATE -> DEPLOY -> MONITOR -> LEARN
Six scoring dimensions (all zones use all six — weights differ):
- Test Coverage — what percentage of behavior is verified by automated tests?
- Blast Radius — if this fails, how many users/dollars/systems are affected?
- Reversibility — can we undo this in minutes, hours, or never?
- Comprehensibility — can a second reviewer explain every line of what changed and why?
- Pattern Familiarity — known pattern with historical success, or novel territory?
- Provenance Traceability — can we trace every decision back to a requirement, a signal, or an explicit human choice?
GREEN Zone (application code, UI, tooling)¶
Composite score >=85% = auto-merge, dark launch, monitoring, graduated rollout, auto-rollback if monitoring fails. No human gate for high-scoring changes. Both probability AND consequence of failure are low.
YELLOW Zone (contract configs, strategies using audited patterns)¶
Composite score >=80% with AI + human review, staging, human-approved deploy. Yellow zone changes touch money or external-facing state, but use patterns previously validated.
RED Zone (core contract logic, payment/lending, token mechanics)¶
Composite score >=90% on ALL dimensions, no single dimension below 80%. AI loop enforces comprehensibility checks BEFORE presenting to human. Human has full veto. After human approval: external audit for code touching user funds.
BLACK Zone (protocol consensus, private keys, RNG, kill switches)¶
"AI does not generate the production artifact. Period. AI assists upstream (specs, threat models, tests) and downstream (verification, formal analysis). But the production code is human-written, human-reviewed, and human-approved. The consequences of failure are catastrophic and irreversible."
The DORA Warning¶
DORA 2025 found the AI Productivity Paradox:
"Individual engineers complete 21% more tasks and merge 98% more PRs with AI tools. But organizational delivery metrics are flat."
Why?
"Individual speed gains are offset by instability (more code churn, more bugs), review overhead (someone has to review all those PRs), and the absence of quality infrastructure (testing, version control maturity, fast feedback loops)."
The lesson:
"Building a loop that generates outputs faster is easy. Building a loop that generates good outputs reliably is hard. The review and measurement infrastructure is the product — not an afterthought."
"This is why every loop needs a track function. Without measurement, you get Klarna: fast, confident, wrong, and expensive to fix."
Who Has Receipts¶
| Company | Claim | Evidence | Verdict |
|---|---|---|---|
| Shopify | AI Before Headcount | Revenue/employee $483K->$1.3M+. OpEx 60%->29%. | Strongest case. Clean before/after. |
| Block | Company as intelligence | Cut 40%. Published essay with Sequoia. | Bold vision, thin evidence. 95% of AI code needs human modification. |
| Klarna | AI replaces 700 agents | Quality collapsed. Rehired humans. | Cautionary tale. Graduated without measurement. |
| Anthropic | Internal transformation | 59% of work uses Claude. 67% more PRs/day. | Honest. 0-20% fully delegatable. |
| Haier | Flat hierarchy at scale | $57B revenue, 23% annual growth since 2012. 12 layers -> 3. | 14 years of evidence, without AI. |
The AI-washing data:
"60% of executives made headcount cuts anticipating AI efficiency. Only 2% had actual AI implementations driving the reductions. Zero of 160 New York companies attributed layoffs to AI in legally required filings."
"Sam Altman: 'There's some AI washing where people are blaming AI for layoffs that they would otherwise do.'"
Cost Economics¶
| Workload | Cost Range |
|---|---|
| Single SWE task | $5-8 in API fees |
| Multi-step research | $5-15 |
| Coding agent session (mixed models) | $3-7 |
| Coding agent session (all-Opus) | $15-30 |
Token consumption: agentic workloads consume 5-30x more tokens than standard chat.
The plan-and-execute pattern — frontier model for planning, cheaper model for execution — reduces costs by up to 90%.
At Dapper's scale (30 active loop herders x 3-5 loops each x ~$10/day average): ~$50-150/day in agent costs, or $1,500-4,500/month.
"Gartner prediction: Over 40% of agentic AI projects will be canceled by end of 2027, primarily from escalating costs, unclear business value, or inadequate risk controls. The companies that succeed solve governance first, not deploy fastest."
Why This Is Not Replacement¶
"Klarna fired 700 customer service agents and publicly celebrated AI replacing them. In May 2025, CEO Sebastian Siemiatkowski reversed course: 'We focused too much on efficiency and cost. The result was lower quality.' The lesson is not that AI fails at customer service. The lesson is that removing human judgment from the loop entirely, without understanding which tasks require it, produces predictable failure."
The AF447 case demonstrates the deeper risk:
"The more reliable the automation, the less the human operator may be able to contribute to success. When the autopilot disengages and a human must suddenly understand a complex, degraded situation, the handoff is the single most dangerous transition in any automated system."
The strongest positive counterpoint:
"Haier and the Rendanheyi model. Zhang Ruimin eliminated 10,000 management jobs, broke the company into thousands of micro-enterprises, reduced hierarchy from 12 layers to 3. Result: $57B trailing 12-month revenue, 23% annual growth since full implementation in 2012. This is not an experiment. It is a $57B company that has operated this way for 14 years — and that was without AI coordination."
The question this raises:
"What happens when you add AI coordination to a model that already works without it?"
What This Means for You¶
"Performance conversations under this model are grounded in outcomes, not effort. The person who builds capabilities the intelligence layer can invoke, maintains their part of the world model, and exercises sharp judgment at gates becomes more valuable."
The tool mindset vs. the system mindset:
"Tool mindset: 'Claude helps me write better specs.' You're still the bottleneck. You scale linearly."
"System mindset: 'I built a loop that writes specs, evaluates them, ships them, and measures them. I improve the loop.' The system scales. You build the next system."
Further Reading¶
- Loop Herding — the full discipline of building and graduating autonomous systems
- AI at Dapper — current AI systems and how to use them
- How We Work — frameworks, vocabulary, and decision-making culture