Geppetto: Autonomous Product Intelligence Loop¶
Source document: Geppetto -- Autonomous Product Intelligence Loop (Architecture & Implementation) (Apr 2026)
Audience: AI Team, Engineering, Product. Team-safe.
Last updated: 2026-04-08
Status: Skills deployed, queue scaffolded, SRE deployment pending.
What Geppetto Is¶
Geppetto is the closed-loop system that connects autonomous agents into a continuous product intelligence pipeline:
Instead of humans manually noticing problems, brainstorming solutions, evaluating them, and tracking results -- the loop runs autonomously, with humans entering at decision gates.
The loop in one line:
Heimdall scans data -> finds signals. Socrates generates solution ideas from those signals. CPO or Delphi evaluates and stack-ranks the ideas. Engineering builds the top-ranked item. Heimdall measures results -> loop restarts.
What it is not:
This is NOT a replacement for human judgment. It's infrastructure that ensures: Problems are found before someone happens to notice them. Every idea gets evaluated through the same rigorous frameworks. Every shipped feature gets measured against its predictions. Learnings compound into the system automatically.
The Five Stages¶
Stage 1: SENSE -- Heimdall (Data Science Agent)¶
Continuously scans BigQuery metrics across all three collectibles products + DeFi.
Operates on three cadences: - Daily: Health scans across 60 metrics per product. RED flags auto-trigger investigation. - Weekly: Opportunity discovery scans for unknown unknowns -- anomalies, emerging segments, unexpected correlations. - Weekly: Whale monitoring -- LLM narrative assessment of the most active high-value collectors.
Additional capabilities include per-user churn/upgrade prediction (BQML) and daily checkpoint execution on previously shipped experiments.
Outputs are scored signals with a structured format:
Signal Type: OPPORTUNITY | PROBLEM | TREND | RESULT Severity: RED | YELLOW | GREEN Domain: COLLECTIBLES | DEFI Product: NBA | NFL | DISNEY | FLOW | PEAK_MONEY Segment: XL | L | M | S | LAPSED | NEW Confidence: HIGH | MEDIUM | LOW
Status: Plugin complete (14 skills, insight graph seeded with 22 findings + 100 canonical numbers). SRE deployment pending -- needs cron jobs on the application cluster.
Stage 2: IDEATE -- Socrates (Product Methodology Agent)¶
Receives a Heimdall signal, generates 3-5 solution hypotheses.
Uses existing Socrates frameworks: JTBD, ICE scoring, validation hierarchy, Bold Beat design.
Critical design decision:
Machine-generated ideas are capped at 0.5 confidence. Only human validation can push above 0.5.
Different signal types get different ideation lenses:
OPPORTUNITY: Amplify / Extend / Monetize / Compound PROBLEM: Patch / Cure / Compensate / Accept TREND: Accelerate / Hedge / Pivot / Bold Beat RESULT: Linked to HIT/MEDIOCRE/FAIL/SURPRISE from prior experiment
Dual output: Machine-readable structured format for the next agent to consume. Human-readable audit log with full reasoning chain.
Routing logic: NBA/NFL/Disney signals go to collectibles-cpo. Flow/Peak Money signals go to delphi.
Status: Skill deployed -- socrates-product-advisor:receive-signal (v1.6.0).
Stage 3: EVALUATE -- Domain Expert (CPO or Delphi)¶
The key difference from prior evaluation:
These skills evaluate proposals COMPARATIVELY, not one at a time. The question is not just "is this good?" but "given everything else we're doing, should we do THIS now?"
For Collectibles (collectibles-cpo:stack-rank):
Runs the Board of Directors model on each proposal: growth-check (/150), economy-check, user-lens (4 archetypes), spec quality gate. Produces a comparative stack-rank with tension mapping: mutual exclusions, sequencing dependencies, synergies, resource conflicts. Assigns items to RICE-LXL buckets: Protect Revenue / Activate Daily / Grow L/XL / Deepen / Defer-Kill. Capacity check against sprint person-months with buffer. CEO escalation summary: top 3 to approve, highest-stakes tension, explicit kill list.
Verdicts: SHIP / STRENGTHEN / REDESIGN / KILL.
For DeFi/Flow (delphi:stack-rank):
Runs 7-step evaluate on each proposal: competitive scan (mandatory first), user impact, tech feasibility, business impact, risk assessment, recommendation, messaging angle. Weighted composite score (/100) with 6 dimensions including 15% phase-alignment. Proposals ranked against the 5-phase strategic plan: Exchange Normalization -> Peak Money -> $FAN -> Credit Protocol -> Agentic Finance.
Phase-gated buckets: NOW / NEXT / LATER / NO. Verdicts: GO / NO-GO / CONDITIONAL-GO.
Status: Both skills deployed -- collectibles-cpo:stack-rank (v1.1.0), delphi:stack-rank (v0.4.0).
Stage 4: BUILD -- Engineering Pipeline¶
Takes the top-ranked, CEO-approved proposal and implements it. Currently connects to Jim Wheaton's SWE Pipeline:
Multi-agent system: researcher -> planner -> plan evaluator -> implementer -> 7 reviewers -> browser verifier -> self-improvement agent. Orchestrated by watcher.sh -- polls Linear, dispatches to agents.
The self-improvement loop:
Agent examines its own Claude Code logs, finds inefficiencies, proposes prompt improvements.
Connection to Geppetto:
Top-ranked queue items become Linear tickets -> Jim's pipeline picks them up.
The engineering pipeline also has a recommendation to adopt addyosmani/agent-skills (19 open-source engineering discipline skills) for TDD, code review, git workflow, and CI/CD gates -- filling gaps in testing discipline and code quality that the current pipeline does not enforce.
Stage 5: MEASURE -- Closing the Loop¶
When a feature ships, a tracking plan is registered in Heimdall with: prediction, success criteria, kill criteria, checkpoint schedule (d+1, d+7, d+30, d+90).
Heimdall track executes daily checkpoints: runs the queries, compares actual vs predicted, classifies GREEN/YELLOW/RED.
When the measurement window closes, a mandatory retrospective is filed: what we predicted, what happened, the gap, what we learned. Learnings auto-feed into Heimdall's self-evolving reference files. New signals generated from results -> loop restarts at Stage 1.
Product review runs in parallel: CPO weekly-review and review-sprint for Collectibles; Delphi review and 12-month scorecard for DeFi.
Status: Heimdall track skill is built. Not yet running on cron (same SRE dependency).
The Opportunity Queue¶
The queue is the coordination surface between all agents:
It's a file-based kanban board in the git repo.
Location: research-reports/opportunity-queue/
Structure:
opportunity-queue/
SCHEMA.md -- Lifecycle rules, format reference
collectibles/
QUEUE.md -- Kanban index: Stack Rank, Pipeline, Measuring, Closed
items/ -- One file per proposal (YAML frontmatter + markdown)
archive/2026-Q2/
defi/
QUEUE.md
items/
archive/2026-Q2/
Lifecycle stages:
DETECTED -> IDEATED -> EVALUATED -> APPROVED -> IN_PROGRESS -> SHIPPED -> MEASURING -> CLOSED (or KILLED from any stage)
Who writes at each stage:
| Stage | Writer |
|---|---|
| DETECTED | Heimdall (creates the signal/item file) |
| IDEATED | Socrates (adds solution hypotheses) |
| EVALUATED | CPO or Delphi (adds score, verdict, rank) |
| APPROVED | CEO (reviews stack-rank, approves top items) |
| IN_PROGRESS | Engineering (adds PR link, branch, Linear ticket) |
| SHIPPED | Engineering (marks deployed, registers tracking plan) |
| MEASURING | Heimdall track (adds checkpoint results) |
| CLOSED | Heimdall track + CEO (files retrospective, archives) |
Why file-based:
Agents communicate via the filesystem. Every agent can read/write markdown. Git gives us full audit trail, diff history, and conflict resolution for free. No external service dependency. No API rate limits. No auth complexity. Human-readable. You can open QUEUE.md and see the state of the world. Per-item files mean agents never write the same file simultaneously.
What Already Existed vs. What's New¶
Already existed (strong):¶
- Heimdall: 14 skills, insight graph, self-evolving references, cross-plugin context calls
- Socrates: 8 human-prompted skills (extract-soul, design-bet, validate, bold-beat, etc.)
- CPO: Board of Directors model, growth-check /150, economy-check, user-lens, review-spec
- Delphi: 7-step evaluate, protocol-adr, defi-adr, competitive scan mandatory
- Growth Engine: design-campaign, review-campaign, launch-gate
- Jim's SWE Pipeline: multi-agent engineering system with self-improvement loop
Already existed (gaps):¶
CPO and Delphi evaluate one proposal at a time. No comparative ranking across a queue. Socrates only works when a human prompts it. No machine-to-machine handoff. No persistent backlog connecting data signals -> product ideas -> engineering work -> measurement. Jim's pipeline has no intake (Tommy's detection pipeline went dark). No engineering discipline enforcement (TDD, code review, git workflow).
New (built for Geppetto):¶
socrates-product-advisor:receive-signal-- bridge between Heimdall signals and product ideationcollectibles-cpo:stack-rank-- portfolio-level comparative evaluation for collectiblesdelphi:stack-rank-- phase-aligned strategic ranking for DeFi/Flow- Opportunity queue -- file-based kanban with SCHEMA.md, two domain queues, lifecycle rules
- Process map -- full loop diagram with status and handoff contracts
- Handoff contracts -- machine-readable schemas defining what each stage passes to the next
What Each Team Needs to Know¶
AI Team¶
Deploy Heimdall on the cluster. SRE dossier exists at research-reports/HEIMDALL-SRE-DEPLOYMENT-DOSSIER.md. Without this, the loop doesn't run. This is the #1 blocker.
Reconnect the SWE pipeline intake. Tommy's detection pipeline is dead. The opportunity queue replaces it -- top-ranked items from CPO/Delphi stack-rank become Linear tickets that watcher.sh picks up.
Consider adopting addyosmani/agent-skills for engineering discipline. Jim's pipeline has the multi-agent execution system. What it lacks: TDD enforcement, code review rigor (5-axis), git workflow (trunk-based, 100-line PRs), CI/CD gate sequencing.
The bridge between the queue and the pipeline:
APPROVED -> create Linear ticket -> watcher.sh picks it up.
David's role: Heimdall SRE deployment is the critical path. Cron jobs, BigQuery access, and the insight graph are defined. It needs cluster access and a service account.
Engineering Team¶
Features will arrive with more context. Every proposal in the queue comes with: Heimdall's data evidence, Socrates's hypothesis and kill criteria, CPO/Delphi's evaluation with archetype impact and economy assessment. You'll know WHY you're building something, not just WHAT.
Every feature ships with a measurement plan. Before you start building, Heimdall registers: prediction, success criteria, kill criteria, checkpoint dates. After you ship, Heimdall measures automatically. No more "we shipped it, did it work? shrug."
What doesn't change: Your existing tools, repos, and workflows. Geppetto is additive.
Product Team¶
Data-driven idea generation happens automatically. Heimdall will surface opportunities you didn't know existed and problems before they hit your dashboard.
Your specs get evaluated through a consistent framework. CPO stack-rank runs growth-check (150pt), economy-check, user-lens, and spec quality gate on every proposal.
You'll see a ranked queue, not a pile of tickets. QUEUE.md shows: what's #1, what tensions exist between proposals, what the capacity constraint is, what should be killed.
Experiments get measured automatically. When you ship a Bold Beat or Tracer Bullet, Heimdall tracks whether it hit the needle at d+1, d+7, d+30, d+90.
What doesn't change: You still own the product decisions. The system generates and evaluates -- you decide.
The confidence cap reinforces this:
Socrates caps machine-generated ideas at 0.5 confidence specifically because ideas need human validation before committing engineering time.
Comparison to External Best Practices¶
Geppetto was benchmarked against addyosmani/agent-skills (Addy Osmani, Google Chrome engineering manager -- 19 open-source skills for AI coding agents).
Where Geppetto is stronger:
Validation and experimentation (Socrates validate, bold-beat, design-bet) -- they have nothing like this. Domain-specific evaluation (CPO growth-check, economy-check, user-lens, Delphi competitive scan) -- they're purely engineering. Campaign-first shipping ("never launch features, always launch campaigns") -- they don't know about GTM.
Where they are stronger (and should be adopted):
TDD (red-green-refactor) -- our product skills have zero testing guidance. Code review process (5-axis: correctness, readability, architecture, security, performance). Git workflow (trunk-based, 1-3 day branches, 100-line PRs, save point pattern). Chesterton's Fence ("before changing or removing anything, understand why it exists").
Recommendation: Engineering team should evaluate addyosmani/agent-skills as the engineering discipline layer within Jim's pipeline. Our skills own product process (what to build, why). Their skills own engineering process (how to build safely). Complementary, not competing.
Humans Can Enter Anywhere¶
The autonomous loop runs continuously. Human intervention can occur at any stage.
The system is designed for human oversight at decision gates -- particularly at the APPROVED stage where the CEO reviews the stack-rank. But humans are not limited to gates. A product manager can inject an idea directly into the queue. An engineer can flag that a shipped feature is behaving unexpectedly. The CEO can kill an item at any stage.
The loop serves humans. It does not replace them.
Current Blocker¶
The entire loop depends on Heimdall running on cron, which requires SRE deployment:
Deploy Heimdall on the cluster. Without this, the loop doesn't run. This is the #1 blocker.
Until David Wang completes the SRE deployment (BigQuery access on the application cluster, cron job scheduling, service account provisioning), the loop operates in manual mode -- agents can be invoked individually but do not run autonomously.