Data Infrastructure¶

REFERENCE | DERIVED | Updated 2026-04-08 | Owner: Data Engineering (David Wang)

The data layer is the foundation everything else builds on. Every analytics skill, every health scan, every intelligence agent reads from it. None write to it. Expanding a mart is the highest-leverage improvement anyone can make, because every downstream system benefits immediately.

BigQuery Semantic Layer¶

All product analytics runs through a curated BigQuery semantic layer: dapperlabs-data.production_sem_open.*.

These are materialized views built from raw event data across all products (NBA Top Shot, NFL All Day, Disney Pinnacle) and organizations. The semantic layer provides a stable, well-defined interface for querying product data.

Key Principles¶

Read-only for consumers — data engineering maintains the layer; everyone else reads from it
Materialized views — pre-aggregated for common query patterns, reducing cost and latency
Schema stability — table schemas change through a controlled process, not ad hoc
Cross-product — same schema patterns across NBA, NFL, and Disney where applicable

Core Tables¶

This section needs enrichment from the engineering team

The table list below is derived from query patterns observed in analytics investigations. Data engineering should provide the canonical schema documentation.

Table Pattern	Purpose	Notes
`user_spend_segmentation_history`	User spending tiers (S/M/L/XL) over time	`event_timestamp` is DATE type, updates weekly
`core_int_open_users`	User profile and account data	Some fields access-restricted
Transaction tables	Purchase, sale, and trade events	Per-product (NBA, NFL, Disney)
Challenge tables	Challenge creation, participation, completion	Key engagement metric
Pack tables	Pack drops, purchases, opening events	Revenue tracking
Marketplace tables	Listings, sales, offers, price data	P2P trading activity
`clientmetric_*` tables	Pre-aggregated client metrics	Preferred for common questions (lower cost)
`mart_*` tables	Analytical marts	Pre-aggregated for specific use cases

Known Data Issues¶

Issue	Impact	Status
Buyback packs not properly tracked	Look like regular packs in backend; skews revenue queries	Known, fix pending
`user_spend_segmentation_history` updates weekly	Cannot get daily segment transitions	By design
Some `core_int_open_users` fields access-denied	Limits certain user-level queries	Permission configuration needed

How to Query: `bqrun.sh`¶

The standard way to execute BigQuery queries at Dapper is through the bqrun.sh wrapper script, maintained in the data plugin.

Location: backups/plugins/marketplaces/dapper-claude-code-plugins/plugins/data/skills/bigquery-concierge/scripts/bqrun.sh

What `bqrun.sh` Does¶

Wraps the bq query CLI command
Adds QID/QI instrumentation labels for cost tracking and audit
Enforces --use_legacy_sql=false (standard SQL mode)
Routes to project dapperlabs-data

Direct BQ CLI (when `bqrun.sh` is not available)¶

bq query \
  --use_legacy_sql=false \
  --format=json \
  --project_id=dapperlabs-data \
  'SELECT ... FROM `dapperlabs-data.production_sem_open.table_name` WHERE ...'

Query Conventions¶

Always use standard SQL (not legacy SQL)
Always query from production_sem_open.* unless you have a specific reason to hit raw tables
Use CTEs for subqueries (BQ does not support subqueries in JOINs the same way as Postgres)
Be aware of scan costs — prefer mart_* and clientmetric_* tables for common aggregations
Include date filters to limit scan scope

The Bigquery-Concierge Skill¶

The data plugin includes a bigquery-concierge skill that provides:

Query execution via bqrun.sh with instrumentation
Comprehensive schema reference (30+ tables with column types and !! warnings for gotchas)
Table selection guidance for non-obvious cases (net spend, WAU/MAU, product signups)
Product glossary and segmentation definitions

All Heimdall skills route their query execution through bigquery-concierge. They never bypass bqrun.sh.

Data Flow Architecture¶

graph TD
    PS[Product Systems - NBA/NFL/Disney] -->|Events| RE[Raw Event Data]
    RE -->|ETL - Hourly| SL[Semantic Layer - production_sem_open]
    SL -->|Read| BQC[bigquery-concierge / bqrun.sh]
    BQC -->|Query results| H[Heimdall Skills]
    BQC -->|Query results| HR[Health Reports]
    BQC -->|Query results| D[Daemon Scripts]
    SL -->|Read| AMP[Amplitude]

    H -->|Findings| IG[Insight Graph]
    IG -->|Context| H

    DW[David Wang / Data Eng] -->|Maintains| SL
    DW -->|Expands marts| SL

ETL Cadence¶

Pipeline	Frequency	Notes
Core ETL	Hourly	Raw events → semantic layer
Segment history	Weekly	User spend segmentation recalculated
Canonical numbers	On demand	Updated by Heimdall health scans

Monitoring and Dashboards¶

Heimdall Health Scans¶

Automated product health monitoring covering ~20 metrics across 3 products:

Quick mode (daily target): Traffic-light dashboard — GREEN/YELLOW/RED per metric
Exhaustive mode (weekly target): Deep dives + insight graph lint
Autonomous investigation: RED metrics auto-trigger investigation (8 queries max budget)
Persistent YELLOW: Auto-triggers trend analysis to catch slow declines

Health Report Cadence¶

Report	Schedule	Destination
Full portfolio health scan	Mon + Fri, 6am PT	#growth Slack channel
Health report thread	Mon + Fri	Slack thread (6 messages, progressive disclosure)
Health report Google Doc	Mon + Fri	Shared Google Doc

Health Reports Are Daemon-Generated

The health report runs as a KAAOS daemon cron job. It queries BigQuery via bqrun.sh, compiles the report, posts to Slack, and writes to a Google Doc. If BQ access fails, it falls back to cached data.

Available Data Domains¶

Domain	What's Covered	Key Metrics
Revenue	Pack sales, marketplace fees, deposit match ROI	Revenue by product, ARPU by segment, drop-over-drop performance
Engagement	DAU, WAU, MAU, session data, challenge participation	Active users, challenge completion rates, collection depth
Retention	Cohort retention, segment transitions, churn signals	D7/D30 retention, XL decay, reactivation rates
Marketplace	Listings, sales volume, price trends, liquidity	Floor prices, sell-through rates, trading depth, phantom supply
User Segments	S/M/L/XL classification, cross-product overlap	Segment sizes, transition rates, cross-product penetration
Campaigns	Pack drops, promotions, deposit matches	Sell-through, reactivation, incremental revenue

Canonical Numbers¶

Heimdall maintains a set of verified canonical numbers (100+ entries) with provenance tracking:

Each number has a source query, date, and sample size
Numbers can be unverified (AI-generated), verified (human-confirmed), or auto-verified (re-derived, matches prior)
Stale numbers (>30 days without refresh) are flagged
Known-wrong numbers are actively blacklisted and blocked from output

Location: research-reports/data-science-insights/numbers/canonical-numbers.md

Known-Wrong Numbers (Blacklisted)¶

These numbers appeared in prior analyses and are confirmed incorrect. They must never be cited:

Wrong Number	Context	Why Wrong
"160 reviews of Corey, 123 of Oleg"	Code review counts	Actual: 74 and 72
"Yoon Kim 548 PRs"	PR attribution	Misattributed. Actual: ~22 PRs in 3 months

See research-reports/data-science-insights/numbers/blacklist.md for the full blacklist.

Access and Permissions¶

Resource	Who Has Access	How to Get Access
BigQuery (`production_sem_open`)	Data team, approved analysts, AI agents with credentials	Request through David Wang
`bqrun.sh`	Anyone with Claude Code + data plugin installed	Install the `data` plugin
Insight graph	Read: anyone. Write: Heimdall skills + authorized sessions	Part of `research-reports/` repo
Health reports (Slack)	Anyone in #growth channel	Join the channel

VM BQ Access Is a Blocker

As of April 8, 2026, BigQuery credentials are not yet provisioned on the KAAOS/Heimdall daemon VM. This blocks Heimdall daemon deployment and limits health report automation. David Wang has agreed to set this up.