Data Infrastructure¶
The data layer is the foundation everything else builds on. Every analytics skill, every health scan, every intelligence agent reads from it. None write to it. Expanding a mart is the highest-leverage improvement anyone can make, because every downstream system benefits immediately.
BigQuery Semantic Layer¶
All product analytics runs through a curated BigQuery semantic layer: dapperlabs-data.production_sem_open.*.
These are materialized views built from raw event data across all products (NBA Top Shot, NFL All Day, Disney Pinnacle) and organizations. The semantic layer provides a stable, well-defined interface for querying product data.
Key Principles¶
- Read-only for consumers — data engineering maintains the layer; everyone else reads from it
- Materialized views — pre-aggregated for common query patterns, reducing cost and latency
- Schema stability — table schemas change through a controlled process, not ad hoc
- Cross-product — same schema patterns across NBA, NFL, and Disney where applicable
Core Tables¶
This section needs enrichment from the engineering team
The table list below is derived from query patterns observed in analytics investigations. Data engineering should provide the canonical schema documentation.
| Table Pattern | Purpose | Notes |
|---|---|---|
user_spend_segmentation_history |
User spending tiers (S/M/L/XL) over time | event_timestamp is DATE type, updates weekly |
core_int_open_users |
User profile and account data | Some fields access-restricted |
| Transaction tables | Purchase, sale, and trade events | Per-product (NBA, NFL, Disney) |
| Challenge tables | Challenge creation, participation, completion | Key engagement metric |
| Pack tables | Pack drops, purchases, opening events | Revenue tracking |
| Marketplace tables | Listings, sales, offers, price data | P2P trading activity |
clientmetric_* tables |
Pre-aggregated client metrics | Preferred for common questions (lower cost) |
mart_* tables |
Analytical marts | Pre-aggregated for specific use cases |
Known Data Issues¶
| Issue | Impact | Status |
|---|---|---|
| Buyback packs not properly tracked | Look like regular packs in backend; skews revenue queries | Known, fix pending |
user_spend_segmentation_history updates weekly |
Cannot get daily segment transitions | By design |
Some core_int_open_users fields access-denied |
Limits certain user-level queries | Permission configuration needed |
How to Query: bqrun.sh¶
The standard way to execute BigQuery queries at Dapper is through the bqrun.sh wrapper script, maintained in the data plugin.
Location: backups/plugins/marketplaces/dapper-claude-code-plugins/plugins/data/skills/bigquery-concierge/scripts/bqrun.sh
What bqrun.sh Does¶
- Wraps the
bq queryCLI command - Adds QID/QI instrumentation labels for cost tracking and audit
- Enforces
--use_legacy_sql=false(standard SQL mode) - Routes to project
dapperlabs-data
Direct BQ CLI (when bqrun.sh is not available)¶
bq query \
--use_legacy_sql=false \
--format=json \
--project_id=dapperlabs-data \
'SELECT ... FROM `dapperlabs-data.production_sem_open.table_name` WHERE ...'
Query Conventions¶
- Always use standard SQL (not legacy SQL)
- Always query from
production_sem_open.*unless you have a specific reason to hit raw tables - Use CTEs for subqueries (BQ does not support subqueries in JOINs the same way as Postgres)
- Be aware of scan costs — prefer
mart_*andclientmetric_*tables for common aggregations - Include date filters to limit scan scope
The Bigquery-Concierge Skill¶
The data plugin includes a bigquery-concierge skill that provides:
- Query execution via
bqrun.shwith instrumentation - Comprehensive schema reference (30+ tables with column types and
!!warnings for gotchas) - Table selection guidance for non-obvious cases (net spend, WAU/MAU, product signups)
- Product glossary and segmentation definitions
All Heimdall skills route their query execution through bigquery-concierge. They never bypass bqrun.sh.
Data Flow Architecture¶
graph TD
PS[Product Systems - NBA/NFL/Disney] -->|Events| RE[Raw Event Data]
RE -->|ETL - Hourly| SL[Semantic Layer - production_sem_open]
SL -->|Read| BQC[bigquery-concierge / bqrun.sh]
BQC -->|Query results| H[Heimdall Skills]
BQC -->|Query results| HR[Health Reports]
BQC -->|Query results| D[Daemon Scripts]
SL -->|Read| AMP[Amplitude]
H -->|Findings| IG[Insight Graph]
IG -->|Context| H
DW[David Wang / Data Eng] -->|Maintains| SL
DW -->|Expands marts| SL
ETL Cadence¶
| Pipeline | Frequency | Notes |
|---|---|---|
| Core ETL | Hourly | Raw events → semantic layer |
| Segment history | Weekly | User spend segmentation recalculated |
| Canonical numbers | On demand | Updated by Heimdall health scans |
Monitoring and Dashboards¶
Heimdall Health Scans¶
Automated product health monitoring covering ~20 metrics across 3 products:
- Quick mode (daily target): Traffic-light dashboard — GREEN/YELLOW/RED per metric
- Exhaustive mode (weekly target): Deep dives + insight graph lint
- Autonomous investigation: RED metrics auto-trigger investigation (8 queries max budget)
- Persistent YELLOW: Auto-triggers trend analysis to catch slow declines
Health Report Cadence¶
| Report | Schedule | Destination |
|---|---|---|
| Full portfolio health scan | Mon + Fri, 6am PT | #growth Slack channel |
| Health report thread | Mon + Fri | Slack thread (6 messages, progressive disclosure) |
| Health report Google Doc | Mon + Fri | Shared Google Doc |
Health Reports Are Daemon-Generated
The health report runs as a KAAOS daemon cron job. It queries BigQuery via bqrun.sh, compiles the report, posts to Slack, and writes to a Google Doc. If BQ access fails, it falls back to cached data.
Available Data Domains¶
| Domain | What's Covered | Key Metrics |
|---|---|---|
| Revenue | Pack sales, marketplace fees, deposit match ROI | Revenue by product, ARPU by segment, drop-over-drop performance |
| Engagement | DAU, WAU, MAU, session data, challenge participation | Active users, challenge completion rates, collection depth |
| Retention | Cohort retention, segment transitions, churn signals | D7/D30 retention, XL decay, reactivation rates |
| Marketplace | Listings, sales volume, price trends, liquidity | Floor prices, sell-through rates, trading depth, phantom supply |
| User Segments | S/M/L/XL classification, cross-product overlap | Segment sizes, transition rates, cross-product penetration |
| Campaigns | Pack drops, promotions, deposit matches | Sell-through, reactivation, incremental revenue |
Canonical Numbers¶
Heimdall maintains a set of verified canonical numbers (100+ entries) with provenance tracking:
- Each number has a source query, date, and sample size
- Numbers can be
unverified(AI-generated),verified(human-confirmed), orauto-verified(re-derived, matches prior) - Stale numbers (>30 days without refresh) are flagged
- Known-wrong numbers are actively blacklisted and blocked from output
Location: research-reports/data-science-insights/numbers/canonical-numbers.md
Known-Wrong Numbers (Blacklisted)¶
These numbers appeared in prior analyses and are confirmed incorrect. They must never be cited:
| Wrong Number | Context | Why Wrong |
|---|---|---|
| "160 reviews of Corey, 123 of Oleg" | Code review counts | Actual: 74 and 72 |
| "Yoon Kim 548 PRs" | PR attribution | Misattributed. Actual: ~22 PRs in 3 months |
See research-reports/data-science-insights/numbers/blacklist.md for the full blacklist.
Access and Permissions¶
| Resource | Who Has Access | How to Get Access |
|---|---|---|
BigQuery (production_sem_open) |
Data team, approved analysts, AI agents with credentials | Request through David Wang |
bqrun.sh |
Anyone with Claude Code + data plugin installed | Install the data plugin |
| Insight graph | Read: anyone. Write: Heimdall skills + authorized sessions | Part of research-reports/ repo |
| Health reports (Slack) | Anyone in #growth channel | Join the channel |
VM BQ Access Is a Blocker
As of April 8, 2026, BigQuery credentials are not yet provisioned on the KAAOS/Heimdall daemon VM. This blocks Heimdall daemon deployment and limits health report automation. David Wang has agreed to set this up.