Skip to content

Data Infrastructure

REFERENCE | DERIVED | Updated 2026-04-08 | Owner: Data Engineering (David Wang)

The data layer is the foundation everything else builds on. Every analytics skill, every health scan, every intelligence agent reads from it. None write to it. Expanding a mart is the highest-leverage improvement anyone can make, because every downstream system benefits immediately.

BigQuery Semantic Layer

All product analytics runs through a curated BigQuery semantic layer: dapperlabs-data.production_sem_open.*.

These are materialized views built from raw event data across all products (NBA Top Shot, NFL All Day, Disney Pinnacle) and organizations. The semantic layer provides a stable, well-defined interface for querying product data.

Key Principles

  • Read-only for consumers — data engineering maintains the layer; everyone else reads from it
  • Materialized views — pre-aggregated for common query patterns, reducing cost and latency
  • Schema stability — table schemas change through a controlled process, not ad hoc
  • Cross-product — same schema patterns across NBA, NFL, and Disney where applicable

Core Tables

This section needs enrichment from the engineering team

The table list below is derived from query patterns observed in analytics investigations. Data engineering should provide the canonical schema documentation.

Table Pattern Purpose Notes
user_spend_segmentation_history User spending tiers (S/M/L/XL) over time event_timestamp is DATE type, updates weekly
core_int_open_users User profile and account data Some fields access-restricted
Transaction tables Purchase, sale, and trade events Per-product (NBA, NFL, Disney)
Challenge tables Challenge creation, participation, completion Key engagement metric
Pack tables Pack drops, purchases, opening events Revenue tracking
Marketplace tables Listings, sales, offers, price data P2P trading activity
clientmetric_* tables Pre-aggregated client metrics Preferred for common questions (lower cost)
mart_* tables Analytical marts Pre-aggregated for specific use cases

Known Data Issues

Issue Impact Status
Buyback packs not properly tracked Look like regular packs in backend; skews revenue queries Known, fix pending
user_spend_segmentation_history updates weekly Cannot get daily segment transitions By design
Some core_int_open_users fields access-denied Limits certain user-level queries Permission configuration needed

How to Query: bqrun.sh

The standard way to execute BigQuery queries at Dapper is through the bqrun.sh wrapper script, maintained in the data plugin.

Location: backups/plugins/marketplaces/dapper-claude-code-plugins/plugins/data/skills/bigquery-concierge/scripts/bqrun.sh

What bqrun.sh Does

  • Wraps the bq query CLI command
  • Adds QID/QI instrumentation labels for cost tracking and audit
  • Enforces --use_legacy_sql=false (standard SQL mode)
  • Routes to project dapperlabs-data

Direct BQ CLI (when bqrun.sh is not available)

bq query \
  --use_legacy_sql=false \
  --format=json \
  --project_id=dapperlabs-data \
  'SELECT ... FROM `dapperlabs-data.production_sem_open.table_name` WHERE ...'

Query Conventions

  • Always use standard SQL (not legacy SQL)
  • Always query from production_sem_open.* unless you have a specific reason to hit raw tables
  • Use CTEs for subqueries (BQ does not support subqueries in JOINs the same way as Postgres)
  • Be aware of scan costs — prefer mart_* and clientmetric_* tables for common aggregations
  • Include date filters to limit scan scope

The Bigquery-Concierge Skill

The data plugin includes a bigquery-concierge skill that provides:

  • Query execution via bqrun.sh with instrumentation
  • Comprehensive schema reference (30+ tables with column types and !! warnings for gotchas)
  • Table selection guidance for non-obvious cases (net spend, WAU/MAU, product signups)
  • Product glossary and segmentation definitions

All Heimdall skills route their query execution through bigquery-concierge. They never bypass bqrun.sh.

Data Flow Architecture

graph TD
    PS[Product Systems - NBA/NFL/Disney] -->|Events| RE[Raw Event Data]
    RE -->|ETL - Hourly| SL[Semantic Layer - production_sem_open]
    SL -->|Read| BQC[bigquery-concierge / bqrun.sh]
    BQC -->|Query results| H[Heimdall Skills]
    BQC -->|Query results| HR[Health Reports]
    BQC -->|Query results| D[Daemon Scripts]
    SL -->|Read| AMP[Amplitude]

    H -->|Findings| IG[Insight Graph]
    IG -->|Context| H

    DW[David Wang / Data Eng] -->|Maintains| SL
    DW -->|Expands marts| SL

ETL Cadence

Pipeline Frequency Notes
Core ETL Hourly Raw events → semantic layer
Segment history Weekly User spend segmentation recalculated
Canonical numbers On demand Updated by Heimdall health scans

Monitoring and Dashboards

Heimdall Health Scans

Automated product health monitoring covering ~20 metrics across 3 products:

  • Quick mode (daily target): Traffic-light dashboard — GREEN/YELLOW/RED per metric
  • Exhaustive mode (weekly target): Deep dives + insight graph lint
  • Autonomous investigation: RED metrics auto-trigger investigation (8 queries max budget)
  • Persistent YELLOW: Auto-triggers trend analysis to catch slow declines

Health Report Cadence

Report Schedule Destination
Full portfolio health scan Mon + Fri, 6am PT #growth Slack channel
Health report thread Mon + Fri Slack thread (6 messages, progressive disclosure)
Health report Google Doc Mon + Fri Shared Google Doc

Health Reports Are Daemon-Generated

The health report runs as a KAAOS daemon cron job. It queries BigQuery via bqrun.sh, compiles the report, posts to Slack, and writes to a Google Doc. If BQ access fails, it falls back to cached data.

Available Data Domains

Domain What's Covered Key Metrics
Revenue Pack sales, marketplace fees, deposit match ROI Revenue by product, ARPU by segment, drop-over-drop performance
Engagement DAU, WAU, MAU, session data, challenge participation Active users, challenge completion rates, collection depth
Retention Cohort retention, segment transitions, churn signals D7/D30 retention, XL decay, reactivation rates
Marketplace Listings, sales volume, price trends, liquidity Floor prices, sell-through rates, trading depth, phantom supply
User Segments S/M/L/XL classification, cross-product overlap Segment sizes, transition rates, cross-product penetration
Campaigns Pack drops, promotions, deposit matches Sell-through, reactivation, incremental revenue

Canonical Numbers

Heimdall maintains a set of verified canonical numbers (100+ entries) with provenance tracking:

  • Each number has a source query, date, and sample size
  • Numbers can be unverified (AI-generated), verified (human-confirmed), or auto-verified (re-derived, matches prior)
  • Stale numbers (>30 days without refresh) are flagged
  • Known-wrong numbers are actively blacklisted and blocked from output

Location: research-reports/data-science-insights/numbers/canonical-numbers.md

Known-Wrong Numbers (Blacklisted)

These numbers appeared in prior analyses and are confirmed incorrect. They must never be cited:

Wrong Number Context Why Wrong
"160 reviews of Corey, 123 of Oleg" Code review counts Actual: 74 and 72
"Yoon Kim 548 PRs" PR attribution Misattributed. Actual: ~22 PRs in 3 months

See research-reports/data-science-insights/numbers/blacklist.md for the full blacklist.

Access and Permissions

Resource Who Has Access How to Get Access
BigQuery (production_sem_open) Data team, approved analysts, AI agents with credentials Request through David Wang
bqrun.sh Anyone with Claude Code + data plugin installed Install the data plugin
Insight graph Read: anyone. Write: Heimdall skills + authorized sessions Part of research-reports/ repo
Health reports (Slack) Anyone in #growth channel Join the channel

VM BQ Access Is a Blocker

As of April 8, 2026, BigQuery credentials are not yet provisioned on the KAAOS/Heimdall daemon VM. This blocks Heimdall daemon deployment and limits health report automation. David Wang has agreed to set this up.