Skip to main content

Data Pipeline - DBT & Semantic Layer

This guide covers running DBT transformations (Bronze → Silver → Gold) and managing the Cube.js semantic layer cache.

Quick Commands

For the most common dbt commands (daily, full-refresh, cache clear, cron), see the Platform Runbook → Data Lakehouse Pipeline. This page covers advanced dbt subcommands, Elementary reports, and validation.


Quick Start

The DBT pipeline runs as a Docker container via the lakehouse profile:

# Define the compose command
DC="docker compose --env-file .env.production --profile lakehouse"

# Start the lakehouse database
$DC up -d lakehouse-db

# Install DBT dependencies
$DC run --rm lakehouse-dbt deps

# Verify connection
$DC run --rm lakehouse-dbt debug

# Run transformations
$DC run --rm lakehouse-dbt run

Part 1: DBT Transformation

Prerequisites

  • Python 3.10+
  • PostgreSQL/Neon database credentials (same as ingester)

Configure Environment

The monorepo uses a unified .env.production at the root:

# Database Connection
DB_HOST=your-neon-host.aws.neon.tech
DB_PORT=5432
DB_USER=neondb_owner
DB_PASSWORD=your-password
DB_NAME=neondb
DB_SSLMODE=require

# DBT Settings
DBT_PROFILES_DIR=.
DBT_TARGET=dev
TABLE_PREFIX=tbl_mst_

Run Transformations

DC="docker compose --env-file .env.production --profile lakehouse"

# Run all models
$DC run --rm lakehouse-dbt run

# Run models + tests
$DC run --rm lakehouse-dbt build

# Full daily pipeline with Elementary reports
$DC run --rm lakehouse-dbt daily

Available Commands

CommandDescription
$DC run --rm lakehouse-dbt debugVerify database connection
$DC run --rm lakehouse-dbt runRun all DBT models
$DC run --rm lakehouse-dbt testRun all DBT tests
$DC run --rm lakehouse-dbt buildRun models + tests
$DC run --rm lakehouse-dbt seedLoad seed/reference data
$DC run --rm lakehouse-dbt run --select silverRun Silver layer only
$DC run --rm lakehouse-dbt run --select goldRun Gold layer only
$DC run --rm lakehouse-dbt docs-generateGenerate docs
$DC run --rm lakehouse-dbt dailyFull pipeline with reports

Note: DC is defined as docker compose --env-file .env.production --profile lakehouse


Part 2: Semantic Layer Cache

After DBT refresh, the Cube.js semantic layer cache must be cleared to reflect new data.

Clear Cache (Local)

docker compose --env-file .env.production --profile analytics restart cube

Clear Cache (Production)

DC="docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.production"
$DC --profile analytics restart cube

This restarts the Cube.js server, clearing all cached queries.


Validation

Option 1: Query Database Directly

psql "postgresql://user:pass@host/db?sslmode=require" \
-c "SELECT COUNT(*) FROM gold.dim_study;"

Option 2: Use AI Chat

Query via the AI chat interface:

"How many studies are in the system?"

Expected response should reflect the latest data count.

Option 3: Cube.js Playground

Open http://localhost:4000 and run:

{"measures": ["Studies.count"]}

Full Pipeline Workflow

Run the complete data pipeline:

DC="docker compose --env-file .env.production --profile lakehouse"

# 1. Ingest data from Frappe API
$DC run --rm lakehouse-ingester

# 2. Transform with DBT
$DC run --rm lakehouse-dbt run

# 3. Clear semantic layer cache (if Cube.dev is running)
docker compose --env-file .env.production --profile analytics restart cube

Elementary Data Observability

The pipeline includes Elementary for data quality monitoring:

DC="docker compose --env-file .env.production --profile lakehouse"

# Build Elementary tables
$DC run --rm lakehouse-dbt run --select elementary

# Generate HTML report
$DC run --rm lakehouse-dbt elementary-report

Troubleshooting

IssueSolution
Stale data in dashboardsRun docker compose --env-file .env.production --profile analytics restart cube
DBT connection failedVerify .env credentials with $DC run --rm lakehouse-dbt debug
Models not foundRun $DC run --rm lakehouse-dbt deps to install packages
Elementary report failsEnsure Elementary package is installed
OpenLineage events not in MarquezVerify OPENLINEAGE_URL is set and marquez-api is running
dbt-ol: command not foundEnsure openlineage-dbt is in requirements.txt and rebuild image

OpenLineage (Data Lineage)

The dbt pipeline uses OpenLineage via the dbt-ol CLI wrapper to emit lineage events to Marquez. This provides visual lineage graphs showing how data flows through the medallion architecture.

How It Works

  • dbt-ol is a drop-in replacement for dbt installed via the openlineage-dbt Python package
  • After each dbt-ol run or dbt-ol build, lineage events are emitted to the Marquez API
  • Marquez stores the lineage metadata and provides a Web UI for exploration
  • If OPENLINEAGE_URL is not set, dbt-ol behaves identically to standard dbt

Environment Variables

VariableDefaultDescription
OPENLINEAGE_URLhttp://marquez-api:5000Marquez API endpoint
OPENLINEAGE_NAMESPACEctms-lakehouseNamespace for lineage events

Viewing Lineage

After a dbt run, open the Marquez Web UI:

# Marquez Web UI
open http://localhost:3300

Navigate to the ctms-lakehouse namespace to explore:

  • Jobs — Each dbt model appears as a job
  • Datasets — Each table/view appears as a dataset
  • Lineage graph — Visual DAG showing source → staging → dimensions/facts → semantic

Feature Toggles

Both OpenLineage and Elementary can be individually enabled or disabled:

VariableDefaultDescription
ENABLE_OPENLINEAGEtrue (when OPENLINEAGE_URL is set)Use dbt-ol wrapper for lineage emission
ENABLE_ELEMENTARYtrueRun Elementary report generation after dbt builds
# Disable OpenLineage for a single run
docker compose run -e ENABLE_OPENLINEAGE=false lakehouse-dbt daily

# Disable Elementary for a single run
docker compose run -e ENABLE_ELEMENTARY=false lakehouse-dbt daily

These toggles can also be passed through the Zynexa REST API when triggering pipelines. See Environment Variables for details.

Validate Lineage

# Check that lineage events are being received
curl -s http://localhost:5000/api/v1/namespaces/ctms-lakehouse/jobs | jq '.jobs | length'

# Should return the number of dbt models that have been run

See Also