Data Pipeline - DBT & Semantic Layer
This guide covers running DBT transformations (Bronze → Silver → Gold) and managing the Cube.js semantic layer cache.
For the most common dbt commands (daily, full-refresh, cache clear, cron), see the Platform Runbook → Data Lakehouse Pipeline. This page covers advanced dbt subcommands, Elementary reports, and validation.
Quick Start
The DBT pipeline runs as a Docker container via the lakehouse profile:
# Define the compose command
DC="docker compose --env-file .env.production --profile lakehouse"
# Start the lakehouse database
$DC up -d lakehouse-db
# Install DBT dependencies
$DC run --rm lakehouse-dbt deps
# Verify connection
$DC run --rm lakehouse-dbt debug
# Run transformations
$DC run --rm lakehouse-dbt run
Part 1: DBT Transformation
Prerequisites
- Python 3.10+
- PostgreSQL/Neon database credentials (same as ingester)
Configure Environment
The monorepo uses a unified .env.production at the root:
# Database Connection
DB_HOST=your-neon-host.aws.neon.tech
DB_PORT=5432
DB_USER=neondb_owner
DB_PASSWORD=your-password
DB_NAME=neondb
DB_SSLMODE=require
# DBT Settings
DBT_PROFILES_DIR=.
DBT_TARGET=dev
TABLE_PREFIX=tbl_mst_
Run Transformations
DC="docker compose --env-file .env.production --profile lakehouse"
# Run all models
$DC run --rm lakehouse-dbt run
# Run models + tests
$DC run --rm lakehouse-dbt build
# Full daily pipeline with Elementary reports
$DC run --rm lakehouse-dbt daily
Available Commands
| Command | Description |
|---|---|
$DC run --rm lakehouse-dbt debug | Verify database connection |
$DC run --rm lakehouse-dbt run | Run all DBT models |
$DC run --rm lakehouse-dbt test | Run all DBT tests |
$DC run --rm lakehouse-dbt build | Run models + tests |
$DC run --rm lakehouse-dbt seed | Load seed/reference data |
$DC run --rm lakehouse-dbt run --select silver | Run Silver layer only |
$DC run --rm lakehouse-dbt run --select gold | Run Gold layer only |
$DC run --rm lakehouse-dbt docs-generate | Generate docs |
$DC run --rm lakehouse-dbt daily | Full pipeline with reports |
Note:
DCis defined asdocker compose --env-file .env.production --profile lakehouse
Part 2: Semantic Layer Cache
After DBT refresh, the Cube.js semantic layer cache must be cleared to reflect new data.
Clear Cache (Local)
docker compose --env-file .env.production --profile analytics restart cube
Clear Cache (Production)
DC="docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.production"
$DC --profile analytics restart cube
This restarts the Cube.js server, clearing all cached queries.
Validation
Option 1: Query Database Directly
psql "postgresql://user:pass@host/db?sslmode=require" \
-c "SELECT COUNT(*) FROM gold.dim_study;"
Option 2: Use AI Chat
Query via the AI chat interface:
"How many studies are in the system?"
Expected response should reflect the latest data count.
Option 3: Cube.js Playground
Open http://localhost:4000 and run:
{"measures": ["Studies.count"]}
Full Pipeline Workflow
Run the complete data pipeline:
DC="docker compose --env-file .env.production --profile lakehouse"
# 1. Ingest data from Frappe API
$DC run --rm lakehouse-ingester
# 2. Transform with DBT
$DC run --rm lakehouse-dbt run
# 3. Clear semantic layer cache (if Cube.dev is running)
docker compose --env-file .env.production --profile analytics restart cube
Elementary Data Observability
The pipeline includes Elementary for data quality monitoring:
DC="docker compose --env-file .env.production --profile lakehouse"
# Build Elementary tables
$DC run --rm lakehouse-dbt run --select elementary
# Generate HTML report
$DC run --rm lakehouse-dbt elementary-report
Troubleshooting
| Issue | Solution |
|---|---|
| Stale data in dashboards | Run docker compose --env-file .env.production --profile analytics restart cube |
| DBT connection failed | Verify .env credentials with $DC run --rm lakehouse-dbt debug |
| Models not found | Run $DC run --rm lakehouse-dbt deps to install packages |
| Elementary report fails | Ensure Elementary package is installed |
| OpenLineage events not in Marquez | Verify OPENLINEAGE_URL is set and marquez-api is running |
dbt-ol: command not found | Ensure openlineage-dbt is in requirements.txt and rebuild image |
OpenLineage (Data Lineage)
The dbt pipeline uses OpenLineage via the dbt-ol CLI wrapper to emit lineage events to Marquez. This provides visual lineage graphs showing how data flows through the medallion architecture.
How It Works
dbt-olis a drop-in replacement fordbtinstalled via theopenlineage-dbtPython package- After each
dbt-ol runordbt-ol build, lineage events are emitted to the Marquez API - Marquez stores the lineage metadata and provides a Web UI for exploration
- If
OPENLINEAGE_URLis not set,dbt-olbehaves identically to standarddbt
Environment Variables
| Variable | Default | Description |
|---|---|---|
OPENLINEAGE_URL | http://marquez-api:5000 | Marquez API endpoint |
OPENLINEAGE_NAMESPACE | ctms-lakehouse | Namespace for lineage events |
Viewing Lineage
After a dbt run, open the Marquez Web UI:
# Marquez Web UI
open http://localhost:3300
Navigate to the ctms-lakehouse namespace to explore:
- Jobs — Each dbt model appears as a job
- Datasets — Each table/view appears as a dataset
- Lineage graph — Visual DAG showing source → staging → dimensions/facts → semantic
Feature Toggles
Both OpenLineage and Elementary can be individually enabled or disabled:
| Variable | Default | Description |
|---|---|---|
ENABLE_OPENLINEAGE | true (when OPENLINEAGE_URL is set) | Use dbt-ol wrapper for lineage emission |
ENABLE_ELEMENTARY | true | Run Elementary report generation after dbt builds |
# Disable OpenLineage for a single run
docker compose run -e ENABLE_OPENLINEAGE=false lakehouse-dbt daily
# Disable Elementary for a single run
docker compose run -e ENABLE_ELEMENTARY=false lakehouse-dbt daily
These toggles can also be passed through the Zynexa REST API when triggering pipelines. See Environment Variables for details.
Validate Lineage
# Check that lineage events are being received
curl -s http://localhost:5000/api/v1/namespaces/ctms-lakehouse/jobs | jq '.jobs | length'
# Should return the number of dbt models that have been run
See Also
- AI Analytics Pipeline — Full architecture overview
- Environment Variables — Configuration reference