Data Pipeline Runner (Docker)
Run the full CTMS data pipeline (Ingester + DBT) using pre-built Docker images. No local Python, DBT, or dependency setup required — just Docker.
Use the Docker runner (run-ctms-data-pipeline.sh) for staging, production, and EC2 deployments. For local development with source code, see Data Pipeline - Ingester and Data Pipeline - DBT.
Prerequisites
- Docker installed and running
- Access to the
ctms.devopsrepository - A configured environment file (
.env.ctms-data-pipeline.<client>) - Running API gateway (Caddy + KrakenD) for local deployments
Quick Start
cd ctms.devops
# Run the full pipeline (ingester + DBT)
bash scripts/run-ctms-data-pipeline.sh --env example full-pipeline
# Or run stages individually
bash scripts/run-ctms-data-pipeline.sh --env example ingester
bash scripts/run-ctms-data-pipeline.sh --env example dbt-build
Environment Setup
1. Create Environment File
Each client/environment has its own config file in the ctms.devops root:
.env.ctms-data-pipeline.example # Example staging
.env.ctms-data-pipeline.example.prod # Example production
.env.ctms-data-pipeline.zynomi # Zynomi
2. Required Variables
# Docker Images
INGESTER_IMAGE=zynomi/ctms-ingester:latest
DBT_IMAGE=zynomi/ctms-dbt:latest
# Target Database (data warehouse)
TARGET_DB_HOST=your-db-host.supabase.com
TARGET_DB_PORT=5432
TARGET_DB_NAME=postgres
TARGET_DB_USER=postgres.your_project_ref
TARGET_DB_PASSWORD=YourPassword
TARGET_DB_SSLMODE=require
# Frappe API Source
FRAPPE_BASE_URL=https://api.localhost/api/v1
# Pipeline Settings
DLT_DESTINATION=postgres
DLT_DATASET_NAME=bronze
DLT_PIPELINE_NAME=hbct_clinical_trial_pipeline
TABLE_PREFIX=tbl_mst_
DBT_TARGET=dev
If your TARGET_DB_PASSWORD contains $ or other shell special characters, do not escape them in the env file. Docker's --env-file reads values literally without shell expansion. The runner script handles this correctly.
3. API Endpoint Groups
Configure which Frappe DocTypes to ingest using JSON arrays:
| Variable | Description | Example DocTypes |
|---|---|---|
DATALAKE_APIS | Clinical/transaction data | Study, Patient, Subject, Vitals, Consent |
DATALAKE_APIS_MASTER | Reference/lookup data | Sites, Dosages, Countries, Study Phase |
DATALAKE_APIS_CRF | Case Report Forms | CRF form definitions |
DATALAKE_APIS_RBAC | Access control | CTMS Roles, Permissions, Navigation |
Set an empty array [] to skip a group:
DATALAKE_APIS_CRF=[]
Pipeline Commands
Command Reference
| Command | Description |
|---|---|
full-pipeline | Run ingester + DBT daily (default) |
ingester | Run bronze layer ingestion only |
ingester-dry-run | Fetch data without writing to database |
dbt-build | Run DBT deps + build (models + tests) |
dbt-daily | Full DBT pipeline (deps + build + Elementary) |
dbt-deps | Install DBT packages only |
elementary | Run Elementary observability + report |
pull | Pull latest Docker images |
cleanup | Remove unused Docker resources |
Options
| Option | Description |
|---|---|
--env <name> | Required. Environment name (e.g., example, example.prod) |
--select <models> | DBT model selection (e.g., staging, marts, +model_name) |
--full-refresh | Run DBT with --full-refresh to rebuild incremental models |
Usage Examples
Run Full Pipeline
bash scripts/run-ctms-data-pipeline.sh --env example full-pipeline
This will:
- Pull latest Docker images
- Run the ingester (Frappe API → Bronze layer)
- Run DBT deps, build, Elementary, and generate the observability report
Ingester Only
# Full ingestion
bash scripts/run-ctms-data-pipeline.sh --env example ingester
# Dry run (fetch data, don't write to DB)
bash scripts/run-ctms-data-pipeline.sh --env example ingester-dry-run
DBT Only
# Build models + run tests
bash scripts/run-ctms-data-pipeline.sh --env example dbt-build
# Build specific layer
bash scripts/run-ctms-data-pipeline.sh --env example dbt-build --select staging
bash scripts/run-ctms-data-pipeline.sh --env example dbt-build --select marts
# Full refresh (rebuild incremental models)
bash scripts/run-ctms-data-pipeline.sh --env example dbt-build --full-refresh
# Combined: specific models + full refresh
bash scripts/run-ctms-data-pipeline.sh --env example dbt-build --select staging --full-refresh
Elementary Reports
bash scripts/run-ctms-data-pipeline.sh --env example elementary
Report is generated at: pipeline-data/dbt-reports/elementary_report.html
Running with Docker Directly
You can also run the containers directly without the runner script.
Ingester
docker run --rm \
--platform linux/amd64 \
--add-host api.localhost:host-gateway \
--env-file .env.ctms-data-pipeline.example \
-v ./pipeline-data/dlt:/app/.dlt \
-v ./pipeline-data/ingester-logs:/app/logs \
--entrypoint /bin/sh \
zynomi/ctms-ingester:latest -c "
export DB_HOST=\"\${TARGET_DB_HOST:-\$DB_HOST}\"
export DB_PORT=\"\${TARGET_DB_PORT:-\${DB_PORT:-5432}}\"
export DB_NAME=\"\${TARGET_DB_NAME:-\$DB_NAME}\"
export DB_USER=\"\${TARGET_DB_USER:-\$DB_USER}\"
export DB_PASSWORD=\"\${TARGET_DB_PASSWORD:-\$DB_PASSWORD}\"
export DB_SSLMODE=\"\${TARGET_DB_SSLMODE:-\${DB_SSLMODE:-require}}\"
exec python bot_frappe_api_to_db.py --batch
"
The env file uses TARGET_DB_* variable names (for DBT compatibility), but the ingester reads DB_*. The entrypoint wrapper maps between them inside the container, where the raw values from --env-file have no shell expansion issues.
DBT
DBT's built-in docker-entrypoint.sh handles TARGET_DB_* → DB_* mapping automatically:
# DBT Build
docker run --rm \
--platform linux/amd64 \
--env-file .env.ctms-data-pipeline.example \
-v ./pipeline-data/dbt-target:/app/target \
-v ./pipeline-data/dbt-logs:/app/logs \
-v ./pipeline-data/dbt-reports:/app/reports \
zynomi/ctms-dbt:latest dbt build
# DBT Deps
docker run --rm \
--platform linux/amd64 \
--env-file .env.ctms-data-pipeline.example \
-v ./pipeline-data/dbt-target:/app/target \
-v ./pipeline-data/dbt-logs:/app/logs \
-v ./pipeline-data/dbt-reports:/app/reports \
zynomi/ctms-dbt:latest dbt deps
# DBT with model selection
docker run --rm \
--platform linux/amd64 \
--env-file .env.ctms-data-pipeline.example \
-v ./pipeline-data/dbt-target:/app/target \
-v ./pipeline-data/dbt-logs:/app/logs \
-v ./pipeline-data/dbt-reports:/app/reports \
zynomi/ctms-dbt:latest dbt build --select staging
Output Directories
All pipeline output is stored under pipeline-data/:
pipeline-data/
├── dlt/ # DLT pipeline state (incremental tracking)
├── ingester-logs/ # Ingester execution logs
├── dbt-target/ # DBT compiled SQL and run results
├── dbt-logs/ # DBT execution logs
└── dbt-reports/ # Elementary HTML reports
└── elementary_report.html
The pipeline-data/dlt/ directory tracks incremental pipeline state. Deleting it forces a full re-ingestion on the next run.
Data Flow
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frappe APIs │────▶│ Ingester │────▶│ PostgreSQL │
│ (via KrakenD) │ │ (DLT Hub) │ │ Bronze Layer │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ DBT │
│ Bronze → Silver │
│ Silver → Gold │
└─────────────────┘
| Layer | Schema | Description |
|---|---|---|
| Bronze | bronze | Raw data from Frappe APIs (via DLT) |
| Silver | silver | Cleaned, typed, deduplicated data |
| Gold | gold | Business-ready dimensional models (facts + dimensions) |
Cron Scheduling
For automated daily runs on EC2 or any Linux server:
# Daily at 2 AM — full pipeline
0 2 * * * /path/to/ctms.devops/scripts/run-ctms-data-pipeline.sh \
--env example.prod full-pipeline >> /var/log/ctms-pipeline.log 2>&1
# Ingester every 6 hours, DBT daily
0 */6 * * * /path/to/ctms.devops/scripts/run-ctms-data-pipeline.sh \
--env example.prod ingester >> /var/log/ctms-ingester.log 2>&1
0 3 * * * /path/to/ctms.devops/scripts/run-ctms-data-pipeline.sh \
--env example.prod dbt-daily >> /var/log/ctms-dbt.log 2>&1
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
CERTIFICATE_VERIFY_FAILED | Caddy's self-signed cert not trusted | The runner script auto-merges Caddy's CA. Ensure ctms-caddy container is running |
Circuit breaker open | Too many failed auth attempts | Wait 5 minutes, verify TARGET_DB_PASSWORD is correct in env file |
column "X" does not exist | DLT skipped columns with all-null values | Populate data in Frappe for the missing fields, then re-run ingester |
relation does not exist | DLT skipped empty API responses | Verify the DocType has data in Frappe |
Password with $ truncated | Shell expanded $ in password | Use TARGET_DB_* vars (not DB_*) — Docker --env-file reads them literally |
Ingester connects to localhost | DB_HOST not mapped | Use the runner script (handles mapping) or the entrypoint wrapper shown above |
See Also
- Data Pipeline - Ingester — Monorepo setup with Make commands
- Data Pipeline - DBT — DBT development with Make commands
- Data Ingestion Architecture — Pipeline architecture overview
- Environment Variables — Full configuration reference
- Platform Runbook — Operational procedures