Skip to main content

Schedule ETL Jobs — Docker Compose

Automate the data lakehouse pipeline (Ingester + dbt) to run on a schedule using Linux crontab with direct Docker Compose commands. This is the recommended approach for production deployments — it runs containers directly without depending on the Zynexa web app.

Prerequisites
  • SSH access to the server
  • Docker Compose project at /opt/ctms-deployment
  • .env.production configured with all pipeline variables
  • Lakehouse DB running: docker compose --profile lakehouse up -d lakehouse-db

For one-off or ad-hoc runs, see Trigger ETL Pipeline via REST API or Platform Runbook — Data Lakehouse Pipeline.


Quick Setup

# Open crontab editor
crontab -e

# Add the daily pipeline schedule (runs at midnight)
0 0 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1

Pipeline Steps

The cron job runs two stages sequentially:

StageContainerDurationWhat it does
1. Ingesterlakehouse-ingester2–5 minExtracts ~46 Frappe DocTypes via REST API → loads into bronze schema
2. dbt dailylakehouse-dbt5–15 mindbt depsdbt build → Elementary report → transforms bronze → silver → gold

The && operator ensures dbt only runs if the ingester succeeds. If the ingester fails, dbt is skipped and the exit code is logged.


Schedule Examples

0 0 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1

Every 6 Hours

0 */6 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1

2 AM Weekdays Only

0 2 * * 1-5 cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1

Ingester and dbt at Different Times

Run ingester more frequently (to keep bronze fresh) and dbt less often:

# Ingester every 4 hours
0 */4 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester \
>> /var/log/ctms-ingester.log 2>&1

# dbt once daily at 1 AM
0 1 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-dbt.log 2>&1

dbt Only (No Ingestion)

If data is loaded via another mechanism and you only need dbt transformations:

0 0 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-dbt.log 2>&1

Cron Syntax Reference

┌───────────── minute (0–59)
│ ┌───────────── hour (0–23)
│ │ ┌───────────── day of month (1–31)
│ │ │ ┌───────────── month (1–12)
│ │ │ │ ┌───────────── day of week (0–7, 0 and 7 = Sunday)
│ │ │ │ │
* * * * * command
ExpressionSchedule
0 0 * * *Midnight daily
0 2 * * *2 AM daily
0 */6 * * *Every 6 hours
0 */4 * * *Every 4 hours
0 2 * * 1-52 AM weekdays
30 1 * * 01:30 AM Sundays
0 0 1 * *Midnight, 1st of month

Logging & Monitoring

Log File Location

All pipeline output is appended to /var/log/ctms-pipeline.log:

# View latest pipeline output
tail -100 /var/log/ctms-pipeline.log

# Follow live during a run
tail -f /var/log/ctms-pipeline.log

# Check for errors
grep -i 'error\|fail' /var/log/ctms-pipeline.log | tail -20

Log Rotation

Prevent the log file from growing indefinitely:

# Create logrotate config
cat > /etc/logrotate.d/ctms-pipeline << 'EOF'
/var/log/ctms-pipeline.log {
daily
rotate 14
compress
missingok
notifempty
}
EOF

Verify Cron is Running

# List current crontab
crontab -l

# Check if cron ran recently (systemd)
journalctl -u cron --since "1 hour ago"

# Check if cron ran recently (syslog)
grep CRON /var/log/syslog | tail -5

Verify Pipeline Results

After a cron run, check table counts in the lakehouse:

docker exec ctms-lakehouse-db psql -U ctms_user -d ctms_dlh -c "
SELECT schemaname AS schema, COUNT(*) AS tables
FROM pg_tables
WHERE schemaname IN ('bronze', 'silver', 'gold')
GROUP BY schemaname ORDER BY schemaname;
"

Expected output:

 schema | tables
--------+--------
bronze | 63
gold | 28
silver | 7

Troubleshooting

Pipeline Doesn't Run

  1. Verify crontab saved: crontab -l — should show the entry
  2. Check cron service: systemctl status cron (or crond on RHEL/Rocky)
  3. PATH issues: Cron uses a minimal PATH. Add the full docker path if needed:
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    0 0 * * * cd /opt/ctms-deployment && ...

Container Already Running Error

If a previous run hasn't finished and a new cron fires:

# Check if containers are still running
docker ps --filter name=lakehouse

# Wait and retry, or kill the stuck container
docker rm -f ctms-lakehouse-ingester ctms-lakehouse-dbt

To prevent overlap, wrap the cron command with flock:

0 0 * * * flock -n /tmp/ctms-pipeline.lock -c 'cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily' \
>> /var/log/ctms-pipeline.log 2>&1

The -n flag makes flock fail immediately if the lock is held (previous run still active), preventing overlap.

dbt Errors After Ingester Success

If the ingester succeeds but dbt exits with code 1, check the dbt output:

# Check last dbt run
grep -A 20 '=== DBT ===' /var/log/ctms-pipeline.log | tail -25

# Common causes:
# - Missing bronze tables (not all DocTypes have data yet) → harmless, skipped models
# - Column mismatch after Frappe schema changes → may need dbt-full-refresh

See Also