Schedule ETL Jobs — Docker Compose
Automate the data lakehouse pipeline (Ingester + dbt) to run on a schedule using Linux crontab with direct Docker Compose commands. This is the recommended approach for production deployments — it runs containers directly without depending on the Zynexa web app.
- SSH access to the server
- Docker Compose project at
/opt/ctms-deployment .env.productionconfigured with all pipeline variables- Lakehouse DB running:
docker compose --profile lakehouse up -d lakehouse-db
For one-off or ad-hoc runs, see Trigger ETL Pipeline via REST API or Platform Runbook — Data Lakehouse Pipeline.
Quick Setup
# Open crontab editor
crontab -e
# Add the daily pipeline schedule (runs at midnight)
0 0 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1
Pipeline Steps
The cron job runs two stages sequentially:
| Stage | Container | Duration | What it does |
|---|---|---|---|
| 1. Ingester | lakehouse-ingester | 2–5 min | Extracts ~46 Frappe DocTypes via REST API → loads into bronze schema |
| 2. dbt daily | lakehouse-dbt | 5–15 min | dbt deps → dbt build → Elementary report → transforms bronze → silver → gold |
The && operator ensures dbt only runs if the ingester succeeds. If the ingester fails, dbt is skipped and the exit code is logged.
Schedule Examples
Midnight Daily (Recommended)
0 0 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1
Every 6 Hours
0 */6 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1
2 AM Weekdays Only
0 2 * * 1-5 cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-pipeline.log 2>&1
Ingester and dbt at Different Times
Run ingester more frequently (to keep bronze fresh) and dbt less often:
# Ingester every 4 hours
0 */4 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester \
>> /var/log/ctms-ingester.log 2>&1
# dbt once daily at 1 AM
0 1 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-dbt.log 2>&1
dbt Only (No Ingestion)
If data is loaded via another mechanism and you only need dbt transformations:
0 0 * * * cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily \
>> /var/log/ctms-dbt.log 2>&1
Cron Syntax Reference
┌───────────── minute (0–59)
│ ┌───────────── hour (0–23)
│ │ ┌───────────── day of month (1–31)
│ │ │ ┌───────────── month (1–12)
│ │ │ │ ┌───────────── day of week (0–7, 0 and 7 = Sunday)
│ │ │ │ │
* * * * * command
| Expression | Schedule |
|---|---|
0 0 * * * | Midnight daily |
0 2 * * * | 2 AM daily |
0 */6 * * * | Every 6 hours |
0 */4 * * * | Every 4 hours |
0 2 * * 1-5 | 2 AM weekdays |
30 1 * * 0 | 1:30 AM Sundays |
0 0 1 * * | Midnight, 1st of month |
Logging & Monitoring
Log File Location
All pipeline output is appended to /var/log/ctms-pipeline.log:
# View latest pipeline output
tail -100 /var/log/ctms-pipeline.log
# Follow live during a run
tail -f /var/log/ctms-pipeline.log
# Check for errors
grep -i 'error\|fail' /var/log/ctms-pipeline.log | tail -20
Log Rotation
Prevent the log file from growing indefinitely:
# Create logrotate config
cat > /etc/logrotate.d/ctms-pipeline << 'EOF'
/var/log/ctms-pipeline.log {
daily
rotate 14
compress
missingok
notifempty
}
EOF
Verify Cron is Running
# List current crontab
crontab -l
# Check if cron ran recently (systemd)
journalctl -u cron --since "1 hour ago"
# Check if cron ran recently (syslog)
grep CRON /var/log/syslog | tail -5
Verify Pipeline Results
After a cron run, check table counts in the lakehouse:
docker exec ctms-lakehouse-db psql -U ctms_user -d ctms_dlh -c "
SELECT schemaname AS schema, COUNT(*) AS tables
FROM pg_tables
WHERE schemaname IN ('bronze', 'silver', 'gold')
GROUP BY schemaname ORDER BY schemaname;
"
Expected output:
schema | tables
--------+--------
bronze | 63
gold | 28
silver | 7
Troubleshooting
Pipeline Doesn't Run
- Verify crontab saved:
crontab -l— should show the entry - Check cron service:
systemctl status cron(orcrondon RHEL/Rocky) - PATH issues: Cron uses a minimal PATH. Add the full docker path if needed:
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 0 * * * cd /opt/ctms-deployment && ...
Container Already Running Error
If a previous run hasn't finished and a new cron fires:
# Check if containers are still running
docker ps --filter name=lakehouse
# Wait and retry, or kill the stuck container
docker rm -f ctms-lakehouse-ingester ctms-lakehouse-dbt
To prevent overlap, wrap the cron command with flock:
0 0 * * * flock -n /tmp/ctms-pipeline.lock -c 'cd /opt/ctms-deployment && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-ingester && \
docker compose --env-file .env.production --profile lakehouse run --rm lakehouse-dbt daily' \
>> /var/log/ctms-pipeline.log 2>&1
The -n flag makes flock fail immediately if the lock is held (previous run still active), preventing overlap.
dbt Errors After Ingester Success
If the ingester succeeds but dbt exits with code 1, check the dbt output:
# Check last dbt run
grep -A 20 '=== DBT ===' /var/log/ctms-pipeline.log | tail -25
# Common causes:
# - Missing bronze tables (not all DocTypes have data yet) → harmless, skipped models
# - Column mismatch after Frappe schema changes → may need dbt-full-refresh
See Also
- Schedule ETL Jobs — REST API Cron — alternative using curl to the Zynexa API
- Trigger ETL Pipeline via REST API — ad-hoc runs without SSH
- Platform Runbook — Data Lakehouse Pipeline — Docker Compose commands reference
- Environment Variables — Pipeline Orchestration — all pipeline config