Data Pipeline - Ingester
This guide covers running the data ingestion pipeline that extracts clinical trial data from Frappe API and loads it into the PostgreSQL bronze layer.
Quick Commands
For the most common ingester commands (run, full-refresh, cron), see the Platform Runbook → Data Lakehouse Pipeline. This page covers advanced configuration and standalone setup.
Prerequisites
- Python 3.10+
- PostgreSQL/Neon database credentials
Quick Start
The ingester runs as a Docker container via the lakehouse profile:
# Define the compose command
DC="docker compose --env-file .env.production --profile lakehouse"
# Start the lakehouse database
$DC up -d lakehouse-db
# Run ingestion
$DC run --rm lakehouse-ingester
Standalone Setup
For standalone deployment, the ingester can also be run independently:
1. Configure Environment
Copy the example environment file and configure:
cp .env.production .env
Edit .env with your database credentials:
# Database Connection
DB_HOST=your-neon-host.aws.neon.tech
DB_PORT=5432
DB_USER=neondb_owner
DB_PASSWORD=your-password
DB_NAME=neondb
DB_SSLMODE=require
# Pipeline Settings
DLT_DESTINATION=postgres
DLT_DATASET_NAME=bronze
TABLE_PREFIX=tbl_mst_
2. Build Ingester Image
DC="docker compose --env-file .env.production --profile lakehouse"
$DC build lakehouse-ingester
3. Run Pipeline
Execute all configured endpoints:
$DC run --rm lakehouse-ingester
Available Commands
DC="docker compose --env-file .env.production --profile lakehouse"
| Command | Description |
|---|---|
$DC run --rm lakehouse-ingester | Run all endpoints |
$DC run --rm lakehouse-ingester --help | Show ingester options |
$DC run --rm lakehouse-ingester --endpoint '...' | Run single endpoint |
$DC run --rm lakehouse-ingester --purge | Drop and recreate schemas |
API Endpoint Groups
The pipeline ingests data from three endpoint groups:
| Group | Variable | Description |
|---|---|---|
| 🔵 Transaction | DATALAKE_APIS | Clinical data (Studies, Patients, Vitals, etc.) |
| 🟢 Master | DATALAKE_APIS_MASTER | Reference data (Sites, Dosages, Countries, etc.) |
| 🟡 CRF | DATALAKE_APIS_CRF | Case Report Forms |