Data Pipeline - Ingester
This guide covers running the data ingestion pipeline that extracts clinical trial data from Frappe API and loads it into the PostgreSQL bronze layer.
Prerequisites
- Python 3.10+
- PostgreSQL/Neon database credentials
Quick Start (Monorepo)
The ingester is part of the consolidated AI Analytics Pipeline monorepo:
# Clone the monorepo
git clone https://github.com/zynomilabs/ctms-data-pipeline-ai-analytics.git
cd ctms-data-pipeline-ai-analytics
# Copy environment file
make env-copy
# Install ingester dependencies
make setup-ingester
# Run ingestion
make ingester-run
Standalone Setup
For standalone deployment, the ingester can also be run independently:
1. Configure Environment
Copy the example environment file and configure:
cp .env.production .env
Edit .env with your database credentials:
# Database Connection
DB_HOST=your-neon-host.aws.neon.tech
DB_PORT=5432
DB_USER=neondb_owner
DB_PASSWORD=your-password
DB_NAME=neondb
DB_SSLMODE=require
# Pipeline Settings
DLT_DESTINATION=postgres
DLT_DATASET_NAME=bronze
TABLE_PREFIX=tbl_mst_
2. Install Dependencies
make setup-ingester
3. Run Pipeline
Execute all configured endpoints:
make ingester-run
Available Commands
| Command | Description |
|---|---|
make ingester-help | Show ingester options |
make ingester-run | Run all endpoints |
make ingester-endpoint ENDPOINT='...' | Run single endpoint |
make ingester-purge | Drop and recreate schemas |
make ingester-summary LOG=<file> | Parse log for table/record summary |
API Endpoint Groups
The pipeline ingests data from three endpoint groups:
| Group | Variable | Description |
|---|---|---|
| 🔵 Transaction | DATALAKE_APIS | Clinical data (Studies, Patients, Vitals, etc.) |
| 🟢 Master | DATALAKE_APIS_MASTER | Reference data (Sites, Dosages, Countries, etc.) |
| 🟡 CRF | DATALAKE_APIS_CRF | Case Report Forms |