Skip to main content

Data Pipeline - Ingester

This guide covers running the data ingestion pipeline that extracts clinical trial data from Frappe API and loads it into the PostgreSQL bronze layer.


Prerequisites

  • Python 3.10+
  • PostgreSQL/Neon database credentials

Quick Start (Monorepo)

The ingester is part of the consolidated AI Analytics Pipeline monorepo:

# Clone the monorepo
git clone https://github.com/zynomilabs/ctms-data-pipeline-ai-analytics.git
cd ctms-data-pipeline-ai-analytics

# Copy environment file
make env-copy

# Install ingester dependencies
make setup-ingester

# Run ingestion
make ingester-run

Standalone Setup

For standalone deployment, the ingester can also be run independently:

1. Configure Environment

Copy the example environment file and configure:

cp .env.production .env

Edit .env with your database credentials:

# Database Connection
DB_HOST=your-neon-host.aws.neon.tech
DB_PORT=5432
DB_USER=neondb_owner
DB_PASSWORD=your-password
DB_NAME=neondb
DB_SSLMODE=require

# Pipeline Settings
DLT_DESTINATION=postgres
DLT_DATASET_NAME=bronze
TABLE_PREFIX=tbl_mst_

2. Install Dependencies

make setup-ingester

3. Run Pipeline

Execute all configured endpoints:

make ingester-run

Available Commands

CommandDescription
make ingester-helpShow ingester options
make ingester-runRun all endpoints
make ingester-endpoint ENDPOINT='...'Run single endpoint
make ingester-purgeDrop and recreate schemas
make ingester-summary LOG=<file>Parse log for table/record summary

API Endpoint Groups

The pipeline ingests data from three endpoint groups:

GroupVariableDescription
🔵 TransactionDATALAKE_APISClinical data (Studies, Patients, Vitals, etc.)
🟢 MasterDATALAKE_APIS_MASTERReference data (Sites, Dosages, Countries, etc.)
🟡 CRFDATALAKE_APIS_CRFCase Report Forms