Data Pipeline - Ingester

This guide covers running the data ingestion pipeline that extracts clinical trial data from Frappe API and loads it into the PostgreSQL bronze layer.

Prerequisites

Python 3.10+
PostgreSQL/Neon database credentials

Quick Start (Monorepo)

The ingester is part of the consolidated AI Analytics Pipeline monorepo:

# Navigate to the analytics monorepo
cd ctms-data-pipeline-ai-analytics

# Copy environment file
make env-copy

# Install ingester dependencies
make setup-ingester

# Run ingestion
make ingester-run

Standalone Setup

For standalone deployment, the ingester can also be run independently:

1. Configure Environment

Copy the example environment file and configure:

cp .env.production .env

Edit .env with your database credentials:

# Database Connection
DB_HOST=your-neon-host.aws.neon.tech
DB_PORT=5432
DB_USER=neondb_owner
DB_PASSWORD=your-password
DB_NAME=neondb
DB_SSLMODE=require

# Pipeline Settings
DLT_DESTINATION=postgres
DLT_DATASET_NAME=bronze
TABLE_PREFIX=tbl_mst_

2. Install Dependencies

make setup-ingester

3. Run Pipeline

Execute all configured endpoints:

make ingester-run

Available Commands

Command	Description
`make ingester-help`	Show ingester options
`make ingester-run`	Run all endpoints
`make ingester-endpoint ENDPOINT='...'`	Run single endpoint
`make ingester-purge`	Drop and recreate schemas
`make ingester-summary LOG=<file>`	Parse log for table/record summary

API Endpoint Groups

The pipeline ingests data from three endpoint groups:

Group	Variable	Description
🔵 Transaction	`DATALAKE_APIS`	Clinical data (Studies, Patients, Vitals, etc.)
🟢 Master	`DATALAKE_APIS_MASTER`	Reference data (Sites, Dosages, Countries, etc.)
🟡 CRF	`DATALAKE_APIS_CRF`	Case Report Forms

Prerequisites​

Quick Start (Monorepo)​

Standalone Setup​

1. Configure Environment​

2. Install Dependencies​

3. Run Pipeline​

Available Commands​

API Endpoint Groups​