Skip to main content

Data Pipeline - Ingester

This guide covers running the data ingestion pipeline that extracts clinical trial data from Frappe API and loads it into the PostgreSQL bronze layer.

Quick Commands

For the most common ingester commands (run, full-refresh, cron), see the Platform Runbook → Data Lakehouse Pipeline. This page covers advanced configuration and standalone setup.


Prerequisites

  • Python 3.10+
  • PostgreSQL/Neon database credentials

Quick Start

The ingester runs as a Docker container via the lakehouse profile:

# Define the compose command
DC="docker compose --env-file .env.production --profile lakehouse"

# Start the lakehouse database
$DC up -d lakehouse-db

# Run ingestion
$DC run --rm lakehouse-ingester

Standalone Setup

For standalone deployment, the ingester can also be run independently:

1. Configure Environment

Copy the example environment file and configure:

cp .env.production .env

Edit .env with your database credentials:

# Database Connection
DB_HOST=your-neon-host.aws.neon.tech
DB_PORT=5432
DB_USER=neondb_owner
DB_PASSWORD=your-password
DB_NAME=neondb
DB_SSLMODE=require

# Pipeline Settings
DLT_DESTINATION=postgres
DLT_DATASET_NAME=bronze
TABLE_PREFIX=tbl_mst_

2. Build Ingester Image

DC="docker compose --env-file .env.production --profile lakehouse"
$DC build lakehouse-ingester

3. Run Pipeline

Execute all configured endpoints:

$DC run --rm lakehouse-ingester

Available Commands

DC="docker compose --env-file .env.production --profile lakehouse"
CommandDescription
$DC run --rm lakehouse-ingesterRun all endpoints
$DC run --rm lakehouse-ingester --helpShow ingester options
$DC run --rm lakehouse-ingester --endpoint '...'Run single endpoint
$DC run --rm lakehouse-ingester --purgeDrop and recreate schemas

API Endpoint Groups

The pipeline ingests data from three endpoint groups:

GroupVariableDescription
🔵 TransactionDATALAKE_APISClinical data (Studies, Patients, Vitals, etc.)
🟢 MasterDATALAKE_APIS_MASTERReference data (Sites, Dosages, Countries, etc.)
🟡 CRFDATALAKE_APIS_CRFCase Report Forms