Skip to main content

AI Analytics Pipeline

The CTMS AI Analytics Pipeline transforms raw clinical trial data from the Frappe API into conversational insights. Users can ask natural language questions and receive answers with KPI dashboards, charts, and actionable metrics.


Architecture Overview

┌─────────────────┐    ┌──────────────┐    ┌─────────────┐
│ Frappe API │───▶│ Ingester │───▶│ Bronze │
│ (Source) │ │ (DLT Hub) │ │ Schema │
└─────────────────┘ └──────────────┘ └──────┬──────┘


┌─────────────────┐ ┌──────────────┐ ┌─────────────┐
│ Gold Layer │◀───│ DBT │◀───│ Silver │
│ (Marts) │ │ Transforms │ │ Layer │
└────────┬────────┘ └──────────────┘ └─────────────┘


┌─────────────────┐ ┌──────────────┐ ┌─────────────┐
│ Cube.dev │───▶│ MCP Agent │───▶│ Chat UI │
│ Semantic Layer │ │ (FastAPI) │ │ (React) │
│ :4000 │ │ :8006 │ │ :3000 │
└─────────────────┘ └──────────────┘ └─────────────┘

Technology Stack

LayerTechnologyPurpose
IngestionDLT Hub (Python)Extract data from Frappe API to Bronze
TransformationDBT + ElementaryBronze → Silver → Gold with observability
SemanticCube.devPre-aggregations, metrics, dimensions
AI AgentFastAPI + MCPOpenAI integration with tool calling
Chat UIReact + ViteEmbeddable chat widget
DatabasePostgreSQL/NeonServerless PostgreSQL

Data Layers

LayerSchemaModelsPurpose
BronzebronzeRaw tablesDirect copy from Frappe API
Silversilverstg_*, int_*Cleaned, typed, standardized
Goldgolddim_*, fct_*, cdisc_*Analytics-ready, CDISC compliant

DBT Model Categories

  • Staging (stg_*) — Source cleaning and typing
  • Intermediate (int_*) — Business logic transformations
  • Dimensions (dim_*) — Master data entities (studies, sites, subjects)
  • Facts (fct_*) — Transactional/event data (visits, AEs, labs)
  • CDISC (cdisc_*) — CDISC SDTM domains (DM, AE, VS, LB, etc.)
  • Semantic (sem_*) — Pre-aggregated views for analytics

Repository Structure

The pipeline is consolidated in a single monorepo:

ctms-data-pipeline-ai-analytics/
├── .env.production # Shared environment configuration
├── Makefile # Master build automation
├── README.md # Project documentation

├── ctms-ingester/ # Bronze Layer (DLT Hub)
│ ├── bot_frappe_api_to_db.py
│ └── requirements.txt

├── ctms-dbt/ # Silver/Gold Layer (DBT)
│ ├── models/
│ │ ├── staging/ # Silver layer
│ │ ├── intermediate/ # Business logic
│ │ └── marts/ # Gold layer
│ └── dbt_project.yml

├── ctms-cube/ # Semantic Layer (Cube.dev)
│ ├── model/cubes/ # Cube definitions
│ └── cube.js

├── ctms-mcp-server/ # AI Agent (MCP + FastAPI)
│ ├── index.js # MCP server
│ └── agent_service_mcp_client.py

└── ctms-ai-chat/ # Chat UI (React)
└── src/components/

Quick Start

Prerequisites

  • Node.js 20+
  • Python 3.10+
  • PostgreSQL/Neon database

Setup

# Clone repository
git clone https://github.com/zynomilabs/ctms-data-pipeline-ai-analytics.git
cd ctms-data-pipeline-ai-analytics

# Copy environment file
make env-copy

# Install all dependencies
make setup

Run Data Pipeline

# Ingest from Frappe API to bronze layer
make ingester-run

# Transform to silver/gold layers
make dbt-run

# Or run full pipeline
make pipeline

Start Services

# Start Cube.dev + MCP Agent together
make stack

# Or start individually in separate terminals:
make cube-dev # Terminal 1: Semantic layer :4000
make mcp-agent # Terminal 2: AI agent :8006
make chat-dev # Terminal 3: Chat UI :3000

Available Commands

Setup & Environment

CommandDescription
make setupInstall all project dependencies
make env-copyCopy .env.production to .env
make helpShow all available commands

Data Pipeline

CommandDescription
make ingester-runRun bronze layer ingestion
make dbt-runRun DBT transformations
make dbt-buildRun models + tests
make pipelineFull pipeline (ingester + dbt)

Services

CommandDescription
make stackStart Cube + MCP Agent (background)
make stack-stopStop all stack services
make cube-devStart Cube.dev (:4000)
make mcp-agentStart MCP Agent (:8006)
make chat-devStart Chat UI (:3000)

Utilities

CommandDescription
make ports-checkShow status of dev ports
make kill-allKill all dev servers
make dbt-debugVerify database connection

MCP Tools

The AI agent exposes these tools via the Model Context Protocol:

ToolDescription
cube_get_metaGet metadata about cubes, measures, dimensions
cube_queryExecute queries against semantic layer
cube_get_sqlGet generated SQL for queries
cube_list_cubesList all available cubes
cube_describe_cubeDescribe a specific cube
cube_clinical_kpisGet pre-defined clinical KPIs

Example Chat Queries

Ask natural language questions about clinical trials data:

"How many subjects are enrolled across all studies?"
"Show me adverse events by severity"
"What is the enrollment rate by site?"
"List all active studies with their patient counts"
"Generate a CDISC DM domain extract"

KPI Queries

"Show total subjects, active studies, and adverse event count"
"Create a dashboard with enrollment metrics by study"
"Display adverse events by system organ class"

Cube Definitions

The semantic layer includes these cubes:

CubeKey MeasuresKey Dimensions
Studiescount, avgDurationphase, status, type
Subjectscount, avgAgesex, race, status
Enrollmentscount, screenFailRatestudyId, siteId, status
AdverseEventscount, saeCountseverity, aeterm, studyId
VitalSignsavgValue, countvstestcd, studyId
LabTestsavgValue, countlbtestcd, studyId
Visitscount, completionRatevisitType, status
Sitescount, activeSitescountry, status

Daily Operations

Running Daily Pipeline

# Full daily run with Elementary observability
make dbt-daily

This executes:

  1. dbt build — Run all models and tests
  2. dbt run --select elementary — Build observability tables
  3. edr report — Generate HTML quality report

Cache Management

After DBT refresh, clear Cube.dev cache:

# Local development
make cube-cache-bust-local

# Production
make cube-cache-bust

See Also