A machine learning pipeline for early AKI detection using the MIMIC-IV critical care database
Clinical motivation and study design
Acute kidney injury (AKI) is a common and serious complication in mechanically ventilated ICU patients, associated with increased mortality, prolonged hospital stays, and long-term renal impairment. Early identification of patients at risk enables timely interventions including fluid management, nephrotoxin avoidance, and early renal consultation.
This project builds a complete predictive modeling pipeline using the MIMIC-IV v3.1 critical care database. The outcome of interest is AKI Stage 2 or higher (KDIGO criteria) within 7 days of mechanical ventilation initiation.
MIMIC-IV v3.1 — 330,000+ hospital admissions from Beth Israel Deaconess Medical Center (2008–2022)
AKI Stage 2+ within 7 days of intubation, defined by KDIGO creatinine criteria with imputed baseline
Logistic regression, LASSO, random forest, XGBoost, and SVM with MICE imputation across 5 imputed datasets
Reproducible, end-to-end workflow from raw EHR data to model interpretation
Identify mechanically ventilated patients (≥24h), apply clinical exclusion criteria (ESRD, elective surgery, pediatric), define AKI outcomes using KDIGO staging with CKD-EPI imputed baseline creatinine.
01_cohort_construction.Rmd R ClinicalExtract 74 candidate predictors from labs (28 categories), vitals, vasopressors, fluid balance, and ICD-coded comorbidities. Derive BMI, P/F ratio, SOFA components, and driving pressure.
02_feature_engineering.Rmd RCharacterize missingness patterns across all features. Apply MICE (Multivariate Imputation by Chained Equations) to generate 5 complete datasets with principled uncertainty propagation.
03_imputation.Rmd StatisticsTrain and evaluate logistic regression, LASSO, random forest, XGBoost, and SVM using tidymodels. Pool predictions across all 5 imputed datasets following Rubin’s rules.
04_machine_learning.Rmd MLSHAP-based global and local feature importance for the best-performing model. Identify which clinical variables drive AKI risk predictions.
05_feature_analysis.Rmd MLSystematic exclusion criteria following CONSORT guidelines
Starting from all ICU admissions with mechanical ventilation ≥24 hours, sequential exclusion criteria were applied to arrive at a clinically homogeneous cohort of 10,089 stays suitable for AKI prediction modeling.
Patient characteristics stratified by 7-day AKI outcome (post-MICE, Imputation 1)
| Characteristic | Overall (N=10,085) | No AKI Progression (n=8,534) | AKI Stage 2+ (n=1,551) | p-value |
|---|---|---|---|---|
| Age, years (median [IQR]) | 65.0 [53.0, 75.0] | 64.0 [53.0, 75.0] | 67.0 [56.0, 77.0] | <0.001 |
| Sex — Female | 4,175 (41.4%) | 3,542 (41.5%) | 633 (40.8%) | 0.630 |
| Sex — Male | 5,910 (58.6%) | 4,992 (58.5%) | 918 (59.2%) | |
| Full demographics table (57 characteristics) available in the complete analysis report. | ||||
Comparing five ML approaches with pooled predictions across 5 MICE imputations
| Model | AUROC | AUPRC |
|---|---|---|
| Logistic Regression | 0.8009 | 0.4434 |
| XGBoost | 0.8003 | 0.4363 |
| LASSO | 0.8000 | 0.4434 |
| Random Forest | 0.7923 | 0.4282 |
| SVM | 0.7855 | 0.4195 |
| Pooled predictions across 5 MICE imputations. Models ordered by descending AUROC. | ||
| Model | Threshold | Sensitivity | Specificity | Youden’s J |
|---|---|---|---|---|
| Logistic Regression | 0.1600 | 71.4% | 75.3% | 0.4666 |
| XGBoost | 0.1300 | 78.5% | 69.4% | 0.4782 |
| LASSO | 0.1700 | 68.8% | 77.4% | 0.4626 |
| Random Forest | 0.1700 | 74.3% | 70.9% | 0.4522 |
| SVM | 0.0500 | 100.0% | 0.0% | 0.0000 |
| Thresholds derived by maximizing Youden’s J = sensitivity + specificity − 1 on held-out test set. | ||||
SHAP-based interpretation of the XGBoost model
SHAP (SHapley Additive exPlanations) values provide both global feature rankings and patient-level explanations for each prediction. This helps clinicians understand which variables drive AKI risk predictions and build trust in model outputs.
Assessing reliability of predicted probabilities
| Model | Brier Score |
|---|---|
| Logistic Regression | 0.1070 |
| LASSO | 0.1071 |
| XGBoost | 0.1075 |
| Random Forest | 0.1088 |
| SVM | 0.1304 |
| Brier score = mean(predicted_prob − observed_outcome)². Lower is better. Pooled across 5 MICE imputations. | |
Fully reproducible pipeline
Full analysis pipeline available on GitHub. All scripts are documented R Markdown files that can be rendered with knitr.
This project uses MIMIC-IV v3.1. Access requires PhysioNet credentialing and a signed data use agreement. No patient data is included in this repository.
A Shiny app for interactive cohort exploration is included in the repository (shiny_app/). Clone the repo and run locally with shiny::runApp("shiny_app/shiny_app").
Requirements: R ≥ 4.5 •
Key packages: tidyverse, tidymodels,
xgboost, mice, shapviz, teal, shiny