Development and Validation of a Machine Learning Algorithm for Predicting the Risk of Postpartum Depression among Pregnant Women

Objective— There is a scarcity in tools to predict postpartum depression (PPD). We propose a machine learning framework for PPD risk prediction using data extracted from electronic health records (EHRs). Methods— Two EHR datasets containing data on 15,197 women from 2015 to 2018 at a single site, and 53,972 women from 2004 to 2017 at multiple sites were used as development and validation sets, respectively, to construct the PPD risk prediction model. The primary outcome was a diagnosis of PPD within 1 year following childbirth. A framework of data extraction, processing, and machine learning was implemented to select a minimal list of features from the EHR datasets to ensure model performance and to enable future point-of-care risk prediction. Results— The best-performing model uses from clinical features related to mental health history, medical comorbidity, obstetric complications, medication prescription orders, and patient demographic characteristics. The model performances as measured by area under the receiver operating characteristic curve (AUC) are 0.937 (95% CI 0.912 – 0.962) and 0.886 (95% CI 0.879– 0.893) in the development and validation datasets, respectively. The model performances were consistent when tested using data ending at multiple time periods during pregnancy and at childbirth. Limitations— The prevalence of PPD in the study data represented a treatment prevalence and is likely lower than the illness prevalence.


Introduction
Postpartum depression (PPD) is a potentially life-threatening mental health condition that occurs up to one year following childbirth (Stewart and Vigod, 2016). The prevalence of PPD is estimated to affect as many as 1 in 7 mothers in the US (Hahn-Holbrook et al., 2017;Wisner et al., 2013), but underdiagnosis and lack of treatment for PPD are common, especially among women with low socioeconomic status (Biaggi et al., 2016;O'Connor et al., 2019). Long-term health effects of PPD to mothers, children, and family include increased maternal and infant mortality, increased hospitalizations, impaired mother-child bonding, and impaired long-term child development (Field, 2010;Jacques et al., 2019;Stein et al., 2014;Weobong et al., 2015). The disease mechanism of PPD is multifactorial. Clinically, a history of mental illness is the most significant risk factor (Meltzer-Brody et al., 2018;Stewart and Vigod, 2016). Social determinants of health (SDoH), including poor marital relationship, low socioeconomic status, and stressful life events are also known contributors to increased PPD risk (Biaggi et al., 2016). New research indicates that there may be additional biomarkers associated with the risk for developing PPD such as excessive proinflammatory immune system activation, possible disruptions in fatty acid metabolism, disruptions in hypothalamic-pituitary-adrenal (HPA) functioning, altered neurosteroid physiology, and genetic and epigenetic signatures (Serati et al., 2016).
The importance of PPD prevention and timely intervention cannot be overstated. The American College of Obstetricians and Gynecologists (ACOG) (Committee, 2018), the American Academy of Pediatrics (AAP) (Earls et al., 2019), the US Preventive Services Task Force (O'Connor et al., 2019), and several other organizations (Stewart and Vigod, 2016) have guidelines and recommendations for universal PPD screening as part of usual care during pregnancy and the postpartum period. Current PPD prevention strategies focus on secondary rather than primary prevention, using questionnaire-based screening instruments such as the Edinburgh Postnatal Depression Scale (EPDS) (Cox et al., 1987) and Patient Health Questionnaire-9 (PHQ-9) (Löwe et al., 2004) to detect symptoms. Primary prevention techniques intervene in an illness course prior to symptom onset while secondary prevention techniques intervene soon after the symptom onset, but prior to the full manifestation of the illness. Unfortunately, it has been demonstrated that in women known to be at high risk of PPD, delaying intervention until the onset of symptoms only mildly attenuates risk for depression, while intervening with appropriately targeted prevention before the onset of symptoms substantially mitigates depression relapse risk (Cohen et al., 2006). In addition to being "too little, too late" from a clinical perspective, these screening tools present major feasibility problems for both large and smaller health systems (Beck and Gable, 2000;Gjerdingen and Yawn, 2007). In order to come into compliance with current screening recommendations, obstetric practices often require substantial change, including not just changes to clinical workflows, but also staffing changes, new electronic health records (EHR) workflow builds, collaboration with referral networks, and investment in staff and provider training. Even then, further challenges persist such as mental health-related stigma, limitations in provider time, attention, and expertise, and scarcity in specialized mental health treatment resources.
We argue that taking a primary prevention approach has the promise of reducing the investment and resources required to address PPD while at the same time reducing the incidence of PPD rates. In this work, to identify signals that may suggest elevated future risk of PPD, we propose a primary prevention approach that is data-driven, leveraging machine learning applications to EHR data (Loudon et al., 2016;Wang et al., 2019). EHR data can be collected and analyzed routinely on a large scale using machine learning, as demonstrated by successful data-driven clinical decision support (Shortliffe and Sepúlveda, 2018) applications that assist with decision making across clinical conditions (Goldstein et al., 2017;Liang et al., 2019;Rajkomar et al., 2018;Tomašev et al., 2019).
We developed an end-to-end framework ( Fig. 1) to extract features from EHR data for processing, including demographics, clinical diagnoses, medication prescriptions, laboratory results, and unstructured clinical notes. These data are sent to an optimization process to select important features and incorporated in multiple machine learning algorithms including regularized logistic regression, random forest, decision tree, extreme gradient boosting (XGboost), and multilayer perceptron (MLP) (Bishop, 2006) to predict the risk of PPD. The framework was implemented and evaluated using data available at different time intervals during pregnancy (12-week, 18-week, 24-week, and 30-week) during pregnancy and after childbirth.
We aim to demonstrate that the data-driven primary intervention approach provides an opportunity for individualized therapeutic interventions such as changing screening timelines, engaging with appropriate preventive strategies, or tailoring clinician PPD counseling time according to a patient need. To the best of our knowledge, this study is among the first in developing an EHR-based machine learning framework for identifying women at risk for PPD (Jiménez-Serrano et al., 2015;Tortajada et al., 2009;Wang et al., 2019;Zhang et al., 2020).

Inclusion Criteria
All pregnant women with fully completed antenatal care procedures who had live births of infants were included in the study. The exclusion criteria were (1) maternal age below 18 or above 45, or (2) lack of outpatient, inpatient or emergency room encounter information in the EHR data within 1 year following childbirth. Participants with a prior history of mental illness and participants with active mental illness were not excluded to ensure clinical applicability in real implementation (Fig. 2). The study was approved by the Institutional Review Board at Weill Cornell Medicine (IRB protocol# 1711018789). Data extraction and analysis were performed in 2019.

Outcome-
The outcome is defined as having a diagnosis of PPD within 1 year of childbirth. A PPD diagnosis was defined using Systematized Nomenclature of Medicine (SNOMED) codes and the use of antidepressants within 1 year following childbirth (Dietz et al., 2007;Stewart and Vigod, 2016). The specific SNOMED codes for PPD definition are listed in Appendix (Table A1). The use of antidepressants was defined by Anatomical Therapeutic Chemical (ATC) codes under N06A (Petersen et al., 2018). To ensure that antidepressants were primarily used for treatment of mental health conditions, and not for other indications such as pain, we further excluded the following medications: Amitriptyline, Clomipramine, Duloxetine, Flupentixol, and Nortriptyline (Schofield et al., 2016).

Data
Sources-For algorithm development, EHR data including demographics, diagnoses, medication prescriptions, procedures, laboratory measurements, and social determinants of health (SDoH) including the built environment characteristics such as distance to public transportation and green space on eligible patients were obtained at Weill Cornell Medicine (WCM) and NewYork-Presbyterian Hospital in New York City, USA between January 2015 and June 2018. For algorithm validation, EHR data was derived from multiple health systems across New York City affiliated to the Patient-Centered Outcomes Research Institute funded New York City Clinical Data Research Network data (NYC-CDRN) between August 2004and October 2017(Kaushal et al., 2014. We randomly selected 80% of the data from WCM as the training set including cross-validation and model tuning, and held the remaining 20% as the test set individually. The NYC-CDRN data was used solely as a validation set. Both datasets were represented using Observational Medical Outcomes Partnership (OMOP) Common Data Model to record patient demographics, encounter records, diagnostic codes, procedures, prescription medications, and laboratory measurements (Overhage et al., 2012). Diagnoses, laboratory measurements, and procedures are represented as SNOMED codes, Logical Observation Identifiers Names and Codes (LOINC), and Current Procedural Terminology (CPT) codes, respectively. Medications were standardized using the ATC classification system. In addition, marital status was extracted from unstructured clinical notes using regular expression-based searches, and individuals were classified as married or not married (single/divorced/widowed) at the time of childbirth. Age was calculated as the time difference between childbirth and delivery dates. Mental health history before pregnancy was defined as having at least one diagnosis including organic disorders, substance-related disorders, schizophrenic/psychotic disorders, mood disorders, anxiety disorders, personality disorders, and other psychiatric disorders (Canada, 2015). Features with frequencies below 10 were omitted from the study to remove rare events during pregnancy. Mean values were used to perform the imputation of missing numerical values. Discrete features, such as clinical diagnoses, prescribed medication, were coded as dummy features (Rodríguez et al., 2018). Numeric features were normalized in the scale of −1 to 1. Statistical comparison across the PPD and non-PPD group was performed using Stata 14. Independent sample T-test assuming unequal variances and Chi-Square test was used for continuous and categorical variables as appropriate. Fig. 1 (Schematic diagram of our PPD prediction framework) that describes the various steps involved in data preprocessing and risk model development. The machine learning model training was optimized using sequential forward selection (SFS) (T. and G., 2015) -a greedy search algorithm that searches for the combination of features that returns the maximum algorithm discriminatory power (Bradley, 1997). Starting with an empty feature set, SFS iteratively examines each feature combination such that the algorithm's performance can be maximized until the stopping criteria for the search is reached (Fig. A1) (T. and G., 2015). Five machine learning algorithms were trained, including random forest, decision tree, extreme gradient boosting (XGboost), regularized logistic regression, and multilayer perceptron (MLP). These algorithms were developed by iteratively splitting the data available to detect collective patterns across features in the subset of the data that maximally discriminate outcome classes, followed by testing the performance on the heldout data. This training process allowed us to develop prediction algorithms that are generalizable to unseen data.

Framework-The schematic diagram of our PPD prediction framework is shown in
Algorithm parameters were determined using a grid search for each algorithm that comprehensively searched for the best hyperparameters and parameters that resulted in the highest model performance as measured by area under the receiver operating characteristic curve (AUC). The stopping criteria for SFS were defined as 1) no increase in the AUC by 0.001 after 10 consecutive iterations, or 2) the predetermined maximum number of feature set has been reached. SFS was performed separately for women with, and without, mental health history to ensure that the model can predict for both types of patients when in actual use. We combined features selected from both SFS into a single feature set such that a single algorithm can be used for patients with and without a history of mental illness. Using the combined features, each of the machine learning algorithms was trained using 5-fold crossvalidation.

Expert adjudicated feature selection-Clinicians
in our study team (AH and RJ) reviewed the selected features in the best performing algorithm to validate feature inclusion and ensure algorithm interpretability. Starting with the entire list of features selected by SFS, we iteratively eliminated features that were determined to be irrelevant, reconstructed the algorithm using the adjudicated features, and measured the algorithm performance. This iterative process was performed while keeping the minimum AUC at 0.8. Features that were changed during this process are listed in the Appendix (Table A2).

Evaluation-
The evaluation was performed using the held-out data set at WCM and the entire dataset from NYC-CDRN using AUC, sensitivity, specificity, and the Brier score (Hanley and McNeil, 1982). AUC is an aggregate measure of the algorithm's ability to discriminate outcome classes across all possible classification thresholds. The Brier score measures the accuracy of prediction (Rufibach, 2010). As such, higher AUC and lower Brier score indicate better prediction performance. To evaluate the algorithm performance in a simulated gestational period where data are being accumulated during pregnancy, we computed evaluation metrics using data available up to 5 different periods. Starting with Zhang et al. Page 5 J Affect Disord. Author manuscript; available in PMC 2022 January 15. each patient's first available pregnancy encounter, we created a test dataset ending at 12week, 18-week, 24-week, and 30-week during pregnancy, and also at childbirth assuming that data at 12-week pregnancy and childbirth contain the least and the most complete information, respectively. Lastly, error analyses were conducted by manual chart review using patients' medical records for upto 2 years after childbirth for 150 false positives and negatives. Machine learning algorithms were trained and evaluated using Scikit-learn and Seaborn in Python (3.6.5).

Results
A total of 15,197 deliveries from January 2015 to June 2018 were included in our analysis, excluding 124 women below age 18 or above age 45 at the time of delivery, and 2,312 women without records of clinical encounters within 1 year following childbirth (Fig. 2). Study data were randomly split into training (N=12157) and testing (N=3040) using crossvalidation. The validation set contained 53,972 deliveries from August 2004 to October 2017, after excluding 1,903 deliveries by women below age 18 or above age 45 and 15,141 deliveries without encounters recorded within 1 year after childbirth (Fig. 2). The prevalence of depression was 6.7% (N=1,010) and 6.5% (N=3,513) in the WCM and NYC-CDRN datasets, respectively. Table 1 provides the descriptive statistics of the two datasets. We found significant differences in age, the number of emergency department (ED) visits, and racial distribution between PPD and non-PPD groups in the training and validation data, respectively. The average age at the time of delivery was 33.68 (SD=4.54) in the non-PPD group and 34.56 (SD=4.39) in the PPD group of patients in the WCM dataset (p-value<0.001), and 28.87 (SD=6.20) and 30.70 (SD=6.13) in the CDRN dataset, respectively (p-value<0.001). The number of emergency room visits in the PPD group was higher than the non-PPD group in both the WCM (1.68 ± 1.55 vs. 1.32 ± 1.24, p-value<0.001) and NYC-CDRN (6.30 ± 9.97 vs. 5.37 ± 6.87, p-value<0.001) datasets. The training and validation datasets had different distribution of PPD across racial groups. In the WCM data, the incidence rate of PPD was the highest among White women (8.8%) and the lowest among Asian women (3.0%). In the CDRN data, the rate of PPD was the highest among White women (12.43%), Black patients had the lowest rates (4.76%).
Using SFS, 32 features were selected to be incorporated in the algorithm related to patient demographic statuses, health service utilization, mental health history, newly diagnosed mental health conditions during pregnancy, other obstetric and/or medical diagnoses during pregnancy, and vital signs. As shown in Table 2, the majority (28 out of 32) of the features included in the algorithm have statistically significant association with the outcome.
Features that are indicative of past and current mental health conditions and being single mothers were associated with higher odds of a PPD diagnosis. Additionally, complications during pregnancy such as palpitations, diarrhea, vomiting, and abdominal pain also were associated with higher odds of a PPD diagnosis. Health service utilization including medication prescriptions such as Beta blocking agents, delivery by cesarean, and emergency department (ED) visits were also associated with higher odds of a PPD diagnosis. Having an Asian race was associated with lower odds of a PPD diagnosis. Fig. A2 in the Appendix shows the Pearson correlation among the features.
Evaluation results of the algorithm performance are shown in Table 3. Logistic regression with L2 regularization was found to be the best performing algorithm using data available up to childbirth. The AUC was 0.937 (95% CI: 0.912-0.962) and 0.886 (95% CI: 0.879-0.893) in the WCM and NYC-CDRN datasets, respectively. The AUC was lower in the validation dataset potentially due to the lack of certain features such as marital status which was available only in the WCM dataset. While evaluating algorithms at different periods during the pregnancy, we observed a steady performance with respect to AUC of 0.921, 0.919, 0.922, 0.921, and 0.937 using data extracted up to 12 weeks, 18 weeks, 24 weeks, 30 weeks of gestation, and at childbirth, respectively. The steady performance may be explained by the early availability and invariability of the predictive features (see Table 2). In the NYC-CDRN dataset, we observe an increase in algorithm performance as more data accumulate over time, with an AUC of 0.810, 0.817, 0.821, 0.824, and 0.886 at 12 weeks, 18 weeks, 24 weeks, 30 weeks of gestation, and lastly at childbirth, respectively. Additionally, we report positive and negative predictive values in Table 3. While negative predictive values are close to 1 for nearly all models across time periods, we find that positive predictive values are low especially in the validation site. This could be explained by the relative low prevalence of PPD and the high frequency of the patients who were not diagnosed to have PPD (based on our criteria), but were predicted so.
False-positive and false-negative results from the algorithm were evaluated by manual chart reviews of a randomly selected 150 cases that were incorrectly classified by the logistic regression classifier. The cases had 140 and 10 false positives and false negatives, respectively. PPD diagnosis after the study period and lack of proper coding were identified as two potential reasons for the false positives and negatives. For example, the manual chart review identified that 45% of the patients incorrectly predicted to develop PPD by the prediction algorithm were in fact women who were noted to be suffering from PPD in the clinical notes. Furthermore, 34% of the PPD mentions in the notes were made one year after childbirth, beyond our study period. Thus, the incorrect predictions were due to the lack of good coding practices for PPD, a phenomenon that is frequently observed in other observational mental health studies using EHRs (Stewart et al., 2019). The availability of predictors related to mental health history also presented challenges. For example, the error analysis identified the history of anxiety and depression on 36.4% of false negative cases through manual chart review. For these patients the mental health history was not coded in the structured EHR data. Extraction of features using natural language processing techniques may facilitate higher performance by the algorithm in future studies.

Discussion
Results from this study suggest a promising direction to leverage routinely collected EHR data to identify pregnant women at risk for PPD. Selected EHR-driven predictors characterize women's health history, pregnancy health, demographics, and healthcare utilization. Several known PPD risk factors from the literature were represented by variables extracted in the sequential feature selection process, including history of anxiety, mood disorder, and other mental disorders, antidepressant use, incidental mental health illnesses during pregnancy, cesarean section, and single motherhood. (Forman et al., 2000;Stewart and Vigod, 2016) Our model further identifies additional comorbid predictors, including palpitations, diarrhea, vomiting during pregnancy, hypertensive disorders and hypothyroidism. Among these comorbidities, thyroid dysfunction and hypertensive disorders have been associated with PPD onset in previous literature. (Le Donne et al., 2017;Strapasson et al., 2018) Palpitation, a common cardiac symptom, may also be a symptom of depression that was discovered by the model. (Alijaniha et al., 2016;Barsky et al., 1994) In addition, medication prescriptions of beta blocking agents and antihistamines were identified as predictors. Literature has reported the use of both beta blockers and antihistamines in association with depression although not conclusively (Yudofsky, 1992)(Gerstman et al., 1996Ozdemir et al., 2014). Related to mode of delivery, our model selected cesarean section as a risk factor for PPD, as also studied in the previous literature. (Carter et al., 2006;Xu et al., 2017) Lastly, the number of ED visits during pregnancy and postpartum may be an indicator of a lack of proper access to primary and obstetric care. (Sheen et al., 2019) As seen in our experiments, the risk computed by the PPD prediction algorithm updates in response to the new health information that accumulates overtime with repeated visits during pregnancy, thereby potentially allowing care providers to take timely actions according to the risk evolution. (Committee, 2018;Earls et al., 2019) With these automatically extractable features, an EHR-based prediction tool may assist with existing EHR interventions for screening to minimize variations across clinical practices in screening and information collection. (Long et al., 2019)Previous studies have reported that while the rates of screening and referrals for mental health care can be high when obstetricians recognize a risk for PPD, but they are low if symptoms are unnoticed by the care provider.(Goodman and Tyer-Viola, 2010) Our risk prediction model, by identifying women with elevated risk, may assist with tacitly raising clinician awareness of PPD and potentially increasing screening and referral rates.

Limitations
Several limitations exist in our research. First, our study cohort as derived from the EHR in an urban academic medical center is not representative of the general US population suffering from PPD and differs from cohorts reported in previous studies with respect to PPD prevalence (Hahn-Holbrook et al., 2017). This prevalence is likely the treatment prevalence rather than the illness prevalence, as the data may not capture patients outside of the studied health system and geographical location. The prevalence may also reflect the clinician coding practices on recording a diagnosis of PPD at the study sites. Persistent stigma and social consequences of having depression coded in the EHR may prevent providers from 'officially' coding the diagnosis even if it is made clinically. Further, also due to stigma, patients may withhold symptom information from providers preventing accurate diagnosis. In addition to using diagnostic codes, we also defined PPD using antidepressant use while excluding those for pain indications. However, it is possible that some antidepressants were used for anxiety rather than PPD. Anxiety disorders are so frequently comorbid with depression in the peripartum period such that a diagnosis of one may even be a proxy for unidentified depression. Thus, we decided it was important not to exclude anxiety disorder indications even at the expense of specificity, although we recognize this as a limitation of our study. Our ongoing and future work will attempt to parse these indications further by applying natural language processing to the unstructured clinical notes.
Relatedly, in this study, we did not specifically include only patients with incident depression. This decision was meant to acknowledge the powerful effect that mental health histories have on risk for developing PPD as well as to provide a clinically meaningful risk stratification for real-world obstetric providers who have large cohorts of patients with mental health histories and those who are actively seeking treatment in their practices. Due to the lack of comprehensive screening at our health systems and clinics in the study sites, we did not capture EPDS and PHQ-9 scores to define PPD. We also did not compare effectiveness of primary prevention via the prediction algorithm to current widely recommended secondary prevention efforts via EPDS or PHQ-9 screening. However, we did compare with algorithms reported in prior literature as a potential primary intervention approach, and demonstrated improved model performance. Compared to prior work by Camdeviren et al (Camdeviren et al., 2007), Tortajada et al (Tortajada et al., 2009), and Natarajan et al (S. et al., 2017), our algorithm was built by exhaustively selecting most predictive features from a larger number of candidate features from the EHR data, with an eventual goal of integrating such risk prediction models within the EHR systems and clinical workflows. Furthermore, compared to our initial pilot work  which did not include prior mental health diagnosis and treatment history as predictors, the prediction algorithm from this study demonstrated a significant increase in AUC, sensitivity, and specificity.
A number of future works are under preparation to address these limitations. We found White and Asian races to be predictive features in this study. However, a substantial proportion of race was unknown in both the training and validation datasets, potentially due to lack of proper documentation in the EHR (Lee et al., 2016). This is an important area for further consideration in future studies. (Sholle et al., 2019) These include a comparison of the data-driven primary intervention against usual care as a clinical trial, and additional validation work at study sites in the greater US and abroad using datasets with different PPD prevalence to evaluate the algorithm generalizability. While findings from this study present a promise for PPD risk identification using available EHR data, we realize that EHR data capture only a limited portion of patients' live s which contribute to PPD. Therefore, we will also evaluate whether the addition of patient-reported outcomes or information derived from mobile health devices, such as wearables, can contribute to higher algorithm performance. Lastly, improvement in the machine learning framework will include techniques to adjust for differing outcome distributions such that the method can be more generally applied to other populations.

Conclusions
In summary, this study demonstrates that a data-driven primary intervention approach using machine learning and EHR data may be leveraged to reduce the healthcare provider burden of identifying PPD risk. Methods created in this study may pave a path towards data-driven, accurate, and scalable clinical decision support for PPD risk identification with potential benefits through early prevention, diagnosis, and intervention.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Applying machine learning to electronic health records (EHR) data can preemptively identify women at higher risk of postpartum depression.

Abbreviations
• Two datasets of multi-site EHR data were used as development and validation sets, respectively.
• Mental health history, number of emergency department visits, blood pressure, complications during pregnancy are among the predictors in the machine learning algorithm.

•
The algorithm performances as measured by area under the receiver operating characteristic curve are 0.937 and 0.886 in the development and validation datasets, respectively.