Early detection of pediatric health risks using maternal and child health data

Machine learning (ML)-driven diagnosis systems are particularly relevant in pediatrics given the well-documented impact of early-life health conditions on later-life outcomes. Yet, early identification of diseases and their subsequent impact on length of hospital stay for this age group has so far remained uncharacterized, likely because access to relevant health data is severely limited. Thanks to a confidential data use agreement with the California Department of Health Care Access and Information, we introduce Ped-BERT: a state-of-the-art deep learning model that accurately predicts the likelihood of 100+ conditions and the length of stay in a pediatric patient’s next medical visit. We link mother-specific pre- and postnatal period health information to pediatric patient hospital discharge and emergency room visits. Our data set comprises 513.9K mother–baby pairs and contains medical diagnosis codes, length of stay, as well as temporal and spatial pediatric patient characteristics, such as age and residency zip code at the time of visit. Following the popular bidirectional encoder representations from the transformers (BERT) approach, we pre-train Ped-BERT via the masked language modeling objective to learn embedding features for the diagnosis codes contained in our data. We then continue to fine-tune our model to accurately predict primary diagnosis outcomes and length of stay for a pediatric patient’s next visit, given the history of previous visits and, optionally, the mother’s pre- and postnatal health information. We find that Ped-BERT generally outperforms contemporary and state-of-the-art classifiers when trained with minimum features. We also find that incorporating mother health attributes leads to significant improvements in model performance overall and across all patient subgroups in our data. Our most successful Ped-BERT model configuration achieves an area under the receiver operator curve (ROC AUC) of 0.927 and an average precision score (APS) of 0.408 for the diagnosis prediction task, and a ROC AUC of 0.855 and APS of 0.815 for the length of hospital stay task. Further, we examine Ped-BERT’s fairness by determining whether prediction errors are evenly distributed across various subgroups of mother–baby demographics and health characteristics, or if certain subgroups exhibit a higher susceptibility to prediction errors.

Linked Birth Files (Birth data): a research database created to study delivery and birth outcomes.It includes maternal antepartum and postpartum hospital records for the nine months before delivery and one-year post-delivery.In addition, the linked file contains birth records and all infant readmissions occurring within the first year of life.The file contains all infants born in a given year, including births that happened in a California hospital that reported to HCAI, births that occurred in a California hospital that did not report to HCAI, and births that occurred outside California.
It includes all infants and mothers, irrespective of whether they were linked to a birth record.The linked pairs of birth/delivery records have information associated with a mother/baby pair from the baby's discharge data record, the mother's discharge data record, and the birth certificate data.
Linked birth files are available beginning with the 1991 calendar year reporting period (HCAI 20 ).
The Patient Discharge Dataset (PDD): consists of a record for each inpatient discharge from a California-licensed hospital.Licensed hospitals include general acute care, acute psychiatric, chemical dependency recovery, and psychiatric health facilities.These datasets are available starting in 1983 (HCAI 20 ).For more information on the data and reporting requirements, see the California Inpatient Data Reporting Manual. 35e Emergency Department Dataset (EDD): includes information from hospitals licensed to provide emergency medical services.The EDD encounters include those patients who had faceto-face contact with the provider.If the patient left without being seen, the patient would not have had a face-to-face encounter with a provider, and therefore the EDD encounter would not be reported.These data sets are available beginning January 2005 (HCAI 20 ).
Our study's primary variable of interest is the primary, secondary, and tertiary ICD 9 or ICD10 diagnosis codes at the time of visits, as well as the hospital length of stay determined at discharge time.We access this data along with other relevant metadata, such as mother-baby demographics and mother health-related outcomes nine months before and 12 months after birth.

A.2 Geospatial Data
The geospatial data is constructed and made available by the Census Bureau.For California, the relevant 2010 ZCTA and county-specific shapefiles, 31,32 the 2010 ZCTA to county codes, 33 and the ZCTA to ZIP crosswalks 34 are identified and mapped to our health data for visualization and analysis purposes.For the Fairness analysis, we extract the California -Census 2020 geographical division of counties into regions.Supplementary Table S0 contains a summary of the counties used for each region.Supplementary Table S0: Geographical division of California's counties -Details of counties included in each California region to support the Fairness analysis presented in Figure 6-7 and Supplementary Figure 7

C Supplementary Figures
Summary statistics of encoded input for pre-training Ped-BERT.The x-axis represents the length of a given patient history, which we optimally set to 40 periods.Each tick on the y-axis represents a diagnosis, age, location history, and padding summary for a given patient ID in the pre-training data.Heatmap values and colors represent: for (a-c), the encoded disease codes, age, and location history; for (b): the effect of zero padding since not all patients have a history length equal to 40.
Supplementary FigureS2: Intrinsic evaluation of embeddings.Learned embeddings are extracted from the pre-training stage for the base + age input embedding specification.The heatmap represents the cosine similarity for all the diagnosis codes in our data aggregated at the chapter level.Negative values (blue shades) reflect opposite similarities, and positive values (red shades) represent close similarities.Extrinsic evaluation of embeddings.Assess the performance of the pre-trained Ped-BERT model in predicting the patient gender distribution for congenital anomalies (light gray) and tuberculosis (dark gray).The y-axis represents the number of patients predicted to have the given diseases.The results presented here rely on the base + age embeddings specification using the 'pre-training validation set'.Abbreviations: F = Female, M = Male.Evaluation of the TDecoder pre-training task.The average precision score (APS, right y-axis) and the area under the receiver operating curve (ROC AUC, left y-axis) are computed as sample averages for the following embedding specifications: base, base + age, and base + age + county embeddings.These metrics represent comparisons between the ground truth (primary diagnosis token at the next medical encounter) and the TDecoder-predicted primary diagnosis token.

Table S1 :
Ped-BERT hyperparameter search.Ped-BERT is tuned using the baseline embeddings specification (i.e., the sum of diagnosis embeddings and positional encoding) and by comparing the APS and ROC AUC results of the 'pre-training train set' and 'pre-training validation set'.The optimal architecture, which we present in the main text of the paper, is highlighted here with maroon (i.e., Ped-BERT v1).

Table S2 :
Rare Genetic Diseases Prediction Task.Number of patients in the 'fine-tuning training set', 'fine-tuning validation set', and 'fine-tuning test set' for selected rare genetic diseases specific to pediatric patients, at the two-digit ICD9 code level.