Predicting preeclampsia and related risk factors using data mining approaches: A cross-sectional study

Abstract Background Preeclampsia is a type of pregnancy hypertension disorder that has adverse effects on both the mother and the fetus. Despite recent advances in the etiology of preeclampsia, no adequate clinical screening tests have been identified to diagnose the disorder. Objective We aimed to provide a model based on data mining approaches that can be used as a screening tool to identify patients with this syndrome and also to identify the risk factors associated with it. Materials and Methods The data used to perform this cross-sectional study were extracted from the clinical records of 726 mothers with preeclampsia and 726 mothers without preeclampsia who were referred to Fatemieh Hospital in Hamadan City during April 2005–March 2015. In this study, six data mining methods were adopted, including logistic regression, k-nearest neighborhood, C5.0 decision tree, discriminant analysis, random forest, and support vector machine, and their performance was compared using the criteria of accuracy, sensitivity, and specificity. Results Underlying condition, age, pregnancy season and the number of pregnancies were the most important risk factors for diagnosing preeclampsia. The accuracy of the models were as follows: logistic regression (0.713), k-nearest neighborhood (0.742), C5.0 decision tree (0.788), discriminant analysis (0.687), random forest (0.758) and support vector machine (0.791). Conclusion Among the data mining methods employed in this study, support vector machine was the most accurate in predicting preeclampsia. Therefore, this model can be considered as a screening tool to diagnose this disorder.


Introduction
Pregnancy blood pressure disorders are one of the most common adverse pregnancy outcomes worldwide (1). One of the most important types of these disorders is preeclampsia (2).
Preeclampsia, which usually begins after the 20 th wk of pregnancy, is defined as blood pressure of at least 140/90 mm Hg in two separate stages at least four hr apart, along with proteinuria of at least 0.3 g in the urine collected within 24 hr (3). This syndrome, which affects 5-8% of pregnancies worldwide, is one of the leading causes of maternal and fetal mortality (4)(5)(6).
The prevalence of preeclampsia varies in different parts of Iran with reports of 4% in rural areas and 10% in urban areas (7). Preeclampsia can lead to complications such as renal necrosis, pulmonary edema, liver rupture, hemolysis, increased liver enzymes, decreased platelet syndrome, and stroke (8). In addition to the above, preeclampsia is associated with intrauterine fetal growth restriction, bleeding problems, preterm delivery, and low birth weight (9). In addition to threatening the mother's physical health, this disorder can lead to emotional disorders such as anxiety and depression (8). Unfortunately, no simple test is available to diagnose preeclampsia, and diagnosis is performed only by repeated visits during pregnancy, repeated blood pressure measurements, and urine analysis, which are costly and highly sensitive, and delay diagnosing the disorder (3). Therefore, simple alternative diagnostic methods are needed.
One of these prediction methods is classification. The simplest type of classification method divides subjects into two groups such as healthy and sick. Classification is one of the main tasks in the field of data mining. Data mining, which is the science of exploring knowledge from data, identifies potential trends, invisible communications, and hidden patterns between the mass of datasets (10). Data mining methods are known as a useful tool for diagnosing a variety of diseases or predicting clinical consequences.
In most studies, these techniques are more accurate than conventional methods of predicting disease (11,12). So far, various classification methods have been introduced to the field of data mining, the most common of which are logistic regression (LR), k-nearest neighborhood (k-NN), C5.0 decision tree, random forest (RF), support vector machine (SVM) and linear discriminant analysis (LDA) (13,14). LR, C5.0 decision tree and RF, in addition to predicting disease status, can identify the risk factors related to a disease. Therefore, in this study, we aimed to select the model with the best performance among the six data mining approaches mentioned above, and to use it as a screening tool to identify mothers with preeclampsia. We also used LR, C5.0 decision tree and RF to identify the risk factors associated with this syndrome. It should be noted that in employing these models, we used clinical data recorded in the hospital which did not require large expenses.

Software
After collecting the information, approximately 70% of the total sample (1016 people) were used for training and about 30% of the remaining sample (435 people) were used to test the models. The training data were used to build and train the model and the test data were used to assess the performance of the model to predict healthy or patient classes (in our study-with preeclampsia or without preeclampsia). Data were processed in the R 3.2.2 software environment. To build the models using the R software, the C50 package for C5.0 decision tree, e1071 package for SVM, random Forest package for RF, and MASS package for LDA were used and their performance was compared using accuracy, sensitivity, and specificity criteria on the test data.

Ethical considerations
Details of the women were collected without including the name. In addition, individuals' information was kept confidential. The study was approved by the Vice-Chancellor for Research and Technology, Hamadan University of Medical Sciences, Hamadan, Iran (Code: 9505122624).

Logistic regression (LR)
LR is a standard method for binary classification.
In LR, Y represents the binary response variable (in this study, Y = 1 for a subject with preeclampsia and Y = 0 for without preeclampsia) and X 1 , …, X represent the vector features (in our study, clinical features of patients). In this case, the probability of Y = 1 (probability of belonging to the class of mothers with preeclampsia) was calculated as follows: Based on this, the person would have been assigned to class 1 if P (Y = 1) > C and otherwise to class 0, where C was a fixed number (15, 16).

k-nearest neighborhood (k-NN)
The k-NN algorithm is a non-parametric method that is commonly used for classification and regression problems. It is one of the most widely used algorithms due to its simplicity and ease of implementation. In order to classify a new person into one of the healthy or patient classes (in our study -with preeclampsia or without preeclampsia) that displayed in the feature space with a point, k-NN calculates the distance between this point and the other points in the training dataset.
Euclidean distance is usually used as the distance criterion. This distance between A and B was calculated as follows: Then the point was assigned to a class in the k nearest neighborhood where k was an integer (17).

Linear discriminant analysis (LDA)
LDA is a classic classifier that uses a linear decision function for classification. In this method, a linear combination of independent variables (features) was used to separate the dependent variable classes in the best way. In other words, the goal was to find a linear function that maximized the probability of separation between the two groups. The conditional probability of independent variables given the label class was used to predict the label class of a new case.
A function was used to maximize the distance between the mean of the groups so that the scatter within the classes was minimized and the scatter between classes was maximized (18).

Decision tree
The structure of the decision tree is similar to a tree, which includes roots, branches, and leaves.
The classification tree divided the data (parent node) into two subsets (children node) using a split criterion. This division continued until we finally reached a homogeneous level of response in each node. In decision tree, the branches represent combinations of input features and the leaves represent the labels of the target class (in our study, 0 was the label of the without preeclampsia class and 1 was the label of the with preeclampsia class) (19).
The rules produced by the decision tree were explained using the logical terms "if" and "then". The decision trees that are most common are ID3, C4.5, C5.0 and CART (16). The C5.0 decision tree, which was introduced by Quinlan in 1987, is modified from the C4.5 version (20).
C5.0 decision tree is faster than the C4.5 and produces more precise rules (21). Therefore, in this study, we used this type of decision tree.

Random forest (RF)
RF is an "ensemble learning" technique that involves a large number of decision trees whose variance is lower than that of a single decision tree.
Each RF tree was based on a bootstrap sample that was randomly extracted from the original dataset and built using the CART method and the Decrease Gini Impurity split criterion (22).

Support vector machine (SVM)
SVM was introduced by Vapnik in 1979. Its goal is to find the best function for classification so that the members of the two classes (in our study-    If the number of pregnancies > 1, and without underlying condition then the possibility of belonging to the group with preeclampsia is 79.6%

2
If the sex of fetus is female, without underlying condition, and pregnancy season is summer, then the possibility of belonging to the group without preeclampsia is 87%

3
If number of children > 1, pregnancy season is winter, and with underlying condition, then the possibility of belonging to the group with preeclampsia is 86%

4
If the number of pregnancies > 1, with degree of education lower than diploma, and with underlying condition, then the possibility of belonging to the group with preeclampsia is 86%

5
If the age < 31 yr, with academic degree of education, and without underlying condition, then the possibility of belonging to the group without preeclampsia is 83%

Discussion
Due to the serious risks that preeclampsia poses to the mother and fetus, it is important to use methods that can predict this outcome.
However, despite recent advances in the etiology of preeclampsia, to date, no clinical screening tests have been identified to diagnose the disorder (25).
Identifying the underlying and predictive factors of preeclampsia can play a significant role in reducing mortality and complications in the mother and fetus. In addition to identifying the risk factors associated with preeclampsia, this study aimed to compare common data mining approaches and select the strongest model to help professionals in this field. In this section, we will first consider the most important risk factors associated with preeclampsia and then discuss the performance of the data mining models.
According to the results of the univariate analysis (Table I)

Conclusion
Among the data mining models employed in this study, the SVM model had the highest prediction accuracy. Therefore, we can conclude that this model can be used as a screening tool to help predict preeclampsia. Based on the results of the RF model, which also showed good performance in this study, the variables of underlying condition, degree of education, pregnancy season, and the number of pregnancies were the most important risk factors associated with preeclampsia.
Therefore, by controlling these factors and also regularly monitoring the blood pressure of mothers with these risk factors, the potential risks associated with this syndrome can be reduced.