Machine learning models predict coagulopathy in spontaneous intracerebral hemorrhage patients in ER

Abstract Aims Coagulation abnormality is one of the primary concerns for patients with spontaneous intracerebral hemorrhage admitted to ER. Conventional laboratory indicators require hours for coagulopathy diagnosis, which brings difficulties for appropriate intervention within the optimal window. This study evaluates the possibility of building efficient coagulopathy prediction models using data mining and machine learning algorithms. Methods A retrospective cohort enrolled 1668 cases with acute spontaneous intracerebral hemorrhage from three medical centers, excluding those under antithrombotic therapies. Coagulopathy‐related clinical parameters were initially screened by univariate analysis. Two machine learning algorithms, the random forest and the support vector machine, were deployed via an approach of four‐fold cross‐validation to screen out the most important parameters contributing to the occurrence of coagulopathy. Model discrimination was assessed using metrics, including accuracy, precision, recall, and F1 score. Results Albumin/globulin ratio, neutrophil count, lymphocyte percentage, aspartate transaminase, alanine transaminase, hemoglobin, platelet count, white blood cell count, neutrophil percentage, systolic and diastolic pressure were identified as major predictors to the occurrence of acute coagulopathy. Compared to support vector machine, the model based on the random forest algorithm showed better accuracy (93.1%, 95% confidence interval [CI]: 0.913‐0.950), precision (92.4%, 95% CI: 0.897‐0.951), F1 score (91.5%, 95% CI: 0.889‐0.964), and recall score (93.6%, 95% CI: 0.909‐0.964), and yielded higher area under the receiver operating characteristic curve (AU‐ROC) (0.962, 95% CI: 0.942‐0.982). Conclusion The constructed models exhibit good prediction accuracy and efficiency. It might be used in clinical practice to facilitate target intervention for acute coagulopathy in patients with spontaneous intracerebral hemorrhage.


| INTRODUC TI ON
Spontaneous intracerebral hemorrhage (ICH) is a major global public health issue, contributing to 7.4 million cases and over 3.1 million deaths worldwide annually. 1 The case-fatality rate of ICH ranges from 35% at seven days to 59% at one year. 2-4 Survivors often present with severe disability, leaving less than 40% regaining functional independence. 4 It has been established that prolonged bleeding frequently occurs during the acute phase of ICH, which contributes to neurological deterioration and worsened outcome. 3,5 The pathogenesis of prolonged bleeding is not fully understood, but coagulation abnormalities are considered as one of the most significant risk factors. 3,6 For patients under antithrombotic therapies (including antiplatelet, anticoagulant, and fibrinolytic agents), the application of hemostatic agents such as vitamin K 1 is a common regimen in neurological ICU and is recommended by ICH guidelines from the American Heart Association and the American Stroke Association. 4 Even without any antithrombotic therapies, coagulation abnormalities are common and are often accompanied by intracerebral hematoma enlargement. 7 Early detection and intervention of acute coagulopathy can significantly reduce mortality and improve outcomes. 8,9 Currently, the diagnosis of coagulopathy is based on conventional coagulation indicators such as prothrombin time (PT), international normalized ratio (INR), and activated partial thromboplastin time (APTT), which typically require a minimum of 1 to 2 hours of processing time after blood sample collection. A most optimal therapeutic window may be missed due to this time lag. Therefore, it is of great importance to develop efficient prediction models that can identify coagulopathy rapidly and timely, so as to provide an early warning to physicians and to facilitate ancillary resource management better to treat patients in the emergency room (ER).
Previously, clinical parameters, including age, gender, body temperature, have been identified as risk factors for coagulation abnormalities in acute ICH patients, yet the sensitivities and specificities varied among studies. [10][11][12][13][14][15] The difficulties of precise predictions come from the fact that the coagulopathy during the acute phase of ICH is a multifactorial pathological process with complex mechanisms, possibly including tissue damage, hypoxemia, acidemia, inflammation, hypoperfusion, and other confounders. For ICH patients arriving at the ER, it is impractical for clinicians to screen every individual factor. Therefore, it is crucial to develop prediction models that are easily applicable to alert clinicians of potential coagulopathy in ICH patients.
The rise of big data analysis and machine learning algorithms offers possible strategies to build efficient prediction models and reveal hidden patterns from enormous datasets. [16][17][18][19][20] In this study, we used machine learning methods to develop and validate a prediction model for coagulopathy after acute ICH based on objective indicators, which are routinely obtained after patients being admitted to the ER. Patients with non-aneurysmal spontaneous ICH, confirmed by Computed Tomography (CT) scanning, were recruited in this study.

| Study cohort and source of data
The inclusion criteria also involve a timing of less than 12 hours from symptom onset to ER admission. 21,22 Since two of the medical centers, Huashan Hospital and Huashan Pudong Hospital, only accept adult patients, this study only recruited patients over 18 years old.
The exclusion criteria involve any of the following conditions: pregnancy; uncorrected shock (systolic pressure ≤90 mm Hg, or diastolic pressure ≤50 mm Hg); thrombocytopenia; cirrhosis; hypohepatia/ hepatic failure; renal failure; currently on oral administration of antiplatelets or anticoagulants (including warfarin, clopidogrel, aspirin, rivaroxaban, and dabigatran); and patients with incomplete data upon ER admission.

| Definition of coagulation abnormalities
Results of coagulation assessment were collected from day 0 to 5 after ER admission. A diagnosis of coagulation abnormality was made when a patient showed either increased International Normalized Ratio (INR ≥1.2) or prolonged activated partial thromboplastin time (APTT, reference range 28-34 seconds). These indicators and reference ranges are used to assess either the extrinsic or intrinsic pathways of coagulation along with the common pathways, which have been commonly applied within clinical researches and literature. 23

| Machine learning algorithms
The data mining and model fitting were performed in python 3.6. Two algorithms, random forest and support vector machine (SVM), were applied to predict coagulopathy among ICH patients admitted to the ER.
Random forest is a supervised learning algorithm. The "forest" in this approach is a series of decision trees that act as "weak" classifiers, which are poor predictors individually but exhibit robust prediction value in the aggregate. To classify an object from an input vector, each tree gives a classification. The forest selects the classification that has the most votes. 27,28 In this study, the Gini Index was applied as the optimization criterion, with 1000 estimators used in the calculation. The hyperparameters used in the current study were as follows: criterion="gini," bootstrap = True, max_parame-ter="auto," max_depth = 10, n_jobs = 2, min_samples_split = 2.
SVM, also known as support vector network (SVN), is also a supervised learning method. SVM looks at data and sorts it into one of two categories. It is trained with a series of data already classified into two groups, building the model as it is initially trained.
The task of an SVM algorithm is to determine which category a new data point belongs to. This makes SVM a kind of non-binary linear classifier. 29,30 In this study, the prediction model was trained via linear SVM, in which relative parameter contributions were derived from the weighted coefficients. To ensure the robustness of the parameter contributions, 1,000 bootstrapped sets were generated, in which 75% of the training set was sampled with replacement. Linear SVMs were trained on each of these bootstrapped sets.
This study adopts an approach of fourfold cross-validation via which the whole data set was randomly divided into four subsets (folds). Of the fourfolds, threefolds were used as training data, and the remaining one was retained as a validation data set. The cross-validation process was repeated four times, and each of the fourfolds was used once as validation data. The four results were then averaged to produce a single estimation. The area under the receiver operating characteristic curve (AU-ROC), precision, classification accuracy, recall score, and F1 score were used to evaluate prediction models. Precision quantifies the percentage of positive class predictions that truly belongs to the positive class. Classification accuracy is the proportion of the correct prediction in all prediction results. The recall score is the proportion of predicted positive samples in all true positive samples. The F1 score provides a single score that balances both the concerns of precision and recall in one number.

| Cohort Characteristics
Between January 2016 and June 2019, 32,857 patients visited the ER of the above three medical centers. Among them, 3,016 patients were diagnosed with acute ICH, and 1,813 patients met the above inclusion criteria. Yet, 145 patients were excluded due to missing data for one or more covariates. Therefore, data of 1,668 patients were finally used for modeling. A flowchart of patient selection is shown in Figure 1. The demographic characteristics of recruited patients are shown in Table 1.  contrary, parameters such as age, gender, hemorrhage locations, history of hypertension, body temperature, heart rate, and pulse oxygen saturation (SpO 2 ) showed no significant difference between the two groups (Table 1).

| Random forest model
In the random forest model, recruited patients were classified according to their coagulation status, and an algorithm was used to assess the importance of each clinical parameter on coagulopathy.
Parameter importance was calculated as the sum of the decrease in error when split by a variable. The importance of each clinical parameter reflects the contribution of each variable in the patient's classification into the coagulopathy or non-coagulopathy group.
Major indicators for coagulopathy during acute phase ICH were ranked in the upper part of Table 2

| Model performance comparison
Model discrimination was assessed using machine learning evaluation metrics, including accuracy, precision, recall, and F1 score. The results are presented in Table 3

| Contributors for the extrinsic and intrinsic coagulation pathway
Since INR and APTT measure the extrinsic and intrinsic coagulation pathway, respectively, the major contributors for either of the pathway were screened independently using the same machine learning strategies. APTT prediction compared to the random forest algorithm. The variables with a significant contribution to both pathways are sorted and listed in Table 5.  In the present study, both random forest and SVM algorithms indicated that A/G, NEUT, LYMPH, and AST changes were the major predictors for the development of coagulopathy. The contributions of the A/G ratio and AST suggest that regular liver function plays a vital role in maintaining normal coagulation.

| D ISCUSS I ON
Hemostasis is closely related to liver function, as most coagulation factors are synthesized by liver parenchymal cells. Also, the liver's reticuloendothelial system plays a critical role in the clearance of the activated form of the coagulation factors. The severity of coagulation abnormalities correlated to the extend of liver disturbance. 36 Yet, bilirubin is not identified as a primary parameter for coagulopathy in this study, suggesting that synthesis dysfunction may play a more critical role than hepatocellular damage or biliary obstruction in coagulopathy.
Acute leukocytosis is a well-established response to ICH.
Previous prospective studies have shown that elevated admission WBC count and neutrophil count are associated with an increased risk of early neurologic deterioration in ICH 37-39 as well as in ischemic stroke. 40 Multiple studies have also reported increased neutrophil-to-lymphocyte ratio associated with higher mortality and increased intracerebral remote diffusion-weighted imaging lesions in ICH, 41,42 and worsened prognosis in glioma. 43 Although the mechanisms are not fully understood, some interactions between coagulation factors and neutrophils are described elsewhere, which may, in turn, play a role in hemostasis. Proteins of the coagulation system, such as FXa, thrombin, and fibrinogen, bind to various sites on neutrophils. This binding leads to complicated consequences. First, it assembles coagulation complexes such as the prothrombinase complex and the contact system on the neutrophil membrane, which further impacts neutrophil functions such as chemotaxis, aggregation, degranulation, and migration. Second, neutrophil elastase degrades multiple coagulation proteins, modulating both the thrombotic and the fibrinolytic systems. 44 In fact, these interactions are recognized as a link between the coagulation and inflammation pathways.
Multiple prospective investigations have indicated that achieving early and stable blood pressure seems safe and associated with favorable outcomes in acute ICH patients. 45,46 While this may be due to avoidance of hypertension-induced hematoma enlargement, 47 studies have revealed that patients with a history of hypertension show lower-grade fibrin formation and higher levels of several anticoagulant factors (eg, antithrombin III, protein C and protein S, and von Willebrand factor antigen). 48,49 The clinical application of these findings warrants additional studies. In this present study, the two machine learning algorithms make it convenient to achieve predicted probability for coagulopathy among patients with acute ICH, which is more efficient than conventional coagulation lab tests. Compared to traditional statistical methods, the random forest and SVM are better at analyzing nonlinear relationships between various biochemical markers and coagulopathy.
Notably, the strategy of machine learning models is highly practical.
All parameters used in this study are easily accessible and well estab-

| CON CLUS ION
Machine learning techniques have been successfully introduced into the field of healthcare. This study provides an example of a systematic analysis of the data set on coagulopathy among ICH patients. The results above demonstrate that machine learning techniques can generate prediction models with excellent performance and high efficiency.
Such methods and theorems could be applied to other evaluations in the future.

ACK N OWLED G M ENT
We

CO N FLI C T O F I NTE R E S T
The authors report no conflict of interest concerning the materials or methods used in this study or the findings specified in this paper.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.