Cluster-Based Ensemble Learning Model for Rapid Detection of Aortic Dissection

(AD) The data set used in this paper comes from 53213 patients, which collected from 8 XiangYa Hospital in Hunan Province from 2008 to 2016. The data includes 802 9 patients with aortic dissection and 52411 patients with non-aortic dissection. In order 10 to help clinicians predict AD, we designed an ensemble learning model based on 11 clustering: Cluster Random Under-sampling Smote-Tomek-link Bagging 12 (CRST-Bagging). This model combines the advantages of clustering-based compound 13 resampling (CRST) method and Bagging ensemble classifier. It achieves good results 14 on aortic dissection data sets. The model validates the effectiveness of the CRST sampling method on the AD data set. We compared the CRST-Bagging model with the classical ensemble models 18 RUSBoost and SMOTE-Bagging on the AD data set. The experimental results show that the CRST-Bagging model has the best performance in the detection of AD. Model’s accuracy and recall rate are 83.6% and 80.7% respectively. And the F1 value 21 is 82.1%, which is 4.8% and 1.6% higher than that of and 22 SMOTE-Bagging model.


Background and introduction
Aortic dissection (AD) is a medial rupture caused by intramural hemorrhage, which 2 leads to the separation of aortic wall layer, followed by the separation of true and false 3 lumen [1]. AD is a dangerous cardiovascular disease with dangerous morbidity, many 4 complications and high mortality. The clinical manifestations of AD are complex and 5 changeable. They lack of special symptoms and signs. And the location, lesion degree 6 and scale of AD are different. So the clinical manifestations and severity are different. 7 In addition, clinicians tend to observe the common symptoms of AD, such as chest 8 pain and back pain. But for painless patients, atypical symptoms make the diagnosis 9 more difficult. It is easy to cause missed diagnosis and misdiagnosis [2]. According to 10 the clinical statistics, the misdiagnosis rate of AD is more than 1/3 in the actual cases 11 of AD [3][4] [5]. Mortality can reach as high as 50% within a week of onset and 12 between 60 and 70% within a month. With the help of scientific methods and effective 13 techniques, the timely diagnosis of AD by clinicians is the most effective means to 14 save patients' lives. 15 In recent years, the application of artificial intelligence in the field of medical and 16 health care has attracted much attention. At present, various auxiliary diagnosis 17 systems in the medical field have emerged one after another. Huang et al. [6] used an 18 enhanced resampling method of electronic medical records to classify and predict 19 Major adverse cardiac events (MACEs) of acute coronary syndrome (ACS). Zhou et 20 al. [7] proposed an interpretable pattern discovery method from the perspective of 21 statistical learning methods to interpret clinical chest data and make classification 22 predictions. Song  Therefore, different from the above methods, we studied the diagnostic and predictive 41 method of AD based on the routine examination data of patients, so as to help doctors 1 judge whether patients need further imaging examination. 2 The rarity of AD leads to the significant imbalance in the data set. If the traditional 3 machine learning technology is applied to the aortic dissection data set, the model 4 tends to fit large samples, showing high accuracy and low recall rate, so the model's 5 generalization ability is low. Therefore, imbalanced learning [16][17] and ensemble 6 learning [18][19] [20][21] were combined to predict aortic dissection data. 7 According to the characteristics of AD data, this paper proposes a cluster-based 8 ensemble learning model: Cluster Random Under-sampling Smote-Tomek-link 9 Bagging (Hereinafter referred to as CRST-Bagging) to help clinicians detect AD in 10 clinical practice. This model includes two parts: Cluster-Based resampling (CRST) 11 and Bagging classifier. The resampling method CRST combines the advantages of the 12 over-and-under sampling method. It overcomes the difficulties in the detection of AD 13 caused by imbalanced data. Bagging classifier is used to improve the generalization 14 ability of the learning model. To demonstrate the effectiveness of the CRST-Bagging 15 approach, we compared it with the classic ensemble models RUSBoost and 16 Smote-Bagging on AD datasets. Experimental results show that the proposed model is 17 superior to other models, which prove the effectiveness of the model. 18 The main contributions of this paper can be summarized as follows: 19  In this paper, missing value processing ， feature screening and 20 dimension-reduction visualization were performed in the AD data set. These 21 methods enable us to have a priori knowledge of the distribution of AD data, 22 which can be used in clinical medicine to explore the pathological 23 mechanism of AD. 24  A new compound resampling method, CRST is proposed. This method 25 combines the advantages of clustering ideology and SMOTE + Tomek-Link 26 sampling methods. This method not only makes the collected samples 27 effectively represent the characteristics of different samples, but also ensures 28 the randomness of sampling, which can effectively reduce the imbalanced 29 ratio of AD data. 30  CRST-Bagging ensemble learning model is proposed to predict AD disease. 31 Through experimental comparison and analysis, the model shows excellent 32 performance and good generalization ability on AD data sets. Therefore, this 33 model can be used for clinical auxiliary diagnosis. 34 The rest of this article is arranged as follows. In the second section, we introduce the 35 data set used in this paper, our resampling method CRST and the imbalanced 36 algorithm integration model. In the third section we present our experimental results 37 and model performance evaluation. In the fourth section, the experimental results are 38 discussed. Finally, the summarization and the discussion of future work are described 39 in the last section. 40 1

Data Overview 2
The dataset used in this paper comes from the examination indicators of 53213 3 patients, which collected from XiangYa Hospital in Hunan Province from 2008 to 4 2016. The data includes 802 patients with AD and 52411 patients without AD. The 5 dataset has 71-dimensional features and 1-dimensional tags (The table of Description  6 of Features is included as additional file 1 of the supplementary material). This dataset 7 has a high imbalanced ratio, with the number of AD samples approximately 67 times 8 that of non-AD samples. In addition, this paper also uses a test set to better verify the 9 classification performance and generalization ability of our model. The test set 10 includes the examination indicators of 235 patients from the same hospital, and the 11 data format is the same as the above data set. 12

Data preprocessing 13
Firstly, some non-numerical indicators are normalized by binary coding, and then 14 standardized. Secondly, we made statistics on the missing rate of samples and features 15 in the original AD dataset as shown in Figure 1 (the abscissa represents the features 16 and the ordinate represents the missing rate). Six features with a deletion rate of more 17 than 50% were found, namely Plasma antithrombin Ⅲ antigen 18 determination(r_28)(missing rate is 81.5%), Plasma plasminogen antigen 19 determination (r_20)(missing rate is 80.7%), Hypersensitivity thyrotropin (r_64) 20 (missing rate is 75.6%), erythrocyte sedimentation rate (r_59) (missing rate is 63.8%), 21 D-dimer (r_19) (missing rate is 62.6%), free triiodothyronine (FT3) (r_62) (missing 22 rate is 51.9%). 23 Due to the high-missing rate of the above-mentioned six-dimensional features, it is 26 difficult to complete them. The general method is to delete them. However, the 27 etiology and related diagnostic indicators of AD are not yet clear, we cannot directly 1 determine whether the missing feature indicators are key indicators, so they cannot 2 simply be deleted. Therefore, the XGboost method is used to analyze feature 3 importance [22][23]. The result is shown in Figure 2, with the horizontal coordinate 4 as feature numbers and the vertical ordinate as feature importance scores. 5 From figure 2, we find that among the 6-dimensional features with a deletion rate 6 greater than 50%, the feature importance scores of free triiodothyronine (FT3) and 7 D-dimer are ranked in the top 10, which indicates that these two features are 8 important for detecting whether a patient suffers from AD. Therefore, we only remove 9 the four characteristics of Plasma antithrombin Ⅲ antigen determination, Plasma 10 plasminogen antigen determination, Hypersensitivity thyrotropin, erythrocyte 11 sedimentation rate. Free triiodothyronine (FT3) and D-dimer remain and are 12 complemented with the remaining features. The adjusted new sample set size is 13 (53213,67). 14 In this paper, the data set is filled by the method of classified random filling method. 15 Compared with ordinary random filling, the method of random filling by class is to 16 fill In order to have a more systematic and in-depth understanding of the 23 high-dimensional AD data set, this paper uses the method of dimensionality reduction 24 visualization to analyze the data distribution. This allows us to understand the data 25 more intuitively and provide information for the design of AD prediction algorithm. 26 We analyze the existing methods of dimensionality reduction. T-SNE[24] algorithm 27 can retain both global and local data structures. Therefore, we used the T-SNE 1 algorithm to reduce the dimensionality of the dataset. We analyze the distribution of 2 samples through dimensionality reduction and visualization. By observing the data 3 distribution, it is concluded that the clustering algorithm is feasible to improve the 4 under-sampling method. 5 The results of the AD dataset using the t-SNE method for dimensionality reduction 6 are shown in Figure 3. The red sample point is the positive sample, and the blue 7 sample point is the negative sample. As can be seen from figure 3, the data 8 distribution of positive samples is agglomerated. This shows that there is a certain 9 similarity between the cases of AD. Therefore, from the analysis of data distribution, 10 the under-sampling method of clustering is effective. At the same time, it can also be 11 seen that there is obvious overlap between the two kinds of samples in space from the 12 visualization of data distribution. Therefore, it is necessary to construct a nonlinear 13 classification model. 14 15 pre-processing is carried out first (see section 2). Then in the second step, a 24 clustering-based resampling algorithm is proposed to resample the imbalanced data 25 set to reduce the imbalanced ratio of the data. Finally, the Bagging ensemble model is 26 used to construct a powerful nonlinear classifier to predict AD. The methods are 27 described in detail below. Resampling technique is to obtain a balanced data set from the original non-balanced 4 data set by using different sampling methods. From the perspective of data level, 5 resampling methods are mainly divided into three kinds: over-sampling, 6 under-sampling and combined sampling methods. In view of the imbalanced and clustered characteristics of AD data, we proposed 10 Cluster Random Under-Sampling Smote + Tomek-link Approach (hereafter, CRST). 11 This method is an under-sampling method which takes the cluster center as the 12 representative point. It combines the advantages of K-means++ and Smote + 13 Tomek-link sampling method. 14 Firstly, the training samples in majority class were clustered by K-means++ algorithm, 15 in which K is obtained by super-parameter optimization. Then random 16 under-sampling is carried out for each cluster. The degree of sampling p% can be 17 determined by the actual situation. After the under-sampling is completed, SMOTE + 18 Tomek-Link combined sampling method is used to form a new balanced data set. By  19 iterating the above operations many times, we get several new balanced sub-datasets. 20 The clustering of samples in majority classes is visualized in the figure 5. The green 21 dots are the selected remaining of majority class sample points after p% random 22 under-sampling for each cluster. It can be seen that in this way the sample points can 23 be uniformly sampled in each cluster by under-sampling, maintaining the original data 24 distribution. 25 Finally, Smote + Tomek-link(S-T) sampling method is applied to generate some 26 minority samples, thus the sample loss caused by the under-sampling method is 27 compensated and the imbalanced ratio is alleviated. As shown in Figure 6, S-T 28 generates minority samples through SMOTE method, while Tomek-Link method is 29 adopted to solve the problem of fuzzy boundary caused by excessive generation of 30 minority samples. This method can reduce the redundancy of samples. The procedure 31 of algorithm is shown in Table 1.

4.
Combine the majority class sample set ′ and the minority class sample set to synthesize the sample set , S-T method is used for to obtain a balanced data set .

ensemble model based on CRST 1
As can be seen from the visualization analysis results of the data in Section 2.1, the 2 data of aortic dissection has high overlap. So the classification boundary is blurred, 3 and it is necessary to construct a nonlinear classification model with strong 4 generalization ability. On the basis of the CRST sampling method proposed in Section 5 2.2, integrated with the idea of Bagging[30] ensemble learning, CRST-Bagging 6 ensemble learning algorithm is proposed in this section. It overcomes the limitation of 7 a single classifier. 8 The CRST-Bagging algorithm is to generate a new sample set = { 1 , 2 , ⋯ , } 9 by using CRST sampling method iteratively. Then each sub-sample set is used to 10 construct a sub-classifier separately. A complete ensemble model classifier can be 11 obtained by integrating the results of the T sub-classifier. The integration rule used in 12 the algorithm is Majority Vote rule [31]. For the classifier, if 1 is greater than or 13 equal to 2 , then 1 gets one vote; if 1 is less than 2 , then 2 gets one vote. 14 1 and 2 represents the sample category. This rule can be expressed by formula 15 (1)(2). After the construction of the classifier, we send the verification set to the 16 classifier for verification to evaluate the effect of the model. The model structure is 17 shown in Figure 7. are randomly sampled so that the ratio of majority to minority is 2:1. Then 19 Smote and S-T are carried out respectively so that the ratio of positive and 20 negative samples after resampling is 1:1. The experimental results of the seven-fold cross-validation on the original dataset 12 (53213,67) are shown in Table 2. Compared with the single over-sampling method 13 Smote and S-T method, CCST and our proposed method CRST has a great 14 improvement in recall rate and F1 value. As CCST selects the samples which is 15 closest to the sample center of each cluster, the sample distance between different 16 clusters after sampling is too far and the sample distribution is uneven, so the effect is 17 inferior to that of CRST. By contrast, the CRST method performed better. It shows 18 that CRST can reduce the occurrence of missed diagnosis in patients with AD to some 19 extent. This is because the CRST is an over-and-under sampling. While clustering and 20 under-sampling the majority samples, the CRST method uses the S-T method to 21 generate the same amount of minority class samples. In this way, we can not only 22 retain the original distribution of majority samples and select representative sample 23 points, but also balance the number of minority samples through over-sampling. 24

Experimental settings 26
The seven-fold cross-validation method is used in the comparative experiments. The 27 first group of experiments was performed in the original data set (53213, 67). The 28 second group experiment was tested on the test set. The experimental details of the 29 various algorithms of this experiment are as follows: 30  RUSBoost: the base learner type is decision tree C4.5, the number is 100, the 31 depth is 5; 32  SMOTEBagging: Set the number of clusters K = 5, the base learner is 33 decision tree C4.5, the number is 100, the depth is 6; 34  CRST-Bagging: Set the number of clusters K = 50, p%=3.1%. 35

Experimental results and analysis 1
The experimental results of the seven-fold cross-validation on the original dataset 2 (53213,67) are shown in Table 3. Figure 8 shows the ROC curve of the algorithm. It 3 can be seen that CRST-Bagging performs best on the AD dataset, SMOTEBagging 4 algorithm is second, and the RUSBoost algorithm performs worst. Compared with 5 RUSBoost and SMOTEBagging methods, CRST-Bagging improved classification 6 performance, and significantly improved accuracy and F1 value. It indicates that the 7 rate of missed diagnosis and misdiagnosis in patients with AD are significantly 8 reduced by CRST-bagging. 9  Table 4. Figure 9 shows the ROC curve of the algorithm. As can be found from the 5 table 4 and figure 9, the SMOTEBagging algorithm has the highest accuracy on the 6 test sample set, and the CRST-Bagging algorithm has the highest recall rate and F1 7 value. CRST-Bagging's algorithm performance is significantly improved compared to 8 RUSBoost and SMOTEBagging algorithms on the test sample set. CRST-Bagging 9 algorithm has stronger generalization ability. In other words, CRST-bagging algorithm 10 is more likely to detect potential patients with AD. 11

12
AD is a rare and high-risk cardiovascular disease. Its complex clinical manifestations 13 and various atypical symptoms lead to serious misdiagnosis and missed diagnosis. 14 The rarity of the disease also leads to a significant imbalance in the data set. This 15 paper studies the misdiagnosis and missed diagnosis of AD and the imbalanced 16 characteristics of data. 17 For the original AD dataset, in the data preprocessing stage, we performed missing 18 value processing, feature screening, data standardization. And at the level of data 19 distribution the data is understood through dimensional reduction visualization. This 20 is different from the general "black box" approach of machine learning algorithms. 21 These methods enable us to have a priori knowledge of the distribution of actual 22 medical data sets. This prior knowledge can inspire clinical medicine to explore the 23 etiology and diagnostic criteria of AD. 24 Aiming at the high imbalance of AD data set, this paper proposes a resampling 25 method CRST based on clustering. This method combines the advantages of 26 traditional sampling methods Smote + Tomek-link and clustering algorithm 27 K-means++. In CRST, a certain percentage of samples are randomly selected from 28 clusters, which not only makes the selected samples effectively represent the 1 characteristics of most kinds of samples, but also ensures the randomness of sampling. 2 Experiments show that CRST scientifically and effectively reduces the imbalanced 3 ratio of rare disease medical data and relieves the obstacles that imbalanced data bring 4 to the construction of classification models. 5 On this basis of CRST, this paper proposes the CRST-Bagging learning model 6 combined with the idea of ensemble learning. After experimental comparative 7 analysis, the CRST-Bagging model presented in this paper shows excellent 8 performance on the AD data set. Not only the accuracy and recall rate of the model on 9 the original AD data set have been improved, but also the generalization ability of the 10 model on the test sample set is also very good. This shows that this model is a good 11 diagnostic model of AD. Clinically, the misdiagnosis rate of AD is close to 12 40%[3][4] [5]. The model performance data show that this algorithm can not only 13 reduce the workload of doctors, but also reduce the misdiagnosis rate of AD to save 14 patients' lives effectively. 15

16
In this paper, a cluster-based ensemble learning model named CRST-Bagging is 17 proposed to assist in diagnosing aortic dissection through the patient's inspection 18 results. Compared with the ordinary classification model, our model pays more 19 attention to the processing of medical data sets with high imbalance ratio. While 20 ensuring high accuracy, CRST effectively improves the recall rate. That is, the missed 21 diagnosis and misdiagnosis rate is reduced. In addition, the algorithm demonstrates a 22 strong ease of use. In many basic hospitals where the equipment is not advanced 23 enough, it is difficult for patients to perform more examination items such as CT, 24 magnetic resonance angiography (MRA), etc. The model proposed in this paper can 25 reduce the burden of doctors and patients to a certain extent and help diagnose the 26 AD. 27 The diagnosis of AD remains one of the most difficult problems in the cardiovascular 28 field. In the future, based on the analysis results of aortic dissection data and the 29 proposed auxiliary diagnostic method for aortic dissection in this paper, we will study 30 the pathological mechanism and key diagnostic indicators of aortic dissection from 31 the perspective of interpretability, and explore whether there is a more definite clinical 32 diagnostic method for aortic dissection. Ethical approval for this study was obtained from the Ethics Board of Xiangya 6 Hospital, Central South University (201502042). 7 Consent for publication 8 Not applicable. 9 Competing interests 10 The authors declare that they have no competing interests. 11 12 Availability of data and materials 13 Data are provided by Xiangya Hospital and it cannot be shared with other research 14 groups without necessary permission. The data used during the current study is 15 available from the corresponding author on reasonable request. The data description 16 supporting the conclusions of this article is included in the article (and its Appendix). 17 18 Funding 19 The study was financially supported by National Natural Science Foundation of China 20 (No. 61502537), and Strategic Emerging Industry Technological Research and Major 21 Technological Achievement Transformation Project, High-tech Development and 22 Industrialization Office (No. 2019GK4013). The funding body had no role in design 23 of the study, collection, analysis, and interpretation of data or in writing the 24 manuscript. 25 26 Authors' contributions 27 All of the authors had full access to all of the data in the study and take responsibility 28 for the content of the manuscript. YG designed the model and experiment 29 implementation. MW and LJZ wrote the code. LJL, GGZ and JML contributed to data 30 collection and feature selection. YG, LJL and GGZ perform the results analysis. MW 31 and LJZ drafted the initial manuscript. YG revised the manuscript. All authors read 32 and approved the final draft of the manuscript for publication. 33 34 abdominal aorta of the adult. The task force for the diagnosis and treatment of aortic diseases 1 of the European Society of Cardiology (ESC). Eur Heart J 2014;35:2873-926.