An accurate diagnosis of coronary heart disease by Catboost, with easily accessible data

Coronary heart diseases (CHD) have become the leading cause of death worldwide. Coronary angiography is the “gold standard” for diagnosing this disease. However, the invasive risk and expensive price make it difficult to promote on a large scale. This study was using Catboost to diagnose CHD through simple indicators. 2642 samples, including 717 patients, were collected from 2018 to 2019. 33 features were collected, including demography, anthropometry, questionnaire and laboratory examination indicators. The diagnosis model of CHD was established by using Catboost, random forest and logistic regression. Accuracy and area under ROC (AUROC) were used to evaluate the classification performance of the diagnosis models. In order to facilitate the application, we also set up a simplified model merely based on non-laboratory dataset. Catboost showed the best performance in identifying patients with CHD. The accuracy of Catboost, random forest and logistic regression was 82.5%, 75.1%, 75.8%, respectively, and the AUROC of them was 0.881, 0.837, 0.832, respectively. Age, total cholesterol and family history of coronary heart disease were the three most important risk factors for diagnosing CHD. Catboost also worked best in simplified models with 77.9% accuracy and 0.857 AUROC. The models can contribute to early screening and diagnosis for CHD, which would facilitate the prevention and timely treatment of the diseases.

Machine learning, which intersects statistics and computer science, can detect complex nonlinear relationship among variables, through minimizing the errors between predictions and observations [6][7]. It has been widely used in predicting diseases, such as Parkinson, liver disease, skin disease, diabetes, some cancers [8][9][10], and as well as CHD [11][12]. Amin MS et al. utilized seven algorithms, including 13 features to predict heart diseases and Vote algorithms performed the best accuracy, 87.4% [7]. Marateb et al. proposed a fuzzy rule-based system based on a neuro-fuzzy classifier and it has 84% accuracy in predicting CHD [13]. Alizadehsani also predicted the left anterior descending coronary artery, left circumflex branch and right coronary artery stenosis, through a combination of SVM and kernel fusion algorithm. The results showed that the accuracy of these three parts were 86.14%, 83.17%, and 83.50% [14]. Most previous studies used clinical parameters, such as echocardiography and exercise tests, to predict CHD. Although these features are relatively simpler to collect than coronary angiography, the inspection process takes a lot of time and requires the participation of professional physicians. As a result, it is a challenge to intelligently diagnose CHD in large scale, especially in poor areas.
Therefore, the purpose of this study was to establish a new diagnostic method of CHD by using lowcost and easy to collect features.

Dataset
From 2018 to 2019, 2642 participants were randomly recruited, including 717 CHD. The diagnostic criteria of CHD are: medical records have a history of CHD, or coronary angiography shows coronary artery stenosis ≥50% (Gensini method). This study was approved by local Ethics Committee and written informed consent was obtained from each participant.

Algorithms
This study used CatBoost, Random Forest Classifier (RFC) and Logistic Regression (LR) to predict CHD and to identify its risk factors. CatBoost is an algorithm utilized gradient boosting technique on decision trees. It can handle categorical features and outperform other gradient boosting methods [15]. CatBoost utilizes one-hotmax-size encoding and permutation in converting labels to numbers, greedy methods at new split of trees and target-based statistics. The algorithm has the following steps: 1. Permutating the dataset in a random order; 2. Transforming the labels to integer numbers; 3. Converting the categorical values to the numerical. RFC combines multiple decision trees and it uses bagging technique, bootstrap feature selection and features randomness, when building trees. The result of RFC comes from the vote of each tree and the result with more votes will become the final result [16][17].
LR, in this study, was the benchmark, which is a classification algorithm. It was developed by Hosmer and Lemeshow [18]. LR uses sigmoid function (logistic function) and presents the value between 0 and 1.

Data processing
80% of the data was used as the training set and the remaining 20% as the test set. Ten-fold cross validation was used to test the generalization ability of the model. This study adopted maximum and minimum function to normalize features, (current-min)/(max-min). Easy Ensemble Sampling technique which was proposed by Zhou et al. was used to optimize the results 13 .

3.Results and Discussions
This study used Catboost, RFC and LR to build six models, based on total features set and nonlaboratory features, respectively. Table 2 and Table 3 illustrated the confusion matrix of six models on the test set. Based on these two confusion matrixes, the accuracy of Catboost, LR and RFC of total features set was 82.5%, 75.1% and 75.8%, respectively and that of non-laboratory set was 77.9%, 75.4%, 74.7%, respectively.   Figure 1 showed the sensitivity, specificity and f-score of the three algorithms in two data sets. Figure  2 demonstrated the ROC of the three algorithms in two data sets. The results illustrated that Catboost had strong diagnostic ability for CHD whether using all features or non-laboratory features. There were few researches applied Catboost in CHD prediction. This study was the first to apply the Catboost algorithm to the diagnosis of CHD worldwide and it performed better than random forest and logistic regression. Moreover, this study was the first to develop a diagnostic model for CHD based on a Chinese population sample database.
CHD related databases have been developed and researchers can evaluate their study methods based on the standardized databases, such as UCI Repository of Machine Learning Database, The Long-Term ST Database and Z-Alizadeh Sani dataset. Alizadehsani et al. utilized Naïve Bayes, C4.5 AdaBoost and Sequence minimum optimization algorithm, with 16 features, including laboratory data and echocardiography, to predict CHD. The results showed that Sequence minimum optimization algorithm had the best accuracy of 82%, and ejection fraction and segmental wall motion abnormalities are the most important features for the diagnosis of CHD [19]. Tsipouras proposed an automatic diagnosis system, an improved decision tree model, of coronary artery disease, based on features of demographics, questionnaires, blood tests and the evaluation index of arterial stiffness (PWV and ABI). The model had 73.4% accuracy, 80% sensitivity and 65% specificity [20]. Those studies proved that machine learning algorithms can perform well in CHD diagnosis. Different database, features, experimental designs and algorithms may have different results, and as a result, the results from different researches should not be simply compared.
The current study not only researched the applicability of the algorithms, but also considered the possibility of promotion of this intelligent CHD diagnosis. Most previous studies used complex features, such as Echocardiography, chest pain symptoms and exercise testing [21][22], which are high-cost, intrusive and need professional workers and equipment. Moreover, some of them have risks in testing process and some cannot be collected from the disabled. Thus, these features are not appropriate to promote in large population, especially in poor areas. The features used in current study are easily to collect and low cost, even free. In China, blood indicators, including fasting blood sugar and serum lipids, are routine inspection items and can be performed at the primary community health service centers. In addition, according to Chinese National Basic Public Health Service Specification, free blood indicators test is available for people over the age of 65, who happen to be at high risk of coronary heart disease. Rest features in this study can be collected from questionnaires. All these make it possible to apply the diagnostic model of CHD established in this study.
The importance of features in Catboost, which had the best performance in this study, was shown in Figure 3. In total features set, age contributed the most, followed by total cholesterol and family history of CHD. Aging has been proved to correlated with Various cardiovascular diseases. With aging, blood vessels are aging, intima thickening, elastin declining, arterial elasticity decreasing, atherosclerosis gradually forming, and blood vessels becoming blocked, eventually leading to CHD [23]. Abnormal lipid metabolism is one of the most important risk factors for CHD. There is a dose-response relationship between total cholesterol and the risk of coronary heart disease [24]. A large number of studies have confirmed that family history is an independent risk factor for CHD [25][26]. The family history of CHD not only increases the risk, but also advances the age of onset of coronary heart disease [27]. Among all non-laboratory features, age, BMI and systolic blood pressure were the three most important features for the diagnosis of CHD. The role of blood pressure in coronary heart disease has been demonstrated in numerous studies. BMI is not only an independent risk factor for CHD, but also can mediate CHD by increasing blood pressure, worsening blood glucose and inducing metabolic syndrome [28]. The order of importance of variables in the model is consistent with the results of epidemiological studies related to CHD, which also indicates that the diagnostic model of CHD constructed in this study is reasonable.

Conclusion
Based on the results and discussions presented above, the conclusions are obtained as below: (1) It is shown that Catboost algorithm has the best performance of CHD diagnosis, compared to random forest and logistic regression.
(2) The diagnostic model of CHD based on Catboost algorithm can reach high accuracy, sensitivity and specificity.
(3) It is concluded that age, total cholesterol and family history of CHD were the three most important risk factors for identification of coronary heart disease.