Development of a machine learning-based multimode diagnosis system for lung cancer

As an emerging technology, artificial intelligence has been applied to identify various physical disorders. Here, we developed a three-layer diagnosis system for lung cancer, in which three machine learning approaches including decision tree C5.0, artificial neural network (ANN) and support vector machine (SVM) were involved. The area under the curve (AUC) was employed to evaluate their decision powers. In the first layer, the AUCs of C5.0, ANN and SVM were 0.676, 0.736 and 0.640, ANN was better than C5.0 and SVM. In the second layer, ANN was similar with SVM but superior to C5.0 supported by the AUCs of 0.804, 0.889 and 0.825. Much higher AUCs of 0.908, 0.910 and 0.849 were identified in the third layer, where the highest sensitivity of 94.12% was found in C5.0. These data proposed a three-layer diagnosis system for lung cancer: ANN was used as a broad-spectrum screening subsystem basing on 14 epidemiological data and clinical symptoms, which was firstly adopted to screen high-risk groups; then, combining with additional 5 tumor biomarkers, ANN was used as an auxiliary diagnosis subsystem to determine the suspected lung cancer patients; C5.0 was finally employed to confirm lung cancer patients basing on 22 CT nodule-based radiomic features.


INTRODUCTION
Lung cancer is the most common cause of cancer-related death worldwide due to insidious incidence, high metastasis, and poor prognosis [1]. As reported by the Annual Report of America in 2018, the five-year survival rate of lung and bronchus cancer ranged from 55.1% (stage I) to 4.2% (stage IV) for cases that were diagnosed from 2007 through 2013 [2]. However, only 25.3% of lung and bronchus cancer patients were diagnosed at stage I or stage II, while 66.9% of cases were diagnosed at stage III or stage IV due to the lack of an efficient early diagnostic tool for lung cancer [2].
Five-year survival analysis by stage and the examination of stage distribution indicates the potential benefits associated with early detection and treatment [2]. Thus, it is essential to develop a novel early diagnostic strategy, which contributes to enhancing clinical therapeutic efficacies for lung cancer.
Nowadays, chemical diagnosis, imaging diagnosis, cell and histocytological diagnosis are the primary diagnostic methods of lung cancer [3]. Among them, computed tomography (CT)-based imaging diagnosis is the AGING primary tool to detect lung cancer at early stages [4][5][6]. The results of the National Lung Screening Trial confirmed that low-dose CT (LDCT) adopted in the high-risk group could reduce the mortality rate of lung cancer by 20% compared with chest X-ray [6]. Several other studies also demonstrated that CT scans should be implemented for the high-risk groups, but not for the general population, to detect early lung cancer, which could decrease the radiation hazard and financial costs [7][8][9]. However, it is a difficult task to identify the highrisk group for lung cancer. At present, the definition of the high-risk group for lung cancer is controversial, which is mainly assessed by age and smoking status [7]. Evidence showed that lung cancer could also be indicated by other epidemiological characteristics and clinical symptoms such as the family history of cancer and hemoptysis [7,9,10].
Indeed, CT provides effective early diagnostic information of lung cancer from a macroscopic perspective, which can clearly locate the nodule sites and indicate the metastasis. It is known that radiologists distinguish the benign from malignant nodules by their size, shape, density, and other characteristics [11]. However, CT images are difficult to be analyzed manually, which requires radiologists to have excellent reading skills, especially for the diagnosis of small and isolated pulmonary nodule [12,13]. It is reported that the false positive rate of LDCT screening for lung cancer is as high as 96.4% [6]. Therefore, the diagnostic efficiency of CT for lung cancer needs to be further improved. On one hand, it is necessary to develop a method that can effectively distinguish benign from malignant CT nodules. At present, many scholars try to extract radiomic features of CT nodules and establish models to achieve the intelligent identification of benign and malignant nodules [12,14,15]. On the other hand, there is an urgent need to seek an auxiliary means, which can enhance the diagnostic efficiency of lung cancer in combination with CT. As we know, tumor markers have been widely used in the detection of lung cancer in recent years, such as progastrin-releasing peptide (ProGRP), vascular endothelial growth factor (VEGF), carcinoembryonic antigen (CEA), cytokeratin 19 fragment (CYFRA21-1) and neuronspecific enolase (NSE) [16,17]. Previous studies confirmed that the risk model constructed with these tumor markers could enhance the early diagnosis of lung cancer [18,19]. Certainly, tumor markers in serum provide microscopic molecular information related to the occurrence and progression of cancer, which points out a new direction for the early detection of lung cancer [16,20]. In addition, blood sampling, minimally invasive and repeatable, can be easily performed, making serum an excellent matrix for lung cancer diagnosis [20,21]. Thus, the combination of tumor markers and the features of CT nodules, which offers microscopic molecular information and macroscopic imaging information, is supposed to be an ideal strategy for lung cancer diagnosis at early stages [22]. However, medical data in current studies are complex, which cannot be processed adequately by traditional statistical methods. Especially, parameter analysis and information mining are challenging tasks [23]. Machine learning based on data mining technology can extract valuable knowledge and information from a large number of incomplete and noisy data, which may be suited for this work [24]. Recent studies have demonstrated that the application of machine learning significantly improves metastases detection in lymph nodes, Ki67 scoring in breast cancer, Gleason grading in prostate cancer, and tumor-infiltrating lymphocyte scoring in melanoma [25]. Furthermore, deep machine learning models are able to predict the changes of some tumor markers in lung, prostate, gastric, and colorectal cancer [25]. Moreover, prognostic deep neural network models have been adopted in the diagnosis of lung cancer, melanoma, and glioma, which is developed based on digitized HE slides [25]. Among the various machine learning approaches, decision tree (DT) C5.0, artificial neural network (ANN), and support vector machine (SVM) have been widely applied in the development of cancer prediction models, which has resulted in making effective and accurate diagnosis [26].
In this study, C5.0, ANN, and SVM were applied to develop an efficient multilayer diagnosis system for lung cancer based on multidimensional variables. The diagnosis system integrated epidemiological characteristics, clinical symptoms, and molecular markers with CT nodule-based radiomic features, which combined micro biomarkers with macro imaging, behavior characteristics, and laboratory research with clinical diagnosis technology.

Statistical analysis of epidemiological characteristics and clinical symptoms from 842 cases in the firstlayer subsystem
The comparisons of the 14 features describing the epidemiological characteristics and clinical symptoms (between the 372 lung cancer and the 470 lung benign diseases) were shown in Table 1. Statistical analysis showed that there were significant differences between the two groups (P<0.05) for the characteristics of age by groups, age, gender, smoking status, drinking status, history of lung infection, expectoration, bloody sputum, fever or sweating, cough and hemoptysis. And, there were no significant differences between lung cancer and lung benign groups (P>0.05) for chest tightness or chest pain, family history of tumor and lung cancer. Demographic characteristics of lung cancer and lung benign disease patients in the second-layer subsystem were presented in Table 2. There were significant differences between the two groups (P<0.05) for the characteristics of age by groups, smoking status, history of lung infection, expectoration, bloody sputum, fever or sweating, hemoptysis and family history of lung cancer. In contrast, there were no significant differences between lung cancer and lung benign patients (P>0.05) for age, gender, drinking status, chest tightness or chest pain, cough and family history of tumor. As shown in Table 3, the levels of ProGRP, VEGF, CEA, and CYFRA21-1 in the lung cancer group were higher than those in the lung benign disease group (P<0.05). However, there was no statistical difference in the level of NSE between the two groups (P>0.05).

Statistical analysis of the 22 radiomic features extracted from lung CT nodules in the third-layer subsystem
The demographic characteristics of the subjects in the third-layer subsystem were shown in Supplementary  Table 1. 22 lung CT nodule-based radiomic features were extracted from 123 lung CT nodules, which contained 64 lung benign nodules and 59 lung cancer nodules. However, the extracted lobulation grade f13 and spiculation grade f14 were 0 in both groups, which couldn't be further statistically analyzed. As shown in Table 4, statistical analysis indicated that there were significant differences between the two groups (P<0.05) for the radiomic features of gray mean f1, gray variance f2, gray histogram entropy f3, seven order invariant distance f4, calcification area f11, calcification area/nodule area f12, cavity number f15, contrast f18, correlation f19, energy f20, homogeneity f21 and entropy f22. However, there were no significant differences between lung CT benign and malignant nodules (P>0.05) for the seven order invariant distance f5, f6, f7, f8, f9, f10, cavity area f16 and cavity area/nodules area f17.

Development of machine learning models
As shown in Table 5, machine learning models were constructed to distinguish lung cancer from lung benign diseases. 14 epidemiological characteristics and clinical symptoms of 638 samples, including 296 lung cancer and 342 lung benign diseases, were used as input features to develop the models of C5.0-1, ANN-1, and SVM-1 in the training set. The accuracies of C5.0-1, ANN-1, and SVM-1 models in the training set were 79.78%, 73.04%, and 77.27%, respectively. 204 samples, including 76 cases with lung cancer and 128 lung benign diseases, were used as the testing set to verify the effect of the three models. The accuracies of the C5.0-1, ANN-1, and SVM-1 models in the testing set were 69.12%, 71.57%, and 65.20%, respectively. The 14 features mentioned above and the 5 serum tumor markers levels including ProGRP, VEGF, CEA, CYFRA21-1 and NSE from 208 patients were employed as the input variables to develop the C5.0-2, ANN-2 and SVM-2 models in the training set, which included 97 lung cancer and 111 lung benign disease patients. The accuracies of C5.0-2, ANN-2, and SVM-2 models in the training set were 97.60%, 85.58%, and 98.08%, respectively. 78 samples, including 32 lung cancer and 46 lung benign diseases, were employed to test the effect of C5.0-2, ANN-2, and SVM-2 models. The accuracies of models in the testing set were 80.77%, 89.74%, and 83.33%, respectively. 22 radiomic features were extracted from 90 lung CT nodules and adopted to train the C5.0-3, ANN-3, and SVM-3 models, which included 42 lung cancer nodules and 48 lung benign nodules. The accuracies of C5.0-3, ANN-3, and SVM-3 models in the training set were 100%, 93.33%, and 100%, respectively. 33 samples, including 17 lung cancer nodules and 16 lung benign nodules, were used to test the effect of the models. The accuracies of C5.0-3, ANN-3, and SVM-3 models in the testing set were 90.91%, 90.91%, and 84.85%, respectively.

Effect evaluation of machine learning models
As presented in Table 6, the testing effect of the model was evaluated by sensitivity, specificity, accuracy, PPV, NPV, and AUC.

DISCUSSION
Although lung cancer has no specific symptoms in its early stage, there are molecular abnormalities and imaging changes during the occurrence and development of lung cancer. The characteristic information can be captured and used for the diagnosis of lung cancer. However, there are different types of data, including descriptive epidemiological and clinical symptoms, SVMs were employed to construct the diagnostic systems of lung cancer [28]. DTs are tree-structured schemes where the nodes represent the input variables, and the leaves correspond to decision outcomes [26]. They are widely used for classification purposes and can be intuitive [3]. ANNs are developed on the basis of biological neurons of the human brain and trained to generate an output outcome as a weighted combination of AGING Table 6. Effect evaluation of machine learning models in the testing set.  [29,30]. They aim to solve a variety of classification or pattern recognition problems [26].

Models Accuracy(%) Sensitivity(%) Specificity(%) PPV(%) NPV(%) AUC(95% CI)
The main advantage of ANN is able to approximate any nonlinear mathematical function [31]. SVMs are based on the principle of structural risk minimization and put the data into a multidimensional space to achieve classification with a hyperplane, which have distinct advantages in solving problems such as the small sample size, nonlinear, or high dimensional pattern types [3,31]. Every approach has its advantages and disadvantages, and it is necessary to try different methods to seek a suitable model for the diagnosis of lung cancer.
Previous studies demonstrated that screening with the use of CT in high-risk groups reduced mortality from lung cancer, but not in the general population [6][7][8][9]. The risk assessment of lung cancer involved multiple factors, which contained epidemiological characteristics and clinical symptoms [9,32]. In this study, 14 epidemiological characteristics and clinical symptoms from 842 subjects were investigated to build C5.0-1, ANN-1, and SVM-1 models. And, the results showed that the ANN-1 model had the best performance. To our knowledge, the definitions of people at risk for lung cancer vary globally, which mainly depend on age and smoking status [6][7][8][9]. Our current model determines lung cancer by integrating multiple factors including age and smoking status, which has been proved to be an effective tool for identifying lung cancer. Moreover, epidemiological characteristics and clinical symptoms can be easily obtained by a questionnaire, which is economical and physically harmless. Therefore, the ANN-1 model constructed based on these data is recommended for the broad-spectrum screening of a large sample population in the first-layer subsystem, which contributes to screening out the high-risk group of lung cancer from patients with pulmonary diseases.
In addition, another strategy -tumor markers in the blood may further help screen the persons who are best suited for CT scan and this will help to decrease the radiation hazard and financial costs [6]. In recent years, ProGRP, VEGF, CEA, CYFRA21-1, and NSE are identified as the tumor markers of lung cancer, which are commonly adopted in clinical detection [33][34][35].
Increasing evidence suggests that the combined assessment of serum molecular markers can effectively discriminate lung cancer [35,36]. According to our results, the performance of ANN-2 and SVM-2 models were superior to C5.0-2 by AUC comparison, which was established with 14 features of epidemiological and clinical data, and 5 serum tumor markers of ProGRP, VEGF, CEA, CYFRA21-1and NSE from 286 samples. And the sensitivity, specificity, accuracy, PPV, and NPV of ANN-2 were higher than SVM-2. Therefore, we propose the use of the ANN-2 model for searching suspected lung cancer patients from high-risk groups, which is named as auxiliary diagnosis subsystem. Further, only the suspected lung cancer patients are recommended to perform CT scans, which will reduce the radiation hazard and alleviate the financial burdens of CT scans. However, CT scan also faces other challenges such as over-diagnosis and high falsepositive rate [6,8]. To overcome these obstacles, the benign and malignant lung nodules on CT images were analyzed [37]. 22 radiomic features were extracted from 123 lung CT nodules, based on which, ANN-3, C5.0-3, and SVM-3 models were developed. All models showed good performance in terms of sensitivity, specificity, accuracy, PPV, NPV, and AUC. In particular, the AUCs of ANN-3 and C5.0-3 were up to 0.9. Although there were no statistical differences by AUC comparison among the three models, the C5.0-3 had the highest sensitivity of 94.12%. Hence, the C5.0-3 model is recommended for distinguishing lung malignant from benign nodules, which can be utilized for the intelligent diagnosis of lung cancer.
Based on our results, we propose an efficient diagnostic strategy for lung cancer, which contains a three-layer system structure. The first layer that broad-spectrum screening subsystem is constructed based on 14 epidemiological characteristics and clinical symptoms using an ANN model for screening high-risk groups from patients with pulmonary diseases. The second layer is an auxiliary diagnosis subsystem built on epidemiological characteristics, clinical symptoms, and 5 serum tumor markers of lung cancer, including ProGRP, VEGF, CEA, CYFRA21-1, and NSE, with an ANN model for searching suspected lung cancer patients from high-risk groups. The third layer that intelligent diagnosis subsystem is developed based on 22 lung CT nodule-based radiomic features using a C5.0 model for the further confirmation of lung cancer patients. The patients with lung cancer will be diagnosed step by step, so as to reduce the radiation hazard, over-diagnosis, and financial costs. This strategy can be used for the on-site screening and clinical diagnosis of the high-risk population. Permission for data and sample collection was obtained from the patients or their relatives.

Measurement of 5 serum tumor markers
3 mL venous blood was collected from every fasting subject in the morning, and then the blood samples were stored at 37°C for 30 minutes, centrifuged for 10 minutes at 1500g. Finally, the serum was separated and stored at -80°C for follow-up analyses. Serum ProGRP and VEGF were determined by ELISA kits (Shanghai enzyme-linked biological technology company) according to the manufacturer's instructions. Chemiluminescence detection kits (Beijing huaketai biotechnology company) were employed to detect serum CEA, CYFRA21-1, and NSE according to experimental procedures.

Flow chart of proposed work
A machine learning-based three-layer diagnostic system for lung cancer was proposed in this study as shown in Figure 1. The first layer was a broad-spectrum screening subsystem, which screened out the high-risk group of lung cancer from pulmonary disease patients. And, the machine learning-based screening models were developed using the 14 features of epidemiological characteristics and clinical symptoms. The high-risk individuals screened by the first-layer subsystem were included in the second-layer subsystem. The second layer was a machine learning-based auxiliary diagnosis subsystem constructed with the 14 features of epidemiological characteristics and clinical symptoms, and the 5 serum tumor markers for identifying suspected lung cancer patients from the high-risk groups. The suspected patients of lung cancer evaluated by the AGING second-layer subsystem were further introduced into the third-layer subsystem. The third layer was an intelligent diagnosis subsystem, which was developed based on the 22 lung CT nodule-based radiomic features using machine learning models for further confirming lung cancer patients.

Establishment of machine learning models
Based on the random sampling function of machine learning models, the samples were randomly divided into training set and testing set according to the ratio of 3:1 using partition node. The training set was employed to develop the models and testing set was used for evaluating the performance of the models. In each of the three subsystems, the 14 epidemiological characteristics and clinical symptoms were applied as the input variables for C5.0-1, ANN-1, and SVM-1 in the firstlayer subsystem; The 14 epidemiological characteristics and clinical symptoms were combined with 5 serum tumor markers as the input variables for C5.0-2, ANN-2, and SVM-2 in the second-layer subsystem; The 22 lung CT nodule-based radiomic features were presented as the input variables for C5.0-3, ANN-3, and SVM-3 in the third-layer subsystem; While the groups (0 for lung benign diseases, 1 for lung cancer) were set as the output variables. Parameters for the models were set as follows:

AGING
The input data of ANN were required to range from 0 to 1, so the parameters that did not meet this requirement were normalized using linear function to range from 0 to 1. Below was the formula: (X was the original value, Y was transformed by the above formula via X, Xmax and Xmin were the maximum and minimum among all original data, respectively).

Statistical analysis
Statistical analyses were performed by SPSS 21.0 software. SPSS Clementine 21.0 software was used for classification analysis. The data were expressed by Median (P25-P75) and analyzed with the Mann-Whitney U. Chi-Square test was employed for each contingency table. P-value of 0.05 was considered as a statistical test level.
Six indexes including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) were used to evaluate the classification models.