Machine learning application in personalised lung cancer recurrence and survivability prediction

Graphical abstract


Introduction
Lung cancer is the most commonly diagnosed cancer globally and the leading cause of cancer death in both sexes combined with an estimated 1.6 million deaths in 2018 [1]. Approximately 85% of patients have a group of histological subtypes collectively known as non-small cell lung cancer (NSCLC), of which lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are the most common subtypes [2]. Lobar resection is the standard curative modality for early-stage (Stage I and II) and selected Stage III NSCLC [3]. Although the treatment of NSCLC has made great progress in the past few decades, the five-year survival rate has not improved significantly due to the initial diagnosis at a late stage. Moreover, the recurrence after surgery usually occurs very rapidly: 50-90% occur two years after surgery, and 90-95% of patients occur within five years [4]. Currently, the popularity of computed tomography has significantly increased the rate of early screening for lung cancer [5]. However, there is still a lack of a systematic and objective approach for better diagnosis and treatment of NSCLC. The present study aims to integrate genomic, clinical, diagnostic and demographic data to generate a full picture of patients in order to develop a risk prediction model for the overall survival and recurrence status for NSCLC.
The identification of individuals' overall survival and relapse requires an accurate and robust predictive model. Machine learning (ML) techniques can discover and identify patterns and rela-tionships between various variables from complex datasets so as to predict effectively future outcomes of many cancers. ML has been widely applied to cancer prognosis and prediction [6][7][8][9][10][11][12][13][14]. It has been reported that ML methods can be used to substantially (15-25%) improve the accuracy of predicting cancer susceptibility, recurrence and mortality [6]. In NSCLC, most previous predictive models for lung cancer have been developed based on risk factors such as tobacco smoking history, family history of lung cancer and occupational exposures [15][16][17]. However, these conventional risk factors generally do not provide enough information to make robust predictions or prognoses [6]. With the rapid development of genomic, proteomic and imaging technologies, more specific molecular scale information about the tumour and the patient have been discovered as powerful indicators of cancer prognosis and prediction [18,19]. In addition to well-known biomarkers of KRAS and EGFR, new genomic biomarkers, such as somatic mutations in ALK, ERBB2, TP53 have been demonstrated to be associated with lung cancer risk, response or prognosis [20][21][22]. As proteomic biomarkers, 17 circulating inflammatory proteins have demonstrated clinical utility in lung cancer prognosis [23][24][25][26]. Most recently, many researchers have analysed the quantitative features from radiological images and correlated the radiomic biomarkers with lung cancer prognosis and mutation status [27][28][29][30].
All the emerging biomarkers act as new pieces of the lung cancer puzzle. Ideally, for accurate and robust prediction of an individual's cancer prognosis, all pieces need to fit into the puzzle to generate the full picture of the patient. This means integrating carefully all histological, clinical, demographic, genomic, proteomic, metabolic and radiomic information to come up with a reasonable prognosis. Chen et al. [31] attempted to assess the survival prediction of NSCLC patients through the use of artificial neural networks (ANNs) with 10 selected genes expression as well as clinical and demographic data (sex, age, T stage and N stage). Hanai et al. [32] applied ANNs to construct a prognostic model for 125 NSCLC patients with 12 clinico-pathological variables (age, sex, smoking index, tumor size, p factor, pT, pN, stage, histology) and 5 immunohistochemical variables (p27 percentage, p27 intensity, p53, cyclin D1, retinoblastoma). Hsia et al [33] investigated the survival time in advanced lung cancer patients using ANNs from the genetic polymorphism of the p21 and p53 genes in conjunction with patients' general data (gender, age, disease type and period of lung cancer, chemical diagnosis, treatment type of chemical diagnosis, smoking habit). Marchevsky et al. [34] predicted the survival of Stage I and II NSCLC patients using clinical-pathological (age, sex, cell type, stage, tumour grade, smoking history) and immunohistochemical variables (c-erbB-3, bcl-2, Glut1, Glut3, retinoblastoma gene and p53).
The objective of our present study was to develop a risk prediction model using ML methods to predict overall survival and recurrence status for NSCLC using The Cancer Genome Atlas (TCGA) cohorts. The distinctive features of the work include integrating genomic, clinical and demographic data and introducing both copy number variation (CNV) and mutation information of a broader set of genes (15) to predict overall survival and recurrence status for both LUAD and LUSC patients. The 15 selected genes (TP53, STK11, KRAS, KEAP1, EGFR, SMARCA4, CDKN2A, BRAF, RB1, PIK3CA, NF1, ERBB2, HRAS, NRAS, AKT1) for constructing the predictive model were chosen based on reports of their significance for NSCLC [35,36]. Mutations in a number of these genes may contribute to NSCLC and represent potential therapeutic targets for these tumours. For example, targeted antibody therapies for lung cancer with mutant EGFR oncogenes include necitumumab (Portrazza, Eli Lilly and Co.), cetuximab (Erbitux, Eli Lilly and Co.) and amivantamab-vmjw (Rybrevant, Janssen Biotech).

TCGA data description
TCGA is the largest public pan-cancer biology database, which is available from the TCGA Data Portal at https://tcga-data.nci.nih.gov/tcga/. TCGA database includes genomic, transcriptomic, and epigenetic data for 33 human cancer types represented with more than 11,000 individual samples. In this work, we focus on NSCLC with two major subtypes: LUAD and LUSC. Altogether, we collected 511 representative samples of LUAD and 487 LUSC for which genomic, clinical and demographic data are available for both subtypes. Demographic data includes Age, Gender and Race. Clinical data includes Cancer Stage, TNM Stage, History of Prior Cancer Diagnosis, Overall Survival and Recurrence Status. Genomic data includes mutation and CNV information of 15 genes: TP53, STK11, KRAS, KEAP1, EGFR, SMARCA4, CDKN2A, BRAF, RB1, PIK3CA, NF1, ERBB2, HRAS, NRAS and AKT1. Among the NSCLC data, only Age and Overall Survival are two numerical variables. According to the average value, the Age and the Overall Survival variables were transformed into two categories, namely < or ! 65 years and < or ! 3 years respectively. For the clinical cancer stage variables (Cancer Stage and TNM Stage), we only consider the major stages from I to IV but not the subdivision stages like IA, IB. Fig. 1 summarize the TCGA data used in this work.

Analysis of variance (ANOVA)
ANOVA is a procedure for determining whether variation in the response variable arises within or among different population groups. In this work, one-way analysis of variance is used to determine whether there are any statistically significant differences between the means of NSCLC factors. The level of statistical significance is expressed as the p-value, which is the probability of observing the sample results given that the null hypothesis is true. Usually, a p-value threshold of 0.05 can be considered as statistically significant.

Decision trees (DTs)
DTs are important, well-established machine learning techniques, which have been used for a wide range of applications, especially for classification problems [37,38].
In this work, a popular decision tree algorithm, CART (classification and regression tree), was applied to construct binary trees [39]. The Gini index was used as the splitting rule for CART. If costs of misclassification are not specified, the Gini index is defined as: If costs of misclassification are specified, then the Gini index is defined as: where the sum extends over all k categories. p j=t ð Þ is the probability of category j at the node t and C i=j ð Þ is the probability of misclassifying a category j case as category i.
The tree structure has been optimized based on the best accuracy found using 10-fold cross-validation in MATLAB (R2017).

Artificial neural networks (ANNs)
ANNs are a set of algorithms, to simulate the functioning of a human brain, that are designed to recognize patterns, which result in data-driven models that can interpret effectively patterns in multivariate data from non-linear systems [40].
In this study, a common neural network algorithm, the feedforward neural network (FFNN) [41,42], was applied to construct a model with one hidden layer of 20 neurons using MATLAB (R2017). The maximum number of epochs for training was set to 1000. To prevent the trained network model from over-training, the training procedure stopped if the validation performance degraded for 10 consecutive epochs. The optimal trained network with the best validation performance was selected. The training function used in this work was the Levenberg-Marquardt algorithm which was designed to solve non-linear least squares problems [43]. The Levenberg-Marquardt algorithm uses the Jacobian matrix in the following Newton-like update: where J is the Jacobian matrix that contains first derivatives of the network errors with respect to the weights and biases, and e is a vector of network errors. If the scalar l is zero, this is just Newton's method, using the approximate Hessian matrix. If l is large, this becomes gradient descent with a small step size. Thus, l is decreased after each successful and is increased only when a tentative step would increase the performance function.

Support vector machines (SVMs)
SVMs are supervised learning methods in machine learning algorithms for classification and regression analysis [44].
Least-squares support vector machine (LS-SVM) [45] was used to construct non-linear classification models in this work using MATLAB (R2017). In this work, the optimal regression line (y ¼ w Á uðxÞ þ b) was found by minimizing the object function in Equation (5) while w and b are the regression weight coefficients and the bias terms of the final model.
e i is the error tolerance of the model. In this work, two parameters, u and r 2 , required by implementation of LS-SVM were set as 5 and 2 accordingly. A summary of the pros and cons of the three machine learning approaches compared in this analysis is provided in Table 1. 2.6. K-fold cross-validation K-fold cross-validation is a widely used technique for assessing the robustness of a model [46]. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsets. Of the k subsets, a single subset is retained as the validation data for testing the model, and the remaining k-1 subsets are used as training data. The cross-validation process is then repeated k times (the folds) and the k results from the folds can be averaged to pro- duce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation and each observation is used for validation exactly once. In this work, 10-fold cross-validation was applied for all three machine learning methods for estimating the prediction error.

Results and discussion
This section presents insights from machine learning methods on identification and prediction of key factors for recurrence and survivability of LUSC and LUAD. Variance analysis is used to reveal the factors with significant influence on recurrence and survivability. These findings are compared with previous studies for corroboration. Three common ML methods (decision trees, neural networks and support vector machines) are applied for building predictive models and their performance is compared in terms of their ability to accurately predict recurrence and survivability of LUSC and LUAD.

Analysis of variance on recurrence risk and survivability for NSCLC
The matrix of p-values for all one-way ANOVA tests for two subtypes of NSCLC, (a) LUAD and (b) LUSC, are shown in Fig. 2. The pvalue matrix reveals the statistically significant impact of copy number variation types (amplification or deletion) between the 15 genes investigated in the work. The signalling pathways involved in the development of lung cancer can explain this observation. The key gene mutations in these pathways are correlated with each other. Therefore, the expression of each key transcription factor may cause a series of downstream factors and crossprotein changes [47][48][49][50]. A similar result [51] reveals that the miR-3151 gene (miRNA gene family) is driven by BRAFindependent mechanisms while the TP53 gene could act as a downstream effector of miR-3151. This finding provided evidence for a causal link between BRAF mutations and TP53.
Statistically significant demographic, clinical and genomic factors for recurrence and overall survival of NSCLC are indicated in Table 2. From the demographic data, race was identified as a significant factor for both LUSC recurrence and overall survivability, with patients of white race having a lower recurrence rate but also a lower survival rate compared to patients of other races. The influence of race on incidence and survival of NSCLC has been attributed to the diversity in inherited genetic variations and an accumulation of somatic genetic events [52,53]. LUSC overall survivability was also found to be influenced by gender with female patients tending to have higher survival rates than male. This observation is rein-forced by other works [54,55]. The mechanism leading to the difference between genders is still unknown, but endocrine factors are believed to play an important role [56,57]. It is worth noting that no significant difference was found in the different age groups (<65 and ! 65) for either recurrence rate or survival rate for LUSC and LUAD, although age is a well-known risk factor for development and progression of cancer [58].
Regarding clinical predictors, cancer stage, N stage and radiotherapy were found to be significant factors for recurrence for both LUAD and LUSC. T stage was identified as a significant factor for recurrence for both LUAD and LUSC but as a significant factor for survival for LUSC only. The analysis indicated that M stage was a significant factor for survival for LUSC only. As is well known, the current cancer staging and TNM staging system for lung cancer are both essential for predicting prognosis and selecting appropriate treatment; it is derived by the International Association for the Study of Lung Cancer from a database of 94,708 patients from 46 sites across 19 countries [59,60]. Usually, patients with a high cancer stage or TNM stage have a poor prognosis and high recurrence risk [60]. This work found that patients with adjuvant radiotherapy had a significantly higher recurrence rate for both LUAD and LUSC. This observation can be explained by the fact that radiotherapy is usually given to the patients in the advanced or terminal stage before or after surgery due to the high relapse rate [61]. Table 2 also highlights the significant genomic factors. The analysis indicates that the EGFR and KRAS copy number variation can have a significant impact on survival rate for LUSC and LUAD respectively while the mutation status of NF1, ERBB2, STK11, TP53, KEAP1 and SMARCA4 can be significant factors. EGFR mutations have been used as the basis for targeted therapies such as EGFR tyrosine kinase inhibitors (EGFR-TKIs) and antibodies. The EGFR pathway is one of the recently discovered pathways that can promote lung cancer. Mutations in the EGFR gene can lead to an increase in the degree of malignancy of lung cancer. There is a significant association between sensitivity to EGFR TKIs and the types of EGFR mutations [62]. Globally, KRAS mutant tumours are the most common potential overlapping molecular subtypes in non-small cell lung cancer [63]. From a clinical point of view, KRAS-mutant lung cancer is usually associated with a worse overall survival rate than KRAS wild-type tumours, especially in advanced cancers [64,65]. However, other studies in the early stage [66] or late group [67] were inconsistent in confirming this poor survival; therefore, the prognostic significance of KRAS mutation status in lung cancer remains a controversial topic. However, recent biologic findings in KRAS, coupled with the advent of immunotherapy, may lead to the development of effective therapeutic strategies and optimal therapeutic stratification of the KRAS-mutant NSCLC in the near future [68,69]. With regard to  other gene mutations highlighted in this work, such as NF1, ERBB2, STK11, TP53, KEAP1 and SMARCA4, they also play important roles in various pathways associated with the metastasis or overall survival [70][71][72]. However, there is still a lack of effective inhibitors to block their expression.

Early stage (Stage I & II) NSCLC recurrence risk prediction
In the NSCLC dataset, there are 276 records of early stage LUSC and 295 records of early stage LUAD with recurrence information.
In order to build up a predictive model with machine learning methods, some pre-processing work of NSCLC data is necessary. First, patient records with null values or missing values have been removed from the training dataset. Second, a set of class labels have been given to the records of training dataset since all the machine learning methods applied in this work are supervised learning methods. According to the recurrence status, each patient record was classified into one of two groups: High risk and Low risk. Table 3 summarises the number of records under each classification in the training dataset of the recurrence risk for LUSC and LUAD.
To compare the performance of different machine learning methods (CART, FFNNs and LS-SVM), for prediction of early stage NSCLC recurrence risk, the receiver operating characteristic (ROC) curve for each method was generated for LUAD and LUSC respectively ( Fig. 3(a) and (b)). The ROC curve is a common method to demonstrate the diagnostic ability of a binary classifier system by plotting the true positive rate against the false positive rate at various threshold settings. The threshold refers to a boundary between the classes of a classifier system [73]. The diagonal line from the bottom left to the top right in a ROC curve represents ran- Note: P refers to the p-value of ANOVA analysis that indicates the statistical significance of each factor. Sig. refers to the significance level of p-value: 0.01 < p < 0.05 (*), 0.001 < p < 0.01(**), p < 0.001(***). dom guessing. The point in the upper left corner, (0, 1), represents the best possible prediction method with 100% sensitivity and 100% selectivity. To compare the average performance of different classifiers, it is common to calculate the area under the ROC curve (AUC) as an average performance indicator. AUC is a portion of the area of the unit square therefore its value is between 0 (worst performance) and 1 (perfect performance). From the ROC curves, it shows that the decision tree models have the best performance in recurrence prediction for both LUAD and LUSC with AUC values   Fig. 3(c) and (d) revealed the key factors on recurrence risk for LUAD and LUSC accordingly. Furthermore, from top to bottom along the branch to each leaf node of the tree, the ''if-then" rules can be generated to describe and predict whether a patient has a high or low risk of recurrence. For example, in Fig. 3(c), the left branch of the tree indicates that if a LUAD patient is in N0 stage, with adjuvant radiation therapy treatment experience and has CNV in KRAS (either in deletion or amplification), then the recurrence risk is low. From the predicted models, N stage and M stage are two important determinants of recurrence risk for both cancer subtypes, which reinforce the ANOVA analysis results in Section 4.1. As discussed before, the TNM staging indicates the level of disease progression and the malignant potential of the primary lung cancer. However, the TNM staging system may reach the limit of its usefulness in recurrence risk prediction since even patients with disease at the same stage exhibit wide variations in their incidence of recurrence after curative resection [74]. The decision tree models indicated that gender in demographics as well as N stage and M stage in clinical status are the common recurrence risk for both LUAD and LUSC. Adjuvant radiation therapy is more effective in LUAD rather than LUSC. For LUSC recurrence risk, the CNV types in KRAS, TP53 and NRAS play important roles. For LUAD recurrence risk, mutation in NF1, ERBB2 and TP53 and CNV types in EGFR and RB1 are important. The predictive models in this work have the potential to help clinicians to accurately predict the cases in which disease is likely to recur and to make personalised clinical approach schedule and follow-up timeline.

Early stage (Stage I & II) NSCLC survivability prediction
In the NSCLC dataset, there are 242 records of early stage LUSC and 249 records of early stage LUAD with survivability information. The pre-processing work of survivability training dataset is the same as recurrence risk prediction mentioned before. According to the overall survival length after initial resection treatment, each patient record was classified into one of two groups: Good and Poor. The summary of training dataset of the survivability is shown in Table 4.
The ROC curves for the three machine learning methods are shown in Fig. 4 (a) and (b) for LUAD and LUSC respectively. From the ROC curves, it shows that the decision tree model has the best performance in survivability prediction for LUAD with the AUC values as 0.767 while the neural network is slightly better than decision tree in LUSC survivability prediction with the AUC values as 0.837 and 0.815 respectively.
The decision tree models in Fig. 4(c) and (d) revealed the key factors for survivability for LUAD and LUSC accordingly. Similar to the results in Fig. 4, from top to bottom along the branch to each leaf node of the tree, the ''if-then" rules can be generated to describe and predict whether a patient is in good or poor survivability. Although the TNM stage plays an important role in survivability for both cancer subtypes, the decision tree models in this work revealed that M stage in clinical status is the common impactor of survivability of early stage NSCLC. Age and N stage are more important for LUAD rather than LUSC while T stage is more important for LUSC rather than LUAD. For genomics information, the CNV types in HRAS, CDKN2A, RB1 and NRAS play important roles in LUAD survivability while the mutation status in CDKN2A and KRAS as well as CNV types in EGFR are important in LUSC survivability. The potential of the tree models in this work offers support for clinicians to predict the individual survivability and to assist with personalised planning of future social and care needs.

Conclusion
The major contribution of this work is the construction of a more complete portrait of NSCLC patients by integrating genomic, clinical and demographic data when building predictive models using machine-learning methods. By comparing these three methods, CART tree models demonstrate good predictive performance and advantages in understandable tree-like graphs that can generate the rules to predict recurrence and survivability for LUSC and LUAD. The key factors and if-then rules revealed by the tree models can provide clinicians with a better understanding of recurrence risk and overall survivability of early stage NSCLC. The results of this work also have the potential to help clinicians to make personalised decisions on tailored treatment and follow-up plans.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.