Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis

The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.


Introduction
Lung cancer is one of the most frequently diagnosed cancers and the leading cause of cancer deaths worldwide. About 2.20 million new patients are diagnosed with lung cancer each year [1], and 75% of them die within five years of diagnosis [2]. High intra-tumor heterogeneity (ITH) and complexity of cancer cells giving rise to drug resistance make cancer treatment more challenging [3]. Over the past decades, the continuous evolution of technologies in cancer research has contributed to many large collaborative cancer projects, which have generated numerous clinical, medical imaging, and sequencing databases [4][5][6]. These databases facilitate researchers in investigating comprehensive patterns of lung cancer from diagnosis, treatment, and responses to clinical outcomes [7]. In particular, current studies on -omics analysis, such as genomics, transcriptomics, proteomics, and metabolomics, have expanded our tools and capabilities for research. Cancer studies are undergoing a shift toward the integration of multiple data types and mega sizes. However, using diverse and highdimensional data types for clinical tasks requires significant time and expertise even with assistance from dimension reduction methods such as matrix and tensor factorizations [8][9][10][11], and analyzing the exponentially growing cancer-associated databases poses a major challenge to researchers. Therefore, using machine learning (ML) models to automatically learn the internal characteristics of different data types to assist physicians' decision-making has become increasingly important.
ML is a subgroup of artificial intelligence (AI) that focuses on making predictions by identifying patterns in data using mathematical algorithms [12]. It has served as an assisting tool in cancer phenotyping and therapy for decades [13][14][15][16][17][18][19], and has been widely implemented in advanced approaches for early detection, cancer type classification, signature extraction, tumor microenvironment (TME) deconvolution, prognosis prediction, and drug response evaluation [20][21][22][23][24][25][26][27]. Herein, we present an overview of the main ML algorithms that have been used to integrate complex biomedical data (e.g., imaging or sequencing data) for different aspects of lung cancer (Figure 1; Tables S1 and S2), and outline major challenges and opportunities for future applications of ML in lung cancer clinical research and practice. We hope that this review promotes a better understanding of the roles and potentialities of ML in this field.
Apply ML for early detection and auxiliary diagnosis of lung cancer ML on early detection and diagnosis using medical imaging datasets Early diagnosis is an important procedure for reducing deaths related to lung cancer. Chest screening using low-dose computed tomography (CT) is the primary approach for the surveillance of people with increased lung cancer risk. To promote diagnostic efficiency, the computer-aided diagnosis (CAD) system was developed to assist physicians in the interpretation of medical imaging data [28,29], which has been demonstrated as a useful second opinion for physicians [30]. The traditional feature-based CAD task can be broken into three steps: nodule segmentation, feature extraction and selection, and clinical judgment inference (classification) (Figure 2). Some approaches apply the measured texture features of specified nodules in CT images combined with the patient's clinical variables as input features to train an ML classifier, including logistic regression (LR) [31][32][33] or linear discriminant analysis (LDA) [34], for malignancy risk estimation. Typically, these measurements include nodule size, nodule type, nodule location, nodule count, nodule boundary, and emphysema information in CT images, and the clinical variables include the patient's age, gender, specimen collection timing, family history of lung cancer, smoking exposure, and more. However, these features are mostly subjective and arbitrarily defined, and usually fail to achieve a complete and quantitative description of malignant nodule appearances.
With the development of deep learning (DL) algorithms, especially convolutional neural networks (CNNs), more studies have been conducted to apply DL-based models in the CAD system to improve its accuracy and reduce its false positive rate and execution time during lung tumor detection ( Table 1) [35,36]. Similar to feature-based CAD system, the workflow of these models usually consists of three steps: nodule detection and segmentation, nodule feature extraction, and clinical judgment inference [37]. Compared with traditional feature-based CAD systems, the DL-based CAD system can automatically retrieve and extract intrinsic features of a suspicious nodule [38,39], and can model the 3D shape of a nodule ( Figure 2). For example, Ciompi et al. [40] designed a model based on OverFeat [41,42] by extracting three 2D-viewfeature vectors (axial, coronal, and sagittal) of the nodule from CT scans. The recently integrated CNN models facilitate a global and comprehensive inspection of nodules for feature characterization from CT images. Buty et al. [37] designed a complementary CNN model, where a spherical harmonic model [43] for nodule segmentation was used to obtain the shape descriptions (''shape" feature) of the segmented nodule and a deep convolutional neural network (DCNN)-based model [41] to extract the texture and intensity features (''appearance" feature) of the nodule. The downstream classification relied on the combination of ''shape" and ''appearance" features. Similarly, Venkadesh et al. [44] used an ensemble model from two different models, 2D-ResNet50based [45] and 3D-Inception-V1 [46], to respectively extract two features of a pulmonary nodule, and then concatenated the two features as the input features for classification. A superiority of the ensemble CNN model is that it can accurately identify malignant nodules from different sizes of nodules using the raw CT images. Benefiting from the features extracted from state-of-the-art CNN models, clinical judgment inference can be implemented through frequent ML techniques, including LR, random forest (RF), support vector machine (SVM), and neural networks (NNs). Notably, some studies also employed CNN models for final clinical judgment inference. Ardila et al. [47] proposed an end-to-end approach to systematically model both localization and lung cancer risk categorization tasks using the input CT data alone. Their approach was based on a combination of three CNN models: a Mask-RCNN [48] model for lung tissue segmentation, a modified RetinaNet [49] model for cancer region of interest (ROI) detection, and a full-volume model based on 3Dinflated Inception-V1 [50,51] for malignancy risk prediction. In addition to CT images, CNN-based models are also widely used in histological imaging to help with lung cancer diagnosis. Compared with CT imaging, histological imaging can provide more biological information about cancer at the cellular level. To this end, AbdulJabbar et al. [52] used the Micro-Net [53] model to identify tissue boundaries followed by an SC-CNN [54] model to segment individual cells from hematoxylin and eosin (H&E)-stained and immunohistochemistry (IHC) images. The segmented cells were then applied for cell type classification to evaluate the proportions of each cell type in the images. This model helps to identify the differential evolution and immune evasion mechanisms between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) with high resolution. Another study [55] utilized the Inception-V3 network [51] to classify whether the tissue was LUAD, LUSC, or normal from H&E-stained histopathology whole-slide images. A highlight of this study is that the model can also predict whether a given tissue has somatic mutations in several lung cancer driver genes, including STK11, EGFR, FAT1, SETBP1, KRAS, and TP53. Note that considering the high complexity and large resources of the datasets, some studies utilized transfer learning to improve their efficiency and robustness when training new models [38,55].
Though these ML algorithms are already widely used in CAD, the challenge is that only a limited number of the images are labeled. Training a complex CNN model using a limited number of training sets may result in overfitting. Recently, generative adversarial network (GAN)-based models have been used to improve the performance of discriminative classifiers by generating pseudo images [56]. Chuquicusma et al. [57] first employed a deep convolutional GAN (DCGAN) [58] model to generate synthetic lung nodule CT scans. With their work, more recent studies have integrated the GAN models with other CNN models to address the overfitting problem in lung cancer classification. Lin et al. [59] used a two-step model -a DCGAN to generate synthetic lung cancer images and an AlexNet [41] for lung cancer classification using both original and synthetic datasets. Similar work was also done by Ren and colleagues [60]. They also used DCGAN [58] for data augmentation. To improve performance, they then designed a regularization-enhanced transfer learning model called VGG-DF for data discrimination to prevent overfitting problems with pre-trained model auto-selection.

ML on early detection and diagnosis using -omics sequencing datasets
Although periodic medical imaging tests are recommended for high-risk populations, implementation has been complicated by a high false discovery rate [61,62]. Therefore, there is a critical need for new techniques in early detection of lung cancers. Recent sequencing technologies enable diverse methods for early detection of lung cancer [63]. In the meantime, accurately classifying lung cancer subtypes is crucial in guiding optimal therapeutic decision-making. LUAD ($ 45%) and LUSC ($ 25%) are the two most common subtypes of lung cancer but are often treated similarly except for targeted therapy [64]. However, studies have indicated that LUAD and LUSC have drastically different biological signatures, and they have suggested that LUAD and LUSC should be classified and treated as different cancers [65,66]. From a computational perspective, both early detection and subtype identification are part of the classification task. Previous ML studies have shown the efficiency and advancement of early detection and cancer type classification in large pan-cancer sequencing datasets [67][68][69][70][71][72][73][74][75], which may provide evidence for lung cancer diagnosis. It is known that cancer cells are characterized by many genetic variations, and the accumulation of these genetic variations can be signatures that document the mutational patterns of different cancer types [3,5,76,77]. For this reason, recent studies have concentrated on extracting better genomic signatures as input features to boost the accuracy of their ML models. For early detection, blood-based liquid biopsy, including  We presented an overview of ML methodologies for different aspects of lung cancer therapies, including CAD from imaging datasets, lung cancer early detection based on sequencing technologies, data integration and biomarker extraction from multi-omics datasets, treatment response and prognosis prediction, and immunotherapy studies. ML, machine learning; IC50, half-maximal inhibitory concentration; HLA, human leukocyte antigen; CT, computed tomography; MALDI, matrix-assisted laser desorption/ionization; DL, deep learning; cfDNA, cell-free DNA; CAD, computer-aided diagnosis; CNV, copy number variation; RECIST, Response Evaluation Criteria in Solid Tumors; TIL, tumor-infiltrating lymphocyte. The model requires specifying the position and diameter of the nodule as input, but many nodules could not be located on the CT images Venkadesh et al. [44] 2D-ResNet50-based [45]; 3D-Inception-V1 [46] An ensemble model based on two CNN models 16,429 CT images AUC (0.86-0.96) 10-fold cross-validation 3D CT volume feature and nodule coordinates The model achieves higher AUC than other benchmarked models The model requires specifying the position of the nodule, but many nodules are unable to be located on the CT images Ardila et al. [47] Mask-RCNN [ The model can annotate cell types at the single-cell level using histological images only The annotation accuracy is affected by the used reference dataset Coudray et al. [55] Multi-task CNN model based on Inception-V3 [51] Multi-task CNN model based on Inception-V3 network [51] 1634 Histological images AUC (0.733-0.856) Hold-out Transformed 512 Â 512-pixel tiles from nonoverlapping 'patches' of the wholeslide images The model can predict whether a given tissue has somatic mutations in genes STK11, EGFR, FAT1, SETBP1, KRAS, and TP53 The accuracy of the gene mutation prediction is not very high Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation, and test data. However, cross-validation is more time consuming than using the simple holdout method.
cell-free DNA (cfDNA) fragments, circulating tumor DNA (ctDNA), microRNA (miRNA), methylation, exosomes, and circulating tumor cells (CTCs), to explore potential circulating tumor signatures is considered a reliable method [63] (Figure 3). Integrating these liquid biopsy signatures, many discriminative models (SVM, RF, and LR) have been used to detect tumors with high discovery rates [78][79][80][81]. For lung cancer subtype classification, somatic mutations, including single-nucleotide variants (SNVs), insertions, and deletions, usually have specific cancer type profiles [82]. Thus, studies have leveraged somatic mutations as input features to train classifiers for LUAD-LUSC classification [83]. Many of these mutations, especially driver mutations, can change expression levels, which impact gene function and interrupt cellular signaling processes [82]. As a result, different cancer types show different expression levels of certain proteins [84,85]. Imposed by these unique expression profiles of cancer type, ML models can leverage RNA sequencing as input data to categorize the malignancy (benign or malignant) and subtypes (LUAD or LUSC) of patients [86][87][88][89]. Similarly, copy number variation (CNV) is reported to be highly correlated with differential gene expression [90], and can be ubiquitously detected in cancer cells. As such, CNVs can also be used to train ML models for cancer type classification in lung cancer studies [81,91,92]. Note that Daemen et al. [92] proposed a recurrent hidden Markov model (HMM) for the identification of extended chromosomal regions of altered copy numbers, which offers high accuracy for classification. More recently, Jurmeister et al. [93] used DNA methylation profiles as input features to determine if the detected malignant nodule is primary lung cancer or the metastasis of another cancer. Directly using all generated genes as an input feature may result in overfitting [94]. Thus, many studies used different computational approaches to select multiple cancer-associated genes to enhance their ML models ( Figure 3). Some studies used ML-based algorithms for feature selection. For example, Liang et al. [80] and Whitney et al. [86] employed the least absolute shrinkage and selection operator (LASSO) method to select the optimal markers for model training; Aliferis et al. [89] utilized recursive feature elimination (RFE) [95] and univariate association filtering (UAF) models to select highly cancer-associated genes. In addition, using unsupervised models for sample population subtype clustering, and then identifying each cluster's marker genes is also seen in many studies [96,97]. Apart from ML-based models, some studies used statistical methods for feature selection. Raman et al. [81] designed a copy number profile abnormality (CPA) score to reinforce the CNV feature which is more robust and less subject to variable sample quality than directly using CNVs as the input feature. Daemen et al. [92] integrated several statistical tests (ordinary fold changes, ordinary t-statistics, SAM-statistics, and moderated t-statistics) to select a robust differential expression gene set. Aside from these single-measured signatures, some studies [81,86,88] combined the -omics signatures with clinical signatures to achieve better results. Using these tumor-type specific -omics signatures, many algorithms, K-nearest neighbors (KNN), naive Bayes (NB), SVM, decision tree (DT), LR, RF, LDA, gradient boosting, and NN, have demonstrated their ability to accurately detect and classify different lung cancer patterns ( Table 2). Note that to improve the accuracy of ML models, Kobayashi et al. [83] added an element-wise input scaling for the NN model, which allows the model to maintain its accuracy with a small number of learnable parameters for optimization.

Figure 3 Omics analysis in lung cancer studies
Different sequencing techniques allow for the simultaneous measurement of multiple molecular features of a biological sample. To improve efficiency and reduce overfitting, statistical and ML tools perform differential analysis or feature selection. Further ML models concatenate the obtained omics features with clinical features as input for lung cancer diagnostic/prognostic prediction. DEG, differentially expressed gene; RFE, recursive feature elimination; UAF, univariate association filtering. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation, and test data. However, cross-validation is more time consuming than using the simple holdout method.
Apply ML to lung cancer treatment response and survival prediction Prognosis and therapy response prediction Sophisticated ML models have acted as supplements for cancer intervention response evaluation and prediction [98,99], and have demonstrated advances in optimizing therapy decisions that improve chances of successful recovery (Figure 4; Table 3) [100,101]. There are several metrics that are available for evaluating cancer therapy response, including the Response Evaluation Criteria in Solid Tumors (RECIST) [102]. The definition of RECIST relies on imaging data, mainly CT and magnetic resonance imaging (MRI), to determine how tumors grow or shrink in patients [103]. To track the tumor volume changes from CT images, Jiang et al. [104] designed an integrated CNN model. Their CNN model used two deep networks based on a full-resolution residual network [105] model by adding multiple residual streams of varying resolutions, so that they could simultaneously combine features at different resolutions for segmenting lung tumors. Using the RECIST criterion, Qureshi [106] set up a RF model to predict the RECIST level under EGFR tyrosine kinase inhibitor (TKI) therapy given the patient's mutation profile in gene EGFR. To improve the prediction performance, the model integrated clinical information, geometrical features, and energy features obtained from a patient's EGFR mutant drug complex as input to train the classifiers. In a recent study, the authors defined a different metric, tumor proportional scoring (TPS) calculated as the percentage of tumor cells in digital pathology images, to evaluate the lung cancer treatment response [107]. They applied the Otsu threshold [108] with an auxiliary classifier generative adversarial network (AC-GAN) model to identify positive tumor cell regions (TC + ) and negative tumor cell regions (TC À ). And they ultimately used the ratio between the pixel count of the TC + regions and the pixel count of all detected tumor cell regions to evaluate the TPS number.
Another study from Geeleher et al. [109] used half-maximal inhibitory concentration (IC50) to evaluate drug response. In their model, the authors applied a ridge regression model [110] to estimate IC50 values for different cell lines in terms of their whole-genome expression level. More recently, Quiros et al. [111] established a phenotype representation learning (PRL) through self-supervised learning and community detection for spatial clustering cell type annotation on histopathological images. Their clustering results can be further used for tracking histological tumor growth patterns and identifying tumor recurrence. Indeed, their model has also demonstrated good performance in the LUAD and LUSC classifications.

Survival prediction
Prognosis and survival prediction as a part of clinical oncology is a tough but essential task for physicians, as knowing the survival period can inform treatment decisions and benefit patients in managing costs [112][113][114]. For most of the medical history, predictions relied primarily on the physician's knowledge and experience based on prior patient histories and medical records. However, studies have indicated that physicians tend to execute poorly in predicting the prognosis and survival expectancy, often over-predicting survival time [115][116][117]. Statistical algorithms, such as the Cox proportional-hazards model [118], have been implemented to assist physicians' prediction in many studies [119][120][121][122], but they are not particularly accurate [12]. As a comparison, ML has shown its potential to predict a patient's prognosis and survival in genomic, transcriptomic, proteomic, radiomic, and other datasets ( Figure 4; Table 3). Chen et al. [123] used 3-year survival as a threshold to split the patients into highrisk (survival time < 36 months) and low-risk (survival time > 36 months) groups, and then constructed a NN model to binary predict the risk of a patient using his gene expression data and clinical variables. In their model, they tested four   In Cancer; EGFR, epidermal growth factor receptor. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation, and test data. However, cross-validation is more time consuming than using the simple holdout method. microarray gene expression datasets and achieved an overall accuracy of 83.0% with only five identified genes correlated with survival time. Liu et al. [124] also utilized gene expression data for a 3-year survival classification. Unlike Chen et al. [123], the authors integrated three types of sequencing data -RNA sequencing, DNA methylation, and DNA mutationto select a total of 22 genes to improve their model's stability. Meanwhile, LUADpp [125] and Cho et al. [126] used the somatic mutations as input features to model a 3-year survival risk classification. To select the genes associated with the highest significant mortality, Cho et al. [126] used chi-squared tests, and LUADpp [125] used a published genome-wide rate comparison test [127] that was able to balance statistical power and precision to compare gene mutation rates. Due to the complexity of survival prediction, multi-omics tumor data have been integrated for analysis in many studies. Compared with single-omics data, the multi-omics data are more challenging to accurately extract the most significant genes for prediction. To address the issue, several studies [128][129][130][131] designed a similar workflow. They first constructed a matrix representing the similarity between patients based on their multi-omics data. Using the obtained matrix, they then employed an unsupervised clustering model (usually autoencoder with K-means clustering) to categorize the patients into two clusters. The two clusters were labeled ''high-risk" and ''low-risk" in terms of the different survival outcomes between the two clusters in the Kaplan-Meier analysis. Following the survival outcome differences, the genes associated with mortality were extracted using a statistical model [128,129] or an ML model [130,131] for downstream analyses.
Apply ML to lung cancer immunotherapy Immunotherapy response prediction Immunotherapy has become increasingly important in recent years. It enables a patient's own immune system to fight cancer, in most cases, by stimulating T cells. Up to date, distinct novel immunotherapy treatments are being tested for lung cancer, and a variety of them have become standard parts of immunotherapy. Immune checkpoint inhibitors (ICIs), especially programmed cell death protein 1 (PD-1)/programmed cell death protein ligand 1 (PD-L1) blockade therapy [132], have been demonstrated to be valuable in the treatment of patients with non-small cell lung cancer (NSCLC) [133,134]. However, immunotherapy is not yet as widely used as surgery, chemotherapy, or radiation therapies. One interpretation is that it does not work for all patients due to the uniqueness of a patient's tumor immune microenvironment (TIME). Therefore, estimating whether a patient will respond to immunotherapy is important for cancer treatment. Recently, AI-based technologies have been developed to predict immunotherapy responses based on immune sequencing signatures and medical imaging signatures ( Figure 4; Table 3) [135].
To predict the response to PD-1/PD-L1 blockade therapy, Wiesweg et al. [136] utilized gene expression profiles of 7 significant genes extracted from ML models plus 25 cell type-specific genes as input features to train an SVM classifier for RECIST classification. Aside from sequencing data, features from CT scans can also be used to assess the RECIST level of a patient. Two recent studies [137,138] used radiomic biomarkers as well as other imaging features of tumor lesions from contrastenhanced computed tomography (CE-CT) scans to train a classifier, including LR and RF, for RECIST classification.

Tumor-infiltrating lymphocyte evaluation
The proportion of tumor-infiltrating lymphocytes (TILs) is another important metric for immunotherapy response evaluation. To this end, using transcriptomics data, DeepTIL [139] optimized the cell deconvolution model CIBERSORT [140] to automatically compute the abundance of the leucocyte subsets (B cells, CD4 + T cells, CD8 + T cells, cd T cells, Mo-Ma-DC cells, and granulocytes) within a tumor sample. A different approach [141] utilized a total of 84 radiomic features from the CE-CT scans, along with RNA sequencing of 20,530 genes as biomarkers to train a linear elastic-net regression model to predict the abundance of CD8 + T cells. Another study [142] created a DL model to identify TILs in digitized H&Estained images ( Table 3). The methodology consisted of two unique CNN modules to evaluate TILs at different scales: a lymphocyte infiltration classification CNN (lymphocyte CNN) and a necrosis segmentation CNN (necrosis CNN). The ''lymphocyte CNN" aimed to categorize the input image into with-and without-lymphocyte infiltration regions. It consists of two steps: a convolutional autoencoder (CAE) [143] for feature extraction, followed by a VGG 16-layer network [144] for TIL region classification. The ''necrosis CNN" aimed to detect TILs within a necrosis region. They used the DeconvNet [145] model for TIL segmentation in ''necrosis CNN" as the model has been shown to achieve high accuracy with several benchmark imaging datasets.

Neoantigen prediction
In addition to immunotherapy response prediction, ML algorithms have shed light on neoantigen prediction for immunotherapy. Neoantigens are tumor-specific mutated peptides generated by somatic mutations in tumor cells, which can induce antitumor immune responses [146][147][148]. Recent work has demonstrated that immunogenic neoantigens are benefit to the development and optimization of neoantigen-targeted immune therapies [149][150][151][152]. In accordance with neoantigen studies in clinical trials, state-of-the-art ML approaches have been implemented to identify neoantigens based on human leukocyte antigen (HLA) class I and II processing and presentation [153][154][155][156][157]. Using the identified somatic mutations, ML models can estimate the binding affinity of the encoded mutated peptides to the patient's HLA alleles (peptide-HLA binding affinity). The neoantigens can be further predicted based on the estimated peptide-HLA binding affinity. NetMHC [158,159] [146,[163][164][165].

Challenges and future perspectives
Despite the widespread use of ML studies in lung cancer clinical practice and research, there are still challenges to be addressed. Here, we post some examples of recent ML algorithms, especially the increasingly popular and important DL algorithms of the past decade, to enlighten them on lung cancer therapy analyses, as well as the challenges for future lung cancer studies.

Imaging data analysis
Learning how to effectively extract nuance from imaging data is critical for clinical use. In the earlier ML-based CAD system, feature extractions were typically based on the image intensity, shape, and texture of a suspicious region along with other clinical variables [166]. However, these approaches are arbitrarily defined and may not retrieve the intrinsic features of a suspicious nodule. To this end, a DL-based CAD system was developed leveraging CNN models to extract features directly from raw imaging data with multilevel representations and hierarchical abstraction [167][168][169]. Contrary to previous methods, features from a CNN model are not designed by humans, and reflect the intrinsic features of the nodule in an objective and comprehensive manner. Recently, the Vision Transformer (ViT) has emerged as the current state-of-the-art in computer vision [170,171]. In comparison to CNN, ViT outperformed almost 4Â in terms of computational efficiency and accuracy, and was more robust when training on smaller datasets [172]. Although, to our knowledge, ViT models have not been implemented in any lung cancer imaging studies, they have shown their potential as a competitive alternative to CNN in imaging data analysis.

Omics dataset analysis
DL is a subfield of ML, which uses programmable NNs to make accurate decisions. It particularly shines when it comes to complex problems such as image classification. In this study, we reviewed the utility of DL models in imaging datasets. Compared with imaging datasets, DL algorithms were less frequent in lung cancer clinical studies using omics data. However, DL models have been extensively applied in other fields of omics analysis. For example, the genomics data are continuous sequences, thus recurrent neural network (RNN) models [173] and CNN models [174] are good tools for the population genetics analysis. Moreover, considering the input dimension of the omics data is usually very high, to improve efficiency and reduce overfitting, many studies have used autoencoders or deep generative models for feature extraction and dimensionality reduction [175]. In the meantime, self-supervised representation learning models can overcome the curse of dimensionality and integrate multi-omics data to combine information about different aspects of the same tissue samples [176]. Accompanied by the development of single-cell-based [177] and spatial-based [178] technologies that have been applied in molecular studies, numerous DL models are becoming more popular for computationally intensive analysis. To deal with the complexity of large genomics data, unsupervised deep clustering tools have been built for population structure identification [179] or cell population subtype annotation [180][181][182][183]. In addition, to process the complex structure of multi-omics data, graph neural network (GNN) models are increasingly popular in dataset integration [184], biomedical classification [185], prognosis prediction [186], and so on. Though these studies have not been directly applied to lung cancer clinical analysis, they are a good inspiration for using DL tools to address complex lung cancer omics datasets.

Multi-view data and multi-database integration
It is common to access large amounts of imaging data, multiomics data, and clinical records from a single patient nowadays. Integrating these data provides a comprehensive insight into the molecular functions of lung cancer studies. However, these data types are typically obtained from different platforms, so platform noise inevitably exists between these data types. For example, imaging data analysis, especially radiomics, usually comes with the challenges of complicated data normalization, data fusion, and data integration. To overcome this limitation, multimodality medical segmentation networks have been developed to jointly process multimodality medical images [187]. Similarly, for sequencing data types, batch noise also exists between different databases (i.e., batch effect).
Removing batch effects and integrating datasets from multiple platforms together in a framework that allows us to further analyze the mechanisms of cancer drug resistance and recurrence is important for cancer therapies. Though biomedical studies have experimented and/or benchmarked integrative tools [188][189][190][191], they are not comprehensive and discriminating enough to address the choice of tools in the context of biological questions of interest.

Model generalizability and robustness
In terms of this review, we find that the performance of an ML algorithm usually varies across different datasets. One interpretation might be the existence of a database batch effect that we discussed earlier. However, the absence of generalizability and robustness might be other factors that hurdle these ML models in clinical studies. In addition, to reduce overfitting, most studies used either statistical models or ML models to select marker genes before classification. However, these marker genes are usually quite different between studies, indicating that the identified marker genes lack generalizability and biological interpretability. To improve the generalizability and robustness of a model, it is important to develop a better understanding of robustness issues in different ML architectures and bridge the gap in robustness techniques among different domains. For example, recent studies have applied transfer learning to use a pre-trained model when training their own datasets in lung cancer imaging data analysis [38,55,192], and have improved the efficiency and robustness of their CNN-based models. For sequencing datasets, transfer learning has also been used in deep NNs to provide a generalizability approach [193], which could be a good example of building a general and robust model for lung cancer sequencing data analysis. In addition, DL is a complex black-box model. Understanding the mechanisms of a DL system in clinical studies could help to build a standardized and unified DL framework to improve its performance and robustness. The explainable AI (XAI) models have provided a tool for modelspecific and model-agnostic analysis [194,195]. These methods can provide the explanations of a model at local and global levels, which further helps the researchers to fine-tune hyperparameters from different models with high efficacy [196,197].

Metrics for performance evaluation
Studies usually focus on the development of algorithms for clinical studies. However, metrics selection for performance assessment of these algorithms is usually neglected, though it usually plays an important component in ML systems [198]. Based on this review (Tables 1-3), accuracy and under the curve (AUC) are the two most conventional metrics, whereas these metrics do not always reflect the clinical needs and should be translated into clinically explainable metrics. Compared with accuracy, sensitivity or specificity might be more associated with clinical needs under certain circumstances, for example, patients at high risk of emergency department visits [199].

Clinical decision-making
A recent study estimated that the overall costs for lung cancer therapy would exceed $50,000 [200] for most patients, and that the cost would be high for most families. Thus, accurate prognosis prediction and decision-making will pave the way for personalized treatment. Recent DL models have been used to predict the effectiveness of a therapy/drug and optimize the combination of different therapies/drugs [201,202]. However, most existing DL models for clinical decision-making have difficulty in keeping up with knowledge evolution and/or dynamic health care data change [203]. Currently, clinical decision support systems, including IBM Watson Health and Google DeepMind Health, have been implemented in lung cancer treatments in recent years [204,205]. Although the efficiency of clinical work has improved with the help of these systems, they are still far from perfect in terms of clinical trials, and currently cannot replace physicians at this stage [205].

Conclusion
AI grants us a different perspective on lung cancer research and allows for exploring the implementation of decision support tools to facilitate precision oncology. In this review, we surveyed the current advances of ML algorithms in various areas of lung cancer therapy, including early detection, diagnosis decision, prognosis prediction, drug response evaluation, and immunotherapy practice. To aid future ML development in lung cancer therapies, we thoroughly summarized the datasets (Table S1), baseline methods (Table S2), and characteristics of the methods (Tables 1-3). At last, we highlighted the current challenges that need to be addressed, such as the current lack of quantity and quality of medical data labels for training, the importance of model robustness and biomedical explanations for clinical use, the concern of the metrics used for performance evaluation, and the need for data integration and batch removal. As this review indicates, future lung cancer therapies will include both imaging data and omics data, so an ML clinical decision-making tool should be a multi-modal system that considers both imaging data and omics data treatment, and the integration of multiple data types. Finally, we expect that these challenges could motivate further studies to focus on lung cancer therapies.
CRediT author statement