The construction of transcriptional risk scores for breast cancer based on lightGBM and multiple omics data

: Background: Polygenic risk score (PRS) can evaluate the individual-level genetic risk of breast cancer. However, standalone single nucleotide polymorphisms (SNP) data used for PRS may not provide satisfactory prediction accuracy. Additionally, current PRS models based on linear regression have insufficient power to leverage non-linear effects from thousands of associated SNPs. Here, we proposed a transcriptional risk score (TRS) based on multiple omics data to estimate the risk of breast cancer. Methods: The multiple omics data and clinical data of breast invasive carcinoma (BRCA) were collected from the cancer genome atlas (TCGA) and the gene expression omnibus (GEO). First, we developed a novel TRS model for BRCA utilizing single omic data and LightGBM algorithm. Subsequently, we built a combination model of TRS derived from each omic data to further improve the prediction accuracy. Finally, we performed association analysis and prognosis prediction to evaluate the utility of the TRS generated by our method. Results: The proposed TRS model achieved better predictive performance than the linear models and other ML methods in single omic dataset. An independent validation dataset also verified the effectiveness of our model. Moreover, the combination of the TRS can prediction of


Introduction
Breast cancer is the most frequently diagnosed cancer in women worldwide [1]. In 2020, there were over 2 million new cases reported [2]. Although morbidity and mortality have declined in recent years due to changes in risk factors and improvements in screening tests and treatments, breast cancer survival rates vary widely across countries and regions. The establishment of effective prevention and treatment measures is essential to prevent breast cancer occurrence and reduce breast cancer mortality. Although carriers of BRCA1 and BRCA2 gene mutations confer a high risk of breast cancer, these gene mutations can be found in only a small part of breast cancer patients [3]. In recent years, genomewide association study (GWAS) identified multiple high frequency and low penetrance susceptibility variants of breast cancer [4]. The accumulation effects of these susceptibility variants can be summarized as a polygenic risk score (PRS). Researchers have developed several PRS models for breast cancer by using a large amount of single nucleotide polymorphisms (SNPs) data [5]. These studies maintained the PRS to be an effective and reliable predictor of breast cancer risk that may provide screening and prevention strategies [6][7][8].
However, the PRS calculated using SNPs data can only assess the genetic risk of an individual, while ignoring the influence of the external environmental exposure on gene expression. With the development of high-throughput omics technology, a large number of related studies based on genomics and transcriptomics emerged [9,10]. Omics data have been widely used for cancer classification based on identified gene signatures [11], gene pathways [12], and protein-protein interaction networks, etc. [13,14]. For example, Zhang Y et al. [15] proposed a novel approach to predict prognosis in glioblastoma multiforme (GBM) by integrating histopathological images and multi-omics data. Zhang X et al. [16] used XGBoost to upgrade a previously developed cancer-related lncRNA classifier to improve the prediction accuracy of lncRNA-cancer associations, which was expected to contribute to the functional annotation of lncRNAs and guide cancer treatment. Tong D et al. [17] collected the clinical, DNA methylation and miRNA expression data of colon cancer from TCGA and proposed a predictive model based on the integration of clinical data and multi-omics data, and the results showed better predictive outcomes. These high-throughput molecular markers can dynamically reflect the comprehensive effects of genetic background, environmental exposure and lifestyle habits individually [18,19]. The analyses of multiple omics data may lead to new insights into diagnosis and prognosis of breast cancer [20]. In addition, in the standard approach of PRS, the effect sizes of the genetic variants are usually estimated in linear statistical models [21]. However, linear statistical model has some limitations and only be applied when specific requirements are satisfied [22]. Advanced machine learning (ML) models [23,24] such as LightGBM can account for non-linear relationships among large-scale variables and have an increasing trend on the applications for breast cancer research. Using these ML models may further improve the prediction accuracy of breast cancer.
Here, we used multiple omics data and LightGBM model to construct a novel transcriptional risk score (TRS) for breast cancer. The results illustrated that our proposed method outperforms traditional linear models and other ML models and can effectively predict individual risk of breast cancer.

Data collection
The datasets in this study were downloaded from the cancer genome atlas (TCGA) project. Now, all TCGA data are accessible without limitations in publications or presentations according to the posted announcement from the TCGA website [25]. We collected four kinds of omics datasets on breast invasive carcinoma (BRCA), including DNA methylation data (Illumina Infinium Human DNA Methylation 450 K; Level 3) measured from 782 tumor tissues and 96 normal tissues (Paracancerous tissue), miRNA-seq data (IlluminaHiSeq_miRNASeq; Level 3) measured from 1078 tumor tissues and 104 normal tissues, mRNA expression (Illumina mRNA-seq; Level 3) measured from 1102 tumor tissues and 113 normal tissues, lncRNA expression (Illumina lncRNA-seq; Level 3) measured from 1102 tumor tissues and 113 normal tissues. We also collected the stage of tumor for the BRCA patients, including stages I-IV. According to the literature [26], the annotation of stages I and II were labelled as early-stage, stages III and IV as late-stage. For BRCA patients, most of the individuals are white, and a small number of individuals are black or African American and Asian. The ages of volunteers used in our study range from 26 to 90. Tables 1 and 2 show the sample sizes and clinical data of patients in BRCA datasets, respectively. An independent dataset (GSE66695) from the gene expression omnibus (GEO) measured by the Illumina Infinium 450 k Human DNA methylation Beadchip was used to validate the predictive performance of the proposed method. This dataset includes 80 BRCA tumor tissues and 40 normal tissues, and all volunteers are from Detroit, USA.

Data pre-processing
For DNA methylation, we retained the CpG sites that most negatively correlated with gene expression according to Firehose [27] and removed CpG sites with missing value to ensure the quality of the datasets [28]. For miRNA, mRNA, and lncRNA, two steps were performed to deal with the missing values in the datasets [29]. First, the probes were excluded if there is missing value in more than 20% of samples. Second, all data were normalized by Min-Max scaling to map the range from 0 to 1. For convenience, CpG sites of DNA methylation and probes of miRNA, mRNA and lncRNA are collectively referred to as biological variables. Table 1 also shows the summary of biological variables.

Overview of TRS model
According to the different phenotypes, we proposed to utilize multiple omics data and breast cancer status to construct two kinds of TRS models. The first phenotype only contains the normal samples (control) and tumor samples (case), which were labelled 0 and 1, respectively. The second phenotype contains the normal samples, early-stage and late-stage tumor samples, which were labelled 0, 1 and 2, respectively. We defined the above-mentioned two TRS models as TRS for case-control status and TRS for cancer stage status. The TRS can evaluate the individual risk of breast cancer and may improve the diagnosis of breast cancer. Moreover, since recent studies found the stage of cancer is highly associated with the prognosis [30], accurate construction of TRS for cancer stage status may facilitate the prediction of breast cancer prognosis. The framework of this study is shown in Figure 1. We provided an executable python program, which is available for downloading from the GitHub website (https://github.com/lab319/TRS_BRCA_omics_LightGBM). The dataset of BRCA was split into two groups as training dataset and testing dataset based on five-fold cross validation. We constructed TRS model by using MCP, LASSO, elastic net, SVM and LightGBM based on training dataset. The hyper-parameters of five models were optimized by using bayesian optimization and three-fold cross validation. The TRS of testing dataset was predicted by optimized model. The predictive performance of final models was evaluated with R 2 .

TRS based on LightGBM
LightGBM is an ensemble model of classification and regression trees (CART) [31], in which each step generates a basic CART model and adds it to the overall model. The TRS models based on LightGBM were built using a training dataset to predict TRS in a testing dataset. We defined each omic Let be the prediction of .
where * i X represents a matrix containing 2 n samples and m biological variables, * i y denotes the TRS. We used T additive CART models to predict the TRS in the training dataset.
where ( ) t i f X corresponds to an independent CART model and F is the space of CART models.
To learn the set of CART used in the TRS model, we minimize the following objective function.
l y y is a differentiable convex loss function that measures the difference between the prediction ˆi y and true phenotype i y . The K and k w respectively represent the number and value of leaf nodes in each CART model,  and  are constant coefficients. In general setting, the second-order approximation can be utilized to quickly optimize the objective function.
where i g and i h are the first and second-order gradient statistics of the loss function.
was defined as the instance set of leaf nodes. LightGBM used two techniques including gradient-based one-side sampling and exclusive feature bundling to estimate the information gain in a high speed [23]. The structure and value of each CART model can be determined by the information gain. Thus, we generated the TRS model consisting of T additive CART models. For the samples in a testing dataset, TRS * i y can be calculated by applying * i X to the TRS model.

TRS based on linear model and other ML model
To evaluate the predictive performance of LightGBM objectively, we applied the linear model and other ML models to construct TRS. The traditional linear model contains minimax concave penalty (MCP) [32], least absolute shrinkage and selection operator (LASSO) [33] and elastic net [34]. The ML model contains support vector regression (SVR) [35]. Here, we compared the TRS methods that only utilize omics data, without considering the methods that use GWAS summary statistics, such as LDpred [36], Lassosum [37] and so on. Similar to the TRS method based on LightGBM model, we used each omic dataset as the input of these models, and the corresponding phenotypes as the output.

Model training and evaluation
To ensure the robustness and stability of the model, we trained and evaluated the proposed TRS model by five-fold cross validation. The five-fold cross-validation method used in this study is shown in Figure 2. This procedure divided each omic dataset into five subsets. In each fold, one of the five subsets was used as the testing dataset and the other four subsets were put together to form a training dataset. We applied bayesian optimization and 3-fold inner cross validation to optimize the hyper-parameters of the TRS model in each training dataset. Specifically, for LASSO, we optimized the parameter "alpha". For MCP, we adjusted regularization parameter "lambda". For elastic net, the parameter "alpha" and "l1_ratio" were optimized. For SVR, we choose "rbf kernel" and optimized the regularization parameter "C". For LightGBM, the optimized parameters were "num_leaves", "n_estimators", "learning_rate", "max_depth", "max_bin", "min_split_gain", "subsample", "subsample_freq", "colsample_bytree", "min_child_sample", "min_child_weight", "reg_alpha", "reg_lambda". Finally, we obtained the TRS of each testing dataset which was predicted by the model with the optimized parameters. Each TRS was standardized based on its mean and standard deviation. The predictive performance of TRS model was evaluated by square of the Pearson correlation coefficient (R 2 ).
where ( , ) Cov Y Y represents the covariation of true phenotype and predicted TRS, ( ) Var Y is the variance of true phenotype, and ( ) Var Y is the variance of predicted TRS. In addition, for case-control status, we can also evaluate the predictive performance by the area under the receiver operating characteristic curve (AUC). The procedure divided the dataset into five folds of approximate equal size. Each fold was used as a test set separately, and the other four folds were utilized as the training set. The performance of the prediction algorithm was estimated by averaging the accuracy on five test sets.

Combination model of TRS
To further improve the predictive performance of TRS, we utilized the TRS based on each omic dataset to construct a new combination model [38,39]. We first matched a common dataset from the four kinds of omics datasets for BRCA. In the common dataset of case-control status, there are 786 tumor samples and 75 normal samples. In the common dataset of cancer stage status, there are 553 early-stage samples, 205 late-stage samples and 75 normal samples. Next, we used the TRS based on four kinds of omics datasets as new feature variables for the combination model. Then, we built the combination model using the LightGBM model. Bayesian optimization was applied to adjust hyperparameters and five-fold cross validation was used to evaluate the overall predictive performance. To evaluate the performance of combination model of TRS, we standardized the TRS based on its mean and standard deviation. The framework of the combination model is shown in Figure 3. We utilized proposed TRS model based on each omic data (DNA methylation, miRNA, mRNA and lncRNA) as new feature variables. The combination model was constructed by LightGBM model. The hyper-parameters of combination model were optimized by using bayesian optimization and 3fold cross validation. The predictive performance was evaluated by five-fold cross validation.

Predictive performance of TRS based on multiple omics data
We first compared our prediction model to the linear methods and other ML methods for casecontrol status. Figure 4.a shows the results of these methods on four kinds of omics datasets from TCGA. We observed that elastic net achieves the best performance in traditional linear models. The R 2 of SVR is 3.3, 7.7 and 0.5% higher than elastic net on DNA methylation, miRNA and lncRNA datasets and 3.1% lower than elastic net on mRNA dataset. The R 2 of our proposed model improved by 8.3, 14.8, 5.1 and 7.2% than elastic net on four kinds of omics datasets. Overall, our model outperformed other models and mRNA data exhibited better performance than other omics data.
Next, we applied our proposed model and other linear and ML methods for cancer stage status. Compared with the case-control status, this phenotype contains normal and two stage statuses of breast cancer. Thus, the predictive performance of TRS for cancer stage status is not as good. Nevertheless, the present results are consistent with the TRS for case-control status. According to the comparison results of our proposed model with other methods (Figure 4b), the LightGBM model performs the best predictive performance, outscoring other methods on four kinds of omics datasets from TCGA. The TRS based on LightGBM obtains the R 2 of 0.405, 0.371, 0.437 and 0.407, respectively. Compared with the elastic net with the highest prediction accuracy in the linear models, the R 2 of LightGBM improved by 12.8, 20.9, 9.8 and 12.4%, respectively. Compared with SVR, the R 2 of our proposed model improved by 10.4, 14.4, 10.6 and 3.3%, respectively. Moreover, the results showed mRNA data obtained better results than other omics data. Table 3 shows more detailed results about predictive performance of TRS based on multiple omics data.

Predictive performance of TRS in independent dataset
To further validate the predictive performance of the proposed TRS model, we utilized the GSE66695 as an independent validation dataset. We first applied the LightGBM algorithm to construct the TRS model for case-control status using DNA methylation dataset from TCGA. The hyper-parameters of LightGBM were optimized by bayesian optimization of 3-fold cross validation. Next, we predicted TRS of each sample of GSE66695 dataset and obtained the R 2 of 0.887. Compared to the BRCA dataset from TCGA, although the predictive performance of the independent validation dataset has slightly decreased, the proposed TRS model can still achieve satisfactory results.

Predictive performance of TRS based on combination model
In this part, we evaluated the performance of combination model of TRS. Figure 4c shows the results of TRS models based on four types of omics datasets and combination model in the common samples. For four kinds of omics datasets from TCGA, although the prediction accuracy in the common samples has decreased, we found that TRS based on mRNA still obtained the best prediction accuracy. For case-control and cancer stage status, the R 2 of combination model were 0.932 and 0.397, respectively. Compared with the TRS model based on mRNA dataset, the R 2 of combination model improved by 5.1 and 2.8%, respectively. Thus, the combination of four types of molecular data can achieve better results of TRS for case-control and cancer stage status. Table 4 shows more detailed results about predictive performance of TRS based on combination model.  The prevalence strata plot of increasing TRS derived by GSE66695. The 1st to 3th stratum can be regarded as a low-risk TRS stratum and the 5th to 10th stratum as a high-risk stratum.

Prevalence of breast cancer
Exploring the prevalence of different TRS strata has a positive impact on the prevention and treatment of breast cancer [40]. The main goal of this part is to analyze the risk stratification of casecontrol status. Thus, we divided the common samples into 10 strata of increasing TRS from the combination model and calculated the prevalence of each stratum (Figure 5a). We defined the prevalence as the proportion of breast cancer patients in each stratum.
Across the common samples, we observed that the prevalence is about 10% in the first stratum then upgrades to 100% in the second stratum and remains steady afterward. The prevalence changes significantly at one stratum because our proposed method achieved relatively accurate prediction of breast cancer risk for case-control status. The trend plot of the prevalence also indicated that the individuals with high-TRS strata have greater breast cancer risk than the individuals with lower-TRS strata. In addition, we built the LightGBM model based on the DNA methylation dataset of TCGA, and then calculated the TRS using the GSE66695 dataset. We subsequently plotted the prevalence curve (Figure 5b) and obtained a similar result with the TCGA dataset.

Associations between TRS and breast cancer risk
We investigated the relationship of TRS with different phenotypes of breast cancer in this section (Table 5). For case-control status, the association of TRS was evaluated in predicted results from the combination model by logistic regression. The TRS generated by combination model and phenotype (case-control status) of breast cancer are used as the variable (x) and outcome (y) of the logistic regression model, respectively. We observed that TRS was associated with occurrence risk of breast cancer (odds ratio (OR) = 18.48; 95% confidence interval (CI): 9.60-35.55; P = 2.46  10 -18 ), suggesting that per one standard deviation increase in TRS is associated with risk increase of breast cancer. In addition, we calculated the OR value by the association of TRS and outcome of breast cancer using the GSE66695 dataset (odds ratio (OR) = 58.19; 95%CI: 12.78-264.90; P = 1.48  10 -7 ), and further verified the above conclusion. For cancer stage status, we performed a multinomial logistic regression model to evaluate the association of TRS and set the normal sample as the reference group. The TRS was associated with early-stage breast cancer risk (OR = 21.05; 95%CI: 10.26-43.19; P = 9.63  10 -17 ) and late-stage breast cancer risk (OR = 46.62; 95%CI: 19.72-110.25; P = 2.14  10 -18 ). The results indicated higher TRS is associated with a significantly increased risk for early-stage and late-stage breast cancer.

Prognosis prediction of breast cancer
We explored whether the TRS for cancer stage status can effectively assess the prognosis of patients. According to the predicted results of tumor samples using combination model, we firstly divided 758 patients of breast cancer into high-risk and low-risk groups based on the 50th percentile of TRS. Next, we utilized the survival time and the status at the end of their survival time for each patient to generate Kaplan-Meier curves (KM curve). We observed that high-risk patients had statistically significantly worse prognosis ( Figure 6). The results showed the TRS for cancer stage may provide an effective prognostic tool of breast cancer patients. We divided patients into high-risk and low-risk groups based on the 50th TRS. The patients with low-risk group have better prognosis than those with high-risk group.

Discussion
In our study, first of all, we have developed a novel TRS method for breast cancer using multiple omics data and LightGBM model. For case-control and cancer stage status, we showed that the proposed method had better prediction performance than linear models and other ML models using multiple omics data. Meanwhile, the prediction results of five-fold cross validation demonstrated the robustness and reliability of our proposed method. Second, the combination of TRS further improved the predictive performance for breast cancer. Finally, by analyzing the trend of prevalence and associations between TRS and breast cancer risk, the results bolstered the clinical understanding and application for breast cancer TRS. In addition, we also found that our TRS models for cancer stage status can improve the prognosis prediction of breast cancer patients.
Most of the previous PRS studies focused on the analysis of individual-level genotype data (SNPs) using linear models. For example, Mavaddat et al. [15] utilized PRS derived from 313 SNPs in 69 studies of the Breast Cancer Association Consortium (BCAC) to predict the breast cancer risk and the AUC was 0.63. Khera et al. [14] derived a PRS based on 5218 SNPs in the UK Biobank and the AUC was 0.68. Although these studies obtain individual-level genetic risk of breast cancer, the current prediction accuracy still maintains at low level. In the independent validation dataset (GSE66695), our TRS model obtained the AUC of 0.98. Thus, the TRS based on multiple omics data and LightGBM not only improved the risk of predicting breast cancer, but also expanded the current definition of PRS from SNP data to genomics and transcriptomics data. In addition, some studies used gene expression data and clinical data to establish prognostic models to assess disease risk. For example, wang et al. [41] established an immune-related prognostic score in 22 breast cancer cohorts with a total of 6415 samples. Yang et al. [42] undertook a study of tumor infiltrating lymphocytes in a large group of ovarian cancer patients and found that high expression levels of the immune-related genes were associated with good prognosis in high-grade serous carcinomas. Distinct with these studies our investigation utilized the LightGBM algorithm and multi-omics data to build a transcriptional risk score (TRS) model and estimate the risk of breast cancer.
Our proposed method has the following advantages. First, the LightGBM model we used exploits gradient boosted trees to fits all biological variables simultaneously, especially high-dimensional data such as multiple omics data [23]. In addition, the LightGBM model takes advantage of ensemble learning, which helps to minimize the main causes of error in ML model such as noise, bias and variance than a single model [43]. Second, it is not easy to obtain SNP data because of ethical and legal constraints. With the development of high-throughput omics technology, related studies accumulate vast amounts of genomic and transcriptomic data which can be downloaded from many public databases. Third, as representative of genomics data, DNA methylation can be modulated by physiological and environmental exposures and provide biomarkers for diagnosis and prognosis for cancer [44][45][46]. Transcriptomics data including miRNA, mRNA, and lncRNA reveals the transcription and regulation mechanism of large-scale genes, which play an important role in determining the mechanism and treatment of cancer [47,48]. Compared to the individual-level genotype data, using multiple omics data to construct breast cancer TRS considered the interaction of genetic and environmental factors, and thus can provide higher prediction accuracy.
Although our TRS methods provide good predictive performance, they have some limitations. First, the LightGBM model has more hyper-parameters than traditional linear models such as MCP, LASSO and elastic net. Thus, we need more time to train the proposed model. We applied multithreading technology to effectively utilize computing resources and correspondingly reduced some running time. Second, the sample size of breast cancer from TCGA is relatively small compared to large-scale genome-wide association studies data. In addition, there are significantly more tumor samples than normal samples in our study. Imbalanced datasets significantly compromise the performance of most standard learning algorithms, because these models assume the balanced class distributions. Third, the results we obtained on the independent validation set are lower than the results on the TCGA breast cancer dataset. The reasons are as the following, the two datasets are collected from different studies. Besides, the sample size of the independent validation (GEO) set is much smaller than that of TCGA. In the future, one of our tasks is to collect more omics data and clinical information from public datasets (TCGA, GEO, METABRIC, et al.), more cancer datasets are needed to improve and validate the proposed TRS model for breast cancer.

Conclusions
We proposed a novel TRS model for two kinds of breast cancer phenotypes by using multiple omics data and LightGBM. The results demonstrated that our model improved the prediction accuracy of current PRS methods indeed and may provide an effective diagnosis and prognosis tool for breast cancer.