Epigenome-Wide Tobacco-Related Methylation Signature Identification and Their Multilevel Regulatory Network Inference for Lung Adenocarcinoma

Tobacco exposure is one of the major risks for the initiation and progress of lung cancer. The exact corresponding mechanisms, however, are mainly unknown. Recently, a growing body of evidence has been collected supporting the involvement of DNA methylation in the regulation of gene expression in cancer cells. The identification of tobacco-related signature methylation probes and the analysis of their regulatory networks at different molecular levels may be of a great help for understanding tobacco-related tumorigenesis. Three independent lung adenocarcinoma (LUAD) datasets were used to train and validate the tobacco exposure pattern classification model. A deep selecting method was proposed and used to identify methylation signature probes from hundreds of thousands of the whole epigenome probes. Then, BIMC (biweight midcorrelation coefficient) algorithm, SRC (Spearman's rank correlation) analysis, and shortest path tracing method were explored to identify associated genes at gene regulation level and protein-protein interaction level, respectively. Afterwards, the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis and GO (Gene Ontology) enrichment analysis were used to analyze their molecular functions and associated pathways. 105 probes were identified as tobacco-related DNA methylation signatures. They belong to 95 genes which are involved in hsa04512, hsa04151, and other important pathways. At gene regulation level, 33 genes are uncovered to be highly related to signature probes by both BIMC and SRC methods. Among them, FARSB and other eight genes were uncovered as Hub genes in the gene regulatory network. Meanwhile, the PPI network about these 33 genes showed that MAGOH, FYN, and other five genes were the most connected core genes among them. These analysis results may provide clues for a clear biological interpretation in the molecular mechanism of tumorigenesis. Moreover, the identified signature probes may serve as potential drug targets for the precision medicine of LUAD.

Partial least squares (Qiu, et al., 2017) Partial least squares (PLS) is an efficient statistical regression technique that is highly suited for the analysis of high-dimensional data, a powerfully proven method for analyzing genomic and proteomic data, especially problems of classification and dimension reduction in bioinformatics and genomics (Abdi and Williams, 2013;Nguyen and Rocke, 2002;Song, et al., 2012). Suppose that the data X is an n×p matrix of n samples and p genes (the raw data set should be scaled to zero mean and unit variance), and let Y denote the n×q vector of response values, such as the indicator of classification of smokers and non-smokers. When n<p, the usual regression tools such as ordinary least squares (OLS), cannot be applied since the p×p covariance matrix X T X is singular. In contrast, PLS may be applied also to the cases, whose aims is to describe linear relationship between the predictor matrix X ∈ R n×p and the response Y ∈ R n×q ,

Y=XB+V
(1) where B ∈ R p×q is the regression coefficient matrix and V ∈ R n×q is the residual matrix. PLS regression is based on the basic principal component decomposition: where * ∈ + ,×. is the latent variables (LVs) matrix, / ∈ + 0×. and 1 ∈ + 2×. are matrices of coefficients, 3 ∈ + ,×0 and 4 ∈ + ,×2 are matrices of random errors, m is the number of LVs.
From equation (1), (2), and (3), the T is the key. The objective criterion for constructing components in PLS is to sequentially maximize the covariance between the response variable and a linear combination of the predictors. That is, in PLS, the components are constructed to maximize the objective criterion based on the sample covariance between Y and XW, thus, 5 6 = arg max <=> ? @ ?AB C (EF, H) (4) Subject to the orthogonal constraint, 5 6 J E K EF L = 0 for all 1 ≤ k < i (5) Where V ∈ + W×X is a matrix of weights.
To derive the T, PLS can all be seen as methods to construct a matrix of latent components T as a linear transformation of X, * = YV (6) If T is constructed, 1 * and is obtained as the least squares solution of Equation (2): 1 * = (* * *) Z[ * * \ (7) The matrix B regression coefficients matrix is constructed from Equation (1): ] = V(* * *) Z[ * * \ (8) The number of LVs is the only parameter of PLS which need to be decided, with the increase of LVs, the information of original data preserved is increasing, until reaching the maximal value, which is the rank of X, all the information of original data is contained in LVs.
Biweight midcorrelation coefficient algorithm (Yuan, et al., 2018) Biweight midcorrelation is considered to be a good alternative to Pearson correlation coefficient since it is more robust to outliers. In order to introduce the biweight midcorrelation coefficient (BIMC) of two numeric vectors ^= (_ B , ⋯ , _ a ) and b = (c B , ⋯ , c a ), x and y can be two column vectors of DNA methylation matrix, d e , > e are defined with i=1,…,n as follows: gli(^) = ghi(|_ e − ghi(^)|) where ghi(^) and ghi(b) are the median of vector x and y respectively. gli(•) represents the median absolute deviation of numeric vector. Based on d e and > e . The weights 5 e (n) for _ e and 5 e (o) for y e are defined as follows: where I is an indicator equation, for equation (13), the indicator equation s(1 − |> e |) is 1 if (1 − |> e |)>0 and otherwise equals to 0. The same situation occurs for equation (12). For equation (10) and (13), as the difference between y e and ghi(b) gets smaller and smaller, 5 e (o) gets closer to 1. If the difference between y e and ghi(b) is larger than T • gli(b), 5 e (o) equals to 0. The same situation occurs for equation (9) and equation (12). T is a pre-defined parameter. Let us discuss pre-defined parameter T . In practice, the bigger T , the smaller the number of values to be filtered out.
where BIMC(x, y) represents the BIMC of x and ... It should be noted that, the range of BIMC is from -1 to 1. If there is a strong positive linear relationship between DNA methylation vectors, the value of BIMC will be close to 1. If there is a strong negative linear relationship between methylation vectors, the value of BIMC will be close to -1. If there is no linear relationship or only a weak linear relationship between methylation vectors, the value of BIMC will be 0 or close to 0. SN (Sensitivity), SP (Specificity), ACC (Accuracy) are defined as follows:

Tobacco exposure Prediction Model assessment
SP = jã jã + äâ (17) ACC = jâ + jã jâ + jã + äâ + äã (18) where TP, FP, TN and FN denote true positive, false positive, true negative, and false negative, respectively. In our study, smoker samples and non-smoker samples were designated as the positive and negative samples, respectively. Correspondingly, sensitivity is the proportion of smoker samples correctly classified, specificity is the proportion of non-smoker samples correctly classified, and accuracy is the proportion of both types of samples correctly classified. In this study, currentsmokers were used as positive sample, and never-smokers were used as negative sample.     Table S7. The 33 possible related genes corresponding to proteins whose betweenness greater than or equal to 100 obtained by BIMC method using the shortest path algorithm and PPI network hgnc_symbol Ensembl Betweenness