High-dimensional mediation analysis in survival models

Mediation analysis with high-dimensional DNA methylation markers is important in identifying epigenetic pathways between environmental exposures and health outcomes. There have been some methodology developments of mediation analysis with high-dimensional mediators. However, high-dimensional mediation analysis methods for time-to-event outcome data are still yet to be developed. To address these challenges, we propose a new high-dimensional mediation analysis procedure for survival models by incorporating sure independent screening and minimax concave penalty techniques for variable selection, with the Sobel and the joint method for significance test of indirect effect. The simulation studies show good performance in identifying correct biomarkers, false discovery rate control, and minimum estimation bias of the proposed procedure. We also apply this approach to study the causal pathway from smoking to overall survival among lung cancer patients potentially mediated by 365,307 DNA methylations in the TCGA lung cancer cohort. Mediation analysis using a Cox proportional hazards model estimates that patients who have serious smoking history increase the risk of lung cancer through methylation markers including cg21926276, cg27042065, and cg26387355 with significant hazard ratios of 1.2497(95%CI: 1.1121, 1.4045), 1.0920(95%CI: 1.0170, 1.1726), and 1.1489(95%CI: 1.0518, 1.2550), respectively. The three methylation sites locate in the three genes which have been showed to be associated with lung cancer event or overall survival. However, the three CpG sites (cg21926276, cg27042065 and cg26387355) have not been reported, which are newly identified as the potential novel epigenetic markers linking smoking and survival of lung cancer patients. Collectively, the proposed high-dimensional mediation analysis procedure has good performance in mediator selection and indirect effect estimation.


Introduction
Mediation analysis based on counterfactuals has been widely used in understanding the causal pathways from an exposure to an outcome. The idea of mediation approach was firstly applied in psychology research [1][2][3], and gradually extended to other fields including epidemiology, biomedical, and clinical studies. Through mediation models, the relationships among exposure, mediator, and outcome can be characterized. Specially, the graphical model of causal mediation models can be illustrated by using a directed acyclic graph (DAG) [4]. Moreover, through mediating effect analysis, the total effect of an exposure on the outcome is decomposed into two parts. One is the natural direct effect, which is the effect of an exposure directly on the outcome that is not through mediators. Another part is the natural indirect effect, which describes the effect of an exposure on the outcome through the mediators.
Extensive works have been done in mediation analysis during the past decades, particularly in the area of causal inference [4][5][6][7][8][9]. Besides, mediation analysis has been generalized from continuous outcomes to binary outcomes [10][11][12], even to the time-to-event outcome [13,14], since many epigenetic questions involve addressing censored survival data. Recent years have seen huge progress in the extensive of mediation methods to survival models. For instance, built upon the framework of causal inference, the methodology of mediation analysis has a pervasive application with Aalen's additive hazards model, Cox proportional hazards model and accelerated failure time model [15,16].
To date most methodology in mediation analysis has been concentrated on the context of a single mediator, with only few attention relating to multiple mediators [17,18], especially in survival data [19]. However, scare attention has been received for the development of approaches to deal with mediation with high-dimensional mediators. As rapid advances in technology have generated large amount of data from genome or genetic researches, there is a broad application of mediation analysis for high-dimensional data [20][21][22][23][24][25][26][27][28][29]. For example, Zhang et al. (2016) considered estimating and testing mediation effects for high-dimensional epigenetic data and showed that DNA methylation is mediated between smoking and lung functions [20]. Nevertheless, methods for high-dimensional mediation analysis with survival outcomes are still yet to be developed. Such an extension is the aim of this work.
As a motivating example, smoking has side effects on human health, especially for lung cancer which is the leading cause of cancer mortality worldwide and creates an enormous public health burden [30]. When individual level phenotype and genotype data are available, numerous researches have indicated that mutation in CpG sites are related to tobacco smoking [31][32][33][34][35]. It is of great scientific interest to identify which methylation markers are acting as mediators between smoking and lung cancer patient's survival, as this is essential for finding the disease diagnosis markers and the treatment target in precision medicine. In practice, both Illumina Infinium HumanMethylation27 and HumanMethylation450 are widely used platform which allow to measure DNA methylation levels of roughly 27k and 450k respectively. High-dimensional methylation data are generated in both platforms [36]. Hence, it is of great importance to identify the significant mediators among the huge number of potential candidates in the survival models. Previous studies utilized continuous or binary outcome when selecting the high-dimensional mediators [20]. However, when in the context of survival analysis, such method will lose efficiency as it ignores the time-to-event information.
In this article, we aim to study the selection of mediators (DNA methylations) between the smoking exposure and the overall survival in lung cancer patients. We propose a procedure to select, estimate and test mediation effects in survival models with high-dimensional epigenetic information. The main idea is as follows. Firstly, we reduce the dimension of potential mediators from ultra high-dimensional to moderate (i.e., one that is less than the sample size) using sure independence screening (SIS) method [37]. Secondly, we conduct variable selection via Cox proportional hazards model with the minimax concave penalty (MCP) [38]. Finally, we decompose the total effect and carry out the Sobel and joint significance test for mediation effect. This is the first proposed procedure for mediator identification in the survival models, to the best of our knowledge.
The rest of the paper proceeds as follows. In the next part, we provide simulation studies to evaluate the performance of our proposed procedure and a real data application to analyze the mediation effects of high-dimensional DNA methylation markers on the causal effect of smoking on lung cancer in a epigenome-wide study. Then, we conclude the paper through discussing limitations and other feasibilities. Finally, we introduce models, assumptions and develop the proposed procedure.

Simulation studies
This section is devoted to a series of simulation studies which evaluate the performance of the proposed method with time-to-event outcome. We demonstrate the performance of the proposed procedure through mediator selection and indirect effect estimation.
Simulation results. We perform the analysis using the proposed procedure with time-toevent outcome and the simulation results are summarized in Tables 1 and 2. We present true positive rate (TPR, percentage of nonzero mediators correctly selected), the number of false positive (FP, the number of zero mediators incorrectly selected), and false discovery proportion (FDP, percentage of incorrect selection among all selected).
We first assess the accuracy of variable selection with our proposed procedure. Table 1 shows the selection results of the proposed procedure. The TPRs are high among all the censoring and sample size settings with the lowest rate of 0.7435 at the high censoring rate setting (35%). Among all the 9,996 zero-effect mediators, the highest FP is 0.2480. The false discovery proportion (FDP) among the selected mediator is lower than 0.0584 among all settings. As the sample size increases to 500 and 1000, the TPR increases to about 1. Compared with the performance of identifying significant markers with the Sobel test, the joint test has higher TPRs and also a slight higher FPs and FDPs.
To highlight the effectiveness of our approach, we compare our procedure with other methods. One is the one-step approach, i.e., using MCP-based regularization alone. Another is the naive approach, i.e., fitting the mediator model and Cox model for each mediator. Our method shows better performance than one-step method and the naive method (S1 Table & S2 Table). Besides, we also conduct other simulations. The proposed method with MCP-based regularization performs slightly better than the LASSO-based regularization (S3 Table). As the number of significant mediator decreases, there is a higher accuracy of mediator selection (S4 Table). Moreover, the TPR decreases as the censoring rate increases, especially with limited sample size and higher censoring rate (S5 Table). Additionally, we also consider the cases that mediators are dependent. The TPR increases with the increases of correlation among the mediators when the correlation is not very large, but decreases when the correlation is large due to the collinearities among the mediators (S6 Table). Overall, the performance of proposed selection procedure is good in terms of selecting significant biomarker and controlling both the FP and FDP.
Except for mediator selection, we also perform the mediation effect estimation. We evaluate the estimation of α k β k and present the results in Table 2. In general, the bias is small. Both the empirical and estimated variance are close to each other and decrease as the sample size increases. Also, the coverage probability tends to be 0.95 as the sample size increases. Besides, the accuracy of effect estimation decreases with the increase of noise and the dimension of mediators (S7 Table & S8 Table).
In summary, the simulation studies show that the selection accuracy of the high-dimensional mediation model is high and the estimators of indirect effect through the nonzero mediators have minimal bias. To further demonstrate the efficacy of the approach, we apply the propose procedure to analyze a lung cancer data set.

Data application
Lung cancer is one of the deadliest cancer worldwide [30]. It can be categorized as non-smallcell lung cancer (account for almost 85%) and small-cell lung cancer (15%) [30]. Among lung cancer patients, tobacco smoking is a common risk factors. Besides, many researches suggest that DNA methylation markers may be the potential promoters for lung cancer. For example, hypermethylation of CpG islands in the promoter regions of genes was demonstrated as a common phenomenon in lung cancer [39]. In addition, tobacco smoking was related with methylation [31]. How the smoking behaviors affect the cancer survival through the methylation is of great interest. We apply the proposed procedure to identify which methylation markers are the potential mediators between the tobacco smoking and the overall survival time.
We applied the proposed method to the TCGA (The Cancer Genome Atlas) lung cancer cohort study including lung squamous cell carcinoma and lung adenocarcinoma. There were 1299 lung cancer patients aged 33-90 years. 907 of them had DNA methylation profile measured using the Illumina Infinium HumanMethylation 450 platform. DNA methylation values were recorded for each array probe in each sample via BeadStudio software. A total of 365,307 probes were included in the analysis.
To identify the potential methylation mediators between the tobacco smoking and the overall survival, we applied the high-dimensional mediator model with smoking status assessed at their initial diagnosis (smoker/non-smoker) as the exposure variable, DNA methylation measured at the same time as the high-dimensional mediators, and the survival time as the outcome variable. The overall survival time was defined as the number of days from initial diagnosis to the death or the last follow-up date. The median survival time was 54.4 months.  Subject with no survival time, exposure and other covariates were excluded; we got 754 patients with 305 deaths observed during the follow-up. Other covariates including age at initial diagnosis, gender, tumor stage and radiotherapy (yes/no) were adjusted. Due to the fact that the relationships between methylation and the outcome are much stronger than those between exposure and methylation in the analysis data set, we add top d = 3n/log(n) CpGs using sure independence screening method based on the path from smoking to the methylation (S1 Text) in order to improve the probability to recognize significant mediators. Secondly, we run a variable selection on the CpGs screened in the first step. Finally, we carry out the significance test for the direct and indirect effects.
The analysis results are presented in Table 3. We identified 6 CpGs mediating the relationship between smoking and overall survival of lung cancer patients with Bonferroni's adjusted p-value<0.05. Since smoking generally increases the risk of lung cancer and reduces overall survival of lung cancer patients with the total effect 1.3248 (95%CI: 1.022, 1.717), we focus on the three of these mediators with the log-hazard indirect effect α k β k >0 (smoking increases the mortality), where k denotes cg21926276, cg27042065 and cg26387355. All the three genes in which methylation sites locate are correlated with lung cancer or tumor growth in previous studies. For example, the gene H19 (cg21926276 locate) is related with both lung cancer and tumor growth, methylation of which has been thought as a sensitive marker of tobacco history [40,41]. The gene CDCA3 (cg27042065 located) is also associated with lung cancer and survival of cancer patients [42][43][44], and Song and Yang (2018) have reported that gene LOC338797 (cg26387355 located) is related with progression of tumor in lung cancer patients [45]. Besides confirming the previously reported genes, the three CpGs (cg21926276, cg27042065 and cg26387355) are also identified as novel markers for the survival of lung cancer patients. Besides, other methylation sites with negative log-hazard mediation effect have not been reported so far, and they may be the potential biomarkers to extend survival time for lung cancer patients. Take cg07690349 as an example, the gene MUC5B (cg07690349 locates) is one of the secreted mucins which are large O-glycosylated proteins that participate in the protection of underlying mucosae in normal adults [46].
We are also interested in how the effect of the exposure is mediated through the DNA methylation markers. The path-specific effects of tobacco smoking on overall survival of lung cancer patients are listed in Table 4. Mediation analysis using Cox proportional hazards model discovers that the effect of having serious smoking history on increased risk of developing lung cancer is mediated through methylation markers including cg21926276, cg27042065, and cg26387355; the hazard ratio for each mediator is 1.2497(95%CI: 1.1121, 1.4045), 1.0920(95% CI: 1.0170, 1.1726), and 1.1489(95%CI: 1.0518, 1.2550), respectively. The direct effect is 1.4309 (95%CI: 1.0806, 1.9074). Interventions can be explored on these markers to improve medical care for detection and treatment of lung cancer among smokers. Besides, we also use the onestep method and the naive approach for the lung cancer data, and they fail to identify any significant mediators.
Through the mediation analysis of DNA methylation for the survival time of the lung cancer patients, we found the three CpGs mediating the smoking and the mortality. Our findings not only were in line with previous studies which found that the gene that CpGs locate were important biomarkers for lung cancer [40], but also uncovered the mediation role of the markers connecting the smoking exposure and the survival time.

Discussion
Identifying the right targets among large-scale potential epigenetic mediators is crucial in biomedical research. High-dimensional mediation analysis not only finds the potential interventional targets, but also connects the exposure and outcome through the identified targets. Finding the significant mediators can also help early detection of lung cancer and hence improve overall survival. In this article, we proposed a high-dimensional mediation survival model utilizing the time-to-event outcome in place of binary outcome to enhance accuracy of variable selection and minimize the estimation bias. Our approach involves sure independence screening, MCP penalized variable selection, as well as the Sobel and joint significance test and effect decomposition.
In this research, we established a facile and efficient procedure for high-dimensional mediation analysis with time-to-event data to select DNA methylation and estimate the effects of exposure and outcome mediated by the mediators. The proposed procedure has good performance in mediator selection and indirect effect estimation which has been showed in the simulation studies and real data analysis. We demonstrate the validity and utility of the proposed method through simulation studies and a TCGA lung cancer data example. The proposed method has high proportion in true positives and shows a well performance in controlling false positives and false discoveries. The proposed method can be widely used in biomedical data analysis, especially involving high-dimensional mediators.
For high-dimensional mediator analysis, many questions are still yet to be answered and of interest to future studies. For example, incorporating multiple phenotypes (outcomes) into a joint model with high-dimensional mediators can improve the efficiency, e.g., the joint model of survival and longitudinal [47], survival and recurrent events [48,49]. Another example is to incorporate multiple exposures into high-dimensional mediation analysis with survival outcome, since both lung cancer and methylation are associated with many risk factors. Besides, with high dimensionality and mediation model, adding interaction terms increase the model complexity dramatically. Since the selection and estimation of interaction terms are of much different interpretation, we consider this to be beyond the scope of the current paper. However, it will be interesting to consider a further study of high dimensional mediator selection with interaction terms. Further researches are needed for these method developments.

Notations and high-dimensional mediation models
Let D i denote the time from onset to an event (death) and C i be the potential censoring time. Mediation models are used to model the mechanism of the exposure's effect on the outcome mediated by the mediators. In the context of time-to-event data, the rate at time t means the probability of experiencing death within the next unit of time, given that a patient is still alive right before time t. Cox proportional hazards model [50] uses the hazard ratio as an expression of how many times greater the rate is for the smoking group relative to the nonsmoking group. For the survival outcome, we consider the following regression models to assess the mediation effects with high-dimensional mediators: where Eq (1) is the Cox proportional hazards model which describes the relationship between the exposure X, mediators M and the time-to-event variable; Eq (2) characterizes how the exposure variables influence the mediators; λ 0 (t) is the baseline hazard function; Z is the baseline covariates including gender, age and other baseline characteristics; γ is the direct effect of the exposure on the outcome; β = (β 1 ,� � �,β p ) T is the parameter vector relating the mediators to the outcome adjusting for the effect of exposure; α = (α 1 ,� � �,α p ) T is the parameter vector relating the exposure to the mediators; c k is the intercept term and e ik~N (0,σ 2 ) is the residual.

Assumptions
Assumptions about absence of confounders should be made if one intends to obtain causal conclusion from an analysis. Here, T(x,m 1 ,� � �,m p ) denotes the survival time when the exposure be set to x and the mediator is set to m k ,k = 1,2,� � �,p, and M k (x � ) denotes the value of the mediator when the exposure is set to x � . Z denotes baseline covariates such as age and gender. Except for the assumption of consistency [51], based on Huang and Yang (2017) [19], we also assume the following hypothesis which is of great importance in subsequent derivations.
(A4). For any k = 1,2,� � �,p, M k (x � )?T(x,m 1 ,� � �,m p )|Z; that is no measured or unmeasured exposure-dependent confounders between the mediators and outcome, where x � is the intervention for the exposure X with different value than x.

Proposed procedure
For estimation in the survival component, the corresponding log-partial likelihood function of (1) is given by where R i = {l:T l �T i } is the at-risk set; P i = (X i ,Z i ,M 1i ,� � �,M pi ) T and Q = (γ,θ,β 1 ,� � �,β p ) T . The goal of variable selection is to identify S ¼ fk :b k 6 ¼ 0g, a subset of Q, which contains all the variables that are the significant mediators between the exposure and the outcome. Nevertheless, the number of mediators p is much larger than the sample size n, and the traditional statistics methods for Cox regression analysis fail to work in (3). To deal with this problem, we will first apply sure independence screening (SIS) [37] method to identify a subset S 1 = {k:1�k�p} of size d = [kn/log(n)] which are the mediators with strong correlation value for the response variable. We will then conduct variable selection via MCP-based Cox model within the subset S 1 . Finally, we estimate the direct and indirect effects and perform significance test. Fig 2 illustrates the overall workflow for high-dimensional mediation analysis. Details of the proposed procedure are in the following steps: Step Sure independence screening, which is based on correlation learning, has been a general technique to reduce dimensionality from high to a small scale that is below the sample size.
Here we use d = 2n/log(n) in the place of d = n/log(n) to increase the probability for identifying important mediators [37], considering that both α k and β k have to be selected as nonzero to ensure a specific mediator to be selected.
Step 2. (MCP-penalized variable selection) Among all the screened mediators M k 2S 1 from the Step 1, we further identify the subset S 2 ¼ fk :b k 6 ¼ 0g via the penalized log-partial likelihood optimizationb where l n (β) is showed in Eq (3); P λ (�) is the penalty function that depends on the regularization parameter λ>0, which controls the strength of regularization. Tibshirani (1997) [52] proposed a penalized reweighted least squares method to solve (4). The detailed calculation and derivation process of the above equation is provided in the Supporting Information (S2 Text).
Here, we adopt the minimax concave penalty (MCP) proposed by Zhang (2010) [38] with the following derivative function where a>1 is a shape parameter. Breheny and Huang (2011) implemented the MCP procedure with the R package ncvreg [53]. Here we prefer MCP approach over other penalty, e.g. lasso, as MCP is a fast, nearly unbiased and accurate approach of penalized variable selection in highdimensional context. Besides, it has the oracle property which can select the correct model with probability tending to 1.
Step 3. (Effect decomposition) Lange and Hansen (2011) have studied direct and indirect effects for single mediator in a survival context with Aalen additive hazards model [15]. The idea is to use the counterfactual rate difference as the effect measure of the exposure changing from x to x � . Huang and Yang (2017) extend to two mediators with Cox model [19]. To extend the decomposition of direct and indirect effect to high-dimensional mediators model, we first approximate the counterfactual outcome defined as log hazard as follows Derivation of the above expression is provided in the Supporting Information (S3 Text). Then, we can express the direct effect and total indirect effect on log hazard ratio by using the above expression as and the total effect is the sum of the direct and total indirect effects.
Step 4. (Significance test) For k2S 2 , a variable M k is considered as a mediator between the exposure and outcome only if the indirect effect is significant. Here, we consider two methods to test the mediation effects, including Sobel test (i.e., product method [54]) and joint significant test (i.e., causal steps method [55]). Followed with the Sobel test for indirect effect, we have the p-value for testing the null hypothesis H 0 : α k β k = 0 of no indirect effect P raw;k ¼ 2 1 À �ð jâ kbk ĵ s a k b k Þ ( ) ; whereŝ a k b k is the estimate of the Sobel standard error (SE) [54]. We have the revised p-value via the Bonferroni's method in order to adjust for multiple comparisons P k ¼ minfP raw;k � jS 2 j; 1g; ð5Þ where |S 2 | is the number of elements in set S 2 . The joint significant test for indirect effect is based on the path-specific (i.e., X!M and M!Y) P-values [55] and does not provide an estimate. Hence, we can reject the null hypothesis of no IE k if P k <0.05, and conclude that the variable M k is the significant mediator between the exposure and outcome.