Testing Mediation Effects in High-Dimensional Epigenetic Studies

Mediation analysis has been a powerful tool to identify factors mediating the association between exposure variables and outcomes. It has been applied to various genomic applications with the hope to gain novel insights into the underlying mechanism of various diseases. Given the high-dimensional nature of epigenetic data, recent effort on epigenetic mediation analysis is to first reduce the data dimension by applying high-dimensional variable selection techniques, then conducting testing in a low dimensional setup. In this paper, we propose to assess the mediation effect by adopting a high-dimensional testing procedure which can produce unbiased estimates of the regression coefficients and can properly handle correlations between variables. When the data dimension is ultra-high, we first reduce the data dimension from ultra-high to high by adopting a sure independence screening (SIS) method. We apply the method to two high-dimensional epigenetic studies: one is to assess how DNA methylations mediate the association between alcohol consumption and epithelial ovarian cancer (EOC) status; the other one is to assess how methylation signatures mediate the association between childhood maltreatment and post-traumatic stress disorder (PTSD) in adulthood. We compare the performance of the method with its counterpart via simulation studies. Our method can be applied to other high-dimensional mediation studies where high-dimensional mediation variables are collected.


INTRODUCTION
Introduced by Baron andKenny in 1986 (Baron andKenny, 1986), mediation analysis has been broadly applied in many scientific disciplines, such as sociology, psychology, behavioral science, economics, epidemiology, public health science, and genetics (e.g., E. Shrout and Bolger, 2002;Preacher and Hayes, 2008;Hafeman and Schwartz, 2009;Pfeffer and Devoe, 2009;Imai et al., 2010;Rocca et al., 2010;Pearl, 2012;Pierce et al., 2014). Through solving a chain of relations between an exposure variable and an outcome, it helps to understand how the effect of one variable is transmitted to another variable. Thus, mediation analysis offers researchers a unique statistical tool to reveal the underlying mechanism or process of various scientific questions, especially when designing an intervention strategy. It has been further extended and developed via taking nonlinearity, interactions, various types of mediating and outcome variables, as well as missing data into account in recent developments (e.g., Imai et al., 2010;Vanderweele and Vansteelandt, 2010;Pearl, 2012;Zhang and Wang, 2013).
Recently, mediation analysis has been applied to genetic association studies in which one can evaluate how genetic variants (e.g., single nucleotide polymorphisms (SNPs)) pass effects to mediators such as gene expression or DNA methylation (DNAm) to affect a disease risk (e.g., Liu et al., 2013;Huang et al., 2014;Huang et al., 2015). The genome-wide mediation analysis provides additional insight into the causal mechanisms of complex diseases. DNAm is an epigenetic phenomenon. Its status change reflects environmental exposures on the genome. DNAm can regulate gene expressions and can be potential biomarkers for the early prevention of stress-related disorders (Klengel et al., 2014). Properly maintained DNAms are necessary for regulating chromosomal stability and gene expressions. However, they can change the DNA activity when things go wrong, and lead to unexpected consequences. A growing body of literature shows that different environmental factors can alter the level of DNAm among individuals (e.g., Guida et al., 2015;Dongen et al., 2016). Abdolmaleky et al. (2004) showed that DNAm may modulate gene-environment interactions on psychiatry disorder. Li et al. (2003) reported that exposure to xenobiotics in early life can persistently change the pattern of DNAm, resulting in potentially adverse biological effects which may explain the increased risk in adulthood of some chronic diseases. All evidences demonstrate the important role of DNAm in mediating the effect of environmental exposures on disease outcomes. Successful identification of causal DNAm as potential biomarkers can offer novel insights into the early prevention of some diseases such as stress-related disorders.
In a typical DNAm study, the number of DNAm can be much larger than the number of sample size. Mediation analysis focusing on one mediator at a time is not efficient enough to handle thousands of mediators (e.g., CpG sites). Methods for multiple mediators have been proposed assuming different data distributions with different methods. Focusing on continuous mediators, Huang and Pan (2016) developed a testing procedure using Monte-Carlo resampling method to evaluate the statistical significance. However, it is time consuming when the computing resource is limited.
Let X be an exposure variable; M j , j=1,…,k be the jth mediator; and Y be an outcome variable. Figure 1 illustrates the mediation model with a single mediator (a) and multiple mediators (b). In an epigenetic study, multiple mediators could be potentially correlated. For example, methylation signals in a given gene or region are typically correlated. Such correlation, if not properly handled, can lead to potential false positives or false negatives in traditional mediation analysis.
The high-dimensional and correlation nature of DNAm signatures ( Figure 1B) motivates us to consider a highdimensional mediation model, which is not a trivial extension of a low dimensional multiple mediator model studied in literature. Methodology development for mediation analysis with highdimensional mediators is still in its infancy. Zhang et al. (2016) proposed a high-dimensional mediation analysis method. They first applied a sure independence screening (SIS) method to reduce the data dimension from ultra-high to high, then adopted a penalized regression to shrinkage coefficients of irrelevant variables to zero. After the shrinkage, those mediators with non-zero coefficients were refit in a low-dimensional regression model for further hypothesis testing. Such penalized regression methods typically produce biased estimators, especially when correlations between predictors exist. This method thus could face potential issues with either false positives or false negatives. Huang and Pan (2016) proposed to transform the correlated mediators into independent ones, then performed the mediation analysis on the transformed variables. Such a method solves the correlation issue but faces the difficulty of interpretation, since the transformed variable is a linear combination of the original mediators and does not have a direct interpretation.
High-dimensional data analysis is typically formulated with high-dimensional penalized regression models, with the purpose to select important features that can minimize the prediction error. Popular methods include LASSO (Tibshiranit, 1996), adaptive LASSO (Zou, 2006), and elastic net (Zou and Hastie, 2005). Although these methods can do variable estimation and selection simultaneously, they cannot quantify the estimation uncertainty. There has been a flourish of recent literature on testing low-dimensional coefficients in high-dimensional sparse regression models (e.g., Zhang and Zhang, 2014;Dezeure et al., 2015;Zhang and Cheng, 2017;Wang and Samworth, 2018). These methods essentially implement a debias technique, then perform hypothesis testing using the debiased estimators (Zhang and Zhang, 2014). Following the asymptotic normality, one can obtain a p-value or construct a confidence interval for each coefficient (Van de Geer et al., 2014). Taking the high dimensionality and correlation issue into account, in this article, we adopt a highdimensional testing framework and conduct simultaneous inference under a high-dimensional sparse mediation model based on the recent de-sparsifying LASSO estimators (Zhang and Zhang, 2014). High-dimensional testing is embedded in the mediation model to handle the high dimensionality and correlation issues between mediators. We conduct extensive simulations to evaluate the performance of the methods and compare it with its counterpart. Application to two real data sets is given. Our method can be extended to other mediation analysis where high-dimensional mediators are observed. Figure 1A demonstrates a single mediation model. There are two types of effect from X to Y: (1) the direct effect from X to Y, denoted as ′ c ; and (2) the indirect effect from X to Y via the intermediate mediation variable M. The indirect effect measures the amount of mediation which comes from two sources: i) the effect from X to M, denoted as a; and ii) the effect from M to Y, denoted as b. The product of a and b defines the indirect effect. The total effect c from X to Y contains two parts, i.e., c c ab = + ′ . By fitting three different regression models, one can use the Sobel's method (Sobel, 1982) to estimate the standard error of ˆâ b from which the significance of mediation effect can be assessed.

STATISTICAL METHOD
The single mediator model shown in Figure 1A can be extended to a multiple mediator model by fitting a multiple regression model involving both the exposure and the mediator variables. The multiple mediator model is given as follows, where M j , j=1,..,k is the jth mediator variable; c represents the total effect from the independent variable X to the dependent variable Y; ′ c represents the direct effect from X to Y adjusting for the effects of multiple mediators; the indirect effect from X to Y mediated by M j is denoted by a j b j . The total mediation effect can be obtained as c c − ′ or a b j j j k = ∑ 1 . When the response variable Y is a categorical variable, method to estimate the total mediation effect based on the product measure, a j b j , is less susceptible to the scaling problem since only the b j coefficient is from a categorical regression analysis (MacKinnon, 2008  (2) As we mentioned in the Introduction section, a genomic mediation study often involves high-dimensional mediators. In many cases, the number of mediators is far beyond the sample size (k>>n). For example, the number of DNAm loci can be nearly half million, far more than the sample size. Another phenomenon for genomic mediators is that they are often correlated. Both the curse of dimensionality and correlation between mediators cause estimation problems in Model (1) and (2). Classical regression analysis cannot be directly adopted to deal with the estimation and testing problem appeared in the third equation in Model (1) and (2). To solve both the high dimensionality and correlation problem, we propose to adopt a high-dimensional testing framework which is focused on de-sparsified LASSO estimators (Zhang and Zhang, 2014). The detailed estimation and testing procedure for the proposed high-dimensional mediation testing framework is given as follows: Step 1: First apply an SIS procedure to reduce the methylation dimension from ultra-high to high dimension (Fan and Lv, 2008). According to the SIS algorithm, the top d=n/log(n) methylation variables with the largest effects were remained in the model when the response Y is a continuous variable. For a binary response, the top d=n/log(n) variables can be kept in the model. SIS theoretically guarantees that no true signals are removed from the model. The SIS step can be based on the third or the second regression equation in Model (2). For a binary response Y, Zhang et al. (2016) suggested that SIS can be done based on the second equation in Model (2). For a continuous response variable, the SIS step can be done based on the third regression equation in (2). After SIS, the number of methylation loci is reduced from k to d. We then focused our analysis to these d methylation variables to test mediation effects. Denote the remaining methylation loci after the SIS step as M j ,j=1,…,d.
Step 2: In the second step, we fit the following model, Other covariates can also be fitted to this model. Since the dimension d can still be relatively large after the SIS step, regular least squares estimation will not work well. For highdimensional data, penalized regressions are commonly applied for simultaneous variable selection and estimation. However, penalized estimators are biased and cannot be directly used for testing or confidence interval construction. Zhang and Zhang (2014) first time proposed a de-biased estimator for high-dimensional data. Let b lasso be the LASSO estimators. For a continuous response variable Y, A de-biased estimator, also called a de-sparsified estimator, is a bias-corrected estimator which can be given as, where b j is the bias-corrected coefficient of the jth methylation M j ; ˆ, b lasso l is the coefficient of the lth M l estimated by fitting a LASSO regression; Z j is the regularized residuals obtained by 1 Ω where γ j 0 represents the true regression coefficient; σ  can be calculated by using the scaled LASSO algorithm (Sun and Zhang, 2012), and Ω jj can be calculated by, Under the null that H . Similarly, we can get a p-value for each mediator based on the asymptotic normality property. Let the p-values for all the d methylation loci denoted Step 3: Let S={t:P t,b < 0.05}, which is based on the highdimensional inference in the second step. For testing H 0 :a t = 0, we denote the testing p-value as P t,a P a where t S ∈ , â t is the ordinary least squares estimator for a t and σ t is the corresponding estimated standard error, by fitting the 2nd regression equation in Model (2).
Step 4: We reject the null hypothesis of no mediation effect for M t only if both a t and b t are significant. The p-value for the joint significance test is defined as, A methylation locus has a significant mediation effect if P t * .
< 0 05 . This is also a so called intersection-union test (Berger and Hsu, 1996).
Remark 1: To make the paper self-contained, here we briefly introduce the High-dimensional mediation analysis (HIMA) method proposed by Zhang et al. (2016). The HIMA method involves three major steps: Step 1: (Screening) Use the SIS (Fan and Lv, 2008) to identify a subset of top mediators.
Step 2. (MCP-penalized estimate). Apply the MCP-based penalized regression to do simultaneous variable selection and estimation based on the variables from step 1.
Step 3. (Joint significance test). For those mediators with non-zero coefficients from step 2, fit a regression model again and get a p-value for testing each coefficient, then, taking the maximum of this p-value and the p-value for testing the α effect as the final p-value to assess the significance of the mediation effect.
Remark 2: Our method has two advantages: 1) It fits multiple mediators in one regression model and do the testing, rather than fitting and testing mediation effect one at a time. Statistically speaking, this yields more robust and efficient estimation and testing results; and 2) Different from Zhang et al. (2016), our method is a simultaneous inference in a high-dimensional sparse regression model implemented with a de-biasing technique. The de-sparsifying strategy can well handle correlations between methylation loci, as demonstrated in the simulation study.

SIMULATION STUDIES
We conduct extensive simulations to evaluate the performance of the proposed method and compare it with the HIMA method proposed by Zhang et al. (2016). In the follows, we denote our method as HDMA (high dimensional mediation analysis) and the method by Zhang et al. (2016) as HIMA. Data are generated following Model (2), where the exposure variable X is generated from a binomial distribution, i.e., B(n,0.74) in which the probability 0.74 is determined based on the proportion of drinking in the first real set (see the real data analysis section for details).
To have a fair comparison, we follow the simulation setup for the regression coefficients as given in Zhang et al. (2016). The first 8 elements of b(b j ,j = 1,…,8) are given as (0.8,0.7,0.6,0.5,0,0,0.5,0.5) T , and the first 8 elements of a(a j ,j = 1,…,8)are given as (0.35,0.25, 0.35,0.55,0.55,0.55,0,0) T . The rest of as and ′ b s are all set to zero. Under this setting, the first four methylation loci have significant mediation effects while the rest have no effect.
For the intercept terms, we set θ 2 = -4.5 and θ j ' = 1 . We also consider different correlations among the mediators, i.e., ρ = 0, and 0.8. When the direct effect ′ = c 0 , the model is a .
We evaluate the performance of our method (HDMA) in terms of false positive rate and power and compare with HIMA. We report the power (M 1 ∼M 4 ) and the type I error (M 5 ∼M 8 ) for each locus. For the rest of the k-8 loci, we report the averaged type I error rate. All simulations are based on 1000 replications under different sample sizes, i.e., n = 300 and 600 and different correlations, i.e., ρ = 0 and 0.8. Table 1 lists the results for binary responses assuming a complete mediation effect, i.e., ′ = c 0. There are several observations: (i) HIMA and HDMA have very similar power and size when there are no correlations between M (ρ = 0) under different scenarios. However, HDMA has substantially higher power than HIMA does when ρ = 0.8; (ii) The testing power decreases as the data dimension increases for both methods. For example, the power of testing M 1 is 0.754 for HDMA with k = 100, but decreases to 0.721 with k = 5000, when fixing n = 300 and ρ = 0; (iii) The power increases as the sample size increases. For example, when fixing ρ = 0.8 and k = 1000, the power increases from 0.598 to 0.951 for testing M 1 when the sample size increases from 300 to 600, a 59% increase; and (iv) HDMA is not sensitive to the correlation structures while HIMA suffers significantly from power loss when there are high correlations between the M variables. The difference is even more striking when the sample size increases from 300 to 600. For example, the power difference for testing M 1 is 0.014 for HDMA compared to 0.238 for HIMA when ρ is increased from 0 to 0.8, when fixing n = 600 and k = 1000. Similar patterns were observed for the other three M variables. Figure 2 summarizes the results with partial mediation, i.e., ′ = c 0 5 . . We consider N = 300 and 600, p = 100, 1000 and 5000, and ρ = 0 and 0.8. Corresponding to each mediator, there are four power bars. The left two correspond to the case with correlation ρ = 0, while the right two correspond to the case with ρ = 0.8. For a fixed sample size, the power typically decreases as the data dimension (p) increases. This is because of the increase of the noise features. When ρ = 0 (the independent case), HIMA and HDMA perform very similarly under different scenarios. However, when the correlation increases to ρ = 0.8, we observe a power gain by HDMA compared to HIMA under a sample size of 300. As the sample size increases from 300 to 600, we observe substantial power gain for HDMA. This shows the advantage of HDMA which can take care of the high correlation structure among the mediators. Figure 3 displays the type I error rate of the two methods. M other represents all p-8 zero effect mediators. The type I error for M other is calculated as the average type I error of the p-8 mediators. Again, each mediator has four bars. The left two correspond to ρ=0 while the right two correspond to ρ=0.8. Overall, the type I errors for the two methods are reasonably controlled, especially under a large sample size (N = 600). When the correlation is high, i.e., ρ=0.8, for some mediators such as M 5 and M 6 , HIMA has a higher false positive rate than HDMA does. This indicates the advantage of HDMA in false positive control when there are high correlations among mediators.
In summary, HDMA shows relative advantages over HIMA under different scenarios, especially when there are high correlations among mediators. As correlations are highly expected in real methylation data, HDMA can be an alternative strategy to HIMA and is generally safe to apply.

REAL DATA ANALYSIS
We apply the HDMA method to two real data sets with methylation loci as the mediators. DNAms play key roles in  regulating many cellular processes and are associated with human diseases (Robertson 2005). The first data set involves DNAm mediating the effect of alcohol consumption on epithelial ovarian cancer (EOC) status. Alcohol may induce DNAm alterations, which could trigger alcohol-induced carcinogenesis (Varela-Rey et al., 2013). In the second data set, we evaluate the effect of childhood maltreatment on post-traumatic stress disorder (PTSD) in adulthood, mediated by DNAms. It is hypothesized that childhood maltreatment affects biological processes via DNAm, which can have negative consequences late in life (e.g., Mehta et al., 2013;Klengel et al., 2016).

Case Study 1: Mediation Analysis of Alcohol Consumption, DNam, and EOC Status
The participants with age ranging from 27 to 91 were recruited between the year 1999 and year 2007 in the Mayo Clinic Ovarian Cancer. They were women of European ancestry who were invasive EOC cases and controls one-to-one matched on the basis of age (within 1-year). After eliminating missing values and other quality control, 196 cases and 202 controls were retained for further analysis. The exposure variable is alcohol consumption. Information on alcohol use was obtained via a written questionnaire asking "Do you currently drink alcoholic beverages?". DNAms are the mediators and EOC status is the outcome. We would like to identify the mediators and further quantify the mediation effect. Readers are referred to Koestler et al. (2014) and Wu et al. (2018) for more details about the data. Table 2 summarizes the lifestyle and demographic characteristics of the study population. The Student t-test or Chi-square test is used for comparisons between groups for continuous or categorical variables, respectively. As can be seen in the table, alcohol consumption is significantly lower in cases compared to controls. Enrollment year shows a significant difference in proportions between cases and controls. Thus, we include the enrollment year as a covariate in further mediation analysis.
Leukocyte-derived DNA was assayed with the Illumina Infinium HumanMethylation27 Beadchip platform and underwent quality control procedures at the Mayo Clinic Molecular Genome Facility (Koestler et al., 2014). The methylation beta values (β) of each CpG locus was logit-transformed (log(β/(1-β))) to get the M-value for further analysis. A total of 25,926 CpG sites were remained for analysis after normalization and adjusting for any batch or plate effects. Study shows that heterogeneity in white blood cells has the potential to confound DNAm measurements and statistical treatment is needed to correct for this confounding effect . Similarly, variation in celltype proportions across samples has the potential to confound the mediation effect of DNAm on the association of alcohol consumption and EOC status (Titus et al., 2017). We thus include the predicted proportions of the leukocyte sub-types for each of the study samples as covariates in the analysis, following a mixture deconvolution method by Houseman et al. (2012).
Since the response is a binary variable, we apply a logistic regression for the first and third regression equation in Model (2), while including enrollment year as a covariate. Note that the cell type data should be included whenever methylation signals are included in the model. Including the enrollment year (Enroll) and the proportion of cell type (CellType), Model (2) becomes, The coefficient estimates for the total effect is given as ĉ Alcohol =-1.310 (p-value < 0.001), indicating a significant protective effect of alcohol consumption on EOC status.
We apply the SIS algorithm to reduce the methylation dimension to 34 (n/2log(n)), then apply the HDMA and HIMA methods for further inference. Table 3 lists the findings by the two methods. Our method identified four CpGs with important mediation effects while HIMA identified two CpGs. Two CpGs, namely cg12278770 and cg03012280, overlap in two methods. A heatmap in Figure 4 shows that there are moderate correlations among the 34 CpG sites. Thus, it is not surprising to see that HDMA identifies more CpG mediators than HIMA does. CpG site cg18394848 resides in gene K-RAS. Nakayama et al. (2008) examined the K-RAS mutations in relation to extracellular signal-regulated protein kinase (ERK) activation in 58 ovarian carcinomas. Auner et al. (2009) drew a conclusion that K-RAS mutation is a common event in ovarian cancer primarily in carcinomas of lower grade, lower FIGO stage, and mucinous histotype. KEGG pathway shows that this gene is involved in the pathogenesis of ovarian cancer (Figure 5). This evidence indicates that cg18394848 could be an important epigenetic marker which mediates the effect of alcohol consumption on EOC pathogenesis. Elgaaen et al. (2010) found that gene KSP37 correlates strongly with histology, stage, and outcome in ovarian carcinomas. Thus, cg08132711 (in gene KSP37) can also be a potential epigenetic marker associated with the EOC status. Although we do not find direct literature support about the two genes FAM167B and ZFYVE19 where cg12278770 and cg03012280 are respectively located in, a two samples t-test results show that there are significant differences on methylation signals of cg12278770 and cg03012280 between cases and controls. The t-test statistics (p-value) are t cg12278770 =4.881(P<0.001) and t cg0301220 =5.415(P<0.001). It suggests that these two CpG sites may act as important players to mediate the effect of alcohol intake on EOC status (Figure 6).

Case Study 2: Mediation Analysis of Childhood Maltreatment, Dnam, and PTSD
The data came from the Grady Trauma Project study recruiting Afro-American participants from Atlanta inner-city residents, approved by the Institutional Review Board of Emory University School of Medicine and Grady Memorial Hospital (Wingo et al., 2018). A growing body of literature indicates that DNAm plays pivotal roles in the disease process of PTSD and in vulnerability and resilience to PTSD (Uddin et al., 2011;Lutz and Turecki, 2014). Studies also show that childhood maltreatment is associated with DNAm changes of multiple loci in adulthood (Mehta et al., 2013). We apply the proposed method to establish the link between childhood maltreatment and PTSD and further evaluate the mediating role of DNAm. The data set contains baseline information, cell composition, and DNAm. We adopt the modified PTSD Symptom Scale (PSS) and the Beck Depression Inventory (BDI) to classify cases and controls. Cases with current symptoms of comorbid PTSD and depression are  defined as having a PSS score ≥14 and a BDI score ≥14. Controls are defined as having neither PTSD nor depressive symptoms, as mirrored by a PSS score ≤7 and BDI score ≤7, despite being exposed to trauma (Beck et al., 1961;Foa et al., 2000;Wingo et al., 2018). We eliminate observations with missing values and exclude those with PTSD treated since the treatment might affect DNAm changes which can complicate the mediation effect. Finally, 54 controls and 74 cases are retained for further analysis. Table 4 summarizes the demographic characteristics of the study population. Ranges of age in case and control are (27.97, 57.97) and (30.69, 56.79), respectively. There is no statistical significance among the selected variables such as age, sex, and body mass index (BMI), but childhood sexual/physical abuse moderate to extreme is significantly higher for cases compared to controls. The same analysis plan as detailed in Case Study 1 is applied here. Since no clinical factors show statistical significance, we do not include any covariates in our mediation model. Next, we apply HDMA and HIMA to test which DNAm plays a mediating role between childhood maltreatment and PTSD.
The raw methylation beta values from the HumanMethylation 450k BeadChip (Illumina) are obtained via the Illumina Beadstudio program. Samples with probe detection call rates <90% and those with an average intensity value of either <50% of the experiment-wide sample mean or <2,000 arbitrary units (AU) are excluded from further analysis. The beta values are further converted to M-values and a total of 335,669 CpG sites are used for subsequent analysis. For the details of the data, readers are referred to the website http://gradytraumaproject.com/. The data set can be downloaded at https://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE72680. Lutz and Turecki (2014) reviewed human studies indicating that early-life experiences (e.g., childhood maltreatment) regulate life-long stress activities (e.g. psychopathological disorders) through epigenetic regulations (e.g., DNAms). Klengel et al. (2014) found that exposure to stress can induce long-lasting changes in DNAs, which may relate to the pathophysiology of depression and PTSD. This evidence suggests that a mediation model can help to understand how childhood maltreatment can alter long lasting DNAm changes which further affect phycological disorders such as PTSD. We fit the following mediation model while adjusting for the cell type effect whenever CpG sites are involved, i.e., Based on the first regression model, we identify an existing relationship between childhood maltreatment and PTSD with ˆ.  we keep n/log(n) mediators rather than n/2log(n) to avoid missing important loci, due to the small sample size. After the SIS step, 27 DNAm sites are left in the model for further analysis.  Figure 7. It is clear that there are strong correlations between some CpG sites and it is not surprising that HDMA identified one more CpG site since it can handle correlation well. We further test the methylation signal difference between cases and controls for the two CpG sites and the results show significant differences for cg06998765 (t = 4.109, P<0.001) and cg16928335 (t = 2.242, P = 0.027). Figure 8 plots the methylation signals between cases and controls for the two CpG sites. Ward et al. (2017) applied a genome-wide analysis method to analyze UK Biobank data and identified four loci associated with mood instability. Gene RPS6KL1 is located nearby one of these regions, suggesting a potential role of this DNAm on PTSD. Although we cannot find evidence to support the association between PTSD and gene SH2D1A where cg06998765 is located, a two samples t-test shows that there is a significant difference on methylation signal of cg06998765 between cases and controls. The upshot suggests that this CpG site may have an important role to mediate the effect of childhood maltreatment on PTSD (Figure 8).

DISCUSSION
A large body of literature has suggested that environmental exposures can leave epigenetic tags such as DNAm changes which further affect disease risks. Such a causal relationship can be better understood with a causal mediation model, with the hope to identify important epigenetic players (e.g., DNAm) that mediate the relationship between an exposure and a disease outcome. As biotechnology getting cheaper and cheaper, the pace of generating epigenetic data becomes faster and faster. In many applications, the number of epigenetic features can be much larger than the sample size, resulting in the so-called (ultra-) high dimensional data. These high-dimensional data provide unprecedented opportunity to reveal the molecular mechanism of many diseases. In the meantime, they also challenge the traditional mediation analysis methods which are developed for low-dimensional data.
In this work, we propose a high-dimensional mediation model to tackle issues due to high dimensionality and high correlation. Different from the HIMA approach developed by Zhang et al.   (Zhang and Zhang, 2014). Such correlations are naturally arising due to the nature of the epigenetic data. We illustrate the performance of the proposed method via simulations and case studies and compare with the HIMA method (Zhang et al, 2016). The simulation studies show that our method (HDMA) outperforms the HIMA method when there are high correlations between mediators. Thus, HDMA can be safely used in a high-dimensional mediation analysis from population studies. In the first real data analysis, four CpG sites are identified to mediate the effects between alcohol consumption and EOC status. HDMA identifies two more CpG sites than HIMA does. In the second real data analysis, of the two CpG sites identified by HDMA, one overlaps with HIMA. These CpG sites may mediate the effect of childhood maltreatment to PTSD risk in adulthood. In both real data analysis, HDMA identifies more CpG sites than HIMA does, demonstrating the superior power of HDMA over HIMA. However, further biological verification is needed to validate the results, since statistical significance does not guarantee a biological significance. Philibert et al. (2012) found that alcohol intake is linked to widespread changes in DNAm in women. Cvetkovic (2003) showed that DNAm alterations are an early step in carcinogenesis and could represent a mechanism of disease. Many such pieces of evidence point to the proper linkage of DNAm mediating the relationship between alcohol consumption and EOC status. Similar evidence also supports the linkage between childhood maltreatment and PTSD mediated by DNAm. Mehta et al. (2013) provided epigenetic support that childhood maltreatment is likely to carve long-lasting epigenetic marks, leading to adverse health outcomes such as PTSD in adulthood. Childhood abuse can increase the risk of neuropsychiatric and cardiometabolic disease via changes in epigenetic marks   (Szyf, 2012;Yang et al., 2013). These studies support the mediation role of DNAm between childhood maltreatment and the risk of developing PTSD in adulthood.
The mediation effect in this study is based on a linear effect assumption, while effects such as interactions including magnitude epistasis and sign epistasis are not considered. Such kinds of complex interactive mechanisms can complicate the model, especially under a high-dimensional setup. For example, if there are antagonistic epistatic interactions among mediators, the mediation effects between exposure and the outcome can be weakened, leading to the failure to detect the mediation effects. If there are synergistic epistatic interactions among mediators, the existence of mediators can produce a synergistic effect to enhance their mediation effect. In the event of multiple exposures, models can be even more complicated. Under these situations, it is not clear on how to model and assess the mediation effect in a high-dimensional setup. These issues imply the simplicity of the current method and also raise modeling challenges for further methodological development. We will take these into consideration in our future studies. The R code that implements the method can be found in github with weblink: https://github.com/ YuzhaoGao/High-dimensional-mediation-analysis-R/blob/ master/HDMA.R.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this manuscript will be made available by the authors, to any qualified researcher.
Requests to access the datasets should be directed to Yuehua Cui cuiy@msu.edu.

AUTHOR CONTRIBUTIONS
YG implemented the method and drafted the manuscript. HY and RF were involved in the data analysis. YZ and EG participated in the study. YC conceived the idea, designed the study, and drafted the manuscript. All authors read and approved the final manuscript.