Replicability analysis in genome-wide association studies via Cartesian hidden Markov models

Background Replicability analysis which aims to detect replicated signals attracts more and more attentions in modern scientific applications. For example, in genome-wide association studies (GWAS), it would be of convincing to detect an association which can be replicated in more than one study. Since the neighboring single nucleotide polymorphisms (SNPs) often exhibit high correlation, it is desirable to exploit the dependency information among adjacent SNPs properly in replicability analysis. In this paper, we propose a novel multiple testing procedure based on the Cartesian hidden Markov model (CHMM), called repLIS procedure, for replicability analysis across two studies, which can characterize the local dependence structure among adjacent SNPs via a four-state Markov chain. Results Theoretical results show that the repLIS procedure can control the false discovery rate (FDR) at the nominal level α and is shown to be optimal in the sense that it has the smallest false non-discovery rate (FNR) among all α-level multiple testing procedures. We carry out simulation studies to compare our repLIS procedure with the existing methods, including the Benjamini-Hochberg (BH) procedure and the empirical Bayes approach, called repfdr. Finally, we apply our repLIS procedure and repfdr procedure in the replicability analyses of psychiatric disorders data sets collected by Psychiatric Genomics Consortium (PGC) and Wellcome Trust Case Control Consortium (WTCCC). Both the simulation studies and real data analysis show that the repLIS procedure is valid and achieves a higher efficiency compared with its competitors. Conclusions In replicability analysis, our repLIS procedure controls the FDR at the pre-specified level α and can achieve more efficiency by exploiting the dependency information among adjacent SNPs. Electronic supplementary material The online version of this article (10.1186/s12859-019-2707-7) contains supplementary material, which is available to authorized users.

The asymptotic optimality can be derived without essential difficulty by extending the proof of Theorem 6 in Sun and Cai (2009).

Additional Simulation Results
In this section, we carried out additional simulation studies to investigate the numerical performance of repLIS in various model settings. Here, it is necessary to note that the repLIS's competitor, repfdr, is carried out in an ideal case that the proportions of each joint hypotheses states are known. In practice, however, these proportions are usually unknown and repfdr can be more conservative. The model settings are coincide with those in Scenario 2 of Simulation I. Since the BH procedure with FDR level 0.02 is too conservative to identify replicated signals, we removed BH procedure from simulation studies. Figure 1 contains the simulation results with more stringent FDR level (0.02). We can observe that the FDR levels of all three procedures are controlled at 0.02 approximately and both oracle and data-driven repLIS dominate repfdr when µ 1 varies from 3 to 5.
The joint states {(H 1,j , H 2,j )} m j=1 are generated with the following transition matrix: , and the initial distribution π is set to be (0.25, 0.25, 0.25, 0.25). We varied A (1,1)(1,1) from 0.5 to 0.7 with an increment 0.05. The numerical results are displayed in Figure 2. Note that the larger value of A (1,1)(1,1) , the higher cluster level of replicated signals. It is easy to interpret that the larger value of A (1,1)(1,1) , the larger value of ATP yielded by repLIS.
We can also concluded from Figure 2 that the performance of the oracle repLIS can be attained by the data-driven repLIS asymptotically and both oracle and data-driven repLIS uniformly outperform the repfdr in finding replicated signals.
2.3. The robustness of repLIS when the order of Markov dependence is incorrectly specified Without loss of generality, we consider the case where the order of Markov dependence is set to be 2. We chose the setup to be consistent with those in Scenario 2 of Simulation I when possible. Specifically, we set σ 1 = σ 2 = 1 and µ 2 = 2. For simplify, consider the following transition matrix: (1, 0), (0, 1), (1, 1)} and j = 1, ..., m − 1. We varied µ 1 from 3 to 5 with an increment 1. The simulation results are depicted in Figure 3. the oracle repLIS is implemented by using {A i,j } 4×4 to replace the corresponding {A i,j } 4×4 . It is easy to see that the performance of the datadriven repLIS is still acceptable (FDR=0.115). This implies that the data-driven repLIS can adaptively adjust for parameter estimations when the order of Markov dependence is incorrectly specified. Here, the superiority of repLIS is achieved by using the information that the proportions of each joint hypotheses states are known.

Extend repLIS to multiple GWAS studies
To focus on the main ideas, we restrict attention to repLIS in testing two GWAS studies.
We varied µ 1 from 2 to 3 with an increment 0.5 and the detailed simulation results are displayed in Figure 4.
We can observe that the simulation results for three GWAS studies are almost coincide with those for testing two GWAS studies. We can also validate the robustness of repLIS when the transition matrix is modified or the order of Markov dependence is incorrectly specified, if desired. The validation of the robustness is standard as described in Section 2.2 and 2.3, so we won't reiterate it here.