Nonparametric false discovery rate control for identifying simultaneous signals

It is frequently of interest to jointly analyze multiple sequences of multiple tests in order to identify simultaneous signals, defined as features tested in multiple studies whose test statistics are non-null in each. In many problems, however, the null distributions of the test statistics may be complicated or even unknown, and there do not currently exist any procedures that can be employed in these cases. This paper proposes a new nonparametric procedure that can identify simultaneous signals across multiple studies even without knowing the null distributions of the test statistics. The method is shown to asymptotically control the false discovery rate, and in simulations had excellent power and error control. In an analysis of gene expression and histone acetylation patterns in the brains of mice exposed to a conspecific intruder, it identified genes that were both differentially expressed and next to differentially accessible chromatin. The proposed method is available in the R package this http URL.


Introduction
Multiple hypothesis testing is now a staple of scientific research, and summary statistics, such as test statistics and p-values, from previously conducted experiments are now readily publicly available. Jointly analyzing summary statistics from different independent experiments can provide scientific insights that cannot be achieved from a single experiment alone. One important type of joint analysis is to identify features that are non-null in each of several experiments, which will be referred to as simultaneous signals. To be precise, given n features in D experiments, define I id to be an unobserved non-random signal indicator that equals 0 when the null hypothesis is true for feature i in experiment d, and equals 1 when the alternative hypothesis is true. The set of simultaneous signals is defined to be S = {i ∈ {1, . . . , n} : I i1 = . . . = I iD = 1} , (1) so that the set of non-simultaneous signals is S c = {i : i / ∈ S}. The problem of identifying simultaneous signals arises in many different contexts. For example, it is of interest to identify genetic variants that are associated with multiple related conditions, such as psychiatric disorders [1,27,28], to uncover potentially shared disease mechanisms. Similarly, it is useful to identify regions in the genome that are simultaneously associated both with a disease outcome and with a gene expression, as these locations may contain important causal mutations [21,48]. As another example, identifying findings that replicate across independent experiments is a crucial component of reproducible research [8,23,24]. Finally, comparative genomics research aims to find genes associated with similar phenotypes across different animal species, in hopes of finding evolutionarily conserved genomic programs [33,39,47].
This paper studies the problem of identifying simultaneous signals under false discovery rate control. Specifically, let T id be a univariate test statistic corresponding to the ith feature in the dth experiment, such that a hypothesis test based on T id can be performed to infer the true value of the signal indicator I id . Let δ : R D → {0, 1} be a simultaneous signal discovery procedure where δ(T i1 , . . . , T iD ) = 1 declares i ∈ S and δ(T i1 , . . . , T iD ) = 0 declares i ∈ S c . False discovery rate control methods aim to maximize the number of discovered simultaneous signals while maintaining the false discovery rate fdr(δ) = E i∈S c δ(T i1 , . . . , T iD ) max{1, n i=1 δ(T i1 , . . . , T iD )} (2) to be at most α, for some prespecified α < 1.
There has been a great deal of recent work on methods to control the false discovery rate when identifying simultaneous signals. An ad hoc approach is to use a standard procedure, like that of Benjamini and Hochberg [7], to discover significant features separately in each of the D experiments and then to identify discoveries common to all experiments. Bogomolov and Heller [9] developed a modified version of this idea and proved that their procedure maintains false discovery rate control. Another common strategy is to summarize the D statistics for each feature i into a single scalar statistic, for example by taking the maximum of their corresponding p-values [30]. This reduces the problem to a single sequence of multiple tests, but it is unclear how to choose the best summary function. A more principled approach treats the T id as a single sequence of multivariate test statistics (T i1 , . . . , T iD ). In this framework, it has been shown that the local false discovery rate [18] is the optimal scalar summary of the multivariate test statistics [11,13,14,24]. This can be difficult to calculate in practice, so Chung et al. [13] assumed a parametric model and used the EM algorithm to estimate unknown parameters, Chi [11] proposed a Taylor expansion approximation, Du and Zhang [14] used a single-index model approximation, and Heller and Yekutieli [24] employed an empirical Bayes approach. Xiang, Zhao and Cai [51] recently introduced a new framework for the joint analysis of multiple tests from independent studies, of which simultaneous signal identification is a special case.
All of these methods assume that the null distributions of the test statistics T id are known to some extent. Many assume that p-values are available, meaning that the nulls must be known exactly. Others estimate parameters of the null distributions, which still requires knowing the parametric families to which the nulls belongs [40]. However, in many important problems in genomics, information about the null distributions of the T id is not readily available, for at least three common reasons. First, small sample sizes can make it difficult to obtain the exact null distribution of standard test statistics [53]. Second, complex test statistics can have intractable null distributions. For example, the null distribution of the SKAT statistic [50], which tests the significance of a set of genetic variants, does not have a convenient closed form and in practice is computationally approximated. Finally, complex data types can give rise to null distributions that are difficult to model or characterize. For example, data from ChIP-seq experiments [29] are used to identify regions of the genome where transcription factors are found to bind, but the number, size, and locations of these regions are not predetermined. This makes accurate quantification of the statistical significance of the identified regions very difficult [12].
To date, relatively little work has considered false discovery rate control when null distributions are not completely known. Some results are available given test statistics from a single experiment. Knockoff filters [2,3,4,10] assume only that the null distributions are identical and symmetric, and p-filters [5,32] assume only that the test statistics can be converted to random variables between 0 and 1 that are stochastically larger than a uniformly distributed random variable. Resampling-based procedures [15,36,49,52] do not require known null distributions, but can only be used if the raw data are available. This may not be true for some applications, such as in genetics, where it is common that only test statistics are easily accessible.
In contrast, results are lacking when there are two or more experiments of interest. Nonparametric methods for detecting the presence of simultaneous signals have been proposed [55,56], but there do not seem to exist methods for identifying them when null distributions are unknown. Section 5 describes a simultaneous signal identification problem, encountered in a study of mouse behavioral genomics [38], where the null distributions in one of the experiments was unknown. No existing false discovery rate control method can be applied to this problem. This paper proposes a novel nonparametric method for controlling the false discovery rate for identifying simultaneous signals when the test statistics have unknown null distributions. A tradeoff of its robustness is that it can be very conservative, especially if the proportion of simultaneous signals is high and the studies D ≤ 3; see Sections 2.2 and 2.5 for further discussion. Section 2 describes the proposed procedure and shows that it can asymptotically control the false discovery rate at the nominal level. Section 3 discusses an alternative procedure that has more power but requires much more restrictive conditions. Section 4 illustrates the performance of the proposed method in simulations and Section 5 applies it in the mouse behavioral genomics problem. Section 6 concludes with a discussion, and proofs of all technical results can be found in the Appendix.

Model
The observed data consist of D sequences of n univariate test statistics T id , corresponding to features i = 1, . . . , n from experiments d = 1, . . . , D. For each T id , let F 0 id and F 1 id denote the cumulative distribution functions under the null and alternative hypotheses, respectively, and S 0 id and S 1 id denote the corresponding survival functions. Therefore for I = 0, 1, where the I id are unobserved non-random indicators of whether the null hypothesis is actually true or false, as introduced in Section 1. Within each sequence d, it is assumed that the T id are mutually independent, and the D sequences are also assumed to be mutually independent. In other words, T id and Finally, the test statistics are assumed to be one-tailed, with larger values of T id giving more evidence against the null. This is formalized in Assumption 1.

Assumption 1.
For all t, S 0 id (t) < S 1 id (t). This paper adopts a fixed-effects model where the I id are non-random quantities. A popular alternative in the multiple testing literature [24,46,51] is the random-effects framework, where the (I i1 , . . . , I iD ) are modeled as independent and identically distributed random D × 1 Bernoulli vectors, whose components I id and I id can be dependent for d = d . The T id are then assumed to be conditionally independent given (I i1 , . . . , I iD ) such that These fixed-and random-effects models are closely related [20,44], and the latter can be useful for interpreting certain aspects of the proposed procedure; see Section 2.2.

Two sequences of test statistics
The proposed method is first introduced assuming that only two sequences of test statistics T id are observed, i = 1, . . . , n and d = 1, 2. Section 2.5 describes a potential extension when there are more than two sequences of interest.
The overall strategy follows the framework of Storey, Taylor and Siegmund [45] for false discovery rate control in a single sequence of test statistics. The proposed procedure declares a feature i to be a simultaneous signal if for an appropriately chosen threshold t. The goal is to choose a threshold t that discovers the most simultaneous signals while maintaining an acceptable false discovery rate. This requires estimating the false discovery rate that would be attained by a particular threshold t. To motivate this estimator, suppose for now that S 0 id = S 0 d and S 1 id = S 1 d for all features i. This condition is much stronger than necessary and will be weakened in Assumption 2. Then under model (3), the expected proportion of false positives would equal where S is the set of simultaneous signals defined in (1). The following result shows that this expected proportion can be upper-bounded by the product of marginal survival functions.

Proposition 1.
For d = 1, 2, define the marginal signal proportion Then under Assumption 1, for any t 1 and t 2 , with S defined in (1).
Proposition 1 motivates the following conservative estimator for the false discovery rate that would be attained by the rejection region [t, ∞) × [t, ∞): whereŜ is the total proportion of rejected features, and ρ is a positive constant that regularizes the asymptotic properties of the proposed procedure. An alternative to (6) would be to define fdr ρ (t) = 0 if G(t, t) = 0, but (6) is more convenient for proving asymptotic false discovery rate control.
This leads to the proposed nonparametric discovery procedurê for some desired false discovery rate α < 1. Features withδ ρ (T i1 , T i2 ) = 1 are declared to be simultaneous signals. The thresholdt ρ maximizes the number of rejected features while maintaining a conservative estimate of the true false discovery rate to be below α. This estimator does not require any knowledge of S 0 id or S 1 id beyond the stochastic ordering of Assumption 1. The tradeoff is that the proposed fdr ρ (t) can give a very conservative estimate of the true false discovery rate. The proof of Proposition 1 shows that where π (1,1) is the proportion of simultaneous signals and C 1 (t 1 , t 2 ) and C 2 (t 1 , t 2 ) are positive quantities. The bound in Proposition 1 can therefore be very loose if the proportions of marginal or simultaneous signals are high. As a result, the actual false discovery rate attained by the proposed method can be much lower than the target level α. On the other hand, there are many settings where these signal proportions are expected to be low, such as in genomics studies where only small proportion of the genomic features are expected to be associated with the outcome of interest.

Remark 1.
The functionĜ(t, t) in the denominator of fdr ρ (t) (6) does not always converge to S 1 (t)S 2 (t), even though T i1 and T i2 are independent under the fixed-effects model (3). This is because the (T i1 , T i2 ) are not identically distributed, due to signal indicators (I i1 , I i2 ) are that are different for different features i. This observation motivates an interesting alternative interpretation of the proposed procedure, which is more easily described using the randomeffects model (4) from Section 2.1. In this framework, the (I i1 , I i2 ) are modeled as identically distributed random Bernoulli tuples and T i1 and T i2 are independent conditional on (I i1 , I i2 ). Therefore the (T i1 , T i2 ) are marginally identically distributed, but dependence between I i1 and I i2 will make T i1 and T i2 marginally dependent. SinceĜ(t, t) estimates the marginal bivariate distribution function of (T i1 , T i2 ), fdr ρ (t) can be thought of measuring the departure ofĜ(t, t) from independence. In this sense, procedure (7) appears closely related to testing for independence between the T i1 and the T i2 .

Remark 2.
The rejection regions in (5) are rectangular, but many other nonrectangular shapes can be used. In fact, Heller and Yekutieli [24] showed that the optimal rejection region is a level curve of the local false discovery rate.
On the other hand, rectangular regions are simple to implement and interpret, and they allow the expected proportion of false positives to be upper-bounded using only marginal survival functions, via Proposition 1. This is crucial to the nonparametric nature of the proposed procedure, and it is not clear whether there exist non-rectangular rejection regions that have this property.

Rank transformation
The rejection region (5) uses the same threshold for both T i1 and T i2 . This may not be appropriate when the null distributions S 0 i1 and S 0 i2 are not comparable, because the test statistics from the two sequences will be on different scales. A simple solution is to transform the T id within each sequence to their corresponding ranks, which makes the sequences more comparable. This is equivalent to replacing fdr ρ (t) with The rank transformation procedure is therefore equivalent to considering rejection regions [F −1 (5). This procedure can be less powerful than a procedure that knows the true null distributions S 0 id of the T id . When the nulls are known, the correct way to place the test statistics on the same scale would be to convert the T id to p-values, which would replaceF 1 andF 2 above with 1−S 0 i1 and 1−S 0 i2 . This would correspond to considering to rejection regions [ . However, these regions may not coincide with those considered by rank transformation. This can happen, for example, if S 0 i1 = S 0 i2 butF 1 =F 2 , and can result in the rank transformation procedure having lower power. This is illustrated in simulations in Section 4.1.
An alternative to the rank transformation approach is to consider rejection [11,14]. This allows different thresholds for the different sequences to be learned from the data, without needing knowledge of the null distributions. However, the resulting procedure seems to require very restrictive conditions in order to guarantee false discovery rate control. This is discussed in detail in Section 3.

Theoretical properties
To motivate the proposed procedure, it was temporarily assumed that S 0 id = S 0 d and S 1 id = S 1 d for all features i. However, in general the test statistics for different features may have different null and alternative distributions. Even so, the proposed method still has good properties under the following assumption about the S 0 id and S 1 id . Let I i = (I i1 , . . . , I iD ) be the vector of true signal indicators corresponding to feature i in each sequence of test statistics.
Assumption 2 endows the fixed-effects model (3) with certain useful properties of the random-effects model (4), and is similar to assumptions introduced by Genovese and Wasserman [20] and others [44,45]. Assumption 2(a) posits the existence of limiting survival functions S 0 d and S 1 d , which can be thought of as the marginal null and alternative distributions of T id under the randomeffects model. Assumption 2(b) recovers the random-effects assumption that the components of (T i1 , . . . , T id ) are independent conditional on I i . In fact it is slightly weaker, requiring that the T id are conditionally independent only when I i = (1, . . . , 1), as the joint distribution G 1 defined in Assumption 2(c) need not equal the product of marginal survival functions. Finally, the π I defined in Assumption 2(d) can be viewed as the probability that I i = I under the random-effects model.
The main result of this paper is that the proposed procedure can achieve asymptotic false discovery rate control.

Theorem 1. Under Assumptions 1 and 2, the proposed procedure
where fdr is the true false discovery rate defined in (2).
Finite-sample rather than asymptotic false discovery rate control would be ideal, and the proof of Theorem 1 suggests that this might be possible if the null and alternative distributions did not vary across features, and if the marginal survival functions S d in fdr ρ (t) (6) were known rather than estimated. The condition that ρ > 0 is necessary for technical reasons, but the simulations in Section 4 indicates that using ρ = 0 still gives good results. Procedure (7) can provably control the false discovery rate for simultaneous signals without any knowledge of the null or alternative distributions, aside from the stochastic ordering condition of Assumption 1. Because of this, it pays a price in terms of power to detect simultaneous signals. A major reason is that the bound in Proposition 1 is not tight, which causes the selected thresholdt ρ (7) to be larger than necessary. In particular, the proof of Proposition 1 indicates that the bound is most tight when each sequence of test statistics has very few signals and when there are very few simultaneous signals.

Two or more sequences of test statistics
In some problems, the goal may be to discover features that are simultaneously significant across D > 2 sequences of test statistics. The proposed method can be extended to this setting by consider rejection regions of the form [t, ∞) D for a threshold t. Under model (3) and Assumption 2, the expected number of false positives that would be discovered by this region can be upper-bounded using the following generalization of Proposition 1.
for any t 1 , . . . , t D , with S d defined in Proposition 1 and S defined in (1).
Following the reasoning in Section 2.2, Proposition 2 therefore motivates the following discovery procedure for any number D ≥ 2 of sequences: and features withδ ρ (T i1 , . . . , T iD ) = 1 are declared as simultaneous signals across all D sequences. It is straightforward to extend the proof of Theorem 1 to this generalized procedure. This reduces to procedure (7) when D = 2. Procedure (8) can suffer from very low power because the bound in Proposition 2 becomes exceedingly conservative for larger D. When D > 2, this bound is derived from repeated applications of the basic bound for D = 2. This is likely not an optimal approach, and further work is necessary to design a more powerful nonparametric discovery procedure for more than two sequences. Nevertheless, simulations in Section 4 suggest that the method is still serviceable when D ≤ 3, which covers many practical problems. Furthermore, few other methods are available when the null and alternative distributions are unknown.

Alternative procedure
For D = 2, Remark 1 of Section 2.2 pointed out that a more natural alternative to rejection region (5) is the rectangle [t 1 , ∞) × [t 2 , ∞), which allows a different threshold for each sequence of test statistics. Applying Proposition 1 to this rejection region suggests the conservative false discovery rate estimator withŜ d and ρ as defined in (6) . This can be shown to be an asymptotically uniformly conservative estimate of the false discovery rate incurred by the rejection region. This leads to the following alternative to the proposed procedure (7):

Theorem 2. For any discovery procedure of the form
where the set Π = {(∞, ∞)}∪{(T i1 , T i 2 ) : 1 ≤ i, i ≤ n} is the union of the point (∞, ∞) along with the Cartesian product of the two sequences of observed test statistics. Thet ρ1 andt ρ2 are chosen to maximizeĜ(t 1 , t 2 ), which is equivalent to maximizing the number of rejected features, subject to controlling the estimated false discovery rate bound. Thet ρ1 andt ρ2 defined in (10) are not unique, in that there can exist multiple distinct rejection regions that maximizeĜ(t 1 , t 2 ). Figure 1 illustrates an example where n = 1,000, the marginal signal proportions were π 1 = 0.019 and π 2 = 0.012, there were six simultaneous signals, the null T id ∼ χ 2 1 , and the non-null T id ∼ χ 2 1 (9). Each of the rectangular rejection regions at the α = 0.05 level rejects a different set of three features. Nevertheless, under certain conditions it can be shown that any (t ρ1 ,t ρ2 ) satisfying (10) can achieve asymptotic false discovery rate control. However, the requirement in Theorem 3 that there exist t 1 and t 2 such that lim n→∞ fdr ρ (t 1 , t 2 ) ≤ α turns out to be very stringent. It can be shown following the proof of Proposition 1 that such t 1 and t 2 must satisfy where π (1,1) is the proportion of simultaneous signals. But this can only happen when π (1,1) > π 1 π 2 , where π d is the proportion of signals in study d. If it does not hold, the alternative procedure may not be able to maintain the false discovery rate at the nominal level. This is in contrast to Theorem 1 for the proposed procedureδ ρ in (7), which guarantees asymptotic false discovery rate control without needing this condition. This is important, for example, when applyingδ ρ (10) to sequences where no simultaneous signals actually exist. Then π (1,1) = 0, andδ ρ may incorrectly identify one or two features as simultaneous signals because it cannot maintain the nominal false discovery rate. This alternative procedure is thus not pursued in the remainder of this paper. Furthermore, the rank transformation described in Section 2.3 obviates the need for a different threshold for each sequence of test statistics.

Effect of rank transformation
This section explores the effectiveness of the rank transformation described in Section 2.3 for D = 2 sequences for a variety of null distributions of different scales. The proposed discovery procedureδ ρ in (7), with ρ = 0, was applied in three ways at a nominal α = 0.05 false discovery rate. First,δ ρ was applied to directly to T id to demonstrate the consequences of ignoring the different scales of the test statistics. Next, p-values P id were calculated based on the T id , and the proposed method was applied without rank transformation to − log 10 P id , a p-value transformation common in the genomics literature. This represents an oracle version of the proposed method that uses information about the true null distributions to place the test statistics on comparable scales. Finally,δ ρ was applied to rank-transformed T id to attempt to recover the oracle performance. Figure 2 reports the false discovery rates and average numbers of discovered simultaneous signals over 200 replications of the following simulation settings. Each setting had n = 10,000 features, and the two sequences were generated with either equal or unequal numbers of signals in each sequence. Example 1 (normal null). For sequences d = 1, 2, T id = Z 2 id where the Z id were independently drawn from N(μ id , d 2 ). If I id = 0, μ id = 0, and if I id = 1, μ id ∼ N (μ d , 1) and was fixed across all replications, for various μ d . Because the maximum of n independent N(0, d 2 ) goes like d(2 log n) 1/2 , μ d was at most d(2 log n) 1/2 . This was to ensure that the simultaneous signals could not be identified by simply choosing the features with T id > d(2 log n) 1/2 in both sequences.
Example 2 (exponential null). Simulations followed Example 1 except that the T id were independently generated from Exp(λ id ) where λ id = 2d for I id = 0 and λ id ∼ max{0.1, N(λ d , 0.01)} and fixed across replications for I id = 1, for various λ d . Because the maximum of n independent Exp(2d) random variable goes like (log n)/(2d), λ d was at least 2d/ log n, so that the expected value of T id for I id = 1 was at most (log n)/(2d).

Example 3 (t 4 null). Simulations followed Example 1 except that
where the Y id were independently generated from d(μ id + t 4 ) with t 4 denoting a random variable drawn from a t-distribution with four degrees of freedom. If I id = 0, μ id = 0, and if I id = 1, and μ id ∼ N (μ d , 1) and was fixed across all replications, where μ d was at most log n.
When the two sequences had the same number of signals, the rank-transformed procedure performed essentially exactly as well as the oracle procedure using the p-values calculated with knowledge of the true null distributions. However, when the two sequences had different numbers of signals, the rank-transformed method had lower power, particularly for t 4 -distributed T id . As discussed in Section 2.3, this is because the rejection regions considered by the rank-transformed procedure no longer coincide with those considered by the oracle procedure.

Unknown null distributions
The main motivation of this paper was to develop a method to control the false discovery rate for discovering simultaneous signals when the null and/or alternative distributions of the T id are unknown. These simulations explore this for D = 2 sequences. In this section, each sequence of test statistics was generated with the same number of signals, following Bogomolov and Heller [9]. See Appendix A for additional simulation settings.
For each d and i, ten correlated z-scores (Z id1 , . . . , Z id10 ) were generated from N(μ id , Σ id ), where μ id = (0, . . . , 0) for I id = 0 and μ id = (μ id1 , . . . , μ id10 ) with μ idj ∼ N(μ d , 1) and fixed across replications for I id = 1, for the same values of μ d used in Example 1 of Section 4.1. Each Σ id was equal to the empirical correlation matrix of a different set of 10 genes selected from a gene expression study of multiple myeloma, obtained from Shi et al. [43]. These z-scores were converted to correlated p-values (P id1 , . . . , P id10 ), and finally T id = −2 10 j=1 log P idj . The vectors of z-scores were generated independently across both d and i, so that the T id were also independent.
This setting models applications where group testing is applied to multiple groups of correlated genomic features. The groups, indexed by i, are independent, but the features within the groups are not. The null distribution of each T id is complicated and in practice would not be known, as it depends on the unknown correlations between the features.
The proposed procedure was implemented with ρ = 0 and rank-transformed T id . To demonstrate the difficulty of this problem, three existing methods described in Section 1 were also implemented: 1. The method of Chung et al. [13] uses p-values P id calculated from the T id . 2. The empirical Bayes method of Heller and Yekutieli [24] uses an estimate of the null distribution of the T id . It first calculates z-scores from the T id  (7); GPA: method of Chung et al. [13]; repfdr: method of Heller and Yekutieli [24]; radjust: method of Bogomolov and Heller [9].
using their known theoretical nulls, then assumes that the z-scores are normally distributed but with unknown mean and variance, and finally estimates the unknown parameters using the method of Efron [16]. See Efron [18] for a discussion of why using this so-called "empirical null" can be more appropriate than using the theoretical null in multiple testing problems. 3. The method of Bogomolov and Heller [9] is based on first selecting promising features from each sequence based on the P id .
These existing approaches all require known null distributions, so they were implemented assuming that the T id followed χ 2 20 , which would only be true if Σ id equaled the identity matrix. Figure 3 reports the false discovery rates and average numbers of discovered simultaneous signals over 200 replications. The proposed method always maintained the false discovery rate at the nominal level, something that none of the other methods could achieve. It appears to be the only existing simultaneous signal discovery procedure that can provably control the false discovery rate in this setting when the T id have complex or unknown null distributions. The tradeoff is that the proposed method is conservative, in that the achieved false discovery rate can be much lower than the desired level α. This was discussed in Section 2.2. Additional simulation results in Appendix A, where there were no true simultaneous signals and where the two sequences had different numbers of signals, lead to similar conclusions.

Known null distributions
In many standard simultaneous signal detection problems, the null distributions of the T id are known. A number of methods already exist to address these cases, such as those described above in Section 4.2. It is interesting to compare them  [13]; repfdr: method of Heller and Yekutieli [24]; radjust: method of Bogomolov and Heller [9]. in these standard settings to the performance of the proposed method. The simulations in this section explore this for D = 2 sequences with a non-zero number of simultaneous signals. As in Section 4.2, each sequence had the same number of signals. Figure 4 reports the false discovery rates and average numbers of discovered simultaneous signals over 200 replications of the following simulation settings. The first three examples are similar to those in Section 4.1 except that in this section, the null distributions of the T id were set to be equal for both sequences.
Example 1 (normal null). This follows Example 1 of Section 4.1, except that T id = Z 2 id where the Z id were independently drawn from N(μ id , 1 2 ). Example 2 (exponential null). This follows Example 2 of Section 4.1 except that Example 3 (t 4 null). This follows Example 3 of Section 4.1 except that T id = Y 2 id where the Y id were independently generated from μ id + t 4 .
Example 4 (dependent test statistics). For sequences d = 1, 2, T id = Z 2 id where the Z id were generated as in Example 1 of this section, except that all Z id within a sequence d were correlated. Their correlation was set equal to the empirical correlation matrix of n genes chosen from a gene expression study of multiple myeloma [43]. This violates the assumption in model (3) of independence between test statistics in a sequence.
The procedure of Chung et al. [13] discovered the largest number of simultaneous signals, but could not maintain the nominal false discovery rate even though the null distributions of the T id were known. The remaining methods were able to maintain the nominal false discovery rate throughout, even when test statistics were dependent. The methods of Heller and Yekutieli [24] and Bogomolov and Heller [9] performed similarly, with the former having somewhat higher power. The proposed method, in many cases, had comparable power when the marginal signal proportions were very low, though as previously mentioned it had very low power otherwise. It was able to control the false discovery rate even when test statistics were dependent in Example 4, though this likely was due in part to the procedure's conservativeness. As before, additional simulations in Appendix A lead to similar conclusions.

More than two sequences of test statistics
The generalized discovery procedure (8) in Section 2.5 was applied to D = 3 sequences with n = 10,000 features. In sequences d = 1, 2, T id = Z 2 id where the Z id were independently generated from N(μ id , 1), with μ id = 0 when I id = 0 and otherwise drawn from N(5, 1) and fixed across replications. In the third sequence, T i3 was generated from a complicated distribution meant to model data from ChIP-seq experiments [29], such as the example studied in Section 5. The proposed procedure is the only one applicable to this setting due to the unknown null distribution of the T i3 .
First, λ i1 = λ i2 were drawn from N(100, 5) when I i3 = 0 and then fixed across replications. These model population average ChIP-seq peak heights at genomic location i under experimental and control conditions, respectively, that are equal under the null hypothesis. When I i3 = 1, λ i1 and λ i2 were independently drawn from Exp(0.001), modeling differences in average peak heights between the experimental conditions under the alternative hypothesis. Next, O il for l = 1, 2 were generated from Poisson(λ il ), modeling observed ChIP-seq peak counts. Finally, T i3 = | log(O i1 /O i2 )|, and will tend to be larger when I i3 = 1 because λ i1 = λ i2 . Table 1 False discovery rates and average numbers of discovered simultaneous signals for D = 3 sequences of test statistics, using the proposed approach (8) with a nominal 0.05 false discovery rate level.  Table 1 reports the results over 200 replications and shows that the proposed procedure maintained the nominal false discovery rate while still being able to detect a significant proportion of the true simultaneous signals. That the attained false discovery rates are much lower than the nominal 0.05 indicates that the procedure is conservative, as discussed in Section 2.5.

Data analysis
The field of sociogenomics studies molecular correlates of social behavior [34]. Saul et al. [38] studied the transcriptomic response to social challenge in mice that were exposed to intruder mice introduced to their cages. At 30, 60, and 120 minutes after intruder removal, they collected RNA-seq data from the amygdala, frontal cortex, and hypothalamus in order to determine which genes were differentially expressed between mice exposed to the intruder and mice exposed to a nonsocial control condition. They also collected ChIP-seq H3K27ac data at 30 and 120 minutes, to identify regions of chromatin that were differentially accessible between experimental and control mice. These data are available from the Gene Expression Omnibus under accession number GSE80345.
This section analyzes these data to find mouse genes that are both differentially expressed and next to differentially accessible regions of chromatin. Integrating these pieces of evidence can identify genes whose expression changes may be directly caused by differential binding of transcription factors to nearby regions of DNA [38]. This analysis can be cast as a simultaneous signal detection problem. Each mouse gene constitutes a genomic feature i, which can be associated with both a differential expression test statistic T i1 and a test statistic T i2 for the differential accessibility of a neighboring region of chromatin. The goal is to identify genes whose T i1 and T i2 are simultaneously non-null.
Following Saul et al. [38], the T i1 = Z 2 i1 where Z i1 were standard z-scores obtained using the edgeR software package [35]. Defining T i2 was more involved. Methods exist for calculating differential accessibility test statistics for genomic regions using ChIP-seq data [22,42,54], but these first identify regions of interest from the same data that the test statistics come from. This makes accurate pvalues difficult to calculate [12]. This analysis takes a simple approach and by were the observed number of H3K27ac reads, in the experimental and control sample respectively, within 100 kb up-and down-stream of the ith gene.
The null distribution of T i2 is highly nontrivial, and the proposed method (7) is the only existing false discovery rate control procedures that can be used without knowledge of this null. Table 2 presents the genes identified at a nominal false discovery rate of 0.1. It indicates that the hypothalamus is the most transcriptionally responsive to social challenge, particularly at 30 minutes. A number of these genes have been previously implicated in mouse behavior. For example, mice without Gpr88 and Penk have been shown to exhibit low anxiety and resistance to mild stress [25,26], and Foxg1 was highlighted in Saul et al. [38] as providing evidence for the role of neuropeptide signaling and neuron differentiation. These findings raise novel mechanistic hypotheses about the molecular response to social challenge.

Discussion
Most of this paper has assumed that the test statistics are independent across features. In the single-sequence false discovery rate control problem with dependent test statistics, an important step is to estimate parameters of the null distribution of the test statistics rather than using the theoretical null [17,19,41]. The nonparametric false discovery rate bound (6) already uses empirical distribution estimates, so the proposed procedure may also be able to control the false discovery rate under dependence. More work is required to fully characterize the behavior of the proposed method with dependent test statistics.
In some cases the two sequences of p-values are not of equal importance, as in replicability analysis [8,9,23,24], which distinguishes between a primary versus a follow-up study. The proposed method makes no such distinction, but could be potentially be modified. Suppose for two sequences that sequence 2 were of greater interest. Then the rejection region could be defined as [t, ∞) × [ct, ∞) for some fixed constant 0 < c < 1. This may allow weaker signals to be captured from the more important study.
A major outstanding issue is the suboptimal power of the proposed method. In some problems, with large numbers of test statistic sequences and/or sequences with moderate or high numbers of signals, the key bounds on the expected proportion of false positives in Propositions 1 and 2 are very loose. This is the tradeoff the method's robustness to unknown test statistic null distributions. Developing nonparametric detection methods with good power in these settings is an important direction for future work.  [13]; repfdr: method of Heller and Yekutieli [24]; radjust: method of Bogomolov and Heller [9].
by the definition of fdr ρ (t ρ ) in (6). Therefore, Analogous to the proposed procedure (7), define the constrained optimization problem (11) This type of leave-one-out construction oft has also been used in proofs of false discovery rate control in a single sequence of test statistics [6,32,37].
It will first be shown that for any η 1 , η 2 < ∞, almost surely, for fdr ρ (t 1 , t 2 ) defined in (9). Next it will be shown that sup t1≤η1,t2≤η2 almost surely, which will complete the proof.