Familywise Error Rate Control via Knockoffs

We present a novel method for controlling the $k$-familywise error rate ($k$-FWER) in the linear regression setting using the knockoffs framework first introduced by Barber and Cand\`es. Our procedure, which we also refer to as knockoffs, can be applied with any design matrix with at least as many observations as variables, and does not require knowing the noise variance. Unlike other multiple testing procedures which act directly on $p$-values, knockoffs is specifically tailored to linear regression and implicitly accounts for the statistical relationships between hypothesis tests of different coefficients. We prove that knockoffs controls the $k$-FWER exactly in finite samples and show in simulations that it provides superior power to alternative procedures over a range of linear regression problems. We also discuss extensions to controlling other Type I error rates such as the false exceedance rate, and use it to identify candidates for mutations conferring drug-resistance in HIV.


Introduction
Multiple testing has received increasing attention with the advent of fields like genetics, technology, and astronomy which produce very high-dimensional datasets. The increasing number of hypotheses being simultaneously tested has motivated extensive research into procedures that maintain control of the familywise errors that abound when each hypothesis is only tested individually. For instance, the canonical criterion of the familywise error rate (FWER) controls the probability of falsely rejecting any of the true null hypotheses. A number of more modern landmark works have introduced new Type I error rates that allow for higher power by relaxing the FWER, including the false discovery rate (FDR) [2], the k-FWER [16,20], and the false discovery exceedance (FDX) [9,29]. Each one has a different interpretation, but all control an error rate defined over all hypotheses being tested, so that conclusions can be drawn by considering rejected hypotheses together.
Among multiple testing problems, some of most important deal with finding relationships between variables. Such investigations are often posed as a linear model where X = [X 1 , . . . , X p ] ∈ R n×p is a design matrix, β ∈ R p is a signal vector of interest, and z ∈ R n is the error term. The hypotheses of interest are which variables β j , after controlling for all other variables, contribute to the model, or have nonzero coefficients. With the ability to encode correlations between variables, linear models capture far more real-life examples than sequence models. Examples abound particularly in genetics, where one searches for relationships between parts of the genome, often in the form of single nucleotide polymorphisms or expression levels, and continuous variables such as health factors or drug response. Unfortunately, due to the dependence among the variables in the linear model, their respective tests do not in general exhibit any of the simple dependence structures, such as independence or positive dependence, that are required for many of the most powerful existing procedures.
In this work we focus on controlling the k-FWER, the probability of making at least k false discoveries, in the context of linear models. Our method uses the framework of knockoffs introduced by [1]. The idea of knockoffs is to carefully construct artificial variables that serve as controls for the original variables. Barber and Candès show that these controls are easy to construct and can be used to automatically account for variable dependence to provide finite-sample FDR control for general design matrices without knowledge of the noise variance. Controlling the FDR can be highly desirable in a high-power setting, but results can be hard to interpret when few discoveries are made, as the realized false discovery proportion may be highly variable. The k-FWER, which in the case of k = 1 reduces to the standard FWER, always has a clear interpretation by explicitly bounding the probability of k or more false discoveries, making it a useful criterion in all settings, as evidenced by its wide acceptance in the scientific community. The k-FWER also provides a fundamental building block to other Type I error rates, such as the FDX and Per Family Error Rate (PFER), as we will discuss in Section 4. We leverage the attractive features of the knockoffs framework to construct a novel procedure for controlling the k-FWER that implicitly accounts for the exact dependence structure in linear regression problems. In particular, we prove finite-sample k-FWER control for general design matrices without any knowledge of the noise variance, and show in simulations that the power can be substantially greater than state-of-the-art alternatives.
Much previous work has studied controlling the k-FWER under varying assumptions on the statistical dependence among the hypothesis test statistics or p-values. The bulk of such work has dealt with procedures that act directly on the p-values. When there are more observations than variables and the noise is i. i. d. Gaussian, ordinary least squares regression generates dependent tstatistics for all variables, allowing those procedures that can account for the dependence structure to be applied to the associated p-values. Unfortunately, the joint distribution of such p-values does not generally satisfy popular dependence assumptions such as positive regression dependence on subset [3] or multivariate total positivity [18]. Furthermore, many of the procedures that can account for general dependence structures do so nonparametrically through resampling. However, resampling procedures tend to require extra assumptions such as subset-pivotality [31] which do not hold in general in the regression setting, or only provide exact control asymptotically [27]. We mention here some work on controlling the k-FWER in finite samples and refer the reader to [12] for a more thorough review. The most popular methods for FWER control are the Bonferroni [7] and Holm's [15] procedures, neither of which require assumptions on the dependence among p-values. Under independence, the Bonferroni procedure can be improved using theŠidák correction [30], or one can employ Hochberg's step-up procedure [14]. In [20], step-down procedures generalizing Bonferroni and Holm's procedures are presented, while [27] introduces a generic step-down procedure, all for controlling the k-FWER. In addition, [26] also presents step-up procedures for controlling the k-FWER under arbitrary unknown dependence.
To avoid confusion, we point out that the recent work of [21] provides p-values for coefficients in a linear model, however they deal with a different notion of a null hypothesis than used here. In their framework, the null hypotheses are defined sequentially with respect to a growing model, wherein each time the model size is increased by one, the null hypothesis is that the new variable is uncorrelated with the response, conditional on only the variables already included in the model. In contrast, in our setting the null hypotheses are defined globally as simply whether elements of β (the full-model coefficient vector) are zero or not. For instance, if β 1 = 0 and β 2 = 0 but is correlated with (only) β 1 , we would consider selecting β 2 to be a false discovery, while in the sequential setting, β 2 would be a true discovery as long as it is selected before β 1 .
The remainder of the paper is structured as follows. Section 2 introduces notation and gives a short introduction to the knockoffs framework. Section 3 describes the knockoffs procedure for control of the k-FWER and proves this control along with tail bounds. Section 4 provides a brief discussion of how the procedure can be used to control the PFER and FDX. Section 5 compares our procedure to state-of-the-art alternatives from the literature, both in terms of practical considerations and power, in a series of simulations. Section 6 demonstrates an implementation on a real dataset from genetics, and Section 7 concludes with discussion and directions for future research.

Preliminaries for knockoffs
In this section, we introduce the knockoffs machinery of [1] at a minimal level to be sufficient for our exposition of k-FWER control. This material is largely borrowed from the reference [1]. In referring to the knockoffs framework, we always assume that the number of observations n is at least the number of variables p, the design matrix X has full column rank so that the Gram matrix X X is invertible, and the noise term z has independent Gaussian entries. We would like to briefly emphasize here that n ≥ p is necessary for the full-model multiple hypothesis testing problem to even be well-defined. For any linear regression problem, the "true" coefficient vector is only statistically well-defined modulo addition with any vector in the null space of the design matrix. If p > n, then the design matrix has a nontrivial null space, thus allowing zeros and nonzeros in the coefficient vector to arise and disappear, changing the fundamental values of the null hypotheses, without changing the data-generating process at all. Except for this non-degeneracy assumption, the knockoffs machinery works for general designs X and does not even require knowledge of noise variance σ 2 .
To start with, again, consider the linear model where the noise vector z has independent N (0, σ 2 ) entries, and each column of X has been normalized to have unit 2 -norm, that is, X j = 1 for all 1 ≤ j ≤ p. The first step of this method is to construct the knockoff design, denoted as X ∈ R n×p , that obeys where s ∈ R p has nonnegative entries and the superscript denotes matrix transpose hereafter. There are multiple ways to construct this knockoff design; see [1, Section 2.1]. The first equality forces X to have the same correlation structure among its columns as X. In the ideal case of n ≥ 2p, it can be guaranteed that the 2p column vectors of X and X are jointly linearly independent. By the second equality, for every 1 ≤ j ≤ p, the original variable X j and the knockoff counterpart X j have the same correlation with all the other 2p − 2 variables, namely, X i , X i for i = j. At a high level, we can view the knockoff design as a control group as compared to the original design X, which is treated as the case group.
Denote by X KO = [X, X] ∈ R n×2p the concatenation of the original design and the knockoff design. With X KO in hand, the next step is to generate statistics for each variable. One way to do so, suggested in [1], is by fitting the entire Lasso regularization path on the augmented design, and letting Z j be the first λ such that β j is nonzero. Formally, Defining Z j analogously for each knockoff variable X j , the knockoff statistics (using slightly different notation than in the original paper) are respectively. As pointed out in the reference paper, many alternative statistics, including some based on least-squares, least angle regression [8], and sorted 1 -penalized estimation [4], can be used instead as long as they obey the sufficiency and antisymmetry properties defined therein. The following result, due to [1], characterizes the joint distribution of the null χ j . We say j is a true null when β j = 0 and a false null otherwise. This simple lemma is very helpful in proving k-FWER control. Its proof follows from the symmetry between X j and X j if β j = 0, which is provided by the construction (2.1). The lemma shows that χ j can be interpreted as a one-bit p-value, in the sense that it has equal chance to take 1 or −1 if β j = 0. In fact when β j = 0, the knockoff symmetry characterized in (2.1) introduces exchangeability between X j and its knockoff counterpart X j in the Lasso path (2.2). Hence, X j and X j are equally likely to enter the Lasso path first. Conversely, if β j = 0, then X j is likely to enter before X j so that χ j = 1. Thus a large W j and a positive χ j provide evidence against the jth null hypothesis H 0,j : β j = 0.

k-familywise error rate control
Inspired by the interpretation of the statistics W j and χ j , it is reasonable to reject hypotheses with positive signs χ j and large W j . Parameterized by a positive integer v, the knockoffs procedure for controlling the k-FWER is as follows. Step Step 2. Let j be the index of the vth −1 in the sequence χ ρ(1) , . . . , χ ρ(p) . If fewer than v negatives appear, set j = p.
More compactly, define the threshold with the usual convention that sup ∅ = −∞. The multiplicity of W j is not accounted for since all W j are unique with probability 1. Then, the knockoffs procedure rejects all H 0,j with W j ≥ T v and χ j = +1.
Before characterizing the distribution of false discoveries made by the knockoffs procedure, we define some notation. Let N 0 = {1 ≤ j ≤ p : β j = 0} be the set of true null hypotheses and NB(m, q) denote a negative binomial random variable, which counts the number of successes before the mth failure in a sequence of independent Bernoulli trials with success probability q.
Proof of Lemma 3.1. First, we prove this lemma in the case where N 0 = {1, . . . , p}, that is, β j = 0 for all j. Conditional on all W j , Lemma 2.1 concludes that χ ρ(1) , . . . , χ ρ(p) are independent and each takes +1 and −1, respectively, with probability 1/2. Note that the permutation ρ is deterministic conditional on the W j . Recognizing that V is the number of positive χ j before the vth negative or the pth trial happens, whichever comes first, we see that V is an early stopped negative binomial random variable. In the general case, false null χ j will insert −1's into the process on the nulls, causing it to stop no later than when N 0 = {1, . . . , p}. Therefore, V is always stochastically dominated by The stochastic upper bound in Lemma 3.1 is tight in the following sense. The distribution of V can be made arbitrarily close to NB(v, 1/2) under the global null by taking p v, as in this case at least v negative χ j will appear in the sequence with high probability. Next we present the main result, which is immediate from Lemma 3.1 and the negative binomial cumulative distribution function.
Theorem 3.1. For any integer k ≥ 1 and significance 0 < α < 1, let v be the largest integer satisfying Then the knockoffs procedure with parameter v controls the k-FWER at level α, that is, P(V ≥ k) ≤ α.
As a concrete example, taking v = 4 would provide 10-FWER control at level 0.05. As one may observe from (3.1), the integer v as a function of the level α cannot be continuous. Consequently, P(V ≥ k) is in general lower than the target level α. In particular, for α ≤ 1/2 k no positive integer v satisfies (3.1), so the naive procedure must reject nothing. This matter can be easily resolved by randomization of v, as we will show in Remark 3.1.
To better understand the knockoffs procedure, we may want to know how many false rejections are made when the k-FWER is not controlled. To this end, the following result bounds the tail probability of V , or the probability of making many more rejections than expected.
Remark 3.1 (Power Improvement). As mentioned earlier, the knockoffs procedure suffers from a discretization problem, especially for small k, but this can be remedied by randomization as follows. For any desired level α ∈ (0, 1), there must exist an integer v ≥ 0 such that where the subscript v or v + 1 emphasizes the parameter of the knockoffs procedure. We can devise a mixture procedure that obeys exactly P(V ≥ k) = α by putting weights ω and 1 − ω, respectively, on the knockoffs procedures with parameters v and v + 1, where Furthermore, as with any procedure controlling the k-FWER, power can always be improved without affecting the k-FWER by always making at least k − 1 rejections. In the case of knockoffs, if we were going to make fewer than k −1 rejections, we can simply continue rejecting the indices with the largest W j and positive χ j until there are k − 1. The benefit of this modification depends on the ordering of the hypotheses induced by W j .

Controlling other error rates
This paper has been about controlling the k-FWER, but the procedure introduced can be used to control other Type I error rates as well, namely the PFER and the FDX.
Originally proposed by John Tukey in an unpublished work in 1953, the PFER is defined as E(V ), or in words, the expected number of false discoveries. The control of this error rate under general p-value dependence has not received as much attention in the literature as other error rates, although both [11] and [23] have discussed using the Bonferroni procedure for this purpose. Lemma 3.1 shows that the knockoffs procedure for controlling the k-FWER also controls the PFER at level v, as E(V ) ≤ E NB(v, 1/2) = 1/2 1−1/2 v = v. The FDX, also known as the γ-false discovery proportion, tail probability for the proportion of false positives, or false discovery excessive probability, is the probability that the FDP exceeds a specified bound γ. It can be viewed as a more stringent form of the FDR, and has received much attention recently; see, for example, [12]. A number of authors have noticed its intimate connection with the k-FWER, and many of the most successful FDX-controlling procedures in the literature can be posed as meta-procedures applied to a family of k-FWERcontrolling procedures [29,9,27]. We briefly review three such meta-procedures, any one of which could be combined with the knockoffs procedure introduced here, and defer further investigation to future work.
In [29], the authors introduced a simple and intuitive procedure which augments any FWER-controlling procedure to control the FDX. This procedure was generalized to any k-FWER-controlling procedure in [10]. Once the k-FWERcontrolling procedure makes R rejections, then if (k − 1)/R > γ, the augmentation procedure makes no rejections, but if (k − 1)/R ≤ γ, r more rejections can be made, where r satisfies (k − 1 + r)/(R + r) ≤ γ. This augmentation procedure controls the FDX exactly when the underlying k-FWER-controlling procedure also provides exact control.
A test-inversion procedure for FDX control was proposed in [9], which is similar to the closure principle of [22] for FWER control. This procedure was then investigated further in [10]. The inversion procedure runs global null hypothesis tests on every subset of hypotheses, and then finds the largest subset S whose maximal intersection with any subset for which the global null was not rejected is at most γ|S|. Note that any k-FWER-controlling procedure is also trivially a test of the global null hypothesis, rejecting whenever k or more rejections are made. Rejecting S from the inversion procedure controls the FDX exactly, and although in general it takes exponential time, for some global tests it can be run in polynomial time [9].
Given a procedure that can control the k-FWER for any k ≥ 1, [27] proposes a heuristic that aims to control the FDX. In short, given a prescribed level γ and significance α, both between 0 and 1, this heuristic uses a k-FWER-controlling procedure to make rejections for increasing k until just before the number of rejections goes above k/γ − 1. Explicitly, let R k be the number of rejections made by a procedure controlling the k-FWER. Then the Romano-Wolf heuristic definesk as the smallest k such that R k < k/γ − 1 and makes rejections as if controlling thek-FWER. Although not rigorous due to its adaptivity ink, under some dependence assumptions, the Romano-Wolf heuristic is shown to enjoy finite sample or asymptotic FDX control for step-down procedures [13,6].

Comparison with other procedures
As mentioned in the introduction, the structure and dependence between coefficients in linear regression preclude the use of many existing procedures. The state-of-the-art procedures that can be found in existing literature and provide exact finite-sample control of the k-FWER in linear regression are: (a) the generic step-down procedure of [27] applied to the least-squares pvalues (b) the step-up procedure of [26] applied to the least-squares p-values (c) the adaptation of Holm's procedure to k-FWER applied to the leastsquares p-values [20] (d) for 1-FWER, the Lasso pathwise testing procedure of [19] (e) also for the 1-FWER, the closure of any global hypothesis testing procedure, such as the χ 2 test, that can be applied to p-values with any known dependence, applied to the least-squares p-values Procedure (d) requires the user to know σ 2 exactly, and both (d) and (e) take computation that is exponential in the dimension, p, making them infeasible to use for problems of even moderate size. As a result, we only compare our procedure to (a), (b), and (c). It should be noted that the problem dimensions we considered in simulations were still limited by procedure (b), whose computation time is O(p k−1 ), since each threshold is computed as a maximum over subsets of size k − 1 from a superset of size up to p. There are also works that obtain asymptotic control of the FWER under some assumptions on the distribution of the design matrix (see, for example, [5,17]). As knockoffs applies under no assumptions on the design matrix and the error rates are controlled exactly, we do not compare to such works here.
In each of the following simulations, we performed many independent experiments to gauge how the performance of knockoffs, both in absolute terms and relative to previous methods, depends on correlation in the columns of X, the sparsity of β, and the signal to noise ratio. In each experiment, X is generated by normalizing the columns of a multivariate Gaussian matrix with independent and identically distributed rows, and β is generated by setting a pre-specified number of entries to zero, and setting the rest to the same nonzero magnitude, which is also prespecified. The following experiments are all performed in the sparse setting, as that is what the canonical statistics W that use the Lasso are best-suited for. However, nothing about the knockoffs framework to control any Type I error rate is particularly tied to sparsity, and it is of continuing interest to find different statistics W that achieve high power in all manner of settings. In all the following simulations, n = 1000, p = 450, σ 2 = 25, we control the 5-FWER at the 5% level, and we apply the modifications in Remark 3.1. The step-up procedure is implemented using the critical values suggested in [26], namely their Equation (13). For a sake of reproducibility, the code to generate these figures is available at http://wjsu.web.stanford.edu/code.html.
Our first experiment took β to have 10 nonzero elements, all with magnitude 10, and varied the pairwise correlation between the columns of X from 0 to 0.5. Figure 1 shows the power (number of true discoveries divided by β 0 ) of the knockoff procedure nearly doubling that of all alternative procedures. The power and 5-FWER of all four procedures is largely unaffected by the correlation in the columns of X.  Our second experiment generated columns for X independently, and varied the sparsity of β, with each nonzero coefficient having magnitude 10. Figure 2 shows the power of the knockoff procedure approximately doubling that of all alternative procedures in the sparsest regime and gradually losing its advantage as the sparsity approaches 10%. The 5-FWER of the knockoffs and step-down decrease as the coefficient vector becomes less sparse, with that of knockoffs becoming conservative especially quickly.
Our third experiment generated independent columns for X, used β with 10 nonzero entries, and varied the magnitude of the nonzero entries on a logarithmic scale.  Figure 3 shows the power of the knockoff procedure above all alternative procedures in the low-to middle-power regimes, while it actually has slightly less power in the very high-power regime, corresponding to a signal-to-noise ratio β 2 /σ 2 > 350. This reversal can be explained by the fact that with nonorthogonal columns and a not-extremely-sparse β, the Lasso will not perfectly select all signal variables before the non-signal variables, even when the signalto-noise ratio is extremely high [28]. As such, the Lasso-based W statistic used in knockoffs never achieves a power of 1; this phenomenon could be remedied by using one of the least-squares-based W mentioned in [1]. The 5-FWER of all four procedures is again largely unaffected by the coefficient magnitude.

Real data experiment
In this section, we apply our method to a data set on HIV drug resistance. Specifically, the data set, described and analyzed in [25] and also used in the original knockoffs paper [1], contains genotype information from samples of HIV Type 1, along with drug resistance measurements for 16 drugs across three classes. The three classes are protease inhibitors (PI), nucleoside reverse transcriptase inhibitors (NRTI), and nonnucleoside reverse transcriptase inhibitors (NNRTI), each of which has its own set of samples. Drug resistance was measured as the log-fold-increase of resistance as compared to a control, and the genetic information comes as single nucleotide polymorphisms (SNPs), so that the design matrix is binary with each column representing the presence or absence of a minor allele at a given locus.
In order to analyze the data, some cleaning was required. In particular, some samples do not have resistance measurements for some of the drugs, so these samples were removed on a drug-by-drug basis. Also, some SNPs have so few mu- tations that either their effect would be too hard to detect, or their inclusion actually causes rank-deficiency in the design matrix. As such, for each drug we only included polymorphisms with at least five mutations present in the culled sample; this was the minimum required to ensure all design matrices were full-rank. We compare our knockoffs procedure to the step-down, step-up, and Holm procedures, as well as to the original knockoffs procedure for controlling FDR at level q. As k-FWER is often used as an exploratory analysis, and to make analysis comparable with knockoffs for FDR control, we set α = 0.5 (FDR controls a mean, and with α = 0.5, k-FWER controls a median). We set k = 2 and q = 0.2, and ran all five procedures on all 16 drugs, the results of which are summarized in Table 1.
Although the ground truth is unknown in this case, there exists an approximate ground truth from treatment-selected mutation (TSM) panels [24]. These panels list mutations that were found to be statistically significantly more frequent in virus samples from individuals treated with a drug in that class than samples from individuals who had not. Thus in our experiment evaluation, we consider a SNP discovery for a given drug to be true if it has a mutation listed in the TSM panel for that drug's class.
The table shows the number of total discoveries and false discoveries made by each method on each data set. As suspected, FDR-controlling knockoffs was more powerful than any of the k-FWER-controlling procedures, but is harder to interpret as it never makes a very large number of discoveries, and thus the FDP may be quite different from q. The remaining procedures have varying levels of 2-FWER, but recall that the error rates reported are likely to be overestimates, as there may be important SNPs that the TSM panels missed. Still, we see that on this data set, the step-down and Holm procedures commit more 2-familywise errors than knockoffs, while the step-up procedure has over 10% less power than knockoffs.
To better-understand the trade-offs with the choice of k and α, we also compared the average number of "true" discoveries (discoveries confirmed by TSM panels) as a function of k and α for all four k-FWER-controlling procedures. Figure 4 shows the relative performance of knockoffs improving as the error control (k and/or α) relaxes. To understand why this is, note that the leastsquares-p-value-based procedures all have fairly constant power as k and α vary, because there is a group of variables with very strong signals whose p-values are so small that they are always rejected, almost regardless of k and α. But as discussed with regard to Figure 3, the Lasso path does not always perfectly order the very strong signals first, so knockoffs using the Lasso W statistic loses some power to reject very strong signals. However, as k and/or α increase, knockoffs easily finds the strong signals and more while the other procedures become increasingly conservative. As evidenced in Figure 4, knockoffs quickly overtakes the available alternatives when k and α are not chosen too stringently, so that a large number of rejections are made. Because of this, we expect the advantages of knockoffs to be especially pronounced in large exploratory analyses.

Discussion
This work leaves a number of important avenues open for future research. First, we mentioned in Section 3 a number of methods that translate k-FWERcontrolling procedures into procedures for controlling the FDX. Investigating the best such method could yield a powerful method for controlling another important Type I error rate. Second, [1] mentions in passing the possibility of multiple knockoffs, i.e., constructing m ≥ 1 sets of knockoffs and replacing the one-bit p-values corresponding to the χ j 's with m+1-discretized p-values. In the setting of FDR control, one can search over many one-bit p-values and need only consider what fraction, on average, may be false discoveries. However to control the k-FWER, one must keep track of every false discovery, and we may expect the extra resolution of multiple knockoffs to provide more power to distinguish true discoveries from false ones. Lastly, we feel the knockoffs framework is still a largely untapped resource for generating multiple testing procedures. The investigation of alternative W j statistics for ordering variables, and the extension to other regression settings such as logistic regression and higher-dimensional problems (p > n) are all important open subjects.
We have presented a novel method for controlling the k-FWER in the context of linear regression. Knockoffs requires no knowledge of the noise variance and implicitly takes into account the exact dependence structure of the problem, allowing it to provide considerable power improvements over state-of-the-art alternatives in a range of settings. This, along with its intuitive justification and ease of computation, makes knockoffs a useful practical tool for multiple hypothesis testing.