False discovery rate control for effect modification in observational studies

In an observational study, a difference between the treatment and control group’s outcome might reflect the bias in treatment assignment rather than a true treatment effect. A sensitivity analysis determines the magnitude of this bias that would be needed to explain away as noncausal a significant treatment effect from a naive analysis that assumed no bias. Effect modification is the interaction between a treatment and a pretreatment covariate. In an observational study, there are often many possible effect modifiers and it is desirable to be able to look at the data to identify the effect modifiers that will be tested. For observational studies, we address simultaneously the problem of accounting for the multiplicity involved in choosing effect modifiers to test among many possible effect modifiers by looking at the data and conducting a proper sensitivity analysis. We develop an approach that provides finite sample false discovery rate control for a collection of adaptive hypotheses identified from the data on matched-pairs design. Along with simulation studies, an empirical study is presented on the effect of cigarette smoking on lead level in the blood using data from the U.S. National Health and Nutrition Examination Survey. Other applications of the suggested method are briefly discussed.


Introduction
In a randomized study, we know the distribution of treatment assignment, but in an observational study, the distribution of treatment assignment is unknown. Consequently in an observational study, assuming that distribution of treatment assignment is random could introduce bias in inferences about the treatment effect. If there are no unmeasured confounders then it is possible to remove such bias by matching on observed covariates and conducting inference that assumes treatment is randomly assigned within matched sets (see e.g., Aakvik, 2001;Gemenisa and Rosemab, 2014;Pimentel, Yoon and Keele, 2016). On the other hand, if there do exist unmeasured confounders, then this analysis would be biased. In such a scenario a sensitivity analysis tries to answer the question of how much bias due to unmeasured confounders has to be present in the data to alter the inference based on the assumption of no unmeasured confounders (see e.g., Keele and Minozzi, 2013;Zubizarreta et al., 2012).
An effect modifier is defined as a pretreatment covariate for which the treatment effect differs according to the levels of the covariate. Effect modifiers are of inherent interest for personalizing treatment. In addition, although we can test for no treatment effect on any subject, i.e., Fisher's sharp null hypothesis (Fisher, 1935;Neyman, 1923), without having to consider effect modifiers, consideration of possible effect modification can increase power and reduce sensitivity to bias. Larger treatment effects are less sensitive to bias due to unmeasured confounders, than small treatment effects. Taking advantage of this fact Hsu, Small and Rosenbaum (2013) suggested forming a collection of subgroups of subjects in which subjects in a subgroup are expected to have the same level of treatment effect and then pool evidence from the subgroups to infer about Fisher's sharp null and effects within subgroups. There is though an operational difficulty in such analysis. Which variables are effect modifiers and at what levels of the variables the effect modification occurs are uncertain. It is not always possible to have a priori knowledge of effect modifiers in a study. Many studies use the data to form the subgroups and then carry out analysis based on the learned groups. So, in such analysis the data is used twice, the first time to identify the subgroups and the second time to make inferences. To avoid always finding something if we look at enough data, it is important to control the error rate for having looked at the data to identify the subgroups of interest. Hsu et al. (2015) provided an algorithm that guarantees family wise error rate (FWER) control when the subgroups are built from the data.
The false discovery rate (FDR) is another choice for error control in multiple testing that is less conservative and has more power. Both FWER and FDR control the probability of falsely rejecting at least one null when all nulls are true. They differ when at least one null is false, for example, if we reject 20 nulls, 19 of which are false, FWER regards this as a failure while FDR regards this as a success for controlling the FDR at level 0.05 (= 1/20). Glickman, Rao and Schultz (2014) has strongly advocated FDR control in epidemiological studies on philosophical grounds. Many scientific communities have widely adopted FDR control as the norm for controlling for multiple testing. Our principal goal is to extend the work of Hsu et al. (2015) to propose a procedure that provides false discovery rate control. We show that in the same set-up of Hsu et al. (2015) we can use the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg, 1995) on groups defined from the data, to provide FDR control on the group hypotheses.
The rest of the paper is organized as follows. In the following subsection we motivate our work by a real data example. Section 2.1 introduces required notation for technical purposes. Section 2.2 is dedicated to a short review of sensitivity analysis in an observational study. In Section 3 we derive the main technical results of our study. A detailed simulation study is presented in Section 4 and in Section 5 we revisit our motivating example.

Motivating example: lead level in the blood of smokers
Does smoking cause an increase in lead level in the blood? We consider data from the U.S. National Health and Nutrition Survey (NHANES) for the years 2009-2014.  studied similar data to elaborate on a different aspect of sensitivity analysis. We consider the data on the 9,103 adults 20 years or older who can be classified as smokers or non-smokers. A smoker is an individual who has reported smoking more than 100 cigarettes in his/her lifetime, has smoked every day for the last 30 days and has smoked one or more packs per days in the last 30 days. A non-smoker is someone who reported smoking less than 100 cigarettes in his/her lifetime and has not smoked any cigarettes in the last 30 days. There are 1,485 smokers and 7,618 non-smokers. Following previous observational studies of the effect of smoking (Rosenbaum, 2007a;Rosenbaum, 2017), we compare heavy smokers (as defined above) to nonsmokers because making the two groups sharply differ in exposure dose increases the insensitivity of the study to unobserved biases when there is an exposure effect and no bias (i.e., it increases the design sensitivity, Rosenbaum, 2004).
We control for the following pretreatment covariates: age, gender, education (encoded in five indicator variables for categories of education level: less than 9th grade, between 9th and 11th grade, high school graduate/GED or equivalent, Some college or AA degree and College graduate or above), race (categorized as Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian and Other Race), income-to-poverty ratio, and an indicator variable for income information missing or not. We control for these pretreatment covariates by pair matching. In a pair matching each treated individual, i.e., a smoker, is matched to a control individual, i.e., a non-smoker, based on their observed covariates. In our study we use rank based Mahalanobis distance with a propensity score caliper (see Rosenbaum, 2010, Ch 8 for details) to form the matched pairs. This matching algorithm is implemented using the pairmatch function of the optmatch package in R (Hansen, 2007). When considering change in lead level in the blood as an effect of smoking, genetic and environmental factors are potential confounding variables that we do not have any information on. A genetic factor can be associated with level of lead in the blood and might also affect the smoking habit of an individual. Also an individual in a certain industrial locality might be prone to a higher lead level and smoking behavior could be associated with locality. Consequently, in absence of information on locality and genetic aspects, a sensitivity analysis becomes important for properly evaluating the exposure effect.
We test Fisher's sharp null of no treatment effect with Huber-Maritz Mstatistics using the senmv function (with parameters inner = 0.1, trim = 1.5) of the sensitivitymv package in R (Rosenbaum, 2015). An M-statistic, proposed by Maritz (1979), is the quantity equated to zero in defining Huber's M-estimates (Huber, 1981). We consider the one-sided alternative of smoking increasing lead level in the blood. The p-value, assuming there is no unmeasured confounding is less than 4.55 × 10 −15 . Using the sensitivity analysis method of Rosenbaum (2007b), we can compute a minimum Γ(≥ 1) at which the conclusion that smoking causes an increase in lead level is sensitive to bias from unmeasured confounding, where Γ is the maximum odds ratio for being a smoker vs. a non-smoker among two subjects matched on the measured confounders. If there are no unmeasured confounders, then Γ = 1; the more unmeasured confounding there is, the larger Γ is. In this study, we find that the effect is insensitive to hidden bias until Γ = 2.6, i.e., if the influence of unmeasured confounding is less than Γ = 2.6, we would still have strong evidence that smoking causes higher lead levels.
Individuals with different measured covariate values may have different magnitudes of treatment effect, i.e., different magnitude of increase in lead level due to smoking. Thus these covariates may work as effect modifiers. Genetic and environmental factors, which are unmeasured, can also be correlated with the magnitude of treatment effect. It is not known a priori how these covariates form relevant subgroups that show similar level of treatment effect. We use our data to form such potential subgroups. Figure 1 shows five subgroups which are created based on a regression tree (Breiman et al., 1984) model of the rank of the absolute difference of lead level in the blood in smokers and non-smokers within each pair on the observed matched covariates. This is implemented using the ctree function of the R package party. Covariates not matched exactly within pairs are averaged. We shall elaborate more on our choice of this method to build the subgroups in Section 3. We have five non overlapping subgroups, which are the leaf nodes of the tree in Figure 1. The key methodological question we will address is, how can we make valid inferences and perform valid sensitivity analyses that account for the fact that we have chosen the subgroups to examine based on the data? In Section 5 we shall report results of our analysis which compares our results in contrast to closed testing method of suggested by of (Marcus, Peritz and Gabriel, 1976) suggested by Hsu et al. This analysis will show evidence of smoking causing increase in lead level in the blood which is much less sensitive to bias compared to an analysis not incorporating effect modification.

Notation
In a matched pair study we denote I as the number of matched pairs. The matched pairs are formed based on observed covariates. For the ith matched pair there are two subjects denoted by j = 1 and 2. In shorthand we write ij to denote jth subject for the ith matched pair. The treatment assignment for subject ij is denoted by Z ij taking value 1 for treated or 0 for control. Let x ij be the vector of observed covariates and u ij be the summary of all unobserved covariates scaled between 0 and 1 for subject ij. By the architecture of the matching algorithm we have the constraints Z i1 + Z i2 = 1 and we seek to have x i1 = x i2 = x i . It is possible though that u i1 = u i2 since u i1 and u i2 are unobserved. In our motivating example of studying the effect of smoking on lead level in the blood, we have I = 1, 485 pairs, i.e., 1, 485 × 2 = 2, 970 individuals.
For each subject ij, there is a pair of potential outcomes (r T ij , r Cij ) corresponding to whether the subject is treated or not, i.e., Z ij = 1 or 0 (Neyman, 1923;Rubin, 1974). To analyze the treatment effect we are interested in the difference r T ij −r Cij . For each individual we only observe one of these two responses based on the observed treatment assignment. We write the observed response for subject ij as R ij = Z ij r T ij + (1 − Z ij )r Cij . We collect the characteristics of the subjects that are fixed regardless of treatment assignment and denote them by F = {(r T ij , r Cij , x ij , u ij ), i = 1, 2, · · · , I; j = 1, 2}. The only variable that is assumed to be random is the treatment assignment Z ij , in other words the inference is conditional on F. The difference in response between treated and control for the ith matched pair is given by Consider a subset G of the indices {1, 2, . . . , I}. We denote by Z G the vector of length 2 × |G| that collects the treatment assignments of matched pairs with indices in G. In the same spirit we can introduce notation r T G , r CG , x G , u G , R G and Y G . As a final piece of notation let Z G denote the collection of all possible treatment assignments. That is, Z G is the collection of 2 |G| many Z G satisfying the constraint Z i1 + Z i2 = 1 for i ∈ G. When G is the full set of indices we will simply drop the term G from the above set of notation.
Consider Fisher's sharp null hypothesis of no treatment effect restricted to subgroup G. Let us denote this by H 0,G . Suppose there are no unmeasured confounders, then we know that the treatment assignment in each pair is exactly randomized, i.e., P r(Z ij = 1 | F G , Z G ) = 1/2 for each subject ij in the i-th pair (i ∈ G). Since under the null hypothesis of no treatment effect among the subjects in G, r CG = R G , we can calculate the null distribution of a test statistic There are various competing candidates for the choice of the test statistic T (·, ·). For our simulation we choose to use Wilcoxon's signed rank statistic due to its familiarity. Theory and computation of our work holds as is for other choices of test statistic suggested in the literature, e.g., Huber's M-statistics (Rosenbaum, 2007b) and U statistics (Rosenbaum, 2011).

Sensitivity to hidden bias
The sensitivity analysis approach we study here is based on two principal assumptions (see Rosenbaum, 1987Rosenbaum, , 2002, for more details). First, subjects are assigned to treatment and control independently of each other. The propensity of treatment assignment is denoted by π ij = P r(Z ij = 1|F). When π ij , even though unknown, is only a function of the observed covariates x ij , we can produce correct inference based on a paired randomized experiment as discussed at the end of the last section. When this may not be the case, we assume that two subjects with the same observed covariates may differ in their odds of treatment assignment by at most a factor of Γ ≥ 1. That is, if for two subjects ij and i j if The factor Γ is termed as the level of hidden bias due to unmeasured confounders. If Γ = 1 then this model would correspond to paired randomized assignment. Let γ = log(Γ). Rosenbaum (2002) shows that this assumption is equivalent to writing the following treatment assignment distribution For various approaches to sensitivity analysis in observational studies, see  Wang and Krieger (2006), Yanagawa (1984) and Yu and Gastwirth (2005).
Based on the above setting a sensitivity analysis provides bounds for the quantities which are unknown because of unobserved u's. For example, suppose T G is the statistic to be used for testing the null H 0,G . Because the π ij are unknown we do not know the null distribution of T G . But under model assumption (2.1) we can produce two statistic T Γ,G and T Γ,G whose distribution under the null are known and they sharply bound T G (using first order stochastic dominance) from below and above respectively. Thus we can produce lower and upper bounds on the p-value as p Γ andp Γ . When Γ = 1, these two bounds are the same. At significance level α we would reject the null hypothesis ifp Γ=1 is less than α. An inference is sensitive to hidden bias Γ at level α ifp Γ ≥ α.
In the context of testing the null hypothesis for a subgroup G, we can relax our model (2.1) to be valid for pairs in G only to write for some u G ∈ [0, 1] 2|G| . The corresponding lower and upper bounds for the p-value are p Γ(G) andp Γ(G) .

Effect modification
In the presence of effect modification, our strategy is to divide the set of individuals into subgroups. These subgroups can be constructed based on a priori information. But in most situations when there is not sufficient a priori information, we would like to be able to derive these groups from observed data. We denote the subgroups by {G 1 , G 2 , . . . , G G }. The subgroups are non-overlapping. An ideal grouping, that explores the effect modifiers perfectly, would aim for subgroups such that asymptotically, in the number of pairs, every matched pair in a single subgroup G g has the same treatment effect and treatment effects of different groups differ. In finite samples there is a bias-variance trade-off in constructing the partition in terms of the size of the subgroups G. To be practically useful, we would like to have the grouping such that we have a moderate number of subgroups and two matched pairs which have similar treatment effects are in the same subgroup. In our motivating example subjects were divided into G = 5 groups based on the data (cf. Figure 1). The procedure that identifies these 5 partitions of the data first calculates the absolute difference in lead level in the blood in each pair, ranks these absolute response differences and the ranks are used as outcome in a regression tree fitting on the observed covariates. The use of a regression tree model allows us to identify these groups in terms of the observed covariates. G 1 = {female of age less than 41 years}, G 2 = {female of age between 41 and 47 years}, G 3 = {male of age less than or equal to 47 years}, G 4 = {female of age more than 47 years}, G 5 = {male of age more than 47 years}.
Each subgroup G g corresponds to a null hypothesis H 0,Gg of no exposure effect in all pairs in the group. Then we want to test the collection of G many hypotheses {H 0,G | G ∈ {G 1 , G 2 , . . . , G G }}. When the groups are derived from the data, the grouping is a random quantity and this collection of hypotheses is also random. Hsu, Small and Rosenbaum (2013) tested the global (intersection) null hypothesis H 0 by first computing p-values for each of the hypotheses and then combining those G p-values. Their suggestion was to use the product of truncated p-values (Zaykin et al., 2002) whereα is a parameter taking value in (0, 1]. Here and later in this paper we use χ(E) to denote the indicator function of an event E. Forα = 1 this would be equivalent to Fisher's method of combining p-values. When the groups are formed a priori, Hsu, Small and Rosenbaum provided upper bounds on the null distribution of P Γ∧ which can be used to carry out a sensitivity analysis.
As we advocated earlier, in presence of effect modifier we would like to provide inference on the collection of hypotheses considered. Hsu et al. (2015) suggested using a closed testing approach (Marcus, Peritz and Gabriel, 1976), that provides the FWER guarantee that the probability of at least one false rejection is at most α. We will develop a method to control the FDR when choosing effect modifier hypotheses based on the data.
In our simulation in Section 4 and the analysis of data set studying smoking effect on the lead level in the blood in Section 5, we see considerable gain in considering effect modifiers in the sense that the evidence from such an analysis is much less sensitive to bias over an analysis that ignores effect modification. Through extensive simulation, Hsu et al. showed that, in the presence of effect modification, the closed testing procedure is less sensitive to unmeasured confounders than the global test of no effect. In our simulation and data analysis we observe that the proposed procedure is more robust to unmeasured confounding than the closed testing procedure.

False discovery rate
For simultaneous hypotheses, the false discovery rate introduced by Benjamini and Hochberg (1995) is defined as the expected value of the proportion of falsely rejected hypotheses out of all rejected hypotheses. Let D g be the decision function receiving the values 1 or 0 for whether H 0,Gg is rejected or not rejected, respectively. Let G 0 be the collection of pairs with no treatment effect. Then .

Adaptive inference under effect modification
The selection of the subgroups from the data must only use information about a pair that would be unchanged if there is no treatment effect in the pair and the treatment assignment were reversed. In our motivating example we used a regression tree of the rank of the absolute difference of responses in a pair (Y i ) on the observed covariates (x i ). This approach is motivated by the fact that when the difference of responses Y i is related to observed covariates via a non-negative function with additive noise, then the absolute response regressed on the covariates would group the similar effects (Jogdeo, 1977;Hsu et al., 2015). For our theoretical result we assume the algorithm of building the groups satisfies the condition below, Condition A. The groups of pairs are mutually exclusive and are formed only as a function of |Y i | and the matched covariates x i .
The condition says that explicitly we are not allowed to use raw information of the treatment assignment. Let G 0 be the collection of pairs with no treatment effect. Then the condition above allows us to use the information I that is the Our main result is that the Benjamini-Hochberg (BH) procedure with level q applied on {p Γ,G | G ∈ {G 1 , G 2 , . . . , G G }} provides the following guarantee (3.1) Following are the two lemmas that are key to establishing (3.1). Proof.
If G g has no treatment effect then G g ⊆ G 0 . Thus for each i ∈ G g the information about i in I is already contained in (F Gg , Z Gg ). Therefore finding the propensity score of treatment assignment conditioning on (F, Z, I) is the same as conditioning on (F Gg , Z Gg ). Thus we get As a consequence of Lemma 2, following the argument of Rosenbaum (2002) we can bound the p-values for testing the random null H 0,Gg by p ΓGg andp ΓGg from below and above respectively. Theorem 1. If the subgroups are built based on Condition A and the bias is no more than Γ, then the BH procedure applied to the collection {p Γ,G | G ∈ {G 1 , G 2 , . . . , G G }} at level q satisfies (3.1).
The proof of the theorem is given in the Appendix. Condition A guarantees that the grouping is determined by the information set I. Different grouping strategies (that satisfy Condition A) may result in different groups, and hence have different power to reject non-null groups. The BH procedure on the p-value upper bounds guarantees that the FDR is controlled given the grouping, for any configuration of null and non-null groups. Moreover, the FDR control is guaranteed for any treatment assignment distribution of the nonnull pairs. Since the grouping is unchanged for any possible treatment assignments for the nonnull pairs when the absolute difference in response is used for group construction, by marginalizing over the information set I, the FDR is unchanged. This is formalized in the following corollary.
Proof. Given the knowledge of F and Z, for a treatment assignment Z, the information set I is determined from the treatment assignments on the nonnull pairs, i.e., Z G c 0 . Let I(z G c 0 ) be the information set of the grouping strategy when the treatment assignment on the pairs in The rest of the proof now follows from Theorem 1 which proves (3.1) for any information set I ≡ I(z G c 0 ).
A motivation for considering effect modification in the sensitivity analysis of treatment effects is that a larger treatment effect tends to be less sensitive to hidden bias than a smaller treatment effect. The next theorem proves that our procedure provides FDR control for such a sensitivity analysis.
Theorem 2. Suppose the subgroups are built based on stated Condition A. For hidden bias level Γ 1 , Γ 2 , . . . , Γ G , if the bias is at most Γ g in g ∈ G then the BH procedure applied at level q on the collection {p Γg,Gg | g = 1, 2, . . . , G} then FDR is controlled at level q.
The proof of Theorem 2 is very similar to that of Theorem 1, and therefore omitted.
A simple three step procedure for our sensitivity analysis is as follows.
• Apply the BH procedure at the desired level q.
For the second step of choosing the Γ g values, subject matter information about what values of Γ g are of concern and information from (F, Z, I) can be used.

Simulation
Here we examine the performance of the suggested procedure 3.1 on data built groups in various simulation settings. On I = 1, 000 pairs, the treatment assignment is assumed to be randomized so that there is no hidden bias in the study. First, we consider a single observed covariate (x) that takes five possible values 1 through 5 independently for each pair with equal probabilities. The response difference for the pairs is simulated from N (μ x , 1). We denote the vector of treatment effects on covariate x as μ = (μ 1 , μ 2 , . . . , μ 5 ). When μ j , j = 1, 2, . . . , 5 are not all the same, there is effect modification. Based on this simulated data we fit a regression tree on the rank of absolute response on the covariate x to build the groups. A second binary covariate with no effect modification is also used in the regression fit. Thus the groups tested (and the number of groups) are possibly different on each simulation. Table 1 reports the summary of the simulation study based on 2,000 simulations. Eight separate scenarios, starting with no effect (A) to different levels of effect modification (B-D) and then various scenarios of strong effect modification (E-1-E-4) are investigated.
In Table 1 the false discovery rate is controlled at the desired level q = 0.05 by both the closed testing and the BH procedure. To compare the power, we compare the closed testing procedure (Marcus, Peritz and Gabriel, 1976) to the BH procedure in terms of the false non-discovery rate (FNR) (Genovese and Wasserman, 2002) defined as the expected value of the ratio of false hypotheses not rejected out of all the hypotheses not rejected. In terms of estimated FNR, the BH procedure is never worse than closed testing and sometimes considerably better, e.g., in scenario E-4 the BH procedure has FNR 0.1% compared to 68% for Γ = 3 closed testing procedure or in scenario E-1 where FNR for the BH and the closed testing are 0% and 46%, respectively for Γ = 1.
We also consider two other type of errors, the proportion of pairs rejected having no effect and proportion of pairs not rejected but having treatment effect. While FDR and FNR are group level measures, these two measures compare the methods on the pair level. In terms of both these errors BH is as good as and often better than closed testing especially in scenarios of strong effect modification (E-1, E-3). For example in scenario E-4 at Γ = 3 on average, 77% of the pairs having treatment effect are not rejected in closed testing as compared to 0% (rounded to one significant digit) for the BH procedure.
The design above does not introduce any hidden bias to our simulation study, thus we have true Γ = 1 along with a treatment effect. The justification for considering this favorable situation for the power computation for sensitivity analysis is explained in (Hansen, Rosenbaum and Small, 2014, Section 3) and we review it here. Even though in practice we cannot know if we are in the favorable situation, by computing the power in this situation we are assessing the ability of our analyses to discriminate between two situations where we know unambiguously the desired result of the sensitivity analyses. In one situation, with moderate bias and no treatment effect, we expect that any associations between treatment and outcome can be explained by magnitude of bias at most Γ and by construction there can be at most a risk of at most α to report  There are two covariates, only one of these covariates is associated with treatment effect. This covariate is distributed as multinomial on 5 levels. Summary is based on 2,000 simulations of random treatment assignment and for the covariate value of x the response difference is simulated from N (μ(x), 1). μ is the vector of length 5 of μx values.
otherwise. In the second situation, when there is no bias and there is a treatment effect, then we hope to reject the null hypothesis. On the other hand, if we were considering a situation where there was large bias in treatment assignment and a small treatment effect, so that rejection of the null is nearly assured for all small or moderate Γ then we would not have been pleased to reject the null for small or moderate Γ because we know we would also have rejected the null in this situation had it been true. As seen in Table 1, when there is no or very mild effect modification, e.g., in the first three scenarios A, B and C, two methods have similar level of sensitivity to hidden bias. Here we say that the test is insensitive to level Γ (≥ 1) if it rejects any of the terminal node hypotheses at that level of Γ. But when there is strong  Average of the number of terminal nodes (numbers in the parenthesis give the interquartile range), number of rejections, FDR, FNR, and power of closed testing and the BH procedure. There are two covariates, only one of these covariates is associated with treatment effect. This covariate is distributed as multinomial on 10 levels. Summary is based on 2,000 simulations of random treatment assignment and for the covariate value of x the response difference is simulated from N (μ(x), 1). μ is the vector of length 10 of μx values.
effect modification of the covariate the BH procedure is insensitive to a much higher level of hidden bias compared to closed testing. For example, in scenario E-4 the average number of rejections by closed testing and BH, at Γ = 3, are 0.66 and 3, respectively; at Γ = 20 they are 0 and 0.66, respectively. FDR control is particularly useful when we expect large number of groups. We carry out two sets of simulation studies to examine false discovery rate control with 10 groups and 20 groups. The simulation design is as before with I = 1, 000 pairs and equal randomized treatment assignment. There are two covariates, one distributed as multinomial with 10 and 20 labels for the two studies respectively and another one is Bernoulli. The difference in response for paired individuals is modeled as normally distributed with unit variance and mean equal to μ(x), where x is the label of the multinomial covariate. Table 2 and Table 3 report results of simulation studies based on this design. The conclusions from Table 2 and Table 3 are consistent with those of the previous  simulation. FDR is controlled throughout and when there are no effects (first row of Table 2 and Table 3) FDR reaches its nominal level by the BH procedure while closed testing is conservative. The power gain from the BH procedure over closed testing can be seen in estimated FNR values in Table 2. As an example, in the last scenario for Γ = 5 the FNR level is at 45% for closed testing compared to 11% for the BH procedure and in the same scenario on average closed testing fails to reject 62% of the pairs with treatment effect as compared to 11% for the BH procedure. Finally, the BH procedure is much less sensitive to hidden bias than closed testing. For example, in the last scenario of Table 2 closed testing is insensitive until hidden bias level of Γ = 5 whereas the BH procedure is insensitive until Γ = 15. The setting of Table 3 does not contain the closed testing results as closed testing procedure for 20 groups does 2 20 comparisons which requires unmanageable amount of computing resources. The number of rejected nodes, FNR and % of pairs rejected with effect in Table 3 shows the considerable power of the proposed procedure.
We summarize our simulation results here. We have considered various scenarios of effect modifications along with different number of subgroups in our simulations. While both closed testing and our procedure controls for FDR at the nominal level, our sensitivity analysis BH procedure is much less sensitive to hidden bias compared to closed testing. In terms of power, the BH procedure always had higher simulated power compared to closed testing in all the measures we have considered in our comparisons. With large number of subgroups closed testing becomes computationally infeasible as the complexity is exponential in the number of subgroups, but the BH procedure has computational complexity which is at most quadratic in the number of subgroups.

Results for study of the effect of smoking on lead in the blood
We go back to our motivating example and analyze the data on 1,485 matched pairs of individuals for effect of smoking on lead level in the blood. As described in Section 1.1, we used a data based method consistent with Condition A to derive five mutually exclusive and exhaustive subgroups of the pairs as described in Section 2.3. These subgroups contain 281, 90, 441, 283 and 390 pairs of subjects respectively. Figure 1 shows average of the ranks of the absolute response differences in the terminal nodes.
We aim to assess whether smoking increases the level of lead in the blood. We use a Huber-Maritz M-statistics with specifications as given in Section 1.1. In the notation of Section 2.1, for an odd function ψ(·) with ψ(0) = 0, a Huber-Maritz M-statistics for group G is of the form i∈G ψ(y i /s G ), where y i is the difference in response between treatment and control for matched pair i and s G is the median of the absolute difference in response for the matched pairs in G. In our analysis we consider ψ(y) = sign(y) (min(|y|, 1.5) − 0.1) + , where (a) + := max(a, 0). For more discussion on M-statistics and related sensitivity analysis procedures see Rosenbaum (2007bRosenbaum ( , 2014.
We first consider the global Fisher's null hypothesis of no treatment effect for any subject. In Section 1.1 we noted that the test that combines all the pairs without consideration of effect modification allows for the rejection decision to be sensitive to hidden bias of Γ = 2.6. Using the group information Table 4 in the second to last column reports truncated product, withα = 0.20, p-valuesp Γ∧ for the global null for different levels of hidden bias. These are computed using the truncatedP function of the sensitivitymv package in R. The decision to reject the global null based on truncated product p-values is sensitive to hidden bias of 2.8.
Simes test (Simes, 1986) can also be used to test the global null hypothesis of no treatment effect. It rejects the global null if and only if the BH procedure rejects at least one of the group-hypotheses. To simplify the notation suppose the group numbers are ordered by their p-values,p Γg,Gg ≤p Γ g ,G g for 1 ≤ g < g ≤ G. Simes' p-value is min g≥1 G ×p Γg,Gg /g. Simes' p-value is a valid p-value if it combines individual p-values that are independent or have PRDS dependence (Benjamini and Yekutieli, 2001). To see this note that Table 4 Maximum of p-values for varied hidden bias levels.
Simes' p-value 1 1.2 · 10 −13 5.5 · 10 −10 0 6.3 · 10 −11 2.2 · 10 −16 0 0 1.5 3.0 · 10 −6 2.2 · 10 −6 4.4 · 10 −8 1. {p Γg,Gg : g = 1, 2, . . . , G} conditional on (F, Z, I) is an independent collection by Lemma 1 and by Lemma 2 each ofp Γg,Gg is stochastically larger than the uniform distribution on [0, 1]. Then conditional on (F, Z, I), the validity of Simes' p-value follows from Simes (1986). The last column of Table 4 reports the Simes' p-values. When Γ g = 2 for all groups Simes' p-value is 1.0 × 10 −3 . Using Simes' p-values the decision to reject the global hypothesis is sensitive for hidden bias levels of Γ g = 3.5 for each group. Thus, consideration of effect modification has provided evidence that is much less sensitive to bias than not considering effect modification, Γ = 3.5 compared to Γ = 2.6 (from Section 1.1). Now we consider testing the null hypothesis of no treatment effect for each group. Table 4 shows the maximum possible values of the p-values at different levels of hidden bias. Under the assumption of no hidden bias for each of the groups we have p-values 1.2 × 10 −13 , 5.5 × 10 −10 , < 9.99 × 10 −16 (reported as 0), 6.3 × 10 −11 and 2.2 × 10 −16 supporting the hypothesis that smoking causes increase in lead level in the blood. These inferences individually are sensitive at different levels of hidden biases 2.5, 4, 2.5, 2.2 and 2.5 for the five groups. This sensitivity analysis is not multiplicity corrected. Table 5 present results of two multiplicity corrected analyses -the BH procedure that controls for the FDR and the closed testing procedure that controls for the FWER (Hsu et al., 2015). The BH procedure is less sensitive to bias than the closed testing procedure. For example, if the hidden bias of treatment assignment throughout the data is at most Γ = 2.8, then the BH procedure still finds effect of smoking causing increase in lead level in the blood for female between age 41 and 47 years (G 2 ) while the closed testing procedure fails to reject the null hypotheses for all the groups. In closed testing, the smallest Γ for which all five of the hypotheses are sensitive to bias is Γ = 2.6 compared to Γ = 3.5 for the BH procedure.

Discussion
In this paper we investigated a sensitivity analysis with false discovery rate control for testing the hypothesis of no treatment effect under possible effect modification. Effect modification is the correlation of magnitude of treatment effect with pre-treatment covariates. Consideration of effect modification in practice leads us to make stronger conclusions about treatment effect. In the absence of prior knowledge about what covariates to consider as potential effect modifiers, we learn about what potential effect modifiers to test from the data. Our main theoretical result says that under appropriate restriction on the grouping method we can guarantee FDR control. In our simulation studies and data analysis, we have used the CART algorithm to construct groups from matched treatment-control pairs from the data. The interpretability of the regression tree makes it a promising choice. There have been new suggestions for interpretable classifiers (see e.g., Letham et al. (2015)). Such algorithms can also be used in practice. In the presence of effect modification one can consider different levels of hidden biases for different groups. In our simulations the FDR controlling method shows more power compared to closed testing. This is to be expected, since typically FDR controlling procedures have greater power than FWER controlling procedures. When there are many different levels of effect modification leading to a large number of groups, FDR controlling procedures can have a large power advantage compared to FWER controlling procedures. The method discussed in this article can potentially be applied to various situations involving historical controls. For example, the method can assist in planning a phase III clinical trial, based on single-arm phase II trial results. After a preliminary safety assessment of a new treatment in phase II, data is collected to understand the effectiveness of the treatment before conducting a full scale Phase III trial. Often phase II trials are single-arm studies with no control group and patients receiving the treatment are compared to historical controls. A subgroup analysis with sensitivity analysis in phase II can be a step towards planning a phase III trial. The sensitivity analysis discussed in this article would point to signals for the groups of patients for which treatment has possibly different levels of effects or no beneficial effect at all. These signals can be used to introduce blocking variables or develop enrichment strategies in designing a Phase III trial (Freidlin and Korn, 2014). In another application, the method can enhance various comparison studies to historical data. For example, Sammarco et al. (2016) used historical control data from NHANES database to assess the harmful effect of exposure to crude oil on petroleum hydrocarbon concentrations in the blood. The treatment group consisted of people who came in contact with crude oil in 2010 BP/Deepwater Horizon oil spill. The poor power of the study, particularly for long-chain hydrocarbons, can be improved by considering a subgroup based sensitivity analysis as person-to-person variability in the effects of exposure to crude oil may be high and may be affected by age, background and smoking pattern.

Appendix: Proof of Theorem 1
Proof. Let C (−Gg) r be the event that r groups are rejected along with G g . The proof follows that of Benjamini and Yekutieli (2001). The FDR is The second identity follows from Lemma 1. The first inequality follows from Lemma 2 and the fact thatp ΓGg being a upper bound on the true p-value is stochastically larger than uniform distribution on [0,1], i.e., P r(p ΓGg ≤ a) ≤ P r(p ΓGg ≤ a) ≤ a. The final identity follows since C