Interactive identification of individuals with positive treatment effect while controlling false discoveries

Out of the participants in a randomized experiment with anticipated heterogeneous treatment effects, is it possible to identify which subjects have a positive treatment effect? While subgroup analysis has received attention, claims about individual participants are much more challenging. We frame the problem in terms of multiple hypothesis testing: each individual has a null hypothesis (stating that the potential outcomes are equal, for example) and we aim to identify those for whom the null is false (the treatment potential outcome stochastically dominates the control one, for example). We develop a novel algorithm that identifies such a subset, with nonasymptotic control of the false discovery rate (FDR). Our algorithm allows for interaction -- a human data scientist (or a computer program) may adaptively guide the algorithm in a data-dependent manner to gain power. We show how to extend the methods to observational settings and achieve a type of doubly-robust FDR control. We also propose several extensions: (a) relaxing the null to nonpositive effects, (b) moving from unpaired to paired samples, and (c) subgroup identification. We demonstrate via numerical experiments and theoretical analysis that the proposed method has valid FDR control in finite samples and reasonably high identification power.


Introduction
Subgroup identification -or identifying subgroups of the population that have some positive response to a treatment -has been a major topic in the clinical trial community and the causal literature (see Lipkovich et al. [1], Powers et al. [2], Loh et al. [3] and references therein).Typically, the treatment effect in the investigated population varies with the subject's gender, age, and other covariates.Identifying subjects with positive effects can help guide follow-up research and provide medical guidance.However, most existing methods do not have an error control guarantee at the level of the individual -it is possible that most subjects in the identified subgroup do not have positive effects.For example, an identified subgroup could be "female subjects younger than 40", which may typically mean that the average treatment effect is positive in this subgroup, but it is possible that only 10% of them with age between 18 and 20 may truly have a positive treatment effect.
This paper considers the identification of positive effects at an individual level: among the participants in a trial, which ones have the treatment potential outcome larger than control potential outcome (we call this the subject's treatment effect)?This is a hard question to answer in general, because we only observe one or the other potential outcome.But we will give a nontrivial answer that may be powerless in the worst case, but powerful in cases where the covariates are informative about the treatment effects, and never making too many false claims. 1BD now works at Google LLC.For each subject, the available data we have includes its treatment assignment, covariates, and observed outcome.We will identify a set of subjects as having positive effects, with an error control on false identification, and without any modeling assumption on how the (treated/control) outcomes vary with covariates.Below is an example to visualize the problem, and prepare for our solution.

An illustrative example
Consider an experiment involving 500 subjects, each is assigned to treatment or control independently with probability 1/2.Suppose each subject is associated with two simple covariates, which are independently and uniformly distributed in [0,1].All the data is shown in Figure 1, where the observed outcomes are in Figure 1b with darker color indicating a higher outcome.Readers may have some intuitive guess on our interested question: "which subjects could have positive treatment effect if treated?", such as subjects on the top right corner.Yet we note that the underlying ground truth can be counter-intuitive while mostly correctly captured by our proposed algorithm (results in Section 1.4).Before showing the results, we formalize our question of interest in the next section.(a) Subjects in treated group and control group separated by two shapes and colors.
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.00  We hope to answer the question: which individuals have a positive treatment effect?

Problem setup
Suppose we have n subjects in the dataset.Each subject i has potential control outcome Y C i , potential treated outcome Y T i , and the treatment indicator A i for i ∈ [n] ≡ {1, 2, . . ., n}.Our results allow the potential outcomes to either be viewed as random variables or fixed.The treatment effect of subject i is defined as and the observed outcome is ). Person i's covariate is denoted as X i .We first focus on Bernoulli randomized experiments without interference: (i) conditional on covariates, treatment assignments are independent coin flips: for any (a 1 , . . ., a n ) ∈ {0, 1} n .The above setting is later extended to observational studies where the probabilities of receiving treatments can be heterogeneous and possibly unknown.
(ii) conditional on covariates, the outcome of one subject Y i is independent of the assignment A j of another subject, for any i ̸ = j: which is implied by (1) when the potential outcomes are viewed as fixed values.
We do not assume the observed data (Y i , A i , X i ) are identically distributed.We consider heterogeneous effects in the sense that the distribution of Y T i − Y C i varies, and aim at identifying those individuals with a positive treatment effect.(If the covariates X i are not informative about the heterogeneity in Y T i − Y C i , our identification power could be low, and this is to be expected.)We choose to formalize and frame the problem in terms of multiple hypothesis testing, by first defining the null hypothesis for subject i as having zero treatment effect: or equivalently, H zero 0i 2 An extension is introduced in Appendix A.1, where we relax the null as those with a nonpositive effect, defined by stochastic dominance ( i if the potential outcomes are fixed.Our algorithms control the error of falsely identifying subjects whose null hypothesis is true (i.e., having zero effect), and aim at correctly identifying subjects with positive effects.Let ≻ denote stochastic dominance, as above.We say a subject has a positive effect if When treating the potential outcomes and covariates as fixed, we simply write Y T i > Y C i .The output of our proposed algorithms is a set of identified subjects, denoted as R, with a guarantee that the expected proportion of falsely identified subjects is upper bounded.Specifically, denote the set of subjects that are true nulls as H 0 := {i ∈ [n] : H zero 0i is true}.Then the number of false identifications is |R ∩ H 0 |.The expected proportion of false identifications is a standard error metric, the false discovery rate (FDR): Given α ∈ (0, 1), we propose algorithms that guarantee FDR ≤ α, and have reasonably high power, which is defined as the expected proportion of correctly identified subjects: where Pos : i } are subjects with positive effects.

Related problem: error control in subgroup identification
We note that our problem setup is not exactly the same as most work in subgroup identification, such as Foster et al. [5], Zhao et al. [6], Imai and Ratkovic [7].The identified subgroups are usually defined by functions of covariates, rather than a subset of the investigated subjects as in our paper.While defining the subgroup by a function of covariates makes it easy to generalize the finding in the investigated sample 2 Alternatively, we can treat the potential outcomes and covariates as fixed, and frame the null hypothesis as H zero 0i : A last, hybrid, version (e.g.,Howard and Pimentel [4]) is to treat the two potential outcomes as random with joint distribution (Y T i , Y C i ) | X i ∼ P i , and the null posits H zero 0i : Y T i = Y C i almost surely-P i , meaning that P i is supported on {(x, y) : x = y}.All our theoretical results work with any interpretation, but we stick to (3) by default.

3
to a larger population, it does not seem straightforward to nonasymptotically control the error of false identifications using the former definition, which is a major distinction between previous studies and our work.Most existing work does not have an error control guarantee (see an overview in Lipkovich et al. [1], Table XV), except a few discussing error control on the level of subgroups as opposed to the level of individuals in our paper.The difference between FDR control at a subgroup level and at an individual level is detailed below.

Subgroup FDR control
Karmakar et al. [8], Gu and Shen [9], Xie et al. [10] discuss FDR control at a subgroup level, where the latter two have little discussion on incorporating continuous covariates and require parametric assumptions on the outcomes.Thus, we follow the setup in Karmakar et al. [8] to compare the FDR control at a subgroup level (in their paper) and individual level (in our paper).Let the subgroups be non-overlapping sets {G 1 , . . ., G G }.The null hypothesis for a subgroup G g is defined as: or equivalently, H 0g : G g ⊆ H 0 (recall H 0 is the set of subjects with zero effect).Let D g be the 0/1-valued indicator function for whether H 0g is identified or not.The FDR at a subgroup level is defined as the expected proportion of falsely identified subgroups: which collapses to the FDR at an individual level as defined in (5) when each subgroup has exactly one subject.Although our interactive procedure is designed for FDR control at an individual level, we propose extensions to FDR control at a subgroup level in Section 6.As a brief summary, Karmakar et al. [8] propose to control FDR subgroup by constructing a p-value for each subgroup and apply the classical BH method [11].However, it is not trivially applicable to control FDR at an individual level, because their p-values would only take value 1/2 or 1 when each subgroup has exactly one subject, leading to zero identification power following either the classic BH procedure or the more recent AdaPT framework in Lei and Fithian [12].In other cases where subgroups have more than one subject, the above error control does not imply whether subjects within a rejected subgroup are mostly non-nulls, or if many are nulls with zero effect.Such error control can be too weak to effectively tell apart most subgroups.Our paper appears to be the first to propose methods for identifying subjects having positive effects with (finite sample) FDR control.Individual level inferences are more fine-grained, from which (a) researchers can proceed with a follow-up analysis on identified individuals whom they strongly believe benefit from the treatment; (b) the identified individuals have more trust in the treatment, and as an example, in industry one can recommend a new product to identified customers with much higher confidence).Importantly, the identified individuals need not have been originally treated, as seen in our representative example (Figures 1, 3b).We end by noting that we do propose an extension to our method for subgroup identification as well, and compare it to [8] in Section 6.

Other related error control at a subgroup level
Cai et al. [13] and Athey and Imbens [14] develop confidence intervals for the averaged treatment effect within subgroups, where the former assumes the size of each subgroup to be large, and the latter requires a separate sample for inference.These intervals can potentially be used to generate a p-value for each subgroup and control FDR at a subgroup level via standard multiple testing procedures, but no explicit discussion is provided.Lipkovich et al. [15], Lipkovich and Dmitrienko [16], Sivaganesan et al. [17] and Berger et al. [18] propose methods with control on a different error metric: the global type-I error, which is the probability of identifying any subgroup when no subject has nonzero treatment effect (i.e., H zero 0i is true for all subjects).Our FDR control guarantee implies valid global type-I error, and FDR control is more informative on the correctness of the identified subgroups/subjects when there exist subjects having nonzero effects.

An overview of our procedure
As discussed, it appears to be new and practically interesting to provide FDR control guarantees at an individual level.Another merit of our proposed method is that it allows a human analyst and an algorithm to interact, in order to better accomplish the goal.
Interactive testing is a recent idea that emerged in response to the growing practical needs of allowing human interaction in the process of data analysis.In practice, analysts tend to try several methods or models on the same dataset until the results are satisfying, but this violates the validity of standard testing methods (e.g., invalid FDR control).In our context of identifying positive effects, the appealing advantages of an interactive test include that (a) an analyst is allowed to use (partial) data, together with prior knowledge, to design a strategy of selecting subjects potentially having positive effects, and (b) it is a multi-step iterative procedure during which the analyst can monitor performance of the current strategy and make adjustments on the selection strategy at any step (at the cost of not altering earlier steps).Despite the flexibility of an analyst to design and alter the algorithm using (partial) data, our proposed procedure always maintains valid FDR control.We name our proposed algorithm I 3 (I-cube), for interactive identification of individual treatment effects.
Figure 2: A schematic of the I 3 algorithm.All treatment assignments are initially kept hidden: only (Y i , X i ) i∈[n] are revealed to the analyst, while all {A i } remain 'masked'.The initial candidate rejection set is R 0 = [n] (thus no subject is excluded initially and i * 0 = ∅).The false discovery proportion FDR of the current candidate set R t is estimated by the algorithm (dashed lines), and reported to the analyst.If FDR(R t ) > α, the analyst chooses a subject i * t to remove it from the proposed rejection set R t = R t−1 \{i * t }, whose assignment A t * t is then 'unmasked' (revealed).Importantly, using any available prior information, covariates and working model, the analyst can choose subject i * t and shrink R t in any manner.This process continues until FDR(R t ) ≤ α (or R t = ∅).
The core idea that enables human interaction is to separate the information used for selecting subjects with positive effects and that for error control, via "masking and unmasking" (Figure 2).In short, masking means we hide the treatment assignment {A i } n i=1 from the analyst.The algorithm alternates between two steps -selection and error control -until a simple stopping criterion introduced later is reached.Below is a intuitive description of the framework and we give the full method in Section 2.
1. Selection.Consider a set of candidate subjects to be identified as having a positive effect (whose null to be rejected), denoted as rejection set R t for iteration t.We start with all the subjects included, At each iteration, the analyst excludes possible nulls (i.e., subjects that are unlikely to have positive effects) from the previous R t−1 , using all the available information (outcomes Y i and covariates X i for all subjects i ∈ [n], and progressively unmasked A i from the step of error control, and possible prior information).Note that our method does not automatically use prior information and the revealed data.The analyst is free to use any black-box prediction algorithm that uses the available information, and evaluates the subjects possibly using an estimated probability of having a positive treatment effect.This step is where a human is allowed to incorporate her subjective choices.

2.
Error control (and unmasking).The algorithm uses the complete data {Y i , A i , X i } to estimate FDR of the current candidate rejection set FDR(R t ), as feedback to the analyst.If the estimated FDR is above the target level FDR(R t ) > α, the analyst goes back to the step of selection, along with additional information: the excluded subjects (i / ∈ R t ) have their A i unmasked (revealed), which could improve her understanding of the data and guide her choices in the next selection step.
The algorithms we propose in the main paper build on and modify the above procedure to achieve reasonably high power and develop various extensions.
Recall our illustrative example in Section 1.1, the underlying groundtruth and the identifications made by the Crossfit-I 3 (our central algorithm) are in Figure 3.Although the observed outcomes tend to be higher for subjects in the top right corner, the subjects with true positive effects are in the center.Such discrepancy is because the control outcome varies with covariates (could often happen in practice) and is designed to be higher in the top right corner, such that their observed outcomes could be high regardless of whether the treatment has any effect.Nonetheless, our proposed algorithm can tell the difference of high observed outcomes caused by high control outcomes versus those caused by positive treatment effects, and correctly identify most true positive effects.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.00 0.25 (a) Blue round dots represent subjects with true positive effect (unknown ground truth).q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.00  identifies most subjects with positive effects, although about half of them did not receive treatment and their potential treated outcomes were not observed.

Related work in testing
Testing procedures that allow human interaction are first proposed by Lei and Fithian [12] and Lei et al. [19] for the problem of FDR control in multiple testing, followed by several works for other error metrics, such as Duan et al. [20] and Duan et al. [21].These papers focus on generic multiple testing problems, which operate on the p-values and ignore the process of generating p-values from data.In contrast, this paper applies the idea of interactive testing to observed data and propose tests in the context of causal inference for treatment effect.To our knowledge, we are not aware of p-values for individual identification in causal inference that are not binary and impose no model assumption.An example of binary p-value is I( ∆ i ≤ 0), where ∆ i can be viewed as a treatment effect estimator and defined in (8), and the p-value satisfies the mirror-conservative property in Lei and Fithian [12].However, the masking framework then results in trivial guesses because each masked p-value is {0, 1} for all individuals.We view our contribution as (1) framing formally the question of selecting individuals with positive effect as a multiple testing problem with FDR control; and (2) overcoming the challenge of designing p-value of individual zero effect hypothesis without any model assumption, by connecting the key property that enables masking framework in Lei and Fithian [12], independence between two functions of a single numeric p-value, and the key observation from our paper -independence between treatment assignment A i and the vector of outcome and covariates {Y i , X i }.
The interactive tests first stem from the knockoff method for regression by Barber and Candès [22], which relies on the symmetry of the carefully-designed test statistics.Here, we construct symmetric statistics in the framework of causal inference.

Paper Outline
The rest of the paper is organized as follows.In Section 2, we describe an interactive algorithm wrapped by a cross-fitting framework, which identifies subjects with positive effects with FDR control.We evaluate our proposed algorithm numerically in Section 3, and provide theoretical power analysis in simple settings in Section 4. Section 5 presents an extension to the setting of observational studies, and we implement the proposed algorithm to real datasets in Section 8.More extensions can be found in Appendix A. Section 9 concludes the paper with a discussion on the potential of our proposed interactive procedures.

An interactive algorithm with FDR control
To enable us to effectively infer the treatment effect, we use the following working model : where U i is zero-mean noise (unexplained variance) that is independent of A i .When working with such a model, we effectively want to identify subjects with a positive treatment effect ∆(X i ).Importantly, model (7) needs not be correctly specified or accurately reflect reality in order for the algorithms in this paper to have a valid FDR control.However, the more ill-specified or inaccurate the model is, the more power may be hurt; related numerical experiments are included in Appendix D.
To identify subjects with positive effects, we first introduce an estimator of the treatment effect ∆(X i ) following the working model (7).Denote the expected outcome given the covariates as m(X i ) := E(Y i | X i ), and let m(X i ) be an arbitrary estimator of m(X i ) using the outcomes and covariates {Y i , X i } n i=1 .Define the residual as E i := Y i − m(X i ), and an estimator of ∆(X i ) is which, under randomized experiments, is equivalent to the nonparametric estimator of the conditional treatment average effect [23,24], and can be traced back to the semiparametrics literature with Robinson [25].A critical property of ∆ i that later leads to FDR control is that3 With the rigorous argument for validity in Appendix B.1, we briefly describe the reasoning here: the condition in (9) corresponds to the information used for selecting subjects (recall in Figure 2), because the treatment assignments A i are hidden for candidates in R t and assignments are mutually independent.The above property indicates that the estimated effect ∆ i is no more likely to be positive than negative if the selected subject has zero effect, regardless of how the analyst decides which subject to select.Therefore, the sign of ∆ i can be used to estimate the number of false identifications and achieve FDR control, which we elaborate next.

The I 3 algorithm
This section presents the I 3 algorithm and proves that it controls FDR.We introduce a modification based on cross-fitting that improves identification power in the next section.
The I 3 proceeds as progressively shrinking a candidate rejection set R t at iteration t, where recall [n] denotes the set of all subjects.We assume without loss of generality that one subject is excluded in each step.Denote the subject excluded at iteration t as i * t .The choice of i * t can use the information available to the analyst before iteration t, where we follow Lei and Fithian [12] to describe the information available to the analysis formally as a filtration (sequence of nested σ-fields): or equivalently, where we unmask (reveal) the treatment assignments A j for subjects excluded from R t−1 , and the sum i∈Rt−1 1{ ∆ i > 0} is mainly used for FDR estimation as we describe later.The above available information include arbitrary functions of the revealed data, such as the residuals {E j } n j=1 defined above equation (8).Similar to property (9), for each candidate subject i ∈ R t−1 , we have which ensures the FDR control as we explain next.
To control FDR, the number of false identifications is estimated by (12).The idea is to partition the candidate rejection set R t into R + t and R − t by the sign of ∆ i : Notice that our proposed procedure only identifies the subjects whose estimated effect is positive, i.e., those in R + t .Thus, the FDR is E Such FDR estimators using the count of positive or negative individuals stem from Barber and Candès [22], and have been extensively used in the literature of interactive testing such as Lei and Fithian [12], Lei et al.
[19], Duan et al. [20,21].Overall, the I 3 shrinks R t until time τ := inf{t : FDR(R t ) ≤ α} and identifies only the subjects in R + τ , as summarized in Algorithm 1.We state the FDR control of I 3 in Theorem 1 and the proof can be found in Appendix B.1.
Theorem 1.In a randomized experiment with assumptions (1) and (2), and for any analyst that updates their working model(s) at any iteration t using the information in F t−1 , the set R + τ rejected by the I 3 algorithm has FDR controlled at level α, meaning that for the null hypothesis (3).
Consider a simple case where model ( 7) is accurate for every subject with a constant treatment effect ∆(X i ) = δ > 0. If δ is larger than the maximum noise, we have R + 0 = [n], and the algorithm can stop at the very first step identifying all subjects.At the other extreme, if the effect δ is too small, the algorithm may also return an empty set, and this makes sense because while small average treatment effects can be learned using a large population, larger treatment effects are needed for individual-level identification.We end the section with a remark.In step 8 of Algorithm 1, we hope to exclude subjects that are unlikely to have positive effects, based on the revealed data in F t−1 .In other words, we should guess the sign of treatment effect ∆ i , which depends on both the revealed data {Y i , X i } and the hidden assignment A i .However, notice that at the first iteration, we may learn/guess the opposite signs for all the subjects; when all assignments {A i } n i=1 are hidden at t = 1, the likelihood of {A i } n i=1 being the true values (leading to all correct signs for ∆ i ) is the same as the likelihood of all opposite values (leading to all opposite signs for ∆ i ), no matter what working model we use.Consequently, the subjects with large positive effects could be guessed as having large negative effects, causing them to be excluded from the rejection set.To improve power, we propose to wrap around the I 3 by a cross-fitting framework as described in the next section.

Improving stability and power with Crossfit-I 3
Cross-fitting refers to the idea of splitting the samples into two halves.We perform the I 3 on each half separately, so that for each half, the complete data (including the assignments) of the other half is revealed to the analyst to help infer the sign of treatment effect, addressing the issue of learning the opposite signs and improving the identification power.
Specifically, split the subjects randomly into two sets of equal size, denoted as I and II where I∪II = [n].The I 3 (Algorithm 1) is implemented on each set separately: at the start of I 3 on set I, the analyst has access to the complete data for all subjects in set II, and tries to identify subjects with positive effects in set I with FDR control at level α/2; similar is the I 3 on set II. Mathematically, let the candidate rejection set of implementing the I 3 on set I be R t (I), where the initial set is R 0 (I) = I.The available information at iteration t is defined as: which includes the complete data {Y j , A j , X j } for j ∈ II at any iteration t ≥ 04 .Similarly, we define R t (II) and F t−1 (II) for the I 3 implemented on set II.The final rejection set is the union of rejections in I and II (see Algorithm 2).We call this algorithm the Crossfit-I 3 .
Input: Covariates, outcomes, treatment assignments {Y i , A i , X i } n i=1 , target level α; Procedure: 1. Randomly split the sample into two subsets of equal size, denoted as I and II; 2. Implement Algorithm 1 at level α/2, where E initially knows {Y k , X k } n k=1 ∪{A j } j∈II and sets R 0 (I) = I, getting a rejection set R + τ (I) ⊆ I; 3. Implement Algorithm 1 at level α/2, where E initially knows {Y k , X k } n k=1 ∪ {A j } j∈I and sets R 0 (II) = II, getting a rejection set R + τ (II) ⊆ II; 4. Combine two rejection sets as the final rejection set, As long as the I 3 on two sets do not exchange information, Algorithm 2 has a valid FDR control (see the proof in Appendix B.2).
In addition to addressing the issue of learning the opposite ∆ i in the original I 3 , another benefit of using the crossing-fitting framework is that with the complete data revealed for at least half of the sample, the analyst does not have to deal with the problem of inferring missing data (the assignment A i ), which probably needs some parametric probabilistic modeling and the EM algorithm.Instead, because the assignments are revealed for subjects not in the candidate rejection set (at least half of the sample), their signs of ∆ j can be correctly calculated and used as "training data".The analyst can then employ a black-box prediction model, such as a random forest, to predict the signs of ∆ i for the subjects whose assignments are masked (hidden).
As an example, we propose an automated strategy as follows to select a subject at step 8 in Algorithm 1.
Algorithm 3 An automated heuristic to select i * t in the Crossfit-I 3 .Input: Current rejection set R t−1 (I), and available information for selection F t−1 (I); Procedure: 1. Train a random forest classifier where the label is sign( ∆ j ) and the predictors are Y j , X j and the residuals E j , using non-candidate subjects j / ∈ R t−1 (I); 2. Estimate the probability of ∆ i being positive as p(i, t) We remark that in practice, the analyst can interactively change the prediction model, such as exploring parametric models to see which fits the data better.In principle, the analyst can perform any exploratory analysis on data in F t−1 (I) to decide a heuristic or score for selecting subject i * t ; and the FDR control is valid as long as she does not use the assignments A i for candidate subjects i ∈ R t−1 (I).For computation efficiency, we usually update the prediction models (or their parameters) once every 100 iterations (say).
To summarize, the Crossfit-I 3 described in Algorithm 2 involves two rounds of the I 3 (Algorithm 1), where step 8 of selecting a subject is allowed to involve human interaction; alternatively, step 8 can be an automated heuristic as presented in Algorithm 3.
Remark 1.We contrast our algorithms to identify individual effects with many existing algorithms targeting at alternative goals.As examples, two commonly discussed goals include testing whether any individual has any effect [26,27,4] and estimating the averaged treatment effect [28,29,30].While the above two questions discuss causal inference at an integrated level for the investigated population, we study the problem of making claims for each individual subject, a harder problem by its nature.Therefore, it is expected that a larger effect size is required, compared to the previous two goals, to have reasonably high power for individual effect identification.In the following sections, we demonstrate through repeated numerical experiments and theoretical analysis that the Crossfit-I 3 has reasonably high power.

Numerical experiments
To assess our proposed procedure, we first describe a baseline method, which calculates a p-value for each subject under the assumption of linear models, and applies the classical BH method [11].We call this method the linear-BH procedure.

Two baselines: the BH procedure (assuming well-specified model) and Selective SeqStep+
BH procedure under linear assumptions.For the treated group and control group, we first separately learn a linear model to predict Y i using X i , denoted as l T and l C .By imputing the unobserved potential outcomes, we get estimators of the potential outcomes , and the treatment effect for subject i can be estimated as If the potential outcomes are linear functions of covariates with standard Gaussian noises (which we refer to as the linear assumption), the estimated treatment effect asymptotically follows a Gaussian distribution.For each subject i ∈ [n], we calculate a p-value for the zero-effect null (3) as where the estimated variance is Var( , and Φ denotes the CDF of a standard Gaussian.To identify subjects having positive effects with FDR control, we apply the BH procedure to the above p-values.Note that, unlike our methods, the error control for BH would not hold when the linearity assumption is violated (see Appendix B.7 for the formal FDR control guarantee; see Section 7 for more numerical experiments when the linear assumption holds or does not).
Selective SeqStep+.Once we construct the estimated treatment effect ∆ i and call out the critical property (9) on the probability of the estimated effect sign, we can apply the Selective SeqStep+ by Barber and Candès [22].Specifically, we set the p-value for each individual as p i = 1 − 1 2 1( ∆ i > 0), which equals 1/2 when estimated treatment effect ∆ i > 0 and equals 1 when ∆ i <= 0. The hypotheses can be ordered by |E i | decreasingly, and the constant c is chosen as 1/2 to maximize the power.We also note that the Selective SeqStep+ can be viewed as an automated version of I 3 where the Statistician (S) picks individuals by the ranking of |E i | in step 8 of Algorithm 1.We expect the power of our proposed Crossfit-I 3 to be higher than the Selective SeqStep+ (an automated version of I 3 ), because Crossfit-I 3 additionally uses information from the revealed treatment assignments when picking/ordering individuals, which are especially helpful to inform the direction of treatment effect.

Numerical experiments and power comparison
We run a simulation with 500 subjects (n = 500).Each subject is recorded with two binary attributes (eg.female/male and senior/junior) and one continue attribute (eg.body weight), denoted as a vector Among n subjects, the binary attributes are marginally balanced, and the subpopulation with X i (1) = 1 and X i (2) = 1 is of size 30.The continuous attribute is independent of the binary ones and follows the distribution of a standard Gaussian.
The outcomes are simulated as a function of the covariates X i and the assignment A i following the generating model (7).Recall that we previously used model ( 7) as a working model, which is not required to be correctly specified.Here, we generate data from such a model in simulation for a clear evaluation of the considered methods.We specify the noise U i as a standard Gaussian, and the expected control outcome as f (X i ) = 5(X i (1) + X i (2) + X i (3)), and the treatment effect as where S ∆ > 0 encodes the scale of the treatment effect.In this setup, around 15% subjects have positive treatment effects with a large scale, and 43% subjects have a mild negative effect5 .q q q q q q q q q q q q 0.0 scale of treatment effect FDR q q q q q q q q q q q q 0.0  The FDR control level is 0.2, marked by a horizontal line in error control plots.For all plots in this paper, the FDR and power are averaged over 500 repetitions.The Crossfit-I 3 controls FDR while the linear-BH procedure does not because the treatment effect is nonlinear.Also, the Crossfit-I 3 can achieve higher power than both the Selective SeqStep+ and the linear-BH procedure.
For the Crossfit-I 3 , we use random forests (with default parameters in R) to compute m, and use the automated selection strategy Algorithm 3 to select a subject at step 8 in Algorithm 1.The linear-BH procedure results in a substantially higher FDR than desired because the linear assumption does not hold in the underlying truth (54) (see Figure 4), whereas the Selective SeqStep+ and our proposed Crossfit-I 3 control FDR at the target level as expected.At the same time, the Crossfit-I 3 appears to have higher power than the Selective SeqStep+ and the linear-BH procedure to correctly identify subjects with true positive effects.More experiments that explore different treatment effect setups can be found in Section 7.

Asymptotic power analysis in simple settings
In addition to the numerical experiments, we provide a theoretical power analysis in some simple cases to understand the advantages and limitations of our proposed Crossfit-I 3 .
First, consider the case without covariates.Our analysis is inspired by the work of Arias-Castro and Chen [31], Rabinovich et al. [32], who study the power of methods with FDR control under a sparse Gaussian sequence model.Let there be n hypotheses, each associated with a test statistic V i for i = 1, . . ., n.They consider a class of methods called threshold procedures such that the final rejection set R is in the form . ., V n ); they discuss two types of thresholds; see Appendix C for details of their results.An example of the threshold procedure is the BH procedure.Our proposed I 3 can also be simplified to a threshold procedure when using an automated selection strategy at step 8 of Algorithm 1: at each iteration, we exclude the subject with the smallest absolute value of the estimated treatment effect | ∆ i | (note that this strategy satisfies our requirement of not using assignments since The resulting (simplified and automated) I 3 is a threshold procedure where V i = ∆ i .Note that our original interactive procedure is highly flexible, making the power analysis less obvious, so we discuss the power of Crossfit-I 3 with the above simplified selection strategy.
To contextualize our power analysis, we paraphrase one of the results in Arias-Castro and Chen [31], Rabinovich et al. [32].Assume the test statistics V i ∼ N (µ i , 1) are independent, with µ i = 0 under the null and µ i = µ > 0 otherwise.Denote the number of non-nulls as n 1 and the sparsity of the non-nulls is parameterized by β ∈ (0, 1) such that n 1 /n = n −β .Let the signal µ increase with n as µ = √ 2r log n, where the signal strength is encoded by r ∈ (0, 1).Their power analysis is characterized by the signal r and sparsity β, which are also critical parameters to characterize the power in our context as we state later.These authors effectively prove that for any fixed FDR level α ∈ (0, 1), no threshold procedure can have nontrivial power if r < β, but there exist threshold procedures with asymptotic power one if r > β.
Our analysis differs from theirs in the non-null distribution of the test statistics.Given n subjects, suppose the potential outcomes for subject i are distributed as: , where the alternative mean is µ i = 0 if subject i is null, or µ i = µ > 0 if i is non-null.Thus, the observed outcome of a null is N (0, 1), and that of a non-null is a mixture of N (µ, 1) and N (0, 1) (depending on the treatment assignment), instead of a shift of the null distribution as assumed in Arias-Castro and Chen [31], and the proof of the following result thus involves some modifications on their proofs (see Appendix C.1).
Theorem 3. Given a fixed FDR level α ∈ (0, 1) and let the number of subjects n go to infinity.When there is no covariate, the automated Crossfit-I 3 and the linear-BH procedure have the same power asymptotically: if r < β, their power goes to zero; if r > β, their power goes to 1/2.Further, among the treated subjects, their power goes to one.Remark 2. Power of both methods cannot converge to a value larger than 1/2 because without covariates, we cannot differentiate between the subjects with zero effect (whose outcome follows standard Gaussian regardless of treated or not) and the subjects with positive effects that are not treated (which also follows standard Gaussian).And the proportion of untreated subjects among those with positive effects is 1/2 because of the assumed randomization.
The above theorem discusses the case where there are no covariates to help guess which untreated subjects have positive effects.Next, we consider the case with an "ideal" covariate X i : its value corresponds to whether a subject is a non-null (having positive effect) or not, X i = 1{µ i > 0}.Here, we design the selection strategy (for step 8 of Algorithm 1) as a function of the covariates, because we hope that subjects with the similar covariates have similar treatment effects.Specifically, for the I 3 implemented on I, we learn a prediction of ∆ j by X j using non-candidate subjects j ∈ II: Pred(x) = 1 |II| i∈II ∆ j 1{X j = x}, where x = {0, 1}.Then for candidate subjects i ∈ I, we exclude the ones whose Pred(X i ) are lower.As we integrate information among subjects with the same covariate value, all non-null subjects (i.e., those with X i = 1) would excluded after the nulls (with probability tending to one), regardless of whether they are treated or not; hence we achieve power one.
Theorem 4. Given a fixed FDR level α ∈ (0, 1) and let the number of subjects n go to infinity.With a covariate X i = 1{µ i > 0}, the power of the automated Crossfit-I 3 converges to one for any fixed r ∈ (0, 1) and β ∈ (0, 1).In contrast, the power of the linear-BH procedure goes to zero if r < β. (When r > β, power of both methods converges to one.) Here is a short informal argument for why our power goes to one (see detailed proof in Appendix C.2). Since the nulls can be excluded before the non-nulls, we focus on the test statistics of the non-nulls.Let → N (−µ/2, 1) for those with A i = 0.) Hence, at the time t 0 right after all the nulls are excluded (and all the non-nulls are in R t0 ), the proportion of positive estimated effects |R + t0 |/|R t0 | converges to Φ(µ), where Φ denotes the CDF of a standard Gaussian.We can stop before t 0 and identify subjects in The power goes to one because µ grows to infinity for any fixed r ∈ (0, 1), so that for large n, we stop before t 0 and the proportion of rejected non-nulls |R + t0 |/|R t0 | (which converges to Φ(µ) as argued above) also goes to one.In short, the power guarantee does not depend on the sparsity β because of the designed selection strategy that incorporates covariates.
We note that our theoretical power analysis discusses two extreme cases, one with no covariate to assist the testing procedure (Theorem 3), and the other with a single "ideal" covariate that equals the indicator of non-nulls (Theorem 4).The numerical experiments in Section 3 consider more practical settings, where the analyst is provided with a mixture of covariates informative about the heterogeneous effect (X i (1) and X i (3) in our example) and some uninformative ones; still, the Crossfit-I 3 tends to have reasonably high power.So far, the paper discusses the setup of a randomized experiment where each subject has 1/2 probability to be treated; in the following, we present the variant of Crossfit-I 3 for observational studies, where the probabilities of receiving treatment can depend on covariates and unknown.

Crossfit-I 3 in observational studies
For clear notation, we denote the true propensity score (probability of receiving treatment) for subject i as π i , and the estimated one as π i (which we introduce soon).For the setup in observational studies, we introduce an alternative set of assumptions to replace assumption in (1): (iii) conditional on covariates, treatment assignments are independent: for any (a 1 , . . ., a n ) ∈ {0, 1} n and {π i } n i=1 can be unknown.
(iv) the propensity scores are bounded away from 0 and 1: Because the propensity scores π i are unknown, we estimate them using the revealed data -specifically in the cross-fitting framework using the complete data for subjects in II.To be exact, we modify the Crossfit-I 3 in algorithm 2 in two components: • prior to step 2 of implementing the I 3 , we estimate the bounds for the propensity scores as π min (I) and π min (I) by the complete data in II.For example, we can estimate individual propensity scores by a logistic regression on covariates X j using the complete data from non-candidate subjects j ∈ II, and take their minimum and maximum as the estimations; and • we modify the implementation of Algorithm 1 in the FDR estimator: and similarly modify the procedure on set II.We call the resulting algorithm Crossfit-I 3 π , which have asymptotic FDR control when the propensity scores are well estimated (proof in Appendix B.5).Note that Crossfit-I 3 π does not involve a modification of the treatment effect estimator ∆ j , which will not affect the FDR control as stated in the theorem below.As a future direction, it would be an interesting extension to modify ∆ j and utilize the information of estimated propensity scores, and potentially improve the identification power.
Theorem 5. Suppose there are n samples for identification.In the cross-fitting framework, let π min (I) and π max (I) be the estimated lower and upper bound of the propensity scores based on data information in F 0 (I).Denote the estimation error as and similarly define ϵ π n (II).Let ϵ π n = ϵ π n (I) + ϵ π n (II), and the FDR of Crossfit-I 3 π is upper bounded: when max{ϵ q n (I), ϵ q n (II)} ≤ 1 2 max{1 − π min , π max }, under assumptions (17), (18), and (2) in observational studies, for the null hypothesis (3).

Corollary 1. Crossfit-I 3
π has asymptotic FDR control for the zero-effect null (3) when the estimation of propensity score bounds is consistent in the sense that ϵ π n → 0 as sample size n goes to infinity.Remark 3. The Crossfit-I 3 π would have a larger FDR than the target level if the propensity score estimation is not consistent, and this inflation increases as the true bounds of propensity score get close to 0 and 1.Nonetheless, note that the inflation only depends on the minimum and maximum of the propensity scores (rather than for each individual), and only concerns the one-sided error for their estimation.Intuitively, we could have exact FDR control if the estimated minimum propensity score is lower than the true minimum and the estimated maximum larger than the true maximum -if the estimation is conservative to capture the extreme cases.Such desirable FDR control comes with a risk of having lower power because the conservative propensity score estimations would increase the FDR estimation in Crossfit-I 3 π , and in turn, more subjects need to be excluded before claiming rejections.π , which can have a stronger error control guarantee at the cost of reserving (masking) more information for FDR control and could potentially have lower identification power.Specifically, the variant can have doubly robust asymptotic FDR control: either when the propensity scores are well-estimated as above, or when the conditional outcomes given covariates m(X i ) ≡ E(Y i | X i ) are well-estimated by m(X i ).In addition, another advantage of the MaY-I 3 π is that it controls FDR for nonpositive effect, as we detail in Appendix A.1 and show in numerical experiments in the next section.

Numerical experiments
We follow the simulation setting in Section 3.2, except different propensity scores specified as a function of covariates.Let the the treatment effect be which is the treatment effect in (54) with S ∆ = 3.Consider the case where subjects with positive effects coincides with those having higher propensity scores: where δ ∈ (0, 0.5) denotes the deviation of the propensity score bounds to 1/2.
In the case of unknown propensity scores, several approaches are explored: estimating the propensity scores as in Crossfit-I 3 π and MaY-I 3 π ; and falsely treating all propensity scores as 1/2 and implementing the original algorithms, referred to as Crossfit-I 3 and MaY-I 3 (detailed in Appendix A.1 for controlling error of nonpositive effects).They are compared with the oracle algorithms where the true propensity scores are plugged into Crossfit-I 3 π and MaY-I 3 π , denoted as Crossfit-I 3 π and MaY-I 3 π .We are interested in the sensitivity of Crossfit-I 3 and MaY-I 3 because we might assume propensity scores to be 1/2 while they differ in practice.
Crossfit-I 3 π and MaY-I 3 π with estimated propensity scores appear to control FDR at the target level for their corresponding null hypotheses, respectively (see Figure 5).They have less power compared with the Crossfit-I 3 π and MaY-I 3 π , which is expected since the latter two methods make use of the true propensity scores.q q q q q q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q 0.0  When all propensity scores are falsely treated as 1/2, we can implement Crossfit-I 3 and MaY-I 3 (see Figure 6).In our experiments, the FDR for the zero-effect null seems to be controlled below the target level even when the true propensity scores are extreme (with π min = 0.1 and π max = 0.9 when δ = 0.4).It coincides with our claim on doubly robust FDR control once noticing that MaY-I 3 is equivalent to MaY-I 3 π when π min = π max = 1/2.In such a case, the propensity scores are poorly estimated | π min − π min | = q q q q q q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q 0.0  | π max − π max | = 0.4, but FDR can be small when the expected outcome ) is well-estimated by m −I (X i ).The FDR for the nonpositive-effect null can exceed the target level when the deviation δ is large and the propensity score estimation is poor, corresponding to the case where π min ≤ 0.25 and π max ≥ 0.75.The power of Crossfit-I 3 and MaY-I 3 does not follow the same trend as Crossfit-I 3 π and MaY-I 3 π when δ grows large, because their FDR estimator does not suffer from conservativeness introduced by extreme propensity scores.

FDR control at a subgroup level
Our proposed interactive methods control FDR on individual level, which means upper bounding the proportion of falsely identified subjects.In this section, we show that the idea of interactive testing can be extended to control FDR on subgroup level, where we aim at identifying multiple subgroups with positive effects and upper bounding the proportion of falsely identified subgroups.Recall that FDR control at a subgroup level is studied by Karmakar et al. [8] as we review in Section 1.3.

Problem setup
Let there be G non-overlapping subgroups G g for g ∈ [G] ≡ {1, . . ., G}.The null hypothesis for each subgroup is defined as zero effect for all subjects within: or equivalently, H 0g : G g ⊆ H 0 (recall that H 0 is the set of true null subjects).Let D g be the decision function receiving the values 1 or 0 for whether H 0g is rejected or not rejected, respectively, and the FDR at a subgroup level is defined as: Same as the algorithms at an individual level, the algorithms we propose at a subgroup level can be applied to samples that are paired or unpaired.For simple notation, we use {Y i , A i , X i } to denote the observed data for subject i when the samples are unpaired, and for pair i when the samples are paired (where Y i = {Y i1 , Y i2 } and similarly for A i and X i ).

An interactive algorithm to identify subgroups
We first follow the same steps of Karmakar et al. [8] to define subgroups and generate the p-value for each subgroup.Specifically, the subgroups G g for g ∈ [G] is defined using the outcomes and covariates {Y j , X j } n j=1 (by an arbitrary algorithm or strategy, such as grouping subjects with the same covariates).For each subgroup G g , we can compute a p-value P g by the classical Wilcoxon test (or using a permutation test, which obtains the null distribution by permuting the treatment assignment {A i } n i=1 ).The interactive procedure we propose differs from Karmakar et al. [8] by how we process the p-values of the subgroups.We adopt the work of Lei et al. [19] that proposes an interactive procedure with FDR control for generic multiple testing problems.The key property that allows human interaction while guaranteeing valid FDR control is similar to that in the I 3 : the independence between the information used for selection and that used for FDR control.Here with the p-values of subgroups, the two independent parts are P 1 g := min{P g , 1 − P g }, which is revealed to the analyst for selection and } − 1, which is masked (hidden) for FDR control.Notice that for a null subgroup with a uniform p-value, (P 1 g , P 2 g ) are independent, and we have that because the p-values obtained by permutating assignments is uniform when conditional on the outcomes and covariates.We remark that the above property is similar to property ( 9) and ( 33) that lead to valid FDR control at an individual level.Similar to the proposed methods at an individual level, the interactive procedure for subgroups progressively excludes subgroups and recursively estimates the FDR.Let the candidate rejection set R t be a set of selected subgroups, starting from all subgroups included R 0 = [G].We interactively shrink R t using the available information: which masks (hides) the partial p-value P 2 g and the treatment assignment A i for candidate subgroups in R t−1 ; and the sum g∈Rt−1 P 2 g is mainly provided for FDR estimation.Similar to our previously proposed interactive procedures, the FDR estimator is defined as: with The algorithm shrinks R t until time τ := inf{t : FDR subgroup (R t ) ≤ α}, and identifies only the subgroups in R + τ , as summarized in Algorithm 4. Details of strategies to select subgroups based on the revealed p-value and covariates can be found in Lei et al. [19].As a comparison, Karmakar et al. [8]

Numerical experiments
We compare the performance of our proposed interactive procedure for subgroup identification with the method proposed by Karmakar et al. [8], following an experiment in their paper.Suppose each subject is recorded with two discrete covariates X i = {X i (1), X i (2)} where X i (1) ∈ {1, . . ., 40} takes 40 levels with equal probability, and X i (2) is binary with equal probability (for example, X i (1) could encode the city subject i lives in, and X i (2) the gender).The treatment effect ∆(X i ) is a constant δ if X i (1) is even, and we vary δ in six levels.We conduct the above experiment in two cases: unpaired samples (n = 2000) with independent covariates and paired samples (n = 1000) whose covariate values are the same for subjects within each pair.
Recall that the subgroups can be defined by covariates and outcomes.Here, since the covariates are discrete, we define subgroups by different values of (X i (1), X i (2)), resulting in 80 subgroups.The interactive procedure tends to have higher power than the BH procedure (see Figure 7a and Figure 7b) because it focuses on the subgroups that are more likely to be the non-nulls using the excluding process, and utilizes the covariates together with the p-values to guide the algorithm.Meanwhile, the BH procedure does not account for covariates once the p-values are calculated.Nonetheless, the interactive procedure can have lower power when the total number of subgroups that are truly non-null is small.We simulate the case where a subject has a positive effect δ if X i (1) a multiplier of 4 (i.e., X i (1)/2 is even), so that there are 20 non-null subgroups in total (previously 40 non-nulls).The power of the interactive procedure is lower than the BH procedure (see Figure 7c) because the FDR estimator in (25) can be conservative when |R + | is small due to a small number of true non-nulls (for example, with FDR control at α = 0.2, we need to shrink R t until A side note is that we define the subgroups by distinct values of the covariates, whereas Karmakar et al. [8] suggest forming subgroups by regressing the outcomes on covariates using a tree algorithm.In their experiments and several numerical experiments we tried, we find that the number of subgroups defined by the tree algorithm is usually less than ten.However, we think the FDR control is less meaningful when the total number of subgroups is small.To justify our comment, note that an algorithm with valid FDR control at level α can make zero rejection with probability 1 − α and reject all subgroups with probability α, which can happen when the total number of subgroups is small.In contrast, with a large number of subgroups, a reasonable algorithm is unlikely to jump between the extremes of making zero rejection and rejecting all n q q q q q q 0.0 q BH interactive Figure 7: Performance of methods to identify subgroups with positive effects: the BH procedure and the interactive procedure (for 80 subgroups defined by the distinct values of covariates).We vary the scale of treatment effect under unpaired or paired samples.In both cases, the interactive procedure can have higher power than the BH procedure.When the number of non-null subgroups is too small (less than 20), the BH procedure can have higher power.The error bar marks two standard deviations from the center.
subgroups; and thus, controlling FDR indeed informs that the proportion of false identifications is low for the evaluated algorithm.

Explanation of the higher power achieved by the interactive procedure
Although the interactive procedure and the BH procedure define the same set of subgroups and corresponding p-values, the interactive procedure has two properties that potentially improve the power from the BH procedure: (a) it excludes possible null subgroups so that it can be less sensitive to a large number of nulls, whereas the BH procedure considers all the subgroups at once; (b) the interactive procedure additionally uses the covariates.We can separately evaluate the effect of the above two properties by implementing two versions of Algorithm 4, which differ in the strategy to select subgroups in step a.Specifically, the adaptive procedure selects the subgroup whose revealed (partial) p-value P 1 g is the smallest (not using the covariates); and the interactive procedure selects the subgroup by an estimated probability of the P 2 g to be positive (using the revealed P 1 g , the covariates, and the outcomes).To see if both properties of Algorithm 4 contribute to the improvement of power from the BH procedure, we tested the methods under two simulation settings.Recall that the previous experiment defines a positive treatment effect when the discrete covariate X i (1) ∈ {1, . . ., 40} is even.Here, we add another case where the treatment effect is positive when X i (1) ≤ 20, so that the density of subgroups with positive effects is the same as previous, but the treatment effect is a simpler function of the covariates.Hence in the latter case, we would expect the interactive procedure to learn this function of covariates rather accurately, and have higher power than the adaptive procedure which does not use the covariates; as confirmed in Figure 8b.In the former simulation setting where the treatment effect is not a smooth function of the covariates and hard to be learned, the adaptive procedure and interactive procedure have similar power (Figure 8a).Still, they have higher power than the BH procedure because they exclude possible null subgroups.q q q q q q 0.0 q q q q q q 0.0  [8], the adaptive procedure, and the interactive procedure under different types of treatment effect (we define 80 subgroups by discrete values of the covariates).Our proposed interactive procedure tends to have higher power than the BH procedure because (1) it excludes possible nulls (shown by higher power of the adaptive procedure than the BH procedure in both plots); and (2) it additionally uses the covariates (shown when the treatment effect can be well learned as a function of covariates in the right plot).

Additional numerical experiments for variations in treatment effect
We have seen the numerical results of the proposed methods in previous sections where the treatment effect is defined in (54) with sparse and strong positive effect, and dense and weak negative effect.This section presents three more examples of the treatment effect.

Linear effect
Let the treatment effect be where S ∆ > 0. In this case, all subjects have treatment effects, and the scale correlates with the covariates (with interaction terms) linearly.Thus, the linear-BH procedure has valid error control as shown in Figure 9a (unlike other cases with nonlinear treatment effect).

Sparse and strong effect that is positive
Let the treatment effect be where S ∆ > 0. Here, the subjects with X i (3) > 1 have positive treatment effects.Although linear-BH procedure seems to have high power, its FDR is largely inflated since the assumption of linear correlation does not hold (see Figure 9b).In contrast, our proposed methods and Selective SeqStep+ have valid FDR control, and our proposed methods have higher power.

Sparse and strong effect in both directions
Let the treatment effect be where S ∆ > 0. Here, the subjects with X i (3) > 1 have positive treatment effects and those with X i (3) < −1 have negative treatment effects; the scale and proportion of effects in both directions are the same.The power of Selective SeqStep+ is trivial because |E i | used in ordering cannot inform the direction of treatment effect.In contrast, the power of Crossfit-I 3 and MaY-I 3 are slightly lower than the previous setting since there is additionally negative effect in this example (see Figure 9c).q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 scale of treatment effect FDR q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 scale of treatment effect FDR (nonpositive−nulls) q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 scale of treatment effect power (a) Dense two-sided effect (linear) as in model ( 26).
q q q q q q q q q q q q 0.0 scale of treatment effect FDR q q q q q q q q q q q q 0.0 scale of treatment effect FDR (nonpositive−nulls) q q q q q q q q q q q q 0.0 q q q q q q q q q q q q 0.0 scale of treatment effect FDR q q q q q q q q q q q q 0.0 scale of treatment effect FDR (nonpositive−nulls) q q q q q q q q q q q q 0.0 Figure 9: FDR for the zero-effect null (3) (the first column), and FDR for the nonpositive-effect null (29) (the second column), and power (the third column) of four methods: linear-BH procedure, Selective SeqStep+, Crossfit-I 3 , MaY-I 3 , under three types of treatment effect when varying the scale of treatment effect S ∆ in {0, 1, 2, 3, 4, 5}.When the linear assumption holds as in the first row, the linear-BH procedure has valid FDR control and high power, but its FDR is large when the treatment is a nonlinear function of the covariates as in the last two rows.In contrast, the Selective SeqStep+, Crossfit-I 3 and MaY-I 3 have valid FDR control for their target null hypotheses, respectively.Our proposed Crossfit-I 3 and MaY-I 3 have higher power than Selective SeqStep+, especially when positive effects have comparable scale as negative effects as in the first and the third rows, because the proposed methods can additionally use the revealed treatment assignments to inform the direction of treatment effects.

A prototypical application to ACIC challenge dataset
We implement our proposed methods on datasets generated by Atlantic Causal Inference Conference (ACIC), which intend to evaluate methods for average treatment effect(ATE) estimation and uses real data covariates and modified outcomes to simulate cases with heterogeneous treatment effect, heterogeneous propensity scores, etc.We take an example dataset with 500 subjects, each of which is recorded with 22 continuous covariates.The proportion of treated subjects is 0.7, indicating that the propensity scores might not be 1/2 as in a standard randomized experiment.The actual ATE is 0.1, rather small compared to the outcomes range [14,76], but the treatment effect could be positive and large for a subgroup of subjects and our proposed algorithms can be used to identify them.π , which control the expected proportion of falsely identifying subjects with nonpositive effect (approximately if the propensity scores are not 1/2).Compared with the rest subjects, the ones identified as having positive effect tend to have larger values for covariate 8, 21 and smaller values for covariate 6, 15, 17 (see Figure 10).

Summary
We discuss the problem of identifying subjects with positive effects.Most existing methods identify subgroups with positive treatment effects, and they cannot upper bound the proportion of falsely identified subjects within an identified subgroup.In contrast, we propose Crossfit-I 3 with finite-sample FDR control (i.e., the expected proportion of subjects with zero effect is no larger than α among the identified subjects).One advantage of the Crossfit-I 3 is allowing human interaction -an analyst (or an algorithm) can incorporate various types of prior knowledge and covariates using any working model; she can also adjust the model at any step, potentially improving the identification power.Despite this flexibility, the Crossfit-I 3 achieves valid FDR control.Notably, because Crossfit-I 3 incorporates covariates, it can identify subjects with positive effects, including those not treated.
Our proposed interactive procedure can extended to various settings: • FDR control of nonpositive effects in randomized experiments (Appendix A.1); • FDR control of zero/nonpositive effects in observational studies (Section 5 and Appendix A.1.4); • paired samples (Appendix A.2); • FDR control at a subgroup level (Section 6).
The error control for our interactive procedures is based on the independence properties between the data used for FDR control and the revealed data for interaction, such as property (9) for the zero-effect null and (33) for the nonpositive-effect null.
The idea of interactive testing can be generalized to many other problems as long as we can construct two parts of data that are (conditionally) independent.As an example, for alternative definitions of the zero-effect null, such as those involving conditional expectations or conditional quantiles, one can explore specific functions of {Y i , X i } that are independent of treatment assignment under certain model assumptions, and then plug them into the Crossfit-I 3 framework.Importantly, our interactive procedures using the idea of "masking and unmasking" should be contrasted with data-splitting approaches.We remark that no test, interactive or otherwise, can be run twice from scratch (with a tweak made the second time to boost power) after the entire data has been examined; this amounts to p-hacking.We view our interactive tests as one step towards enabling experts (scientists and statisticians) to work together with statistical models and machine learning algorithms in order to discover scientific insights with rigorous guarantees.
A Extensions of the I 3

A.1 FDR control of nonpositive effects
The Crossfit-I 3 controls the false identifications of subjects with zero treatment effect, as defined in the null hypothesis (3) and its corresponding footnote.In this section, we develop a modification to additionally control the error of falsely identifying subjects with nonpositive treatment effects, by defining a different null hypothesis.

A.1.1 Problem setup
We define the null hypothesis for subject i as nonpositive effect: or equivalently, As before, our algorithm applies to two alternative definitions of the null hypothesis.In the context of treating the potential outcomes and covariates as fixed, the null hypothesis is and in the hybrid version where the potential outcomes are random with joint distribution (Y T i , Y C i ) | X i ∼ P i , the null posits Note that the nonpositive-effect null is less strict than the zero-effect null.Thus, an algorithm with FDR control for H nonpositive 0i must have valid FDR control for H zero 0i , but the reverse needs not be true.Indeed, we observe in numerical experiments (Figure 12b) that the Crossfit-I 3 does not control FDR for the nonpositiveeffect null.This section presents a variant of Crossfit-I 3 that controls false identifications of nonpositive effects, possibly more practical when interpreting the identified subjects.For example, when controlling FDR for the nonpositive-effect null at level α = 0.2, we are able to claim that the expected proportion of identified subjects with positive effects is no less than 80%.

A.1.2 An interactive procedure in randomized experiments
Recall that the FDR control of the Crossfit-I 3 is based on property (9) that when the null hypothesis is true for subject i, we have P ∆ i | {Y j , X j , E j } n j=1 ≤ 1/2, but this statement no longer holds when the null hypothesis is defined as H nonpositive 0i in (29).Fortunately, this issue can be fixed by making the condition in (9) coarser and removing the outcomes: which is reflected in the interactive procedure as reducing the available information for selecting subject i * t (at step 8 of Algorithm 1) -we additionally mask (hide) the outcome Y i of the candidate subjects i ∈ R t−1 (I) when implementing the I 3 on set I. We call the resulting interactive algorithm MaY-I 3 , as it masks the outcomes.
Specifically, the MaY-I 3 modifies Crossfit-I 3 where we define the available information to select subjects when implementing Algorithm 1 on set I as To calculate ∆ i at t = 0 when Y i for all i ∈ I are masked, let m −I (X i ) be an estimator of E(Y i | X i ) that is learned using data from non-candidate subjects {Y j , X j } j / ∈I , and let the residuals be i , and similar to property (9) for the zero-effect null, we have , leading to valid FDR control for nonpositive effects.Overall, the MaY-I 3 follows Algorithm 2, except the estimated treatment effect ∆ i replaced by ∆ −I i , and the available information for selection F t−1 (I) replaced by F −Y t−1 (I).See Appendix B.3 for the proof of FDR control.
Theorem 6.Under assumption (1) and (2) of randomized experiments, the MaY-I 3 has a valid FDR control at level α for the nonpositive-effect null hypothesis under any of definitions (29), (30) or (31).For the last definition, FDR control also holds conditional on the covariates and potential outcomes.
Similar to Algorithm 3 for the Crossfit-I 3 , we can design an automated algorithm for the MaY-I 3 to select a subject in step 8 of Algorithm 1, but the available information F −Y t−1 (I) no longer includes the outcomes of candidate subjects.One naive strategy is to follow Algorithm 3 in the main paper, which is designed for the Crossfit-I 3 , with the outcomes removed from the predictors; however, it appears to result in less accurate prediction of the effect signs, and in turn rather low power (numerial results are in the next paragraph).Here, we take a different approach by predicting the treatment effect instead of their signs, because the treatment effect might be better predicted as a function of the covariates (without outcomes) than a binary sign, especially when the treatment effect is indeed a smooth and simple function of the covariates.Specifically, we first estimate the treatment effect for the non-candidate subjects j / ∈ R t−1 (I) using a well-studied doubly-robust estimator (see Kennedy [24] and references therein): where ( µ 0 , µ 1 ) are random forests trained to predict the outcomes for the control and treated group, respectively.Using the provided covariates X i , we can predict ∆ DR i for the candidate subjects i ∈ R t−1 (I).The subject with the smallest prediction of ∆ DR i is then excluded.This automated strategy is described in Algorithm 5.
Algorithm 5 An automated heuristic to select i * t in the MaY-I 3 .Input: Current rejection set R t−1 (I), and available information for selection F −Y t−1 (I); Procedure: 1. Estimate the treatment effect for non-candidate subjects j / ∈ R t−1 (I) as ∆ DR j in (34); 2. Train a random forest where the label is the estimated effect ∆ DR j and the predictors are the covariates X j , using non-candidate subjects j / ∈ R t−1 (I); 3. Predict ∆ DR i for candidate subjects i ∈ R t−1 (I) via the above random forest, denoted as To summarize, we have presented two types of strategy for selecting subjects: the Crossfit-I 3 chooses the one with the smallest predicted probability of a positive ∆ i (see Algorithm 3 in the main paper), which we denote here as the min-prob strategy; and the MaY-I 3 chooses the one with the smallest prediction of estimated effect ∆ DR j (see Algorithm 5), which we denote here as the min-effect strategy.Note that the proposed interactive algorithm can use arbitrary strategy as long as the available information for selection is restricted.That is, the Crossfit-I 3 can use the same min-effect strategy, and the MaY-I 3 can use the minprob strategy (after removing the outcomes from the predictors, which we elaborate in the next paragraph).However, we observe in numerical experiments that both interactive procedures have higher power when using their original strategies, respectively (see Figure 11).q q q q q q q q q q q q 0.0 5}.The Crossfit-I 3 tends to have higher power when using the min-prob strategy, and the MaY-I 3 tends to have higher power when using the min-effect strategy.

A.1.3 Numerical experiments
Before details of the experiment results, we first describe the min-prob strategy for the MaY-I 3 , where the available information F −Y t−1 (I) does not include the outcomes for candidate subjects.Similar to the min-prob strategy in Algorithm 3 of the main paper, we hope to use the outcome Y i and residual E i = Y i − m(X i ) as predictors, and predict the sign of treatment effect for candidate subjects i ∈ R t−1 (I), but Y i and E i for the candidate subjects are not available in F −Y t−1 (I).Thus, we propose algorithm 6, where we first estimate Y i and E i using the covariates (see step 1-2); and step 3-5 are similar to Algorithm 3, which obtain the probability of having a positive treatment effect.
The Crossfit-I 3 has higher power when using the min-prob strategy than the min-effect strategy because the former additionally uses the outcome as a predictor.For the MaY-I 3 , the min-effect strategy leads to higher power because the estimated treatment effect ∆ DR j in (34) can provide reliable evidence of which subjects have a positive effect.If using the min-prob strategy, it could be harder to learn an accurate prediction by Algorithm 6 where two of the predictors Y −I (X j ) and E −I (X j ) are obtained by estimation, increasing the complexity in modeling.Therefore, we present the Crossfit-I 3 and MaY-I 3 with the minprob and min-effect strategies, respectively, as preferred in numerical experiments.Nonetheless, we remark that our proposed interactive frameworks for the Crossfit-I 3 and MaY-I 3 allow arbitrary strategies to select subjects, and an analyst can design her own strategy based on her domain knowledge.
We compare the Crossfit-I 3 and MaY-I 3 using the same experiment as Section 3.2.In terms of the error control, both the Crossfit-I 3 and MaY-I 3 control FDR for the zero-effect null at the target level (Figure 12a).When the null is defined as having a nonpositive effect, the Crossfit-I 3 can violate the error control (Fig- where E −I is learned using non-candidate subjects j / ∈ R t−1 (I); 3. Train a random forest classifier where the label is sign( ∆ j ) and the predictors are Y −I (X j ), X j , E −I (X j ) , using non-candidate subjects j / ∈ R t−1 (I); 4. Predict the probability of ∆ i being positive as p(i, t) for candidate subjects i ∈ R t−1 (I); 5. Select i * t = argmin{ p(i, t) : i ∈ R t−1 (I)}.
q q q q q q 0.0    ure 12b), whereas the MaY-I 3 preserves valid FDR control.In terms of the power, the Crossfit-I 3 has slightly higher power since the analyst can select subjects using information defined by F t−1 (I) in (10), which is richer compared to F −Y t−1 (I) in (32) for the MaY-I 3 .To summarize, the error control of the MaY-I 3 is more strict than the Crossfit-I 3 , controlling false identifications of both zero effects and negative effects, while its power is slightly lower.We recommend the Crossfit-I 3 if one only concerns the error of falsely identifying subjects with zero effect.Alternatively, we recommend the MaY-I 3 when it is desired to control the error of falsely identifying subjects with nonpositive effects.

A.1.4 Extensions to observational studies
The extension from randomized experiments to observational studies for the nonpositive-effect nulls is similar to that for the zero-efffect nulls from Crossfit-I 3 to Crossfit-I 3 π -we estimate the propensity scores π i using the revealed data (detailed steps described in Section 5).We call the resulting algorithm MaY-I 3 π .FDR control for MaY-I 3 π holds even when the bounds of true propensity scores reach 0 or 1: (iv) the propensity scores are bounded between 0 and 1: which is less stringent than assumption (18) for Crossfit-I 3 π .The error control of the MaY-I 3 π is doubly robust: when the propensity score estimation is poor, FDR would still be close to target level when the expected outcomes are well-estimated.To characterize the error of expected outcome estimation, we define a "centered" CDF Φ as with "upper" and "lower" bounds defined as Φ max (ϵ) := max i∈[n] Φ i (ϵ) and Φ min (ϵ) := min i∈[n] Φ i (ϵ).If the estimation error ϵ is small and the outcome distribution is symmetric and continuous, the centered CDF is close to 1/2, leading to less FDR inflation as we describe later.We define several estimation errors when performing the I 3 on set I as follows.
• Let the error of propensity score estimation be ϵ π n (I) := max i∈I |π i − π i (I)|, where π i (I) is the estimated propensity score using data information in F −Y 0 (I).• Let the error of expected outcome estimation be where m −I (X i )) is the estimated expected outcome learned using data information in F −Y 0 (I), which includes the complete data of non-candidate subjects j ∈ [n]\I.Recall that the sign of ∆ m (X i ) in MaY-I 3 depends on the sign of Y i − m −I (X i ), whose probability of being positive deviates from 1/2 at most by max Φ max ϵ Y n (I) , 1 − Φ min −ϵ Y n (I) .• For a candidate subject i ∈ I such that the zero-effect null hypothesis (3) is true, denote the probability of ∆ m (X i ) being positive as which is upper bounded by which is close to 1/2 (the ideal case) when either the true propensity score is close to 1/2 or the error of the expected outcome estimation ϵ Y n (I) is small.• The estimation error of q i (I) is upper bounded as Similar error terms can be derived for the procedure on set II.
Theorem 7. The FDR control of MaY-I 3 π is upper bounded: , when ϵ q n (I) ≤ q max (I)/2 and ϵ q n (II) ≤ q max (II)/2, in an observational setting satisfying assumptions (17), (35), and (2) for the zero-effect null (3).Note that the above theorem states the FDR guarantee for the zero-effect null (3), and the error control for the nonpositive-effect null ( 29) is discussed in Appendix B.6, whose condition to ensure asymptotic error control is similar, but practically it could be hard to have consistent estimation for the expected outcomes.Also, the above theorem provides an upper bound of the FDR in terms of the maximum estimation error over all subjects, while in practice we expect the FDR to be close to the target level when the the estimation error is small for most subjects.

A.2.1 Problem setup
Our discussion has focused on the case where samples are not paired, and the proposed algorithms can be extended to the paired-sample setting.Suppose there are n pairs of subjects.Let outcomes of subjects in the i-th pair be Y ij , treatment assignments be indicators A ij , covariates be X ij for j = 1, 2 and i ∈ [n].We deal with randomized experiments without interference, and assume that (i) conditional on covariates, the treatment assignments are independent coin flips: (ii) conditional on covariates, the outcome of one subject Y i1,j1 is independent of the treatment assignment of another subject A i2,j2 conditional on A i1,j1 , for any (i 1 , j 1 ) ̸ = (i 2 , j 2 ).
As before, we can develop interactive algorithms for two types of error control (only the definitions when treating the potential outcomes as random variables are presented, but the FDR control still applies to all versions of the null): Next, we present the extensions of Crossfit-I 3 for FDR control of zero effect and MaY-I 3 for FDR control of nonpositive effect for the paired-sample setting, by a few modifications in the effect estimator.

A.2.2 Interactive algorithms for paired samples
With the pairing information, the treatment effect can be estimated without involving m as in (8): as used by Rosenbaum [26] and Howard and Pimentel [4], among others.The above estimation satisfies the critical property to guarantee FDR control: for a null pair i of two subjects with zero effects in (39), we have Thus, the Crossfit-I 3 (Algorithm 2) with ∆ i replaced by ∆ paired i has valid FDR control for the zero-effect null (39), where the analyst excludes pairs using the available information, including {Y i1 , Y i2 , X i1 , X i2 } for candidate subjects i ∈ R t−1 (I), and {Y j1 , Y j2 , A j1 , A j2 , X j1 , X j2 } for non-candidate subjects j / ∈ R t−1 (I), and the sum i∈R> t−1 (I) 1{ ∆ paired i > 0} for FDR estimation.An automated strategy exclude pair i * t (at step 8 of Algorithm 1) under paired samples is the same as Algorithm 3, except ∆ i being replaced by ∆ paired i .Recall the nonpositive-effect null under paired samples: and we observe that where ∆ paired i is defined in (41) of the main paper.Thus, the MaY-I 3 with ∆ i replaced by ∆ paired i has valid FDR control for the nonpositive-effect null, where the analyst progressively excludes pairs using the available information: We can also implement an automated version of the MaY-I 3 where the selection of the excluded subject follows a similar procedure as Algorithm 5.The difference is that in step 1, we estimate the treatment effect for non-candidate subjects j / ∈ R(I) directly as to avoid estimating outcomes in ( µ 0 , µ 1 ).

A.2.3 Numerical experiments
We compare the power of the interactive procedures with and without the pairing information, using the same experiments as previous.When the subjects within each pair have the same covariate values, the power under paired samples is higher than treating them as unpaired (see Figure 13a), because the noisy variation in the observed outcomes that results from the potential control outcomes can be removed by taking the difference in outcomes within each pair.
The advantage of procedures under paired samples becomes less evident when the subjects within a pair do not match exactly.We simulate unmatched pairs by introducing a parameter ϵ such that for each pair i, the covariates of the two subjects within satisfy: P 3) + U (0, 2ϵ), where U (0, 2ϵ) is uniformly distributed between 0 and 2ϵ, and a larger ϵ leads to a larger degree of mismatch.As ϵ increases, the power of procedures using the pairing information decreases (see Figure 13b), because the estimated treatment effect ∆ paired i becomes less accurate for the mismatching setting.q q q q q q q q q q q q 0.0 (a) Exact pairs.q q q q q q q q q q q q 0.0  ) utilize the pairing information, which is higher than treating all subjects as unpaired.The advantage is less evident when the subjects within each pair are not exactly matched to have the same covariate values.q q q q q q q q q q q q 0.0 ) with or without pairing information, when the scale of treatment effect is fixed at 2 and the degree of mismatch ϵ varies.The power of algorithms without pairing information first increase and then decrease as ϵ becomes larger.
To further investigate the power decrease, we extend the definition of mismatch for ϵ ∈ (0, 1) to a larger ϵ: P(X i1 (1) ̸ = X i2 (1)) = min{ϵ, 1} and P(X i1 (2) ̸ = X i2 (2)) = min{ϵ, 1} and X i1 (3) = X i2 (3) + U (0, 2ϵ), where U (0, 2ϵ) is uniformly distributed between 0 and 2ϵ, and a larger ϵ leads to a larger degree of mismatch.As ϵ increases, the power under the unpaired samples first increases (see Figure 14).It is because the treatment effect is positive when X i (3) > 1, which only takes 15% proportion if without mismatching; thus, the pattern of treatment effect is not easy to learn.In contrast, when there is a positive shift on X i (3) as designed in the mismatching setting above, more subjects have positive effects so that the algorithm can more easily learn the effect pattern and hence increase the power.The power can slightly decrease when the degree of mismatch is too large (ϵ > 1), because there are fewer subjects without treatment effect, also affecting the estimation of treatment effect.

B Proof of error controls
The proofs are based on an optional stopping argument, as a variant of the ones presented in Lei and Fithian [12], Lei et al. [19], Li and Barber [33] and Barber and Candès [22].
Lemma 1 (Lemma 2 of Lei and Fithian [12]).Suppose that, conditionally on the σ-field G −1 , b 1 , . . ., b n are independent Bernoulli random variables with and τ is an almost-surely finite stopping time with respect to the filtration (G t ) t≥0 , then

B.1 Proof of theorem 1
Proof.We show that the I 3 controls FDR by Lemma 1, where and (b) the filtration in our algorithm satisfies F t ⊆ G t , so the time of stopping the algorithm t := min{t : FDR(R t ) ≤ α} is a stopping time with respect to G t ; and (c) C t+1 is measurable with respect to G t .Thus, by Lemma 1, expectation, we have By definition, the FDR conditional on G −1 at the stopping time t is and the proof completes by applying the tower property of conditional expectation.Notice that when the potential outcomes are treated as fixed, the same proof applies to the null defined as Y T j = Y C j , because the independence between A i and G −1 still holds for the nulls.In the hybird version of the null H zero 0i : Y T i = Y C i almost surely-P i , the above proof applies with G −1 := σ {Y j , Y T j , Y C j , X j } n j=1 .Thus, FDR is controlled at level α conditional on the potential outcomes and covariates {Y T j , Y C j , X j } n j=1 .

B.2 Proof of theorem 2
Proof.Let the set of false rejections in R(I) be V(I).We conclude that the FDR of the I 3 implemented on set I is controlled at level α/2: following the error control of the I 3 in Section B.1, where the initial candidate rejection set is R 0 = I, and thus, C 0 = I ∩ H 0 .Similarly, the FDR of the I 3 implemented on set II is also controlled at level α/2.Therefore, the FDR of the combined set R(I) ∪ R(II) is controlled at level α as claimed: the proof completes for the null (3) in the main paper after applying the tower property of conditional expectation.The FDR control also applies to the other two definitions of the null with fixed or hybrid version of the outcomes, following the same arguments as the end of Section B.1.

B.3 Proof of theorem 6
Proof.We prove that the FDR control holds for the I 3 implemented on I, and the same conclusion applies to II, so the overall FDR control is guaranteed following the proof of theorem 2 in Section B.2.We first present the proof when the potential outcomes are viewed as random variables.Define G ∈Ct , i∈Ct b i , which contains more information than G t as defined in (44).We claim that Lemma 1 holds when we replace G t by G ′ t , because the distribution of b i conditional on G t is the same as on G ′ t for any t = 0, . . ., n.Similar to the proof of Theorem 1 in Section B.1, we check that the assumption in Lemma 1 are satisfied: (a) the filtration in our algorithm satisfies F t ⊆ G ′ t , so the time of stopping the algorithm t := min{t : FDR(R t ) ≤ α} is a stopping time with respect to G ′ t ; and (b) C t+1 is measurable with respect to G ′ t ; and (c) for subjects with nonpositive effect i ∈ H nonpositive 0 : To see that the last assumption holds, notice that For any potential outcomes of the nulls such that (Y for any constant D, so ) is 1/2 for all subjects, we have which proves Claim (45) and in turn the FDR control of the MaY-I 3 .
When the potential outcomes are treated as fixed, the above proof applies to the null defined as ) is zero or one, and the above arguments still hold.For the hybird version of the null (31) in the main paper, the above proof applies with . Thus, FDR is controlled at level α conditional on the potential outcomes and covariates {Y T j , Y C j , X j } n j=1 .
B.4 Preliminaries to proof of error controls under observational studies Lemma 2. Let q i be the conditional probability of a positive estimated sign: where m is an summary statistic (mean or mean) of Y i | X i , learned using G −1 .Denote the maximum as q max (I) ≡ max i∈I q i and let its estimation of using information in G −1 be q max , and the (one-sided) estimation error be ϵ q n (I) = max{q max (I) − q max (I), 0}.Define the FDR estimator as , then the FDR of I 3 run by Analyst I at level α/2 is bounded: .
when ϵ q n (I) ≤ q max (I)/2.Proof.By Lemma 1 where and the tower property of conditional expectation, we have where the stopping time is denoted as t.The FDR at t is upper bounded: where ϵ π i (I) = π i − π i (I); it can be written in one line as q i − q i ≤ ϵ π i (I) − max 0, max{π i , 1 − π i } − max{Φ max ϵ Y n (I) , 1 − Φ min −ϵ Y n (I) } .
Remark 5.If we consider nulls as nonpositive effects, we have where Φ max (c) := max i∈I P( , and then, we can make the same claim as above.However, it is harder to have robust FDR control when the propensity scores are poorly estimated, because m(x i ) is an estimator for E(Y i | X i ), which can be very different from the expected control outcome for a subject with negative effect.(We can design an algorithm where m(x i ) is an estimation of the expected control outcome, but when the estimation is well, it would have zero power to detect positive effect if the subject is not treated.)B.7 Error control guarantee for the linear-BH procedure Theorem 8. Suppose the outcomes follow a linear model: Y i = l ∆ (X i )A i + l f (X i ) + U i , where l denotes a linear function, and U i is standard Gaussian noise.The linear-BH procedure controls FDR of the nonpositiveeffect null in (29) of the main paper asymptotically as the sample size n goes to infinity.
Note that the error control would not hold when the linear assumption is violated.For example, if the expected treatment effect E(Y T i − Y C i | X i ) is some nonlinear function of the covariates, the estimated treatment effect ∆ BH i would not be consistent; in turn, for the null subjects with zero effect, the p-values would not be valid (i.e., not stochastically equal or larger than uniform).Hence, the linear-BH procedure would not guarantee the desired FDR control, as we show in the numerical experiments in Section 3 of the main paper.
Proof.For simplicity, we treat all the covariates as fixed values and denote them as the covariance matrix X a = (X i : A i = a) T for a ∈ {T, C}, where we temporarily use A i = T to denote the case of being treated A i = 1.Under the linear assumption, the estimated outcome l a asymptotically follows a Gaussian distribution, whose expected value is l ∆ (X i )1{a = T } + l f (X i ).Its variance can be estimated as Var( l a (X i )) = σ 2 a (X T i (X T a X a ) −1 X T i ), where the variance from noise is estimated as and d is the number of covariates.Note that the observed outcome also follows a Gaussian distribution N (l ∆ (X i )1{a = T } + l f (X i ), σ 2 a ).Note that in each estimated effect ∆ BH i , the observed outcome Y a i is independent of the estimated potential outcome Y a i , where a is the complement of a: a ∪ a = {T, C}.Thus, the estimated effect asymptotically follow a Gaussian distribution whose expected value is l ∆ (X i ) (nonpositive under the null) and the variance is Var( ∆ BH i ) = Var( Y T i ) + Var( Y C i ), where an estimation is Var( Y a i ) = σ 2 a 1{A i = a} + Var( l a (X i ))1{A i = a}.Therefore, the resulting p-value P i as defined in (15) of the main paper is asymptotically valid (uniform or stochastically larger) if subject i is a null, and hence the BH procedure leads to asymptotic FDR control [34].

C Proof of power analysis
Our proof of the power analysis mainly uses the results in Arias-Castro and Chen [31], who consider the setup with n hypotheses, each associated with a test statistic V i for i ∈ [n].Assume the test statistics are independent with the survival function P(V i ≥ x) = Ψ i (x), which equals Ψ(x − µ i ) where µ i = 0 under the null and µ i > 0 otherwise.They focus on a class of distribution called asymptotically generalized Gaussian (AGG), whose survial function satisfies: with a constant γ > 0. For example, a normal distribution is AGG with γ = 2.They discuss a class of multiple testing methods called threshold procedure: the final rejection set R is in the form for some threshold τ (V 1 , . . ., V n ), and separately study two types of thresholds: the BH procedure [11] with threshold: τ BH = V (ιBH) , ι BH := max{i : where V (1) ≥ . . .≥ V (n) are ordered statistics; and the Barber-Candès (BC) procedure [22] with a threshold on the absolute value of V i : where |V| := {|V i | : i ∈ [n]} is the set of sample absolute values, and FDP(ν) := |{i : V i ≤ −ν}| + 1 max{|{i : V i ≥ ν}|, 1} , and the final rejection set is those with positive V i and value larger than τ BC .The stopping rule for the BC procedure is similar to our proposed algorithms, as detailed next.
Recall in Section 4 of the main paper, we consider a simplified automated version of the I 3 that exclude the subject with the smallest absolute value of the estimated treatment effect | ∆ i |.Thus, the automated I 3 is a BC procedure where the test statistic of interest is V i = ∆ i = 4(A i − 1/2)(Y i − m n (X i )).Following the above notations and let Φ be the CDF for standard Gaussian, we denote the survival function for the nulls as ).Thus, the power for large n is smaller than 1/2 + 1/2(1 − Φ(d 0 )) for all d * ∈ (0, ∞); in other words, the limit superior of the power is smaller than inf d∈(0,∞) 1/2 + 1/2(1 − Φ(d * )) = 1/2.
With the limit inferior and superior of the power bounded by 1/2, we conclude that the power converges to 1/2.In fact, the above proof implies that the power of identifying non-null subjects that are treated is one (notice that Ψ non-null, treated n (d * n ) = 1 − Φ(d * n + m n − µ) → 1 with probability one, so the limit inferior of the power for treated non-nulls is at least 1).
Proof for the linear-BH procedure The power of the linear-BH procedure when there is no covariates can be proved following similar steps as above, and using intermediate results of Theorem 2 in Arias-Castro and Chen [31].In their notation, the linear-BH procedure uses V i = ∆ BH i as the test statistics, and we separately discuss power among the treated group and the control group in ensure the independence among V i .The survival functions for the nulls and non-nulls in the treated group are → 0, and n T is the number of treated subjects.Since the above distributions converge to a Gaussian, these survival functions satisfy the AGG property asymptotically as defined in (52).
Proof.First, we claim that the power goes to zero when r < β, following the proof in Section C.1 for any threshold procedure (separately for the treated group conditional on control group).Then, we prove that power converges to 1/2 when r > β: the power among untreated subjects is asymptotically zero because the survival functions for the nulls and non-nulls are the same; the power among treated subjects is asymptotically one following the proof of Theorem 2 in Arias-Castro and Chen [31] as detailed next.
Again consider the sequence of thresholds d * n = (γr * log n) 1/γ for some r * ∈ (β, r ∧ 1).We first claim that the FDR estimator at d * n is less than any α ∈ (0, 1) for large n, or mathematically FDR(d * n ) ≤ α.It can be verified by the proof of Theorem 2 in Arias-Castro and Chen [31] where G(d Recall that the true stopping threshold is no larger than d * n , so the limit inferior of the power among the treated subjects is at least 1.Therefore, the overall power converges to 1/2.

Figure 1 :
Figure 1: An illustrative example with 500 subjects, each has two recorded covariates.Every point is a subject.The treatment assignments are in the left plot (squares are treated, diamonds are not), and the observed outcomes are in the right plot.We hope to answer the question: which individuals have a positive treatment effect?
Green round dots represent subjects identified by the Crossfit-I 3 .

Figure 3 :
Figure 3: An illustrative example with 500 subjects, each has two recorded covariates.The Crossfit-I 3 identifies most subjects with positive effects, although about half of them did not receive treatment and their potential treated outcomes were not observed.

Figure 4 :
Figure 4: FDR (left) and power (right) of the Crossfit-I 3 compared with the linear-BH procedure and the Selective SeqStep+, with the treatment effect specified as model (54) and the scale S ∆ varying in {0, 1, 2, 3, 4, 5}.The FDR control level is 0.2, marked by a horizontal line in error control plots.For all plots in this paper, the FDR and power are averaged over 500 repetitions.The Crossfit-I 3 controls FDR while the linear-BH procedure does not because the treatment effect is nonlinear.Also, the Crossfit-I 3 can achieve higher power than both the Selective SeqStep+ and the linear-BH procedure.

d→
denote convergence in distribution.The estimated effect ∆ i d → N (µ, 1) for each non-null (since in the notation of Algorithm 1, m(X i = 1) converges to µ/2 for the non-nulls, and thus, E i d → N (µ/2, 1) for those with A i = 1, and E i d

Remark 4 .
Another variation is proposed in Appendix A.1, as what we call MaY-I 3 FDR for the zero-effect null (3).
Power of identifying subjects with positive effects.

Figure 5 :
Figure 5: Performance of Crossfit-I 3 π and MaY-I 3 π , which estimate the propensity scores, compared with Crossfit-I 3 π and MaY-I 3 π , which use the knowledge of the true propensity scores, when the treatment effect specified as model (21) and the propensity score deviates from 1/2 by δ where δ varies in {0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4}.Both Crossfit-I 3 π and MaY-I 3 π appears to control FDR, and have similar power.Their power are lower than Crossfit-I 3 π and MaY-I 3 π because the latter additionally use the true propensity scores.
Power of identifying subjects with positive effects.

Figure 6 :
Figure 6: Performance of Crossfit-I 3 and MaY-I 3 , which falsely treat all propensity scores as 1/2, compared with Crossfit-I 3π and MaY-I 3 π , which use the true propensity scores, when the treatment effect specified as model (21) and the propensity score deviates from 1/2 by δ where δ varies in {0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4}.Power of the Crossfit-I 3 and MaY-I 3 increases because they do not suffer from conservative FDR estimator as δ increases.Although FDR for the nonpositive-effect null grows to exceed the target level when δ is larger than 0.2, FDR control for the zero-effect null seems to hold even when the true propensity scores are vastly different from 1/2.
Only a few subgroups are nonnulls.
Effect as discrete function of covariates.
Effect as a simpler function of covariates.

Figure 8 :
Figure8: Power of two methods for subgroup identification: the BH procedure proposed by Karmakar et al.[8], the adaptive procedure, and the interactive procedure under different types of treatment effect (we define 80 subgroups by discrete values of the covariates).Our proposed interactive procedure tends to have higher power than the BH procedure because (1) it excludes possible nulls (shown by higher power of the adaptive procedure than the BH procedure in both plots); and (2) it additionally uses the covariates (shown when the treatment effect can be well learned as a function of covariates in the right plot).
Sparse and strong effect that is positive (nonlinear) in(27).
Sparse and strong effect in both directions (nonlinear) in model(28).

( b )
Boxplot of covariates for subjects identified as positive effect (left) versus those not being identified (right).

Figure 10 :
Figure 10: Characteristics of identified subjects: they tend to have larger value for variable 8, 21 and smaller value for variable 6, 15, 17, compared with not identified subjects.

Figure 11 :
Figure11: Power of the Crossfit-I 3 and MaY-I 3 with two strategies to select subjects: the min-prob strategy and the min-effect strategy, under the treatment effect defined in (54) of the main paper with the scale S ∆ varies in {0, 1, 2, 3, 4, 5}.The Crossfit-I 3 tends to have higher power when using the min-prob strategy, and the MaY-I 3 tends to have higher power when using the min-effect strategy.

Algorithm 6
The min-prob strategy to select i * t in the MaY-I 3 .Input: Current rejection set R t−1 (I), and available information for selection F −Y t−1 (I); Procedure: 1. Predict the outcome Y k of each subject k ∈ [n] by covariates, denoted as Y −I (X k ), where Y −I is learned using non-candidate subjects j / ∈ R t−1 (I); 2. Predict the residual E FDR for the zero-effect null (3).

Figure 12 :
Figure 12: Performance of two interactive methods, Crossfit-I 3 and MaY-I 3 , with the treatment effect specified as model (54) and the scale S ∆ varying in {0, 1, 2, 3, 4, 5}.The MaY-I 3 controls FDR for a more relaxed null (nonpositive effects) than the Crossfit-I 3 , while the Crossfit-I 3 has slightly higher power than the MaY-I 3 .

Corollary 2 (
Doubly-robust FDR control).As sample size n goes to infinity, the MaY-I 3 π has asymptotic FDR control for the zero-effect null (3) when either 1.(a) the propensity score estimation is consistent in the sense that E F0(I) [ϵ π n (I)] → 0; and (b) E F0(II) [ϵ π n (II)] → 0, and the true propensity scores are bounded away from 0 and 1; or 2. (a) the expected outcome estimation is consistent in the sense that ϵ Y n (I) → 0 almost surely over the conditional distribution given F 0 (I); and (b) same for ϵ Y n (II); and (c) the difference between bounds on true propensity scores and 1/2 is larger than its estimation error: max{π max , 1 − π min } − 1/2 ≥ ϵ π n (I) ∨ ϵ π n (II) almost surely; and (d) the outcome distribution is symmetric.

Figure 13 :
Figure 13: Power under paired samples with treatment effects specified by model (54) when our proposed algorithms (Crossfit-I 3 and MaY-I 3 ) utilize the pairing information, which is higher than treating all subjects as unpaired.The advantage is less evident when the subjects within each pair are not exactly matched to have the same covariate values.

Figure 14 :
Figure 14: Power of identifying subjects with positive effects of the proposed algorithms (Crossfit-I 3 and MaY-I 3) with or without pairing information, when the scale of treatment effect is fixed at 2 and the degree of mismatch ϵ varies.The power of algorithms without pairing information first increase and then decrease as ϵ becomes larger.
* n ) = (1 − ϵ)Ψ null n (d * n ) + ϵΨ non-null n (d * n ) with ϵ = n −β ,and the fact that Ψ non-null n (d * n ) → 1 for the treated group and Ψ null n (d * n ) → n −r * (by property (52)) with probability one.It follows that the true stopping threshold τ n satisfies τ n ≤ d * n .Also, by Lemma 1 in Arias-Castro and Chen [31], we have that the proportion of correctly identified non-nulls at threshold d * n is 1 n1 i / ∈H0 1{ ∆ BH i ≥ d} = Ψ non-null n (d * n ) + o P (1), where Ψ non-null n (d) for the treated group decreases in d and converges to 1 when d = d * n .
(9)definition, where recall H 0 is the set of true nulls.Intuitively, the number of false identifications |R + t ∩ H 0 | can be approximately upper bounded by |R − t ∩ H 0 |, since the number of positive signs should be no larger than the number of negative signs for the falsely identified nulls, according to property(9).Note that the set of true nulls H 0 is unknown, so we use |R − Algorithm 1 The I 3 (interactive identification of individual treatment effect) procedure.Both players initialize R 0 = [n] and set t = 1.1.S builds a prediction model m from X i to Y i .2. S informs C about residualsE i ≡ Y i − m(X i ).
t |,1} ≤ α. 7. If yes, S sets τ = t, reports R + τ and exits.8. Else, S picks any i * t ∈ R t−1 using everything S currently knows.(S tries to pick an i * t that S thinks is null, i.e. S hopes that ∆ i * t ≤ 0.) 9. C reveals A i * t to S, who also infers ∆ i * t and its sign.10. S updates R t+1 = R t \{i * t }, and also |R + t+1 | and |R − t+1 |. 11.Increment t and go back to Step 6.
use the same set of p-values {P g } g∈[G] , and control FDR by the classical BH procedure.An interactive procedure for subgroup identification.Initial state: Explorer (E) knows the covariates, outcomes {Y i , X i } n i=1 .Oracle (O) knows the treatment assignments {A i } n i=1 .Target FDR level α is public knowledge.{A i } i∈Gg to E, who also infers P 2 g .10. E updates R t+1 = R t \{g * t }, and also |R + t+1 | and |R − t+1 |; 11.Increment t and go back to Step 6.