Agglomerative Hierarchical Clustering for Selecting Valid Instrumental Variables

We propose a procedure which combines hierarchical clustering with a test of overidentifying restrictions for selecting valid instrumental variables (IV) from a large set of IVs. Some of these IVs may be invalid in that they fail the exclusion restriction. We show that if the largest group of IVs is valid, our method achieves oracle properties. Unlike existing techniques, our work deals with multiple endogenous regressors. Simulation results suggest an advantageous performance of the method in various settings. The method is applied to estimating the effect of immigration on wages.

of the instruments may be invalid in the sense that they may fail the exclusion restriction.
We show that under the plurality rule, our method can achieve oracle selection and estimation results. Compared to the previous IV selection methods, our method has the advantages that it can deal with the weak instruments problem effectively, and can be easily extended to settings where there are multiple endogenous regressors and heterogenous treatment effects. We conduct Monte Carlo simulations to examine the performance of our method, and compare it with two existing methods, the Hard Thresholding method (HT) and the Confidence Interval method (CIM). The simulation results show that our method achieves oracle selection and estimation results in both single and multiple endogenous regressors settings in large samples when all the instruments are strong. Also, our method works well when some of the candidate instruments are weak, outperforming HT and CIM. We apply our method to the estimation of the effect of immigration on wages in the US.

Introduction
Instrumental variables estimation is a widely used statistical method for analysing the causal effects of treatment variables on an outcome factor when the causal pathway between them is confounded. Consistent IV estimation requires that all instruments are valid. This requires that (a) instruments must be associated with the endogenous variables (relevance condition) and (b) instruments should not affect the outcome directly or through unobserved factors (exclusion restrictions). In practice, a main challenge in IV estimation is that when there are a large number of candidate instruments, some of them may be invalid in the sense that they may fail the exclusion restrictions. Many IV applications select valid instruments from the set of potential instruments merely based on economic intuition, or even simply include all the candidate instruments in IV estimation. This kind of practice is problematic because including invalid instruments may lead to severely biased results. Therefore, it is important to develop IV selection methods in presence of possibly invalid instruments, when complete knowledge about the candidate instruments' validity is absent.
The importance of developing IV selection techniques can be illustrated by a class of IV application: the shift-share IV estimation in international economics, where the instruments are constructed by many class-specific share variables. For example, Apfel (2019) estimates the effect of immigration on wages in the US labor market. The instruments for contemporaneous immigration pattern is the lagged immigration pattern, which is constructed by 19 origin-country-specific share variables. Researches in this area have documented that to make a valid IV, each of the 19 share variables must satisfy the exclusion restriction. However, some of the shares may violate the exclusion restrictions, as they may affect the wage variable directly through long-term dynamic adjustment process, or be correlated with unobserved demand shocks.
In this paper, we propose an IV selection and estimation method, which combines the agglomerative hierarchical clustering algorithm, a machine learning algorithm typically employed in cluster analysis, with the Sargan test for overidentifying restrictions. The estimator that we develop relies on the plurality rule (Guo, Kang, Cai, and Small, 2018) which states that the largest group of IVs consists of valid instruments, where instruments form a group if their instrument-specific just-identified estimators converge to the same value. Under the plurality rule, our method can achieve oracle selection, which means that the valid instruments can be selected consistently, and the IV estimator using the selected valid instruments has the same limiting distribution as the estimator if the true set of valid instruments were known.
Previous work has tackled the IV selection problem in the single endogenous variable case. Kang, Zhang, Cai, and Small (2016) propose a selection method based on least absolute shrinkage and selection operator (LASSO). Windmeijer, Farbmacher, Davies, and Smith (2019) make improvements by proposing an adaptive Lasso based method under the assumption that more than half of the candidate instruments are valid (the majority rule), which theoretically guarantees consistent selection. Guo, Kang, Cai, and Small (2018) propose the Hard Thresholding method under the suffcient and necessary identification condition (the plurality rule), which is a relaxation to the majority rule. Under the same identification condition, Windmeijer, Liang, Hartwig, and Bowden (2020) propose the Confidence Interval method, which has better finite sample performance. Our research adds to the literature in five ways: 1. We combine agglomerative hierarchical clustering with a traditional statistical test, the Sargan over-identification test, to yield a novel downward testing algorithm for IV selection.
This new method provides the theoretical guarantee that it can select the true set of valid instruments consistently, and is computationally feasible.
2. We extend the method to settings with multiple endogenous regressors. Such an extension is not available for the aforementioned methods, but it is straightforward in our setting.
3. Our method performs well in the presence of weak valid or invalid instruments, which is an advantage over existing methods. 4. We also discuss the application of our method to a setting with heterogeneous treatment effects.
5. Our algorithm is computationally less complex than the CIM method. Also, compared with the commonly used K-means clustering, our method does not need to pre-specify the number of clusters or any starting points -the only pre-specified parameter for our algorithm is the critical value for the Sargan test, which has been well established in the existing theory to guarantee consistent selection, making our method easy to implement in practice.
We conduct Monte Carlo simulations to examine the performance of our method, and compare with two existing methods, the Hard Thresholding method (HT) and the Confidence Interval method (CIM). The simulation results show that our method achieves oracle performance in both single and multiple endogenous regressors settings in large samples when all the instruments are strong. Also, our method works well when some of the candidate instruments are weak, outperforming HT and CIM. We apply our method to shift-share IV estimation to estimate the dynamic effects of immigration on wages in the US.
The remainder of this paper is structured as follows. In Section 2, we state the model and assumptions and illustrate some of the well-established properties of the 2SLS just-identified estimator. In Section 3, we describe the basic method and the algorithm, and investigate its asymptotic properties. In Section 4, we present extensions to settings with multiple endogenous regressors and heterogenous treatment effects, and discuss our method's performance in presence of weak instruments. In Section 5, we provide Monte Carlo simulation results. In Section 6, we apply our method to estimate the effects of immigration on wages. Section 7 concludes.

Setup
In the following we introduce notational conventions used throughout this paper. Matrices are in bold. Vectors are in bold and italic. Let y an n × 1-vector of the observed outcome, D 1 , ..., D P be P endogenous regressor vectors (each n × 1), which can be subsumed in an n × P -matrix D, Z 1 , ..., Z J be J instrument vectors, which can be subsumed in an n × J -matrix Z. Let error terms be u and ε p for p ∈ {1, ..., P }, which are all n × 1 error-vectors and are correlated with cov(u, ε p ) ≡ σ up . The latter covariances measure the endogeneity of regressors in D. The coefficient vector of interest is β (P × 1). The J × P matrix γ contains first-stage coefficients.
Let s be the number of invalid instruments in set I , g be the number of valid instruments in set V and J = g + s be the total number of instruments in set J . The mean of a variable x is µ x = E(x), · denotes the L2-norm and | · | denotes cardinality. The n × n projection matrix is P X ≡ X(X X) −1 X , the annihilator matrix is M X = I − P X andD = P Z D are the fitted values.

Model
We start from the model setup with a single endogenous regressor, i.e. throughout Section 2 and Section 3, P = 1. The extension of our method to the cases with multiple endogenous regressors can be found in Section 4.1. Following previous research, we adopt the following observed data model which takes the potentially invalid instruments into account: and the first stage reduced form is α is a J × 1-vector with entries α j , each of which is associated with an instrument. Each entry indicates which of the instruments has a direct effect on the outcome variable and hence is invalid. An IV which is associated with a zero-entry in α is valid( (Guo, Kang, Cai, and Small, 2018) . The number of valid IVs is g ≡ |{j : α j = 0}|.

Assumptions
The first assumption makes sure that the just-identified estimators all exist Assumption 1. Existence of just-identified estimators.
For all j ∈ 1, ..., J, we assume The next assumption is key. It states that the largest group is composed by valid IVs. A group of IVs is defined as a set which converges to the same value β + (Guo, Kang, Cai, and Small, 2018).

Assumption 5. Plurality Rule
The assumptions above will be modified when there are more than one endogenous regressors.

Properties of Just-identified Estimators
From (1) and (2), we have the outcome-instrument reduced form There are J just-identified 2SLS estimators. We write these estimators as in Windmeijer, Liang, Hartwig, and Bowden (2020).
whereΓ j andγ j are the OLS estimator for Γ j and γ j respectively. Then we have Hence, the inconsistency ofβ j is c j = plim(β 2SLS,j ) − β = α j γ j . We define a group following the definition in Guo, Kang, Cai, and Small (2018) as: Then the group consisting of all valid instruments is Let there be Q groups.

IV Selection and Estimation Method
We explore clustering methods for IV selection and estimation. First we fit the general clustering framework to the IV selection problem, which is summarized in the minimisation problem in 3.
This general method needs a pre-specified parameter K, which is the number of clusters. We show that when K equals the number of groups, it can achieve consistent selection. However, the fact that consistent selection depends on K makes it difficult to implement in practice, as we do not have prior knowledge about the number of groups. If K is too large (larger than the number of groups), then the largest group will be split. If K is too small, then the largest group might be in a cluster with some other group. To tackle this problem, we propose a downward testing procedure which combines the agglomerative hierarchical clustering method (Ward's method) with the Sargan test for overidentifying restrictions to select the valid instruments, which allows us to systematically select K.
Note that in this section, we develop our methods and properties with P = 1. In Section 4.1 we extend the method to cases where P > 1. All proofs in the Appendix are for a general P .

Clustering Method for IV Selection
Let θ = {θ 1 , ..., θ K } be a partition of J just-identified estimatorsβ j into K cluster cells. Let [k] be the set of identities of the just-identified estimators which are in cluster θ k . The clustering result is the solution to the following minimization problem: β k is the mean of all just-identified estimatorsβ j in cluster θ k .
Based on Assumption 5, the group that consists of valid IVs is selected as the cluster that contains the largest number of just-identified estimators: Then the set of invalid instruments iŝ Now we show that when the number of clusters K is equal to the number of groups Q, K = Q, then the partition minimizing the sum in 3 is such thatθ k = G q , i.e. each cluster is formed by a group. Define this partition as the true partition θ 0 .
To see that, first note that if the partition is such that θ k = G q ∀k, q, , we have plimβ q = plimβ k , and plim{||β q −β k || 2 } = 0. This is the case for all k ∈ 1, ...K, hence g(θ 0 ) = 0. Second, if the partition is such that some θ k = G q , i.e. θ = θ 0 , then plimβ j = plimβ k for some j ∈ [k] and g(θ) > 0. This means that when n → ∞ there is a unique solution for 3, which is such that θ = θ 0 . Based on Assumption 5, the valid instruments are those contained in the largest cluster. This of course relies on the correct choice of K which satisfies K = Q.

Ward's Algorithm for IV Selection
To tackle the difficulty of choosing the correct value of K without prior knowledge of the number of groups, we propose a selection method which combines the Ward's algorithm, a general agglomerative hierarchical clustering procedure proposed by Ward (1963), with the Sargan test for overidentification. Our selection algorithm has two parts. The first part is Ward's algorithm, as is listed in Algorithm 1, which generates K clusters of the just-identified estimators for each K = 1, ..., J. After obtaining the clusters for each K, we use a downward testing procedure based on the Sargan test (Algorithm 2) to select the set of valid instruments. 2. Do Sargan test on the instruments contained in the largest cluster using the rest as controls.

Select the largest cluster (in terms of number of just-identified estimators) that does not
get rejected by the Sargan test. If there are multiple such clusters, select the one with the smallest Sargan statistic.

Select the instruments contained in the cluster from Step 4 as valid instruments.
The Sargan statistic in Step 4 is given by whereδ K is the 2SLS estimator using the instruments contained in the largest cluster for each K as valid instruments and controlling for the rest of the instruments, and u( δ K ) is the 2SLS residual. We show later that to guarantee consistent selection, the critical value for the Sargan Step: 2, Nr. of clusters: 4 Step: 3, Nr. of clusters: 3 Step: 4, Nr. of clusters: 2 Step: 5, Nr. of clusters: 1 The procedure is illustrated in figure 2. Here, we have a situation with six instruments. Three of them are valid as they affect the outcome variable only through the endogenous regressor, while it is not the case for the other three invalid instruments. In the graph the circles above the real line denote the just-identified estimators for the coefficient β 0 estimated by each of the six instruments. From left to right, we number these estimators and their corresponding instruments as No.1 to No.6.
In the initial Step (0) of the clustering process, each just-identified estimator has its own cluster. In step 1, we join the two estimators which are closest in terms of their Euclidian distance, i.e. those estimated with instrument No.3 and No.4 (the two orange circles). These two estimators now form one cluster and we only have five clusters. We re-check the distance between each two of the five clusters and merge the closest two into a new cluster. Continue with this procedure, until there is only one cluster left in the bottom right graph. We evaluate the Sargan test at each step, using the instruments contained in the largest cluster. When the p-value is larger than a certain threshold, say 0.1/log(n), we stop the procedure. Ideally this will be the case at step 2 or 3 of the algorithm, because here the largest group (in orange) is formed only by valid IVs (2,3 and 4). If this is the case, only the valid IVs are selected as valid.

Oracle Selection and Estimation Property
In this section, we state the theoretical properties of the IV selection results obtained by Algorithm 1 and Algorithm 2 and the post-selection estimators. See Section 4.1 for detailed theoretical results developed for the general case P ≥ 1. We establish that our method can achieve oracle properties in the sense that it can select the valid instruments consistently, and that the post-selection IV estimator has the same limiting distribution as if we knew the true set of valid instruments.

Theorem 1. Consistent selection
Let ξ n be the critical value for the Sargan test in Algorithm 2. LetV dts be the set of instruments selected from Algorithm 1 and Algorithm 2. Under Assumptions 1 -5, for ξ n → ∞ and ξ n = o(n), The post-selection 2SLS estimator using the selected valid instruments and controlling for the selected invalid instruments has the same asymptotic distribution as the oracle estimator: Theorem 2. Let ZÎ = Z \ ZV dts with ZÎ , ZV dts being the selected invalid and valid instruments respectively. LetβV dts be the 2SLS estimator given bŷ where σ 2 or is the asymptotic variance for the oracle 2SLS estimator given by with I 0 being the true set of invalid instruments.
The proof of Theorem 2 follows from the proof of Guo, Kang, Cai, and Small (2018, Consistent selection leads to oracle properties, Theorem 2)

Computational Complexity
Recent implementations of the hierarchical agglomerative clustering algorithm have a computational cost of O(J 2 ) (Amorim, Makarenkov, and Mirkin, 2016). In the downward testing procedure, a maximum of J − 1 different models needs to be tested. Therefore, the computational cost of the downward testing algorithm is O(J 2 ). This is an improvement on the CIM which has a time complexity of O(J 2 log(J)) and where the maximal number of tests is J(J − 1)/2.

Extensions
In this section, we propose extensions of the method to a setting with multiple endogenous regressors and to a setting with heterogeneous effects. We also discuss the performance of our method in presence of weak instruments as compared with the HT and CI method in this situation.

Multiple Endogenous Regressors
One shortcoming of previous methods that try to select invalid instruments is that they only allow for one endogeneous regressor. Therefore, in this section we show how our method can be naturally extended to select invalid instruments when P > 1. First of all, the input of our method, all the just-identified estimators, are estimated by all the P combinations from Z 1 , ..., Z J . Hence we now have J P instead of J just-identified estimators. Let [j] be a set of identities of any P instruments such that the model is exactly identified with these P instruments. Let Z [j] denote the corresponding n × P instrument matrix. To guarantee that all the J P just-identified estimators exist, we modify Assumption 1 as follows:

Assumption 1.a. Existence of just-identified estimators
For all possible values of [j], let γ [j] be the combinations of the k th -row of γ for all k ∈ [j]. Then we assume The plurality assumption also needs modification for P > 1. For P = 1, Assumption 5 states that the valid instruments form the largest group, where instruments form a group if their justidentified estimators converge to the same value. If we find the largest set of just-identified estimators that converge to the same value, then this set is automatically the largest group of instruments as each just-identified estimator is estimated by a single instrument. However, when P > 1, each just-identified estimator is estimated by multiple instruments, hence the equivalence between the largest set of just-identified estimators and the largest group of instruments may not hold. In this case, we modify the plurality rule so it is based on the combinations of P instruments instead of individual instruments. The modification starts with revisiting the asymptotics of the just-identified estimators for P > 1.
There are J P just-identified models. We write the corresponding just-identified estimators for β and π analogously to the proof of Proposition A1 in Windmeijer, Liang, Hartwig, and Bowden (2020) for the case P = 1. First, for an arbitrary [j], partition the matrix where Z 1 is a n × P matrix containing the [j]-th columns of Z, and Z 2 is a n × (J − p) matrix containing the rest columns of Z. γ = (γ 1 γ 2 ) is the equivalent partition of the matrix of first-stage coefficients.
The just-identified 2SLS estimators using Z [j] as instruments and controlling for the rest instruments can be written as By Assumption 4, we have the following asymptotics Hence, the inconsistency of β is c =β 2SLS − β = γ −1 1 α 1 and there are J P inconsistency terms c [j] . Letβ [j] be the just-identified 2SLS estimator estimated with Z [j] . As when P > 1, not each IV is associated with a single c, we introduce the concept of a family: A family is a set of IV combinations that is associated with just-identified estimators which converge to the same Then the family that consists of IV combinations which generate consistent estimators is Let there be Q families. Note that when P = 1 a group of IVs automatically is a family.
Analogously to Assumption 5, we assume that F 0 is the largest family: We show in Appendix B, that a combination of IVs is an element of F 0 if and only if all of the IVs used for its estimation are in fact valid. This means that the family of valid IVs consists of all combinations that use P IVs out of V and hence |F 0 | = g P . Therefore, the plurality assumption can be modified to Assumption 5.a. New plurality The inconsistency term of the other families depends on the first-stage coefficient vectors and hence there is no direct relation from α [j] to c [j] . This means that one family can be estimated with IV combinations which have different vectors α [j] . We show this in Appendix C.
One way in which the new plurality could be fulfilled is where [j] and [j ] are two different sets of P IVs.
The second part of the assumption makes sure that the family of valid IVs consists of more than one element, without which Assumption 5.a. can not be fulfilled. The third part of this assumption makes sure that sets containing at least one invalid IV do not converge to the same value and the valid group of IVs also translates into the largest family. This can be seen as a technical assumption and is stronger than needed.
The procedure to estimate V is analogous to the one in the preceding section (see Appendix A for illustration) only that now we need to account for the presence of families.
The valid IVs are then selected as those that are involved in estimating the largest cluster.
The cluster containing the valid instruments is chosen as the one where the number of estimates in the cluster is maximal and the Sargan test is not rejected (Sargan statistic smaller than the threshold ξ J−|I|−P,n ). In cases where there are multiple such clusters, we select the cluster in which more IVs are involved.
One ambiguity arises in finite samples: asymptotically, following Assumption 5.b the largest group (in terms of direct effects) being valid also implies the largest family being valid. However in finite samples, the number of IVs involved in the estimation of the largest cluster might be smaller than the number involved in the estimation of another cluster. Therefore, we could also select the valid IVs aŝ directly selecting the family associated with the maximal number of IVs instead of the largest cluster. It is an empirical question which of the two methods should be used.
The method has oracle properties as stated in Theorem 1 and Theorem 2. Here we formally establish the theoretical results for the general case P ≥ 1. See Appendix D for proofs of all theorems. The following Lemma establishes that when assigning a just-identified estimatorβ q to either (1) clusters that are formed by just-identified estimators from the same family asβ q , or (2) clusters that contain at least one element that is from a different family fromβ q , asymptoticallŷ β q will be assigned to the first type of clusters.
Lemma 1. Letβ j be a just-identified estimator such thatβ j ∈ F q , θ k be the cluster such that all the elements in θ k are from F q ,β [k] ∈ F q . For r = q, let θ r be a cluster such that at least one element in θ r is from a different family from F q ,β l ∈ θ r andβ l / ∈ F q . Under assumptions 1 to 5 and Algorithm 1, if assigningβ j to either θ q or θ r ,β j is assigned to θ q with probability converging to 1.
In Algorithm 1, we start from the number of clusters K = J P . For each step onward, according to Step 3 in Algorithm 1, there would be two clusters joining with each other and forming a new cluster. For a cluster that contains more than one just-identified estimators, the cluster mean can be viewed as a single estimator averaging over all the estimators in it (e.g. β k andβ l in Step 3 Algorithm 1). Therefore, the joining process for each step in Algorithm 1 can be viewed as assigning an estimator to a cluster. Note that a cluster that consists of just-identified estimators from the same family can be viewed as an estimator from this family.
Based on Lemma 1, along the path of Algorithm 1, estimators from different families will not be joined with each other until all the estimators from the same family have merged into one cluster. If for each family, the just-identified estimators contained in it have merged into the same cluster, then we know that the total number of clusters would be K = Q. This implies that when the number of clusters is smaller than Q, then the current clusters must be subsets of families.

Corollary 1. Under assumptions 1 to 4, in steps 3 and 4 of Algorithm 1:
To better understand why this is the case, consider the following analogy. There are N guests ( J P just-identified estimates) which belong to Q families. These N people live in a hotel, which has N rooms (clusters). Each day, one room disappears, and one of the people needs to move into the room of some other guest. The people in a family have closer ties, so the person whose room disappears will move into the room of somebody from their own family. This goes on until each family is living respectively in one crowded room. The hotel now continues to shrink. Only now are people from different families merged together into the same rooms. The largest family can be detected, when all people from the same family have been merged into one room, but people from other families have not been merged into one room completely (or have just been all merged into one room respectively).
In Algorithm 1, the number of clusters starts with K = J P and ends with K = 1. For each step inbetween, the number of clusters decreases by 1, hence there must be a step where K = Q.
Based on Lemma 1 and Corollary 1, estimators from different families are joined together only when all elements of their own family have been completely joined to their clusters. This implies that in particular when K = Q, there is a clusterθ k = F 0 . Therefore, the path generated by Algorithm 1 contains the true partition as there must be one step such that K = Q.

Corollary 2. When
The theoretical results above establish that the selection path generated by Algorithm 1 covers the true set of valid instruments F 0 . Next we show that by Algorithm 2, we can locate this F 0 and select the valid instruments consistently. This consistent selection property is summarized in Theorem 1 which holds for P ≥ 1 under Assumption 1(1.a.) to Assumption 5 (5.a., 5.b.). This is also the case for Theorem 2.

Weak Instruments
A major advantage of our method is that it can deal with the presence of weak instruments (valid or invalid) efficiently. The intuition is that for weak instruments, their instrument-specific just identified estimators tend to have much larger magnitude than those of the strong instruments.
Hence, the Euclidian distance between these two types of estimators tends to be large, causing them to be less likely to be joined with each other. The existing method, the HT and CI method, can face problems in the presence of weak instruments. CI would always selects the weak instruments as valid, while the first stage hard thresholding of the HT method might keep invalid instruments and rule out valid ones under certain correlation structures among the instruments.

Heterogeneous Treatment Effects
The instrumental variable estimator also has a local average treatment effect (LATE) interpretation, as estimating the average treatment effect of a sub-population, whose treatment can be changed by the instrument (Imbens and Angrist, 1994). Hence, LATEs will naturally vary with the instruments. For example, an increase in minimum school-leaving age and proximity to school will see different populations increase their schooling. In this section we show such a setting and argue that our method can retrieve the largest group associated with a given LATE or the whole set of different LATEs.
For simplicity, we look at a setting with a binary treatment D i , a binary instrument Z i and potential outcomes Y 1i and Y 0i . The outcome and the treatments can be written as If the last three assumptions are fulfilled, Imbens and Angrist (1994) show that the IV estimand is the average treatment effect of compliers: In the following, we show a setting in which the LATEs are dependent on one potentially unobserved variable U . For this, we make use of the setting in Angrist and Fernandez-Val (2010).
The treatment is determined by the following latent-index assignment mechanism where h Z (U i , 1) ≥ h Z (U i , 0) and the potential outcomes depend on the variable U : where the errors are E( |U, Z) = 0.
Angrist and Fernandez-Val (2010) then assume The authors then show that under this assumption LATE can be written as function of U : Next, we are interested in a setting where the by-IV treatment effects form groups: This might be the case, when different compliant populations have the same U or different U lead to the same β [j] . Keep in mind that the number of groups is Q.

Lemma 2. Under Assumptions 6, 7 and 8, in steps 3 and 4 of Algorithm 1:
This follows by the same assumptions as above. In the same way: where G m is the largest group.
The difference to the setting with invalid IVs is that in the LATE setting not only the largest cluster contains valuable information, but also the smaller clusters contain coefficient estimates obtained with valid instruments.

Monte Carlo Simulations
First, we apply our method to simulated data. In the single regressor setting, we want to compare the performance of the new clustering method with that of the existing Confidence Interval Method and the Two-Stage Hard Thresholding Method. Therefore, we run simulations in which we follow closely the setting in Windmeijer, Liang, Hartwig, and Bowden (2020): There are 21 IVs, twelve of which are invalid, while nine are valid with α = c α (ι 6 , 0.5ι 6 , 0 9 ), γ = c γ × ι 21 , where ι r is an r × 1 vector of ones and 0 r is an r × 1 vector of zeros. We set c α = 1 and c γ = 0.4. The true β is 0 and Z ∼ N (0, Σ z ) with Σ z,jk = 0.5 |j−k| . Errors are The results are in     with α = (ι 12 , 0 9 ), γ 1 = unif (1, 2), γ 2 = unif (3, 4) and γ 3 = unif (5, 6), when there is a third endogenous regressor. The rest of the parameters are the same as before.
With this setting we estimate β for m = 100 replications. The results can be found in table 4.
Again, it appears that the performance of our estimator approaches that of the oracle estimator as the sample size grows large.
These simulations illustrate the two key advantages of our method. First, in settings with weak instruments our method potentially outperforms existing methods. If it is not known whether weak IVs are valid or invalid, it would be preferable to use AC instead of the existing methods. In the setting from WHLB, the performance of the methods is comparable. Second, our method is applicable to the multiple endogenous variable case.

Application: Effect of Immigration on Wages
In this section we apply our method to the estimation of the effects of immigration on wages in the US. We first describe the setting and then discuss the results.
Many recent studies have tried to estimate the causal effects of immigration on labor market outcomes. 1 Most papers in the literature only estimate the contemporaneous effects of immigration on labor market outcomes. Jaeger, Ruist, and Stuhler (2020) point out that there might be general equilibrium adjustments that affect wages in the long run, for example through the attraction of capital or the responses of native labor. This calls for including lagged immigration into the regression equation. However, the instrumental variables typically used in estimation might be invalid, because of correlation with unobservable shocks or direct effects.
We estimate the following linear model: as in Basso and Peri (2015).
Here The key econometric challenge is that migrants select where to live endogenously. For example, migrants might choose where to live based on economic conditions in a region. This creates a bias in the estimates. A much-used estimation strategy to address this issue is to use a shift-share instrumental variable, also known as Bartik-instrument due to Bartik (1991).
The key idea is to interact shares of previous migrants in a base-period, with current, aggregatelevel shifts, or inflows of migrants. This identification strategy dates back to Altonji and Card (1991) in migration economics. Goldsmith-Pinkham, Sorkin, and Swift (2020) show that the validity of the shift-share instrument depends on the validity of all shares and that an overidentified model with all shares as instruments can be used equivalently to the just-identified model. Therefore, we use all shares s jlt 0 of migrants from a certain origin country j, at a base period t 0 in region l. We use origin-specific shares from 19 origin country groups and base years 1970 and 1980 as separate IVs and obtain L = 38 IVs.
The main drawback of Bartik-type designs is that all instruments need to be valid instruments.
Why should these instruments be invalid ? Jaeger, Ruist, and Stuhler (2020) show that the shiftshare IV estimator might be inconsistent, first, because of correlation of the IVs with unobserved demand shocks and, second, because of dynamic adjustment processes. Hence, none of these two should play a role. However, it is well plausible that some origin country groups did not locate randomly in the past or have had direct effects on the wages. The second challenge can be somewhat tackled by including lagged immigration as an additional regressor. Of course, this will also be subject to the same endogeneity problem as before and hence should also be instrumented. 2 To circumvent these problems, we apply the new estimator, which allows for direct effects of many shares on wages by selecting the invalid shares. This alternative estimation of shift-share designs has also been proposed in Apfel (2019) Table 5.: Impact of Immigration

Results
The results can be found in Table 5. The first column shows results for ordinary least squares: the contemporaneous effect is 0.586, while the lagged effect is lower and negative.
When using all shares as valid IVs, both effects are higher in absolute terms but only the contemporaneous effect is marginally statistically significant. The Hansen-Sargan test for this model gives a p-value of 0.0126, which is lower than the proposed significance level of 0.1/log(n).
When using AHC with this significance level in the downward testing procedure, two origin country shares are selected as invalid: the share of Mexicans in the US in 1970 and 1980. This means that two IVs which are similar a priori in that they are shares from the same origin 2 More thorough discussion on the inconsistency of this particular IV estimator and the applicability of IV selection techniques in this setting can be found in Jaeger, Ruist, and Stuhler (2020).
country. These shares are likely to be invalid, because Mexican migrants were attracted to border regions as Texas and California by the good economic conditions in those states, both in the base year and in later periods. California's economy has a large agricultural sector, and both states are among the wealthiest in the US. It is therefore likely that wages or unobserved productivity shocks that have driven the initial settlement are correlated over time, invalidating the initial shares. Moreover, Goldsmith-Pinkham, Sorkin, and Swift (2020) find that Mexico has the highest sensitivity-to-misspecification weight, that is the overall bias will be sensible to any invalidity stemming from the Mexican share. Indeed, after controlling for Mexican shares, the contemporaneous effect almost doubles, while the lagged effect triples.

Conclusion
We have proposed a novel method to select valid instruments. This method is applied to the estimation of the effect of immigration on wages in the US. The method can also be easily applied to any other setting in which there are many candidate instruments. Another application that we will include is that of Mendelian Randomization, the use of instrumental variables in epidemiology.
The advantages of our method are that it extends straightforwardly to the setting with multiple endogenous regressors and to the setting with heterogenous treatment effects. It also performs well in the presence of weak instruments.
Ways to further improve the method would be to account for the variance of each justidentified estimator in the selection algorithm, and to apply it in nonlinear models. Also, we plan to further explore applications of our method to models with richer forms of heterogeneity.

A. Illustration of the IV Selection Procedure for P > 1
In figure 2, the procedure is illustrated. Here, we have a situation with four IVs and two endogenous regressors. Instrument NO. 1 is invalid, because it is directly correlated with the outcome, while the remaining three IVs (2, 3, 4) are related with the outcome only through the endogenous regressors and are hence valid.
In the first graph on the top left, we have plotted each just-identified estimate. The horizontal and vertical axes represent coefficient estimates of the effects of the first (β 1 ) and second regressor (β 2 ), respectively. Each point has been estimated with two IVs, in this case with IV pairs 1-2, 1-3, 1-4, 2-3, 2-4 and 3-4, because there are four candidate IVs.
In the initial Step (0), each just-identified estimate has its own cluster. In step 1, we join the estimates which are closest in terms of their Euclidian distance, e.g. those estimated with pairs 2-3 and 2-4. These two estimates now form one cluster and we only have five clusters. We re-estimate the distance to this new point and continue with this procedure, until there is only one cluster left in the bottom right graph. We evaluate the Sargan test at each step, using the IVs which are involved in the estimation of the largest group. When the p-value is larger than a certain threshold, say 0.05, we stop the procedure. Ideally this will be the case at step 2 or 3 of the algorithm, because here the largest group (in orange) is formed only by valid IVs (2,3 and 4). If this is the case, only the valid IVs are selected as valid.
This also implies that F 0 consists of valid IVs only and all sets {[j] :γ −1 1 α j = 0} of cardinality P are elements of F 0 . Hence, the following remark directly follows: Remark 2. |F 0 | = g P .

C. One family can consist of different vectors α
We have shown that the number of valid IVs defines the size of the family |F 0 |. However, this relation between g and |F 0 | is available only when α 1 = 0.
γc + X γ c − X α = α Thereforeγ c + X γ c − X α =γc (12) Hence, even though the number of IVs with the same value α j is smaller than g P , the largest family might still consist of combinations of invalid IVs, because the first-stage coefficient matrix also determines c [j] .

D. Oracle Properties
This section gives proofs for Lemma 1 and Theorem 1. All proofs apply for the general case that P ≥ 1.

D.1. Proof of Lemma 1
Proof. Consider and hence plim(β q ) = q Show that the probability that [j] is assigned to a cluster with elements of its own family goes to 1.
The following two are hence equivalent
The Sargan test statistic has the following properties: With these properties we can show that the downward testing procedure selects the oracle model consistently with ξ n,|J|−|I|−P → ∞ for n → ∞, and ξ n,|J|−|I|−P = o(n).
1. Consider |V| = |J |. By property 1. and the property that the critical values go to infinity, it follows that lim n→∞ P (S(θ I ) < ξ n,|J|−|I|−P ) = 1. Hence, the first selectionV that fulfills the HS criterion is the one associated with the oracle model.