Testing the impossible: identifying exclusion restrictions

Method of moment estimators are generally obtained by adopting orthogonality conditions, in which particular functions in terms of the observed data and unknown parameters are supposed to have zero expectation. For regression models this implies exploiting presumed uncorrelatedness of the model disturbances and identifying instrumental variables. Here, utilizing non-orthogonality conditions is examined for linear cross-section multiple simultaneous regression models. Employing exible bounds on the correlations between disturbances and regressors one avoids: (i) adoption of often incredible and unveriable strictly zero correlation assumptions, and (ii) imprecise inference due to possibly weak or invalid instruments. The asymptotic validity of the suggested alternative form of inference is proved and its nite sample accuracy is demonstrated by simulation. It enables to produce inference on coe¢ cient values that within constraints is endogeneity robust. Also a sensitivity analysis of standard least-squares or instrument-based inference is possible, and even a test of the in the standard approach unavoidable though "non-testable" exclusion restrictions regarding external instruments. The practical relevance is illustrated in a few applications borrowed from the textbook literature.


Introduction
The standard quasi-experimental approach in applied econometric research requires the adoption of so-called orthogonality conditions. An initial set of such conditions has to be justi…ed on the basis of persuasive common sense or economic-theoretical arguments. However, contenders may easily disqualify these arguments as opportunistic subjective beliefs, since these conditions cannot be vindicated by empirical statistical evidence without the adoption of further non-testable conditions. Although the formal testing of any overidenti…cation restrictions is feasible, its interpretation is contingent on the validity of the initial just-identifying set of non-testable orthogonality conditions. For the analysis of single regression equations this implies that at least as many excluded variables have to be proclaimed as being uncorrelated with the unobservable random disturbances of the equation as there are unknown coe¢ cients of the endogenous explanatory variables, which are those that could be correlated with the disturbances. So these excluded variables, also addressed as the external instruments, should have no direct e¤ect on the dependent variable of the equation of interest. Moreover, for yielding e¤ective inference, they should at the same time have a substantial correlation with the endogenous regressors of the relationship.
Here it will be shown that there is an alternative route towards identi…cation by adopting non-orthogonal moment conditions, which may in fact be more credible than the just-identifying orthogonality conditions, because the moments concerned do not have to be strictly zero but may vary over an interval. A simple implementation of this yields a mutation of the ordinary least-squares (OLS) estimator. It is consistent if the correlation between all regressors and the disturbance is known. The limiting distribution of this unfeasible estimator enables to construct feasible asymptotic test procedures for regression coe¢ cients. These exploit, in addition to some standard orthogonality conditions with respect to any exogenous regressors, also bounds on the degree of non-orthogonality or simultaneity of the endogenous regressors. In simulations it is demonstrated that even in quite small samples these tests have appropriate level control and impressive power. They can be converted into feasible asymptotic con…dence intervals with conservative coverage, which are often more informative than intervals obtained by instrumental variables (IV) methods, especially when some instruments are weak.
Moreover, this robusti…ed OLS-based inference enables to test exclusion restrictions in the following way: it can produce the set of values of simultaneity correlations which endorses the exclusion restrictions, as well as the set under which these are rejected at a chosen signi…cance level. Thus, depending on the span and the location of these two sets, the credibility of exclusion restrictions can in this way be supported or wiped away by even in small samples very accurate statistical evidence.
The methods developed here are based on di¤erent assumptions than those made under the standard approach. Fewer assumptions, because no external instruments and corresponding exogeneity assumptions are required. On the other hand, some extra assumptions have to be adopted, namely on the third and fourth moments of the regressors and disturbances because these determine the variance of the adapted least-squares estimator. And, when it comes to making decisions on the basis of the produced inference, this has to be confronted with an opinion of the researcher on the likely degree of endogeneity. So, basically, strict exogeneity assumptions on external instruments are exchanged for interval assumptions with respect to the endogeneity of the regressors.
The techniques developed here are a generalization of (now allowing for an arbitrary number of endogenous and exogenous regressors) and an extension to (also enabling general tests of linear restrictions on the coe¢ cients of both endogenous and exogenous regressors) some basic initial results already published in Kiviet (2013, Section 4) and further justi…ed in Kiviet (2016). This approach, addressed as kinky least-squares (KLS), was triggered by some rudimentary …ndings dating back to Goldberger (1964, p.359) and Rothenberg (1972). That even the standard OLS estimator often beats IV or two-stage least-squares (TSLS) on a mean squared errors criterion has already been demonstrated in Kiviet and Niemczyk (2012), Kiviet (2013) and Doko Chatoka and Dufour (2016). That the here presented generalized KLS procedures are preferable to standard OLS, and in many cases also to IV, is because they enable accurate statistical inference in models with endogenous regressors, while avoiding the hazards of weakness or invalidity of instruments.
For an overview of complicating issues undermining the accuracy of statistical inference in models with endogenous regressors due to employment of weak or invalid external instruments, see for instance Dufour (2003). In essence these complications are fourfold: (i) under weak though valid instruments standard asymptotic IV inference is inaccurate (yields seriously biased non-normal coe¢ cient estimates with poorly assessed standard errors, resulting in bad level control of tests), (ii) employing more sophisticated weak-instrument techniques may result in improved level control, but yields con…dence sets which are often very wide or even unbounded, (iii) the use of invalid instruments produces as a rule highly inaccurate inference, whereas (iv) testing the validity of particular instruments seems only possible when a su¢ cient number of valid instruments is already available. During the last decades (i) and (ii) received a lot of attention in the literature. The present study addresses the two more fundamental problems (iii) and (iv). It escapes from problem (iii) by developing a formal frequentist approach to produce accurate inference in simultaneous models not employing any external instrumental variables at all by incorporating into the analysis an interval assumption on the degree of simultaneity. When implemented as an exclusion restrictions test this approach also allows to break out of the vicious circle of problem (iv) through testing the validity of instruments without requiring any untested orthogonality conditions. Various other studies have addressed problem (iii). The degree of invalidity of instruments is incorporated into a frequentist analysis by Ashley (2009) 1 and by Bayesian methods in Kraay (2012). Nevo and Rosen (2012) derive set estimates under assumptions on the signs and relative magnitudes of the simultaneity and instrument invalidity. Conley et al. (2012) augment the model with the instruments and make assumptions on its coe¢ cients (which would be zero under correct exclusion) which next allow frequentist or Bayesian methods to obtain inference allowing for instrument invalidity. However, unlike ours, all these approaches still employ IV methods and so do not escape from problem (i) nor (ii). To our knowledge feasible tests for (iv) have not been developed before, apart from various informal procedures, such as suggested in, for instance, Bound and Jaeger (2000) and Altonji et al. (2005).
In Section 2, after having reviewed how in a linear multiple regression equation with some endogenous regressors consistent estimators can be obtained by exploiting classic identifying orthogonality conditions, we demonstrate how this can also be achieved by adopting non-orthogonality conditions. This yields an adapted least-squares estimator which is a function of the nuisance parameter vector containing the correlations between all the regressors and the disturbances. The limiting distribution of this unfeasible estimator is presented in Section 3 (and derived in Appendices); from this a feasible test procedure for a set of general restrictions on the coe¢ cient values readily follows. Section 4 demonstrates how this procedure can be employed for testing any exclusion restrictions relevant within the context of a classic instrumental variables based analysis. Section 5 provides simulation results on size and power of exclusion restriction tests in simultaneous models with one or two endogenous regressors. Section 6 demonstrates how the various techniques can be employed in practice by analyzing three empirical data sets used for illustrative purposes in well-known textbooks. Section 7 concludes.

Two distinct approaches towards identi…cation
Consider a sample of n independent and identically distributed observations fy i ; x 0 i ; i = 1; :::; ng on a linear causal relationship given by where is an unknown constant K 1 coe¢ cient vector, K K matrix xx is positive de…nite with all its elements …nite and 0 < 2 u < 1: (2.2) with x (1) u 2 R K 1 which in practice is generally unknown. Model (2.1) with zero mean regressors may actually originate from a model with the same disturbances, the same K slope coe¢ cients corresponding to regressors with an arbitrary observation speci…c mean, and including as well an unknown intercept. Then taking all observations in deviation from their expectation annihilates the intercept and results in model (2.1) with zero-mean regressors.
A non-zero but constant correlation between elements of the K 1 1 vector of regressors x 1i and the disturbance u i may be due either to simultaneity (elements of x (1) i ; while constituting causes for y i , are causally dependent on y i themselves too), or to measurement errors in x (1) i ; or perhaps to particular omissions in the regression speci…cation. Generally, such non-zero correlations render all elements of the OLS estimator x i y i biased for and inconsistent under common regularity conditions. A standard method to achieve consistent estimators is reviewed in the next subsection, followed in a second subsection by the development of an alternative nonstandard procedure which in a sense repairs the inconsistency of least-squares.

Exploiting orthogonality conditions
The standard approach to achieve identi…cation and consistent estimation of is to …nd a L 2 1 vector of observations z (2)0 i ); with L = K 2 + L 2 K; one is willing to assume validity of the orthogonality conditions These imply E(z i y i ) = E(z i x 0 i ) ; 8i: Next, the "analogy principle" of the method of moments, by which expectations are replaced by corresponding sample averages, suggests as an estimator for the "best" solution^ of where X = (x 1 ; :::; x n ) 0 ; y = (y 1 ; :::; y n ) 0 and Z = (z 1 ; :::; z n ) 0 : In case L = K and Z 0 X has full rank there is a unique solution, namelŷ IV = (Z 0 X) 1 Z 0 y; (2.4) which realizes Z 0 (y X^ IV ) = 0; thus achieving orthogonality of the residualsû IV = y X^ IV and the instruments in the sample, similar to the zero moments E(Z 0 u) = 0: If L > K; while X 0 Z has rank K and Z 0 Z has rank L; then orthogonality of all individual instruments and the residuals cannot be achieved, but a unique solution is found by minimizing a quadratic form in the vector Z 0 (y X^ ); namely (Z 0 y Z 0 X^ ) 0 W (Z 0 y Z 0 X^ ); where W is some symmetric positive de…nite weighing matrix. This yields the estimator^ W IV = (X 0 ZW Z 0 X) 1 X 0 ZW Z 0 y: Under standard regularity conditions^ W IV (like^ IV ) is consistent and has a limiting normal distribution. When L > K the e¢ cient Generalized Method of Moment (GMM) estimator is obtained by choosing W proportional to [Z 0 V ar(u)Z] 1 ; where u = (u 1 ; :::; u n ) 0 : Because we have here V ar(u) = 2 u I this simpli…es to the TSLS estimator where P Z = Z(Z 0 Z) 1 Z 0 : If L > K it does not realize in the sample orthogonality of Z andû T SLS = y X^ T SLS ; but it does realize the orthogonality relationshipsX 0û T SLS = 0; withX = P Z X the orthogonal projection of the K regressors X on the L dimensional sub-space spanned by the instrumental variables Z: In the above approach, validity of all inference is based especially on validity of the orthogonality conditions (2.3). In case L = K no statistical evidence can be produced from the sample under study on this validity, because then Z 0 (y X^ IV ) equals zero by construction. Self-evidently, validity (orthogonality) of the instruments z (2) i requires validity of the L 2 zero restrictions i u i ) = 0: Thus, these zero or exclusion restrictions should be valid. 2 However, they cannot be properly tested unless we would have another K 1 valid instruments in order to cope with the endogeneity of x (1) i . But, for these we would only be able to test their exclusion restrictions unless ..., and so on. More generally, when the model with L K instruments forms the starting point and K of these instruments are presupposed to be valid, then the validity of only L K exclusion restrictions, establishing L K over-identi…cation restrictions, can be tested. This is equivalent with testing the validity of L K instruments in addition to K valid -though untested-instruments. Within the present context, testing the validity of this initial just-identifying set of K instruments is simply impossible. 3 This impossibility is highly uncomfortable, because the interpretation of the outcome of overidenti…cation or instrument validity tests is conditional on the legitimacy of adopting K non-testable zero correlation assumptions. This embodies the Achilles heel of many applied econometric studies. This vulnerability can only be concealed by wrapping this limb into non-statistical often highly speculative rhetoric arguments. Such a verbal cover-up often provides just meager protection against dissident views.

Exploiting some non-orthogonality conditions as well
Next consider an alternative to the above standard approach regarding achieving iden-ti…cation. Instead of adopting L K orthogonality conditions E(z i u i ) = 0; which imply zero correlation between each instrument and the disturbance, consider adopting a numerical assumption concerning the K elements of the correlation vector xu = ( x 1 u ; :::; x K u ) 0 ; where x j u = E(x ji u i )=( u j ); j = 1; :::; K; with 2 j equal to the j-th diagonal element of matrix xx : Hence, suppose we replace the assumption zu = ( z 1 u ; :::; z L u ) 0 = 0 by xu = r; (2.6) where r = (r 1 ; :::; r K ) 0 with scalar r j the adopted value of the correlation between x ij and u i ; so jr j j < 1; 8j: This implies adopting the K moment conditions E(x ij u i ) = x j u = r j u j ; j = 1; :::; K: (2.7) If we are still convinced of the exogeneity of the regressors x (2) i we could have r j = 0 for j = K 1 + 1; :::; K and r j 6 = 0 otherwise.
Then we adopt K 2 standard orthogonality conditions and K 1 non-orthogonality conditions.
One may object that in practice one generally would not know the true values of the elements of xu ; so r will generally di¤er from xu : Although true, this will turn out to be of moderate concern, because in the analysis to follow r j will not necessarily be kept …xed, but may vary within the interval ( 1; +1): Moreover, in the classic approach the adopted strictly zero values for the L elements of zu may be false too, raising far more serious credibility issues, because this approach does not allow for non-zero correlations between instruments and disturbances. Using Invoking again the "analogy principle" this suggests for the method of moments estimator the solution^ ; where n 1 X 0 y n 1 X 0 X^ = u S x r: Here S x is the sample equivalent of x : The j-th diagonal element of S x could either be taken as the square root of n 1 n i=1 x 2 ij (since the regressors have zero expectation) or (n 1) 1 n i=1 (x ij x j ) 2 with x j = n 1 n i=1 x ij (which may be bene…cial in small samples where x j may deviate seriously from zero). This yields solution (r; u ) = (X 0 X) 1 X 0 y u (n 1 X 0 X) 1 S x r =^ OLS u (n 1 X 0 X) 1 S x r: (2.9) This estimator involves a correction to the OLS estimator, aiming to correct its inconsistency when xu 6 = 0: Estimator^ (r; u ) is unfeasible as long as u has not been replaced by a sample equivalent. Of course, the standard OLS estimator of 2 u ; which is given by^ 2 u;OLS = u 0 OLSû OLS =(n K); whereû OLS = y X^ OLS ; is inconsistent, like^ OLS ; when xu 6 = 0; since Thus, a feasible though r-based estimator, which attempts to correct^ 2 u;OLS for its inconsistency, is^ : Now a feasible estimator which attempts to correct^ OLS for its inconsistency iŝ It is obvious that the correction terms of (2.10) and (2.11) only really succeed in discarding the OLS estimators from their inconsistency if r = xu :

KLS inference for multiple regressions
In order to examine the expedience of estimator (2.11) for producing inference on we shall …rst examine its limiting distribution under the (unrealistic) assumption that the K values r equal the true correlations xu : Under the assumptions made above, it is found that consistent (though unfeasible) least-squares estimator^ ( xu ) has a limiting normal distribution with in general a rather involved variance matrix. Its asymptotic variance appears to be a¤ected by the skewness and kurtosis of the distributions of u i and x i : Substantial simpli…cations occur when the third and fourth moments correspond to those of the normal distribution and especially in models with just one endogenous explanatory variable a remarkably neat result emerges. For the unfeasible estimator^ ( xu ); which generalizes for multiple structural regression models the KLS estimator of Kiviet (2013, Section 4), we …nd the following result (proof in Appendix B) for models with an arbitrary number of endogenous regressors, where all regressors and disturbances are identically distributed and have their …rst four moments similar to normally distributed variables.
Corollary 1.1: If in the situation of Theorem 1 one has K = 1, thus xx = 2 x and R = xu are scalar, then = 2 x so that V ( xu ) = 2 x is invariant with respect to xu ; A proof of Corollary 1.1 can already be found in Kiviet (2013). Another interesting special case of Theorem 1 considers the situation where just one regressor (say the …rst one) is endogenous (K 1 = 1) and all further regressors are exogenous. This leads to the following.
Corollary 1.2: If in the situation of Theorem 1 xu = ( 1 ; 0; :::; 0) 0 then, denotinĝ where = 1 2 1 2 1 11 with 11 = ( 1 xx ) 1;1 : Moreover, for the …rst element of vector ( 1 ), denoted^ 1 ( 1 ); this yields which is invariant with respect to 1 : Note that surprisingly the limiting distribution of the coe¢ cient of the one and only endogenous regressor of the inconsistency corrected least-squares estimator (KLS), irrespective of the occurrence in the model of any further exogenous regressors, is equivalent to that of OLS in case all regressors are exogenous. This is no longer the case when K 1 > 1: To produce feasible inference the result of Theorem 1 can be exploited as follows. Suppose we are interested in testing jointly h ( K) linear restrictions on the coe¢ cients of model (2.1), given by H 0 : Q = q; where Q is a known h K matrix of rank h and q a h 1 known vector. Now consider the feasible test statistic Under H 0 and the conditions of Theorem 1 we have Now for r = ( 0 ; 0 0 ) 0 and 2 1 (h) the (1 ) 100% quantile of the chi-squared distribution with h degrees of freedom, the set represents all possible values of x (1) u for which H 0 does not have to be rejected at asymptotic signi…cance level : Likewise, the compliment of C 0 (Q; q; ) in R K 1 ; given by C 1 (Q; q; ) R K 1 nC 0 (Q; q; ); represents all possible values of x (1) u for which H 0 should be rejected at asymptotic signi…cance level : The sets C 0 (Q; q; ) and C 1 (Q; q; ) enable to supplement standard OLS inference on H 0 with an indication of its robustness or sensitivity regarding simultaneity. Suppose that the K 1 1 zero vector is in set C 0 (Q; q; ); then this set represents also all non-zero values of x (1) u which corroborate under simultaneity the non-rejection of H 0 established under full exogeneity. The OLS decision not to reject H 0 is robust regarding simultaneity as long as it obeys the restrictions set by C 0 (Q; q; ); whereas for values of x (1) u in C 1 (Q; q; ) H 0 should be rejected. When the zero vector is in set C 1 (Q; q; ); thus standard OLS inference rejects H 0 ; this decision can be extended under simultaneity represented by all vectors x (1) u in C 1 (Q; q; ); but should be reversed for values of It is obvious that the inference based on KLS as just described will be fully robust with respect to simultaneity only if C 0 (Q; q; ) is R K 1 or ?. This seems highly unlikely, as becomes clear when we examine the case K 1 = 1 as in Kiviet (2013). It just extends the asymptotic validity of standard OLS for the case x (1) u = 0 to asymptotic validity for either x (1) u 2 C 0 (Q; q; ) or x (1) u 2 C 1 (Q; q; ): This could be labelled constraint robustness.
Of course, the actual numerical assessment and representation of the above sets may be quite complicated in practice, especially when K 1 is (much) larger than 1, and when the number of tested restrictions h is too. In the special case K 1 = 1 K and h = 1; while the tested restriction concerns just the coe¢ cient of the endogenous regressor, it is actually quite simple, as already exposed in Kiviet (2013Kiviet ( , 2016. In more general cases obvious problems regarding the numerical feasibility of this approach will occur in samples, and choices of r; where the correction of^ 2 u does no longer make sense, which is the case when r 0 S x S 1 xx S x r 1: We will monitor this in Sections 5 and 6.

Testing exclusion restrictions
One of the paradigms of classic econometric theory is that the exclusion restrictions in a just identi…ed model cannot be tested, and that in overidenti…ed models one cannot test all exclusion restrictions but just a limited number of them equal to the degree of overidenti…cation. Hence, in both cases some exclusion restrictions seem non-testable. By the methodology exposed above, however, it is possible to a certain extent to test any exclusion restrictions. It enables to assess, at a chosen nominal signi…cance level, the set of all possible xu values for which any arbitrary subset of exclusion restrictions should be rejected. If this set seems to cover the area in which the true value of xu may reside, then one should reject validity of the variables associated with these exclusion restrictions as instruments. On the other hand, when it seems likely that the true value of xu will not be in the assessed set, or when this set is empty, rejection of validity of the instruments under test is not indicated. Hence, at the stage of deciding whether or not the true value of xu seems covered by a particular non-empty set, expert knowledge is required to decide on the validity or not of an instrument, as in the case regarding adopting zu = 0 or not. However, the assessed set regarding zu may turn out to be so wide (or so narrow) that the decision becomes relatively easy. By calculating P -values of T (Q; q; r) for deliberately chosen Q and r and all relevant values of r; we will show in the illustrations below how evidence on the (in)validity of instruments can be produced which in many cases may be more convincing than evidence just based on pure rethoric arguments.
Note that the procedure just sketched is not an alternative to the test for overidentifying restrictions, often addressed as Sargan-Hansen test. This test presupposes validity of a number of external instruments equal to the number of endogenous regressors in the model, which just-identify the model, and then tests the validity of additional overidentifying instruments. The procedure discussed here can be implemented such that it produces inference on the validity of any set of candidate instruments, so also on this initial set on which standard (and incremental or di¤erence) overidentifying restrictions tests build without any prior statistical veri…cation.
Here we will …rst work out in detail this test for just-identifying exclusion restrictions for the model introduced in section 2.1 focussing on the special case K 1 = 1; whereas L = K: Hence, the structural multiple regression model is just identi…ed, x i1 is the one and only endogenous regressor, and the question is whether external scalar variable z i2 is exogenous and thus can be used as an instrument next to the K 1 regressor variables x (2) i , which are maintained to be exogenous. Hence, we will test the validity of z i2 as an instrument, or the assumption E(z i2 u i ) = 0: This is untestable by the Sargan-Hansen approach, because it requires L > K: Attempting to test E(z i2 u i ) = 0 could be done by including z i2 in the regression and test by an appropriate method whether its coe¢ cient is signi…cant. Its insigni…cance would endorse (although certainly not guarantee) its valid exclusion from the regression and use as an external instrument. To test by the established methods in model 4 (4.1) the exclusion hypothesis H e : z = 0; while respecting at the same time the simultaneity of regressor x i1 ; would require yet another valid external instrument, which would bring their number to L + 1 = K + 1; whereas we assumed that, apart from the K 1 exogenous regressors x i ; the only further candidate instrument is z i2 : So, testing the exclusion restriction z = 0 seems impossible indeed.
Though, in the present situation, the …rst result of Corollary 1.2 applies, after translating it from model (2.1) into the context of augmented model (4.1). The latter we will denote as y = X + u; where X = (X; z 2 ) and = ( 0 ; z ) 0 . With 1 (K + 1) matrix Q = (0; :::; 0; 1) = e 0 K+1 and q = 0; and r 1 still indicating the assumed value for x 1 u ; we may use for this single hypothesis the test statistic where, when using the notation S 1 x x = (n 1 X 0 X ) 1 ; and e 1 now representing the unit vector with K + 1 elements, we havê whereñ could simply be chosen n; or if one wants to employ a small sample adjustment it could be taken n K 1 or, when all variables have been taken in deviation from their sample average, n K 2: It is easy to show that of Theorem 1 is always positive. To achieve this for^ (r 1 ) too, we should not vary r 1 over the whole ( 1; +1) interval, but just examine r 2 1 < (e 0 1 S 1 x x e 1 )=s 2 1 : However, also in samples where^ (r 1 ) happens to be positive but much smaller than its true value estimators^ 2 u (r 1 );^ z (r 1 ) and its estimated standard deviation may be seriously a¤ected. This may lead to unpleasant consequences for the distribution of the test statistic. Such consequences seem more likely when n is small and when is small. We will monitor this in the simulations in the next Section. Extending the assumptions of Theorem 1 to model (4.1) and evaluating (4.2) in 1 ; we have t K+1 ( 1 ) d ! N (0; 1) under H e : Hence, if 1 were known an asymptotically exact test would be available. If 1 is unknown, and when testing two-sided, we should seek the set of all r 1 values for which [t K+1 (r 1 )] 2 > 2 1 (1) or (1) > 0: (4.8) The left hand side of this inequality is non-linear in the scalar r 1 : Assuming that …nding the roots of (1) = 0 over domain jr 1 j < 1 is feasible, …nding the set of r 1 values for which inequality (4.8) holds will be feasible too. Under the assumption that the true value of 1 is contained in this set, the hypothesis z (2) u = 0 should be rejected. This procedure has an asymptotic signi…cance level not larger than : Small sample performance may improve upon replacing 2 1 (1) by F 1 (1;ñ): Instead of …nding the roots of (4.8) at a particular an easier and more informative approach is to construct a graph over all relevant values of r 1 ; satisfying r 2 1 < (e 0 1 S 1 x x e 1 )=s 2 1 ; of the P -values of t 2 K+1 (r 1 ) with respect to the F (1;ñ) distribution. For any this immediately shows the range of values for x 1 u where the test statistic rejects (or not) the exclusion restriction.
Of course, for any L 2 1 subset z i from L 2 1 vector z (2) i its valid exclusion from model (2.1) can be tested in a similar way, also in models where K 1 1. This involves a special implementation of test (3.1). Now let X = (X; Z ) and Calculating the P -value of W (Q ; r) with respect to the F (L 2 ;ñ) distribution over all relevant values r indicates for which values xu validity of the instruments z i seems (un)likely.

Simulation results
In this section we want to produce simulation evidence on the …nite sample behavior of the inference techniques suggested in this study. As always such a study is only feasible when one strongly restrains the number of parameters of the simulation design. For practical reasons one also has to constrain the grid of parameter values from the design parameter space for which the techniques are actually examined. Therefore simulated models do often not fully mimic all aspects of empirically relevant models, but just their major basic characteristics. However, sometimes one can prove that the phenomenon of interest is in fact invariant with respect to particular parameters, which implies that relatively few calculations for a discrete choice of parameter values can represent the relevant properties for a whole subspace of the design parameter space. The model introduced in Section 2 is primarily characterized by the values n; K 1 ; K 2 and L 2 ; where L 2 K 1 1; L 1 = K 2 0 and n L 1 + L 2 K 1 + K 2 ; and also by ; xx ; zz ; zx ; 2 u ; xu and zu In Kiviet (2013) very favorable results have been produced on the …nite sample accuracy of KLS inference on in the very simple case L 2 = K 1 = 1 with L 1 = K 2 = 0 when xu were known. But, also for the more realistic situation where xu is unknown and an (in)correct interval [ L xu ; U xu ] is adopted which is supposed to contain the true value, it is shown that KLS inference is often much more useful than standard or Anderson-Rubin instrument-based inference. Because one may suppose that the simulated data in these experiments have been obtained after partialling out any exogenous regressors, the results are invariant regarding the chosen value for K 2 ; so they represent the situation for any K K 1 = 1: For this situation, also the actually chosen values for ; xx ; zz and 2 u have been shown not to a¤ect this KLS based test. Below, in the …rst subsection, we will consider the same simulation design as used in Kiviet (2013), but examine now the …nite sample behavior of the exclusion restriction test, under situations where xu is either known or unknown. Next we will present simulation results on KLS inference regarding and exclusion restrictions tests for models where K K 1 = L 2 = 2: All presented results are based on 250,000 replications.

The simplest possible implementation of the exclusion restriction test
For the very simple model with K = K 1 = L = L 2 = 1 we will examine here some of the small sample qualities of test (4.2) on a single just-identifying exclusion restriction. Due to invariance, the results will also represent cases where L 1 = K 2 > 0 and K = L > 1: The Monte Carlo design is constructed as follows. Let " i ; i and i be three mutually independent series (i = 1; :::; n) of standard normal drawings. From these we construct the three series where all coe¢ cients do not exceed 1 in absolute value; moreover, Hence, when values for u > 0; x > 0; z > 0; j xu j < 1; j zx j 1 and j zu j 1 are chosen, we can generate the series for u i and x i and …nd matching values for z from (5.5) and for z from (5.6) so that series z i can be generated as well. However, the three chosen correlations should obey in order to ensure that 0 2 z 1 and 0 2 z 1: For each realization of the series u i ; x i and z i in the simulation replications, we may …rst subtract their respective sample average from each observation. In that way an arbitrary intercept of an underlying model with one further regressor and one external potential instrument (each distributed with a possibly non-zero arbitrary mean) has been partialled out.
The dependent variable is generated by the model where coe¢ cient z has true value zero. Its standard least-squares estimator (4.5) sim-pli…es to^ where we de…ne the sample statistics as r z 0 u = z 0 u=(z 0 zu 0 u) 1=2 ; s u =j p s 2 u j with s 2 u = u 0 u=n 1 (and similar for r zx ; r xu ; s x and s z ), where n 1 is either n 1 or n; depending on whether deviations from sample average have been taken or not.
For this special model we further have from (4.6), taking r 1 as appraisal of 1 = xu ; (r 1 ) = 1 r 2 1 s 2 x s 2 z s 2 x s 2 z (1 r 2 zx ) = 1 r 2 1 r 2 zx 1 r 2 zx ; (5.10) and, because y 0 (I P X )y 0 specializes here to u 0 u[1 (r 2 xu 2r xu r zx r zu +r 2 zu )=(1 r 2 zx )]; we …nd for (4.7) the expression 2 u (r 1 ) = u 0 ũ n 1 r 2 zx r 2 zu + 2r zx r zu r 1 r 2 1 1 r 2 1 r 2 zx ; (5.11) which will be positive (as a variance estimate should) provided r 2 1 +r 2 zx < 1 or^ (r 1 ) > 0: Clearly, practical problems may emerge in cases where r 1 is chosen large in absolute value and r 2 zx happens to be larger than 2 zx : In the simulations we will monitor the occurrence of^ (r 1 ) 0 (which may be frequent, especially when n is small and the variance of r zx large) but will skip such replications, because^ z (r 1 ) is only de…ned when^ (r 1 ) > 0: In this simple model it specializes tô z (r 1 ) =^ z +^ u (r 1 ) For its estimated asymptotic variance, assuming r 1 = 1 ; we …nd n 1 nn From the expressions (5.12), (5.13) and (5.10) we observe that in this special model both^ z (r 1 ) and its asymptotic standard error are invariant with respect to and to s x ; whereas both are a multiple of s u =s z : Hence, in this simple model the exclusion restriction test statistic (4.2) will be invariant to ; 2 u ; 2 x and 2 z : Therefore, without loss of generality, we may set in the simulation: = 0 and u = x = z = 1: Another invariance result is the following. If r 1 = xu and from two of the three correlations xu ; zx and zu we change their sign, then the square of the exclusion test statistic does not change. Hence, considering in the simulations only cases where these three correlations are nonnegative (as we will) is not as restrictive as it seems at …rst sight.
We also …nd plim^ z (r 1 ) = zu zx r 1 1 1 + zu which is zero when zu = 0: Because it seems to be mostly non-zero for zu 6 = 0; we are hopeful that a test based on KLS estimator^ z (r 1 ) may have power for testing the invalidity of instrument z i for the regression of y i on x i . From (5.9) we …nd that the pseudo-true-value of^ z is which is non-zero in general, unless zu = 1 zx : Hence, even when zu = 0 it may be non-zero, unless also 1 = 0 or zx = 0: So obviously, the exclusion restriction should not be tested on the basis of the standard OLS estimator^ z : In Tables 1 (n = 500) and 2 (n = 50) we report the rejection frequency of the twosided unfeasible test based on the square of statistic (4.2), where we substituted the true value of 1 for r 1 : Since we took the generated data series in deviation from their sample mean we used n 1 = n 1 and, employing the 5% nominal critical value of the F-distribution, we took it at 1 and n 3 degrees of freedom. In the block of results for zu = 0 we observe in Table 1 that the asymptotic test using the true value of 1 demonstrates very good size control 5 when n = 500. According to inequality (5.7) the model is not de…ned for cases where zu is moderate and both xu and zx are large in absolute value (indicated by "-" in the tables). As already predicted, for cases where 2 xu + 2 zx is close to unity we observe deterioration of the performance of the test. Settings for which some experiments did produce negative^ ( 1 ) realizations are indicated by a hashtag. In fact, unde…nedness of KLS occurred for those cases in about 50% of the replications. At this large sample size the power of the test is already remarkable for zu = 0:05; impressive for zu = 0:1 and outright splendid for zu 0:2: Apart from close to the non existence region the rejection probability is found to be almost invariant with respect to the degree of simultaneity xu : For zu 6 = 0 the rejection probability increases with the absolute value of zx ; but is also very good for zx = 0; so the KLS exclusion restriction test does not su¤er in any way from weak instrument problems.  Table 2 presents similar …ndings for sample size n = 50: Note that all results marked by an asteriks or hashtag have been obtained from fewer than 250,000 replications, because when^ ( 1 ) turned out to be non-positive have been skipped (this occurred with frequency less than 5% for cases indicated by an asteriks and over 50% for cases indicated by a hashtag). Because in smaller samples r zx may deviate much more from zx we note deterioration of the test qualities over a larger band of cases approaching the non existence area. Otherwise, however, the size properties of the test are still appropriate and power improves with the absolute value of zu ; but self-evidently not as sharply as for larger samples.
In practice r 1 will usually deviate from 1 : Therefore, as in Kiviet (2013) for inference on , we will now examine the merits of a feasible exclusion restriction test in this simple model when employed on the basis of an interval [r L 1 ; r U 1 ] which is supposed to contain 1 : We investigate the three cases r L 1 = 1 0:1; r U 1 = 1 +0:1; r L 1 = 1 0:2; r U 1 = 1 +0:2; and r L 1 = 0; r U 1 = 0:3: From Table 3 where n = 100 we see that when xu 2 [r L 1 ; r U 1 ] the test is undersized, and still has remarkable power away from the non existence region. The bottom two rows, where xu = 2 [r L 1 ; r U 1 ], show that the test can be either conservative or liberal. Far away from the non existence region the test may still help to produce useful inference on instrument (in)validity, but its results become uninterpretable otherwise. In this table the asteriks stands for a frequency to obtain an unde…ned test not exceeding 5% and a hashtag for a frequency exceeding 45%.

Findings for a model with two endogenous regressors
As Section 3 on the simulation study of Kiviet and Pleus (2016) shows, designing a just identi…ed simultaneous model with two endogenous variables such that one can easily control the degree of simultaneity, the strength of the instruments and the multicollinearity between the regressors is not self-evident. For the present purpose the situation is even more complex, because we will have to allow for possible invalidity of the instruments as well when analyzing the power of exclusion restriction tests. We proceed as follows. Let the 5 1 vectors i contain (for i = 1; :::; n) independent drawings from a …ve element multivariate standard normal distribution. Now consider the linear transformation with A = (a jl ) a 5 5 upper-diagonal real valued matrix. To realize that all elements of d i have unit variance, the …ve rows of matrix A should all have inner-product unity. This directly implies a 55 = 1 and u i = i5 : Note that the …nal column of A actually represents ( x (1) u ; x (2) u ; z (1) u ; z (2) u ; 1) 0 : In the simulation we will control these four correlation parameters by choosing empirically relevant values for them, as well as for six other relevant correlations, all in the ( 1; +1) interval. The 10 yet unknown elements of A will follow from these these 6+4 correlations, the …rst four equations of (5.14), which are x (1) (5.18) and the imposed unit variance of all elements of d i : The unit variance of (5.18) implies By controlling the value of z (1) z (2) ; which follows from (5.17) and (5.18) to be z (1) u z (2) u + a 34 a 44 ; we …nd a 34 = ( z (1) z (2) z (1) u z (2) u )=a 44 : (5.20) Correlating (5.18) and (5.16) we …nd z (2) x (2) = a 44 a 24 + z (2) u x (2) u ; so and correlating (5.18) and (5.15) gives z (2) x (1) = a 44 a 14 + z (2) u x (1) u ; hence Due to the unit variance of (5.17) we have Then from z (1) x (2) = a 33 a 23 + a 34 a 24 + z (1) u x (2) u we obtain a 23 = ( z (1) x (2) a 34 a 24 z (1) u x (2) u )=a 33 ; (5.24) and from z (1) x (1) = a 33 a 13 + a 34 a 14 + z (1) u x (1) u we …nd a 13 = ( z (1) x (1) a 34 a 14 z (1) u x (1) u )=a 33 : (5.25) The unit variance of (5.16) yields and from x (1) x (2) = a 12 a 22 + a 13 a 23 + a 14 a 24 + x (1) u x (2) u we get So, all elements of matrix A can be expressed in the 10 correlations x (1) u ; x (2) u ; x (2) and z (1) z (2) : Not all combinations of values for these correlations in the ( 1; 1) interval will be compatible though. Obvious requirements are a 2 34 + 2 z (1) u < 1; a 2 23 + a 2 24 + 2 x (2) u < 1; a 2 12 + a 2 13 + a 2 14 + 2 x (1) u 1: (5.29) We will examine just a few compatible combinations of the ten correlations which seem relevant and just considered solutions on the basis of the positive square roots for the diagonal elements of A: That all elements of d i have unit variance does not lead to loss of generality. The values to be chosen for 1 and 2 can compensate for the unit variance of x (1) i and x (2) i in the model y The KLS based test of joint restrictions on 1 and 2 can be shown to be invariant with respect to 1 and 2 when the null is true, and so is the KLS based test on the joint signi…cance of z (1) i and z (2) i when added to this model, both under the null and alternatives, and it is also invariant to the scale of all regressors. So when investigating the size of the KLS test of joint restrictions on 1 and 2 and the rejection probability both under the null and under alternatives for the joint exclusion restrictions test we may without loss of generality set 1 = 2 = 0: In the simulations we will again take all vectors d i in deviation from sample averages, so the results are actually about models that include an intercept as well, whereas they in fact also hold for models which yield similar vectors d i after partialling out any number of arbitrary further exogenous regressors. Table 4 presents some results for the two types of KLS tests for the ideal (but unrealistic) situation that r = xu (true value of the degree of simultaneity is known). We examined all 1024 combinations of x (1) u 2 f0:2; 0:5g; x (2) u 2 f0:0; 0:3g; z (1) u 2 f0:0; 0:4g; z (2) u 2 f0:0; 0:2g; x (1) x (2) 2 f0:2; 0:6g; z (1) x (1) 2 f0:3; 0:6g; z (1) x (2) 2 f0:0; 0:3g; z (2) x (1) 2 f0:1; 0:3g; z (2) x (2) 2 f0:2; 0:5g and z (1) z (2) 2 f0:0; 0:3g; but present only 64 of them in Table 4. The table has two panels.
In the left one we present all 32 results for the lower values of z (1) x (1) ; z (1) x (2) ; z (2) x (1) ; z (2) x (2) and z (1) z (2) ; and in the right-hand panel those for their higher values. In all experiments the chosen correlation coe¢ cients obeyed the compatibility criteria (5.29). R x represents the rejection frequency (representing the estimated actual signi…cance level) of the joint signi…cance test on x i ; which represents its estimated actual signi…cance level for cases where z (1) u = z (2) u = 0: For R z results marked with an asteriks we found^ (r) 0 in less than 0.1% of the replications. However, in a few of the cases not included in the table we found the exclusion test to be unde…ned much more frequently.
From Table 4 we observe that the size properties of both unfeasible KLS test procedures are also very reasonable in the more general model, whereas the power of the exclusion restrictions test seems …ne. This being the case for the ideal situation in which x (1) u and x (2) u are supposed to be known provides the appropriate starting point for reasonably successful implementations under more realistic assumptions, as we saw in the preceding subsection.

Empirical illustrations
We will produce empirical results for three di¤erent cross-section data sets.

A wage equation for employed women
As a …rst illustration we closely follow a textbook example on IV/TSLS estimation given in Carter Hill et al. (2012, p.415). It concerns a subset of data originating from Mroz, namely a few variables on a sample of n = 428 employed women. After taking all these variables in deviation from their mean, the relationship considered is a special case of model (2.1), namely with K 1 = 1 and K 2 = 2; where y i is the log of wage, x (1) i is education in years and vector x (2) i contains the variables experience in years and its square. It is assumed that x (1) i is actually a proxy for the unavailable variable x (1) i ; which should express ability or intelligence. So, implicitly it is assumed that applying OLS to the model i u i ) = 0: However, regression (6.2) being unfeasible, the use in model (6.1) of L 2 = 2 external instrumental variables z i1 is the education in years of the mother and z (1) i v i ) = 0: Then substituting (6.3) into (6.2) gives which implies for model (6.1) that 1 = 1 1 1 ; 2 = 2 1 1 1 2 and u i = u i 1 1 It seems plausible to assume E(v i u i ) = 0; because otherwise x (1) i would in fact be an omitted variable from regression (6.2). So, we …nd assuming that E(v 2 i ) = 2 1 ; where, say, 0:1 < < 0:4: Coe¢ cient 100 1 represents the percentage wage increase per extra year of education. Since we expect 0 < 1 < 0:1; we may suppose 1 < 1 < 0; so that the OLS estimate of 1 will have a negative bias. If 1 1 = u is about 2 (which it would be on the basis of the TSLS …ndings), then we should have a special interest in examining the range 0:8 < 1 < 0:2: Because we deduced that E(x (1) is endogenous in regression (2.1). Its endogeneity is not due to classic simultaneity or dual or reciprocal causality, but it simply stems from an omitted explanatory variable which has been replaced by a proxy variable, so the origin of the endogeneity is actually measurement error. Thus, the endogeneity of x (1) i is not intrinsic here, but incurred. That x (1) i is the one and only endogenous regressor in (6.1) is due to the untested assumption E(x i v i ) = 0: Figure 1 shows over a wide range of r 1 values the P -values of the single just-identifying exclusion restriction tests for the variables z (1) i and z (2) i respectively. Over the range 0:9 < 1 < 0:1 (which we suppose to contain the true value of 1 ) for both variables all calculated P -values are below 5%. Testing their joint exclusion (results not presented) leads to the same conclusion. Thus, overwhelming evidence has been found forcing to conclude that these instruments are invalid. Nevertheless, the standard methods strongly support the TSLS results for the chosen speci…cation. In reduced form regressions for x i1 is added, its F -test value is 73.95, whereas this is 87.74 for z (2) i2 ; so both instruments seem pretty strong. Jointly they have an F -value of 55.40. Also, the Sargan test for the single over-identi…cation restriction when using both instruments has P -value 0.54. So, according to standard practice methods, acceptance of the TSLS results seems vindicated; the invalidity of the instruments remains undetected. There is an aspect overlooked by Carter Hill et al. (2014), which seems to reveal the inconsistency of TSLS for the present model. They …nd a larger estimate of 1 by OLS than by TSLS and argue that this is to be expected if variable x (1) i is positively correlated with the omitted factors in the error term. However, as we demonstrate above, we expect 1 to be negative, and hence the inconsistency of the OLS estimator of 1 should be negative too. So, supposing TSLS to be consistent, one should expect OLS to yield a smaller estimate than TSLS. The negative sign of 1 implies that the KLS estimator of 1 will turn out to be larger than the OLS estimator. Figure 2 shows the TSLS asymptotic 95% con…dence interval for 1 (red dotted lines), which is invariant regarding 1 ; and is centered at the TSLS estimate 0.0614 (red line). It also shows the KLS estimator (blue line), which varies with r 1 and the KLS asymptotic 95% con…dence interval (blue dotted lines). The right-hand-side graph zooms in on the area which we suppose to comprise the true value of 1 : The standard OLS nominal 95% con…dence interval is indicated at r 1 = 0, centered around 0.108. Figure 2 shows that for substantially negative values of 1 the consistent (when the assumptions of Section 2 apply) KLS estimators produce ludicrous values for 1 : Hence, the conclusion must not simply be that the two external instruments are invalid for model (6.1), as Figure  1 shows, but that more serious speci…cation problems undermine this model than just endogeneity of x

An analysis of the weight of newborns
We shall present another simple illustration. In Wooldridge (2010, p.116) an exercise is presented in which it is analyzed for n = 1388 newborns whether smoking by their mother during pregnancy a¤ects birth weight. A model like (6.1) is analyzed with K 1 = 1 but now K 2 = 3; where y i is the log of birth weight, x (1) i is the average number of packs of cigarettes smoked during pregnancy and vector x (2) i contains a dummy for the gender of the baby, a variable parity, which is the birth order of the child, and the log of family income. It is assumed that x (1) i is correlated with the disturbance term, because various further determinants of birth weight may be correlated with smoking behavior, such as alcohol use, health consciousness, …tness activity, stress, food, sleep and many more, and these have all been omitted from the model. It is suggested to use the price of cigarettes as an instrumental variable, because economic theory predicts that it is negatively correlated with packs smoked, whereas it does not seem likely that this price has a direct e¤ect on birth weight.
OLS yields a coe¢ cient estimate for packs of -0.084 with standard error 0.017, which (when consistent) would suggest that each extra cigarette smoked per day (each package containing 20 cigarettes) reduces birth weight by about 0.4%. TSLS yields an outrageous coe¢ cient of 0.651 (positive!) with standard error 0.854. These are clearly a¤ected by weakness of the instrument (price elasticity will be moderate because smoking is addictive) since the relevant F -value in the …rst-stage regression is only 1.00.
To this standard evidence the procedures developed in this study can add the following. The analysis presented at the start of this section can now be interpreted as follows. Suppose x (1) i is used here as a proxy for the comprehensive latent variable "life-style risks for baby's birth weight". Then we have again 1 > 0; but expect now 1 < 0: So here 1 > 0; which would render the OLS estimator of 1 positively biased which suggests that an extra cigarette per day may reduce birth weight by more than 0.4%. Suppose that in fact 0:08 > 1 > 0:15; 0:2 < < 0:8 and 0:5 < 1 = u < 2 then it follows from (6.4) that 0:008 < 1 < 0:24: From the left-hand side of Figure 3 we can see that over this area the validity of the instrument lacks strong support. Although the exclusion test does not force to reject at a signi…cance level smaller than 10%, in order to justify the use of the instrument a P -value much larger, say exceeding 50%, would provide much more comfort. The relatively low P -values for 1 > 0 also do not encourage to move on to applying weak instrument techniques.
Assuming that the conditions to apply KLS do hold, the right-hand-side graph in Figure 3 shows that if we knew 1 we would be able to produce highly accurate KLS inference on 1 (blue lines; the con…dence interval is so narrow that the …gure barely shows it). For instance, it enables to infer rejection of the hypothesis 1 > 0; provided 1 > 0:05: KLS also allows a sensitivity analysis of TSLS: It shows that the extremely wide (and hence pretty useless) TSLS con…dence interval is conservative at the nominal 95% level, provided 0:95 < 1 < 0:9: Zooming in on this …gure yields the left-hand side of Figure 4, from which we can deduce that for 0 1 0:35 (which does not seem unrealistic) the con…dence set 0:36 1 0:05 has asymptotic con…dence coe¢ cient 0.95. In the right-hand side of Figure 4 we produce KLS inference on one of the coe¢ cients of the exogenous regressors, namely the log of family income. The TSLS estimate of this coe¢ cient is 0.064 and its 95% con…dence interval is ( 0:048; 0:175), but KLS learns that for realistic values of 1 this coe¢ cient is much smaller and signi…cantly negative. Hence, for realistic values of 1 the TSLS interval is liberal here (actual con…dence coe¢ cient smaller than the nominal coe¢ cient), whereas it is predominantly conservative in the right-hand side of Figure 3.

A wage equation for young men
The above illustrations required using the simple Corollary 1.2 only, because they concern models with just one endogenous regressor. Next we will exploit Theorem 1 in its full complexity in an empirical model where K 1 = 2 which is based on a classic data set originating from work by Griliches and also used for illustrative purposes on a subset (n = 758) of these data on young men in Hayashi (2000, p.251). In Kiviet and Pleus (2017, p.18) we used the same data to illustrate tests on establishing the endogeneity of subsets of regressors. These are built on assuming validity of two untestable (according to the classic approach) identifying orthogonality conditions. Like in the …rst illustration log wage is the dependent variable, but next to schooling also an iq test score is a possibly endogenous regressor in addition to a range of exogenous controls (K 2 = 11), including age and experience. The external instruments used are (L 2 = 4): age2 (age squared), expr2 (experience squared), kww (another test score) and kww2 (kww squared). The overall Sargan test (2 degrees of freedom) has satisfying P -value 0.89, but it leaves two underlying just-identifying restrictions untested. Figure 5 shows (colored) contour plots for the P -values of the KLS based exclusion restrictions tests (L 2 = 2) of age2 and expr2 (left-hand) and kww and kww2 (right-hand) respectively. These plots have been obtained by calculating test statistic W of (4.9) over a range of values for the simultaneity correlations, where r 1 refers to schooling and r 2 to iq score. For both the grid values -0.99:0.01:0.99 have been examined. For cases where r 0 S x S 1 xx S x r > 0:99 we did set the P -value at 1.1, to be interpreted as "not de…ned". Both plots show that the statistic is de…ned over an ellipse. The exclusion restriction regarding the squares of both included regressors age and expr does not have to be rejected whatever the true values of 1 and 2 will be, since all P -values exceed 0.75. This is pretty hard evidence (although not irrefutable) on the possible validity of these two instruments. For score test variable kww and its square the situation is di¤erent. Over a substantial area of ( 1 ; 2 ) combinations their exclusion test has P -values well below 0.1, whereas the area where it exceeds 0.7 forms just a narrow shell, covering cases where 2 1 + 2 2 is relatively large. Especially when the simultaneity is nonexistent or mild the validity of kww and kww2 as instruments seems doubtful. Assuming that both schooling and iq are positively related to "ability", we expect both 1 and 2 to be mildly positive. In the left-hand side contour plot of Figure 6 we test the exclusion of the L 2 = 4 variables jointly. Now no P -values are obtained below 0.18. This demonstrates that the exclusion test may have limited power when some rightly (age2 and expr2) and some wrongly (kww and kww2) excluded regressors are tested jointly. In the right-hand side of Figure 6 we test the model in which kww has been included as an exogenous regressor (K 2 = 12), and it is tested whether the L 2 = 3 squared variables seem valid external instruments for the two endogenous regressors schooling and iq. Figure 6 highlights that the inference on endogeneity of these two regressors as presented in Kiviet and Pleus (2017), which uses the model and instruments of the left-hand contour plot, although supported by a large P -value of the Sargan test, should better have been executed in the model and with the external instruments of the right-hand contour plot, because this does not discourage the use of these three instruments irrespective of the actual values of 1 and 2 . The results just found can either be used to precede a traditional TSLS analysis, as performed in Kiviet and Pleus (2017). Or it can be interpreted as supporting the speci…cation of the model which includes kww as an extra regressor and excludes the three squared variables (K 2 = 12), while treating both schooling and iq as endogenous (K 1 = 2), and next -avoiding the use of possibly weak or invalid instruments-analyze its coe¢ cients on the basis of KLS inference. We will do the latter here, just focussing on the coe¢ cients of iq and kss. We perform one sided tests on the single hypothesis that these coe¢ cients exceed some particular value. For this value we chose their estimated values as obtained by OLS, which are 0.0029 and 0.0045 respectively. From the contours of Figure 7 one can see that, roughly, this hypothesis is rejected for iq when 1 > 2 2 and for kss when 1 > 0:9 2 : Assuming both 1 and 2 to be about 0.3 it seems likely that the coe¢ cient of iq is smaller than 0.0029 and that of kss larger than 0.0045. To depict similar inference results when we would allow kss to be endogenous too becomes of course much more complicated. However, it is not impossible. One could produce, next to the above results where kss is supposed to be exogenous (which could be expressed as 3 = 0), also contours for some cases where, for instance, 3 = 0:1 : 0:1 : 0:5: Hence, such contour plots allow to examine the sensitivity of OLS or TSLS coe¢ cient estimates, but also to produce inference on either instrument validity or regression co-e¢ cients conditional on speci…c assumptions regarding simultaneity or to some degree robust to simultaneity.

Conclusion
By incremental or di¤erence Sargan-Hansen tests for over-identifying restrictions the validity of a subgroup of instruments can be tested, provided a su¢ cient number of valid (i.e. exogenous) and relevant (i.e. su¢ ciently strong) over-or just-identifying instruments are already available. When one wants to verify whether these latter instruments are really valid indeed, the only route provided by the standard approach is: …rst adopt another non-testable set of valid identifying instruments. Thus, providing statistical evidence on the validity of all instruments is simply impossible by these tools. It mimics a situation where for a proof by mathematical induction one can prove the induction step, whereas proof of the truth of a base case is yet missing, and its proof seems even completely beyond reach.
However, a particular implementation of the KLS-based test procedure developed in this study, by which general linear restrictions can be tested in a multiple regression model with an arbitrary number of endogenous regressors without exploiting any external instruments -which is of substantial practical relevance by itself-also allows to generate statistical evidence on the tenability of exclusion restrictions. In situations where in this way a just-identifying or over-identifying set of acceptable instruments has been established, it provides for a series of incremental Sargan-Hansen tests the essential underlying building block which was missing so far. This supporting building block seems mandatory if one still wants to employ IV-based estimation and inference. However, the general tools developed in this paper also allow to produce inference on coe¢ cient values while avoiding the use of external instruments altogether and thus missing out all the ensuing problems such as sacri…cing credibility, accuracy and power due to possible weakness or invalidity of instruments.
The tools developed here are not very demanding computationally, and can also be used to provide a sensitivity analysis of least-squares or instrumental variables based inferences with respect to less strict assumptions regarding the orthogonality assumptions on which these are built.
Of course, as always, deeper insights and further generalizations are called for. Preceding the usual Sargan-Hansen tests by a just-identifying exclusion restrictions test exacerbates the pre-test problems. Theorem 1 presupposes homoskedasticity of both disturbances and regressors, so if this is not the case one should manage to …rst weigh all observations such that this is achieved as closely as possible. Developing inference methods which are robust regarding both simultaneity and heteroskedasticity and at the same time control size and boost power over the whole model building process remain a challenge for the future e¤orts.

A. Some basic derivations
We have assumed that f(x 0 i ; u i ) 0 ; i = 1; :::; ng are independently and identically distributed with zero mean and All elements of the latter matrix are assumed to be …nite. In addition, we assume that all elements of (x 0 i ; u i ) 0 have a symmetric distribution, whereas E(u 4 i ) = u 4 u and E(x 4 ij ) = x 4 j ; with 1 u < 1 and 1 x < 1: We denote the typical element of xx by jk (j; k = 1; :::; K), but for its diagonal elements we will sometimes use 2 j = jj : The typical element of vector xu can be denoted u j j ; because j = x j u =( j u ): Next to xx and its sample equivalent S xx ; where for the latter we de…ned two options at the end of Section 2, we will also use the matrices 2 x = diag( 2 1 ; :::; 2 K ) and x = diag( 1 ; :::; K ); as well as the diagonal matrices S x and S 2 x : The latter has the same main diagonal as S xx ; and S x S x = S 2 x : Invoking a standard version of the central limit theorem, we can now obtain the following results, which will be exploited later. We have hence X 0 u=n xu = O p (n 1=2 ): This is found by decomposing x i into two components, 0 xu : For j; k = 1; :::; K we have This is proved by decomposing x ik = a 1 x ij + a 2 ik ; where x ij and ik are independent and ik has zero mean and unit variance. Because E(x 2 ik ) = 2 k = a 2 1 2 j + a 2 2 and E(x ik x ij ) = kj = a 1 2 j we have a 1 = kj 2 j and a 2 2 = 2 k 2 kj 2 j : Now we obtain : Another result that we will exploit later, which involves the Hadamard (element by element) matrix product (denoted ), is where R = diag( 1 ; :::; K ): Since, following the …rst option for S 2 x ; we have n 1=2 (S 2 x 2 jk k ; which is the typical element of the limiting variance matrix of (A.4).
In what follows we also need the mutual covariances of scalar (A.1) and vectors (A.2) and (A.4). We …nd E[(u 2 B. Proof of Theorem 1 To …nd the limiting distribution of the inconsistency corrected OLS estimator^ ( xu ) = OLS n ^ u ( xu )(X 0 X) 1 S x xu we examine First, we have to separate from the right-hand side expression the leading O p (1) terms from o p (1) terms. Matrix n 1 X 0 X = O p (1) can be decomposed as where the …rst component is deterministic and …nite, denoted as xx = O(1); and the second component is n 1 X 0 X xx = O p (n 1=2 ); see derivation below (A.3). Exploiting the smaller order of this second component we …nd Hence, this inverse has a leading O(1) term, a second term of order O p (n 1=2 ) plus a remainder of smaller order.