Confidence Sets Based on Inverting Anderson–Rubin Tests

Economists are often interested in the coefficient of a single endogenous explanatory variable in a linear simultaneous�?equations model. One way to obtain a confidence set for this coefficient is to invert the Anderson–Rubin (AR) test. The AR confidence sets that result have correct coverage under classical assumptions. However, AR confidence sets also have many undesirable properties. It is well known that they can be unbounded when the instruments are weak, as is true of any test with correct coverage. However, even when they are bounded, their length may be very misleading, and their coverage conditional on quantities that the investigator can observe (notably, the Sargan statistic for overidentifying restrictions) can be far from correct. A similar property manifests itself, for similar reasons, when a confidence set for a single parameter is based on inverting an F�?test for two or more parameters.


INTRODUCTION
Classical confidence intervals are, at least implicitly, defined by inverting a test. A confidence set at level 1 − α, which may or may not be a single bounded interval, is simply the set of parameter values for which a test at level α does not reject the null hypothesis that each value in the set is correct. This seems to imply that inverting an exact test must lead to a confidence set that has good properties. However, as we show in this paper, this is not the case when the test statistic involves more restrictions than the dimension of the confidence set.
Rather than attempting to state and prove a general result, in this paper, we deal with two special cases. The first case is the inversion of an F -test of two or more restrictions to obtain a one-dimensional confidence interval. This is not something that any sensible econometrician would do, of course, but it shows just what the issues are in a very simple context. The second case, which is the main focus of the paper, is confidence sets obtained by inverting the test proposed by Anderson and Rubin (1949). In the linear simultaneous-equations model with weak instruments, the asymptotic distributions of t-statistics often provide poor guides to their finitesample distributions; see Staiger and Stock (1997). As a consequence, confidence intervals based S40 R. Davidson and J. G. MacKinnon on inverting t-tests often have very poor coverage. One proposed solution to this problem is to invert a test that has better finite-sample properties. In a model with just one right-handside endogenous variable, the Anderson-Rubin (AR) test for the value of the coefficient of that variable is exact under classical assumptions. It has therefore been suggested in several papers, including Dufour (1997), Zivot et al. (1998), and Dufour and Taamouti (2005), that one should invert the AR test to produce what we refer to as an AR confidence set.
In this paper, we argue that, although AR confidence sets have correct unconditional coverage, at least under classical assumptions, they have many undesirable properties. Although some of these properties have previously been studied, notably by Zivot et al. (1998) and Mikusheva (2010), we offer some new theoretical results, together with supporting simulation evidence. AR confidence sets do not have correct coverage conditional on the type of confidence set that actually occurs. Moreover, when they are bounded, their length depends on the value of the Sargan statistic for the validity of the overidentifying restrictions. Therefore, any AR confidence set that is actually observed does not have correct coverage. It can be empty, misleadingly short, misleadingly long, or unbounded.
Having correct coverage unconditionally, while desirable, is by itself not very useful. It is always possible to create a 1 − α confidence set with the correct unconditional coverage by setting it equal to the empty set with probability α and the real line with probability 1 − α. However, such a confidence set provides no useful information. Unfortunately, when the instruments are weak, the AR confidence set may not be much more informative than this confidence set. Even when they are strong, it never has the correct conditional coverage. Forchini and Hillier (2003) have argued that the AR statistic is not in fact pivotal, because it does not depend on the parameter of interest everywhere in the parameter space, and that confidence sets based on it are therefore invalid. Our paper is concerned with the more detailed properties of AR confidence sets, but some of the issues that arise here are related to this important point.
It is well known that AR confidence sets may be unbounded. In general, when the instruments in a linear simultaneous-equations model are sufficiently weak, a confidence set with correct coverage must be unbounded with positive probability; see Gleser and Hwang (1987) and Dufour (1997). Thus, the possible unboundedness of AR confidence sets can actually be seen as a positive feature. What is less widely appreciated is that AR confidence sets may be empty or extremely small. They can thus provide a very misleading impression of how much information the sample provides about the parameter of interest.
The problem of confidence sets that are empty or very small can arise whenever we invert a test that has more degrees of freedom than the number of parameters in which we are interested. In the next section, we show that it can occur when we invert an F -test in the classical normal linear regression model. In Section 3, we introduce AR confidence sets and show that there are four types. In Section 4, we explore the important relationship between AR confidence sets and the Sargan statistic for overidentification. In Section 5, we use simulation experiments to study the properties of AR confidence sets. In Section 6, we briefly discuss alternative ways of forming confidence sets in regression models estimated by instrumental variables. We conclude in Section 7.
Inverting Anderson-Rubin tests S41 the context of the classical normal linear model y = xβ + X 2 β 2 + Zγ + u, u ∼ N(0, σ 2 I), (2.1) where y and x are n × 1 vectors, X 2 is an n × k 2 matrix, and Z is an n × k 3 matrix. We assume that the k ≡ k 2 + k 3 + 1 columns of x, X 2 , and Z are linearly independent. Suppose we attempt to construct a confidence set for β by inverting the F -test for the joint hypothesis H(β 0 ) : β = β 0 ; β 2 = 0, assuming of course that the true β 2 is indeed zero. The null model can be written as y − xβ 0 = Zγ + u, and the alternative as where X ≡ [x X 2 ]. Clearly, (2.1) and (2.2) are just different parametrizations of the same model. The F -statistic for a test of H(β 0 ) at nominal level α is Any value of β 0 for which F (β 0 ) ≤ q, where q is the 1 − α quantile of the F k 2 +1,n−k distribution, belongs to the confidence set formed by inverting the F -test.
Here and throughout the paper, P B is for any matrix B the orthogonal projection B(B B) −1 B on to the columns of B, and M B ≡ I − P B is the complementary orthogonal projection on to the orthogonal complement of the space spanned by the columns of B; for details, see Davidson and MacKinnon (2004), section 2.3. These matrices are symmetric and idempotent.
The inequality F (β 0 ) ≤ q can be expressed as a quadratic inequality in β 0 , x P M Z X x β 2 0 − 2 x P M Z X y β 0 + y P M Z X − cM [X Z] y ≤ 0, (2.4) where c ≡ (k 2 + 1)q/(n − k). The discriminant of the quadratic is The probability that < 0 is the probability of obtaining an empty confidence set, because the coefficient of β 2 0 in (2.4) is always positive. Therefore, if the corresponding quadratic equation has no real roots, the quadratic function is everywhere positive, and the inequality is satisfied nowhere. This probability is, of course, less than α.
From (2.5), we see that < 0 if and only if (2.6) The right-hand side of this inequality is the squared norm of the projection of y on to the direction of P M Z X x. However, because P M Z X x = P M Z X M Z x = M Z x, the right-hand side of (2.6) is simply y P M Z x y.

S42
R. Davidson and J. G. MacKinnon If we subtract y P M Z x y from both sides of (2.6), the first term inside the parentheses on the left-hand side becomes P M Z X − P M Z x . Because the inequality (2.6) can then be rewritten as which can be rearranged as The left-hand side of this inequality is distributed as F k 2 ,n−k , and so the probability that < 0 can readily be calculated. The numerical value depends on the nominal coverage 1 − α, the sample size n, and the numbers k and k 2 of regressors in the model (2.1). Suppose, without loss of generality, that the true value of β is zero and the true value of σ is one. Then, the confidence set covers zero if and only if it is non-empty, that is, > 0, and the two real roots of the quadratic have opposite signs. The product of the roots is the ratio of the last term on the left-hand side of (2.4) to the coefficient of β 2 0 . Because the latter is always positive, the roots have opposite signs if and only if because this inequality implies that > 0; compare (2.6). The inequality (2.9) can be rewritten as y P M Z X y/(k 2 + 1) y M [X Z] y/(n − k) < q, (2.10) and the probability that the inequality is satisfied is, of course, just 1 − α, because the left-hand side of (2.10) is distributed as F k 2 +1,n−k . Consider next the statistic for the F -test of the part of H(β 0 ) that has nothing to do with β 0 , namely, that β 2 = 0. This statistic is .
(2.11) From (2.7), the left-hand side of (2.10) can be rewritten as and so, if we write s 2 = y M [X Z] y/(n − k), the coverage event can be expressed as or, equivalently, y P M Z x y < s 2 (k 2 + 1)q − k 2 F 2 . (2.12) The two sides of this inequality are independent, and the left-hand side is distributed as χ 2 (1). Therefore, conditional on F 2 and s 2 , coverage is given by the cumulative distribution function S43 (CDF) of χ 2 (1) evaluated at the right-hand side of (2.12). It is almost never equal to 1 − α, and the larger the value of F 2 , the shorter the interval. The inequality (2.12) can never be satisfied if < 0. From (2.8) and (2.11), we see that the event < 0 can be written as k 2 F 2 > (k 2 + 1)q. It follows that, whenever the statistic F 2 for β 2 = 0 is sufficiently large, the confidence interval defined by (2.4) must be the empty set.
When the confidence interval does exist, its length is the distance between the two roots of the quadratic in (2.4), that is, 2 √ /x P M Z X x. It can be seen from (2.5) and (2.11) that and so the length of the interval, when it exists, is As noted earlier, the coverage of the confidence interval defined by (2.4) is given by the CDF of χ 2 (1) evaluated at the right-hand side of (2.12). From (2.13), this coverage can also be expressed as the CDF of χ 2 (1) evaluated at four times the squared length of the interval multiplied by x P M Z X x. It is evident from expression (2.13) that, if β 2 differed substantially from a zero vector, and if F 2 were consequently a large number, (k 2 + 1)q − k 2 F 2 would be negative, and there would not exist a bounded interval. This could happen either by chance or because β 2 = 0. It seems very unsatisfactory that the length, and even the existence, of a confidence interval for β should depend on the value of β 2 .
Of course, in the context of the classical linear model (2.1), it makes no sense to base a confidence interval for β on the statistic F (β 0 ) defined in (2.3). The usual interval is instead based on the t-statistic for β = β 0 . However, as we see in the next section, inverting an AR test is very much like inverting F (β 0 ).

ANDERSON-RUBIN CONFIDENCE SETS
We deal with the simultaneous two-equation linear model Here, y 1 and y 2 are n-vectors of observations on endogenous variables, Z is an n × k matrix of observations on exogenous variables, and W is an n × l matrix of exogenous instruments with the property that S(Z), the subspace spanned by the columns of Z, lies in S(W ), the subspace spanned by the columns of W . The n × (l − k) matrix W 2 is constructed in such a way that S(Z, W 2 ) = S(W ), and W is assumed to have full column rank. Equation (3.1) is a structural equation, and (3.2) is a reduced-form equation. The model is overidentified if l > k + 1, the number of overidentifying restrictions being l − k − 1.

S44
R. Davidson and J. G. MacKinnon The disturbance vectors u 1 and u 2 are assumed to be serially uncorrelated and homoscedastic, with mean zero and contemporaneous covariance matrix For the AR test to be exact, we also need the disturbances to be normally distributed.
The AR statistic for a test of the hypothesis that β = β 0 is . This statistic is, of course, minimized at the limited information maximum likelihood (LIML) estimatorβ LIML .
Let q now denote the 1 − α quantile of the F (l − k, n − l) distribution. Then, β 0 belongs to the confidence set at level 1 − α if and only if AR(β 0 ) ≤ q. This inequality can be reformulated as Zivot et al. (1998) have studied this inequality in some detail. For convenience, we summarize their results here. The discriminant of the quadratic equation obtained by replacing the inequality in (3.4) by an equality is D ≡ 4 y 1 A y 2 2 − 4 y 1 A y 1 y 2 A y 2 ; (3.5) compare (2.5). If D < 0, the equation has no real roots, so that the inequality (3.4) is either always or never satisfied. It is always satisfied if y 2 A y 2 > 0, and so, in this case, the confidence set is the entire real line. It is never satisfied if y 2 A y 2 < 0, which implies that the confidence set is empty. If D > 0, the equation has two real roots. If y 2 A y 2 < 0, the confidence set is the interval between them, while, if y 2 A y 2 > 0, it is the set composed of the disjoint union of the open infinite interval from the upper root to +∞ and that from the lower root to −∞.
Regardless of the sign of D, in the knife-edge case for which y 2 A y 2 = 0, the inequality (3.4) is satisfied with equality when β 0 = y 1 A y 1 /(2 y 1 A y 2 ), and the confidence set extends unboundedly from this value to the right or left accordingly as y 1 A y 2 is negative or positive. In the doubly knife-edge case with y 2 A y 2 = y 1 A y 2 = 0, the confidence set is the whole real line if y 1 A y 1 ≥ 0, and is empty if y 1 A y 1 < 0.
It follows that the confidence set is unbounded whenever y 2 A y 2 ≥ 0. This condition can be rewritten as c y 2 M W y 2 − y 2 P 1 y 2 ≥ 0.
By using the definition of c and a little algebra, we can rewrite this inequality as The quantity on the left-hand side of (3.6) is the ordinary F -statistic for π 2 = 0 in (3.2), and q is the critical value for a test at level α based on this statistic. Thus, as Zivot et al. (1998) have shown, the AR confidence set is unbounded whenever the statistic for π 2 = 0 in (3.2) is less than q; that is, whenever we cannot reject the hypothesis that the instruments that are not also explanatory variables (namely, the columns of W 2 ) have no explanatory power for y 2 .
It should be noted that the confidence set cannot be empty if the model is just identified. In that case, the LIML estimatorβ LIML is equal to the instrumental-variables (IV) estimatorβ IV , which satisfies the estimating equation y 2 P 1 ( y 1 − y 2βIV ) = 0. In this case, the image of P 1 is one-dimensional. Thus, the vector y 1 − y 2βLIML is orthogonal to all vectors in that image, and we see from (3.3) that AR(β LIML ) is zero. Consequently,β LIML always belongs to the confidence set in the just-identified case, whatever the significance level.
There is no point calculating an AR confidence set whenever the inequality (3.6) holds, because a set that consists of the entire real line, perhaps with a hole in the middle, tells us nothing useful about the value of β. In contrast to the confidence set, the identifiability test statistic does provide valuable information, because it provides a natural measure of the strength of the instruments; see Stock and Yogo (2005).
The fact that some types of AR and other confidence sets are unbounded when the instruments are sufficiently weak can be viewed as a consequence of a fundamental result in Dufour (1997), namely, that no valid confidence set which is almost surely bounded exists in the neighbourhood of a point where the parameter is not identified.
Unconditionally, the AR confidence set always has the correct coverage. However, once we observe what type of set it is, that is no longer the case. By construction, the empty set undercovers, and the real line overcovers. The bounded interval and the disjoint interval can either overcover or undercover. This is, of course, the case for any confidence set that may be empty or the whole real line.

RELATIONS WITH THE SARGAN TEST
The Sargan statistic for overidentifying restrictions (Sargan, 1958) should be computed whenever l − k − 1 > 0, that is, unless the system is just identified. It is most commonly computed as 1/σ 2 1 times the minimized value of the IV criterion function, whereβ IV is the IV (or two-stage least-squares) estimate of β, and the estimated varianceσ 2 1 denotes n −1û 1 M Zû1 , withû 1 ≡ y 1 −β IV y 2 . The equality in (4.1) follows from the fact that because Z is orthogonal to the IV residuals. It is evident that the numerator of the expression on the right-hand side of (4.1) would be identical to the numerator of the AR statistic (3.3) ifβ IV were replaced by β 0 . Becauseβ IV minimizes the numerator, the latter will always be no smaller than the former. That is why the AR statistic has l − k degrees of freedom in the numerator, while the Sargan statistic (which, of course, is not exact) has l − k − 1. It is not hard to show that the numerator of (3.3) can be S46 R. Davidson and J. G. MacKinnon rewritten as ( y 1 −β IV y 2 ) P 1 ( y 1 −β IV y 2 ) + (β IV y 2 − β 0 y 2 ) P 1 (β IV y 2 − β 0 y 2 ). (4. 2) The first term in (4.2) is the numerator of the Sargan statistic (4.1). Thus, if the Sargan and AR statistics had the same denominator, the latter would always be no smaller than the former. This is not always true in finite samples, because the denominators are not the same, although they both estimate σ 2 1 consistently under the null. However, there is inevitably a very strong tendency for large values of the Sargan statistic to be associated with large values of the AR statistic.
In order to analyse the statistical properties of the AR confidence set and its relationship to the Sargan statistic, we need to specify a data-generating process (DGP). Following Davidson and MacKinnon (2008), we use the DGP y 1 = β y 2 + u 1 , u 1 = rv 1 + ρv 2 , (4.3) where w ∈ S(W ) is an n-vector with w 2 = 1, and This DGP is a special case of the more general model specified by (3.1) and (3.2). However, by varying the three parameters β, ρ, and a, we can generate AR or Sargan statistics with the same distributions as those generated by any DGP contained in the more general model; see section 3 of Davidson and MacKinnon (2008). The vector W 1 lies in the direction of W π in (3.2). It provides all the explanatory power of all the instruments in the matrix W. Because it is only S(W ) that matters, we are perfectly free to perform a linear transformation on W that makes this the case. Similar reasoning shows that we can ignore any exogenous covariates in the matrix Z by the expedient of replacing all other variables by the respective residuals from regressing them linearly on Z (i.e., by pre-multiplying each variable by M Z ).
By normalizing the vector w to have squared length unity, we are implicitly using weakinstrument asymptotics; see Staiger and Stock (1997). The strength of the instruments is measured by the parameter a. The square of this parameter is called the scalar concentration parameter; see Phillips (1983, p. 470) and Stock et al. (2002).
Finally, the fact that we can, without loss of generality, set the variances σ 2 1 and σ 2 2 in (4.4) to unity is a consequence of the fact that all the statistics we consider are homogeneous of degree zero with respect to y 1 and y 2 separately.
As we have seen, the AR confidence set is a bounded interval if and only if D > 0 and y 2 A y 2 < 0. In this case, the length of the interval is the distance between the two roots of the quadratic in (3.4), which is − √ D/ y 2 A y 2 . Under the DGP given by (4.3)-(4.5), the limit of this ratio as a → ∞ is zero. The quantity that has a non-trivial limit as a → ∞ is thus the length of the interval times a. It can be shown that this limit is the square root of (4.6) The first term in (4.6) is c times a random variable that follows the χ 2 (n − l) distribution. The second term is an independent random variable that follows the χ 2 (l − k − 1) distribution. Of course, both these quantities would have to be divided by σ 2 1 if we did not set it to unity. The distribution of the second term, and its independence from the first term, both follow from the S47 fact that the matrix P 1 − P w = P W − P Z − P w projects on to the l − k − 1 components of W that do not lie in S(w, Z).
Expression (4.6) is random and can be either positive or negative. It is most likely to be negative when α is large, so that q, the 1 − α quantile of F (l − k, n − l), is small. There is evidently a non-empty confidence set only when it is positive. Because we are considering the limit as a → ∞, there is no danger that the set will be unbounded.
Thus, from (4.7), we havě σ 2 1 S = u 1 ( P 1 − P P 1 y 2 )u 1 = u 1 P 1 u 1 − (u 1 P 1 y 2 ) 2 y 2 P 1 y 2 . (4.10) If we replace y 2 by aw + u 2 and retain only the leading-order terms as a → ∞, the term that is subtracted in the rightmost expression here tends to (w u 1 ) 2 = u 1 P w u 1 , where the equality follows from the fact that w w = 1. Thus, in the limit, It is easy to see thatβ IV tends to β as a → ∞. From (4.9), .
Because the second term in the rightmost expression here is O(a)/O(a 2 ) = O(a −1 ), that expression vanishes as a → ∞. The consistency ofβ IV implies that as a → ∞. Thus, the first term in (4.6) can be replaced by

S48
R. Davidson and J. G. MacKinnon Similarly, by (4.11), the second term can be replaced byσ 2 1 S. We conclude that, in the limit as a → ∞, the length of the bounded AR interval, if it exists, is simply This is a deterministic function ofσ 1 and S, which is proportional to the former and non-linear in the latter. As S increases, the interval becomes shorter and eventually ceases to exist. Although the result (4.12) is strictly true only in the limit, it can be expected to provide a good guide whenever a is reasonably large (i.e., whenever the instruments are reasonably strong). It implies that, when the AR confidence set is a bounded interval, its coverage will vary inversely with the magnitude of the Sargan statistic. This may be especially problematic in practice if, as will very often be the case, the overidentifying restrictions are not quite satisfied. In consequence, observed Sargan statistics may well tend to be larger than they should be by chance, and bounded AR intervals consequently shorter.
The fundamental reason for the result that the AR confidence set depends on the value of the Sargan statistic is that the AR statistic has more than one degree of freedom. The Sargan statistic plays exactly the same role in (4.12) as k 2 times the statistic F 2 did in (2.13). In obtaining (2.13), there was no need to consider a limiting argument. The only reason we needed a limiting argument to obtain (4.12) is that the Sargan statistic does not have a unique distribution independent of the model parameters when a is finite. Figure 1 illustrates all four types of interval by graphing AR(β 0 ) against β 0 . The dashed horizontal line is the critical value, q. Two variants of the bounded interval case are shown. In one of these, the interval is very short, and in the other it is quite long. What type of interval we obtain depends on α. In particular, the probability that the interval is an empty set diminishes as α becomes smaller and q consequently becomes larger. All five intervals in the figure are for samples drawn from the same DGP, for which the instruments are moderately weak. The figure illustrates the fact that sampling variation can produce radically different AR confidence sets.
The confidence sets considered in Zivot et al. (1998), which were constructed by inverting the LR test and variants of the LM test, share a number of properties with AR confidence sets. They, too, can be unbounded, and so they have incorrect coverage conditional on being bounded or not. What differentiates AR confidence sets from these others is that, because the AR statistic depends on the Sargan statistic, an AR confidence set can be empty, and, as the figure illustrates, a bounded interval can be very much too short. Thus, we cannot interpret an observed AR confidence set, even a bounded interval, in the way we would like to interpret a confidence interval. On average, at least when the model is well identified, bounded intervals must overcover, in order to offset the failure of empty sets to cover at all. However, there will always be bounded intervals like the one shown in the top panel of Figure 1, which give the misleading impression that we have estimated β much more accurately than is actually the case.
The AR set in the just-identified case is not subject to this criticism. This is a natural consequence of the fact that, in this case, the Sargan statistic is zero. In fact, it is easy to see that, for a just-identified model, the AR test statistic is equal to the version of the LM statistic proposed by Kleibergen (2002) and Moreira (2009).

PROPERTIES OF ANDERSON-RUBIN CONFIDENCE SETS
In this section, we use simulation experiments to study various properties of AR confidence sets, including their conditional coverage. We generate artificial data from the DGP specified by (4.3),   (4.4), and (4.5). Because this DGP uses weak instrument asymptotics, the sample size does not matter much once it exceeds a threshold size. In Davidson and MacKinnon (2010), we found that the performance of various test statistics for β changed very little once n exceeded 400. We therefore set n = 400 in all our experiments. For each DGP, we generated 500,000 simulated data sets.
The key parameters in our experiments are a, ρ, and l − k. To save space, we report results only for l − k = 7, which means that the model is moderately overidentified. The results for      substantially smaller or larger values of l − k might look quite different, but that would primarily be because a needs to increase with l − k in order to keep the strength of the instruments constant. The basic structure of the results does not seem to change much with l − k. Figure 2 shows how the frequencies of the four types of 95% AR confidence set depend on a and ρ. The figure has four panels, which correspond to four different values of a. The value of ρ, which varies from 0.00 to 0.99 by increments of 0.01, is on the horizontal axis. Negative values are not included, because the figures would simply be symmetric around ρ = 0.
When a is small, the bounded interval may either overcover slightly (when ρ is small) or undercover severely (when ρ is large and a = 1). When a is not small, the bounded interval always overcovers, as it must do in order to offset the undercoverage associated with the empty set.
The two-segment confidence set undercovers when ρ is small. However, as ρ increases, its coverage increases, and it eventually overcovers. This type of confidence set does not occur when a = 8.
The coverage of the bounded interval changes dramatically when we condition on the Sargan statistic. When the latter rejects at the nominal 0.10 level, the bounded interval always undercovers, often severely. In contrast, when it fails to reject at the 0.50 level, the bounded interval always overcovers except for larger values of ρ when a = 1. This overcoverage is generally quite extreme. For example, when a = 8, the 95% bounded AR interval always covers at least 99.8% of the time when the Sargan statistic fails to reject at the 0.50 level.
These results suggest that the length of a bounded AR confidence interval will generally provide a poor guide to the precision with which the parameter β is estimated. To investigate this conjecture, we have calculated the dispersion ofβ LIML as the difference between its 0.025 and 0.975 quantiles over the 500,000 replications. In Figure 4, we compare this with the median and with the 0.01 and 0.99 quantiles of the lengths of the 95% AR confidence sets when they are bounded intervals. Ideally, the median length of the bounded AR intervals should be very similar to the dispersion of the estimates, and the upper-tail and lower-tail quantiles of interval length should not be too much higher or lower than the median.
The three left-hand panels of Figure 4 show results for 95% AR confidence sets when they are bounded intervals, and the three right-hand panels show results for conventional Wald intervals based onβ LIML . It is appropriate to compare AR intervals with those based onβ LIML , because the AR statistic is minimized atβ LIML . Results are presented for a = 4, a = 8, and a = 16. We do not present results for smaller values of a because most of the AR confidence sets were unbounded (see Figure 2) and because it is unreasonable to expect any method to produce reliable results in these cases. Note that the vertical axis is logarithmic.
It is evident that the median length of the bounded 95% AR interval is generally a poor guide to the dispersion ofβ LIML . The former always overestimates the latter, and the problem does not go away as a becomes larger. Moreover, the length of the bounded AR intervals evidently varies greatly. When a = 4, the upper-tail quantile of the distribution of their lengths can be more than 80 times the dispersion ofβ LIML , while the lower-tail quantile can be no more than 1/4 of the dispersion. Of course, as the theory of Section 4 makes clear, there are a few bounded intervals that are just barely longer than zero, but these are evidently well to the left of the 0.01 quantile. For large a, this occurs whenever q(l − k) − S in (4.12) is just barely positive.
Whereas the median length of the AR intervals always overstates the dispersion ofβ LIML , that of the Wald LIML intervals always understates it (although just by a small amount when a = 16). The lengths of the Wald intervals vary much less than those of the AR intervals.
The conventional Wald intervals improve more rapidly as a increases than do the AR intervals. When a = 8, and even more so when a = 16, the former have much better properties than the latter. The median length of the Wald intervals is just slightly smaller than the dispersion of the LIML estimates, while the median length of the AR intervals is much greater. Moreover, the distribution of the lengths of the Wald intervals is much tighter than that of the AR intervals. For a = 16, even the 0.99 quantile of the former is always smaller than the median length of the latter.

WHICH CONFIDENCE SETS SHOULD WE USE?
The goal of this paper is simply to study the properties of AR confidence sets, not to settle the more difficult problem of which confidence set(s) to use when making inferences about β in (3.1) when the instruments are not strong. In Davidson and MacKinnon (2014), a companion paper, we tackle the latter problem.
The results of Mikusheva (2010) suggest that inverting the CLR test can work very well (and that inverting the LM test generally works poorly) and Mikusheva has discussed how to invert the CLR test without using simulation. Davidson and MacKinnon (2014) propose an explicit algorithm for inverting asymptotic CLR tests and find that, in large samples, CLR confidence sets perform extremely well (unconditionally), even when the instruments are very weak.
The result of Dufour (1997), that any confidence set with correct coverage must be unbounded with positive probability in the neighbourhood of a point at which the parameter of interest is not identified, has consequences for any such possibly unbounded confidence set. The coverage probabilities conditional on the set being bounded, or on being unbounded, are not, in general, equal to the unconditional coverage probability. This fact is not an argument in favour of trying to use conditional coverage probabilities; rather, it underscores the undesirability of making inferences on nearly unidentified parameters. Table 1 compares the behaviour of AR confidence sets and those obtained by inverting the CLR test for our reference case with n = 400 and l − k = 7, and for various values of a and ρ. It shows the percentages of the time when 95% confidence sets are empty, the whole real line, and the real line with a hole or bounded. It also shows the coverage probabilities conditional on the confidence set being neither empty nor the whole real line.
Only the AR confidence sets can be empty, but both can be unbounded with weak instruments. Both types of set exhibit coverage conditional on being bounded that is very different from the unconditional coverage when there is a non-negligible probability of an unbounded set. Like the AR sets, the CLR sets have very satisfactory unconditional coverage, but their conditional coverage is often quite unsatisfactory. The CLR sets always have a higher probability of being bounded than do the AR sets, and of course they cannot be empty. When a = 8, the CLR sets are all bounded, at least in our simulations, so that conditional and unconditional coverages are the same, and very close to the nominal level. However, there are still empty AR sets, and so coverage for bounded sets exceeds the nominal level.
In our view, there exist no circumstances in which one should use an AR confidence set. The dependence on the value of the Sargan statistic means that, even when the set is a bounded interval, its length provides very unreliable information about the precision with which the parameter of interest has been estimated. Moreover, as the instruments become stronger, the AR interval continues to perform poorly; see Figure 4. Thus, the real defect of AR confidence sets is that, even when the instruments are strong, so that unbounded sets do not occur, the length of the sets can lead one to believe that the parameter of interest has been estimated either more or less precisely than is in fact the case.
CLR confidence sets certainly have merit and may well be worth using in practice, at least when the sample size is not too small. However, in addition to sometimes having seriously unsatisfactory conditional coverage, they have two practical disadvantages. First, they cannot readily be extended to handle two or more right-hand-side endogenous variables; see Mikusheva (2010). Second, because commonly used CLR tests are based on the LR statistic, they cannot easily deal with heteroscedasticity of unknown form. Note: Columns 3-6 give the percentage for each type of confidence set: R= whole real line; ∅ = empty set; hole = real line with a hole; bounded = finite bounded interval. Columns 7-9 give the unconditional coverage, followed by coverage conditional on a set that is the real line with a hole, and finally coverage conditional on a bounded interval. For each pair of values for a and ρ, the results for the AR set are in the first line, and the results for the CLR set are in the second line.
Normally, higher power of an underlying test is reflected in shorter confidence intervals. Because the length of a bounded AR confidence set depends on the value of the Sargan statistic, test power must also depend on it. Without conditioning on the Sargan statistic, the relationship between the power of the AR test and that of other tests is complicated; see Davidson and MacKinnon (2008). It is shown there that the CLR test is, except for a few configurations of a and ρ, at least as powerful as the AR test. This is not the case for the LM test of Kleibergen (2002) and Moreira (2009), which is often much less powerful.
In contrast to CLR confidence sets, confidence intervals based on Wald tests can readily handle any number of endogenous variables and can easily be modified to allow for heteroscedasticity of unknown form. In our view, this type of interval has far more merit than it is generally given credit for. In Davidson and MacKinnon (2008), we proposed a procedure for bootstrapping t-tests on β, called the restricted efficient (RE) bootstrap. In Davidson and MacKinnon (2010), we proposed a wild bootstrap variant of this procedure, called the wild restricted efficient (WRE) bootstrap, which allows for heteroscedasticity of unknown form. Both procedures seem to work extraordinarily well, very much better than the pairs bootstrap or semiparametric bootstraps that do not impose restrictions, except when the instruments are extremely weak.
It is conceptually straightforward to invert t-tests based on eitherβ LIML orβ IV that have been bootstrapped using either the RE or WRE bootstraps. The idea is simply to locate the ends of the interval at the points where the bootstrap P value is equal to α. This procedure can be computationally intensive, however. The problem is that, because the bootstrap DGP must impose a restriction on β, it is necessary to generate a different set of bootstrap samples for every candidate value of the upper and lower limits of the confidence interval. If the interval has a hole, which is possible, it is also necessary to generate a set of bootstrap samples for every candidate value of the limits of the hole. Thus, forming one confidence set can involve generating a great many bootstrap samples.
In Davidson and MacKinnon (2014), we present some simulation results for confidence intervals based on t-statistics and the RE bootstrap, and we find that that they generally work quite well, especially those based onβ LIML . In large samples, they perform almost as well as asymptotic CLR intervals, provided the instruments are not very weak, and in small samples they perform much better.

CONCLUSION
It seems natural to make inferences about a parameter or parameters by inverting an exact test, such as the AR test, because the resulting confidence set necessarily has correct coverage unconditionally. However, we argue that this is a very bad idea whenever the test has more degrees of freedom than there are parameters of interest. By considering two special cases, namely, inverting an F -test and inverting an AR test for a scalar parameter, we show that the resulting confidence set provides very little useful information about the parameter of interest. It may be empty, extremely short, or excessively long. In the case of the AR confidence set, it may also be unbounded, although that is a problem shared by all confidence sets with good unconditional coverage when the instruments are weak.
The basic problem was explained in Section 2 in the context of inverting an F -test to obtain a confidence set for a single parameter in a linear regression model. The problem arises because the confidence set depends not only on what the data tell us about that parameter but also on what they tell us about a number of additional restrictions. When those restrictions are moderately incompatible with the data, the confidence set will be a misleadingly short interval. When they are very incompatible with the data, it will not exist at all.
As we saw in Section 4, exactly the same problem arises in the context of the AR test. The additional restrictions are now the overidentifying restrictions that may be tested using a Sargan test. In this case, because the Sargan test is not exact, our results are necessarily asymptotic. When the Sargan statistic is moderately large, the AR confidence set is a misleadingly short interval. When it is very large, the AR confidence set does not exist. The simulation results in Section 5 provide strong support for our theoretical results, and show that AR confidence sets can be very misleading even when the instruments are strong.
We emphasize that, although our analysis has treated only two special cases in detail, the conclusion about the misleading nature of confidence sets based on inverting tests with extra degrees of freedom, over and beyond those needed for inference on the number of parameters