Efficient closed-form estimation of large spatial autoregressions

Newton-step approximations to pseudo maximum likelihood estimates of spatial autoregressive models with a large number of parameters are examined, in the sense that the parameter space grows slowly as a function of sample size. These have the same asymptotic efficiency properties as maximum likelihood under Gaussianity but are of closed form. Hence they are computationally simple and free from compactness assumptions, thereby avoiding two notorious pitfalls of implicitly defined estimates of large spatial autoregressions. For an initial least squares estimate, the Newton step can also lead to weaker regularity conditions for a central limit theorem than those extant in the literature. A simulation study demonstrates excellent finite sample gains from Newton iterations, especially in large multiparameter models for which grid search is costly. A small empirical illustration shows improvements in estimation precision with real data.


Introduction
Spatial autoregressive (SAR) models, introduced by Cliff and Ord (1973), are popular tools for modelling cross-sectionally dependent economic data. The pre-eminent feature of such models is the presence of one or more 'spatial weight' matrices, which parsimoniously capture the dependence between units in the sample. Such dependence need not be geographic in nature, indeed the spatial weight matrix is known by other terms such as 'adjacency matrix', 'network link matrix' and 'sociomatrix'. For n × 1 vectors y n and u of responses and unobserved disturbances, respectively, and an n × k covariate matrix X n , the SAR model is λ 0in W in y n + X n β 0n + u, (1.1) where the elements of the n × n spatial weight matrices W in are inverse economic distances and λ 0n = (λ 01n , . . . , λ 0pn ) ′ and β 0n are unknown parameter vectors. Subscripting with n permits treatment of triangular arrays, an important issue for spatial models in general (see Robinson (2011)), and for SAR models even more so due to various normalizations of the W in that make it n-dependent. This paper justifies computationally straightforward estimation for the parameters of (1.1) with the same asymptotic properties as pseudo maximum likelihood estimates.
SAR models allow dependence to occur across a very generalized notion of space: so long as a mapping exists between every pair of individuals to the real line a spatial weight matrix may be constructed. The flexible nature of the SAR model means that it may be used to model a very wide range of phenomena. Thus it has found application in many fields of economics such as development economics (Case (1991), Helmers and Patnam (2014)), industrial organization (Pinkse, Slade, and Brett (2002)), trade (Conley and Dupor (2003)) and peer effects (Hsieh and van Kippersluis (2018)), to name only a few examples. Another frequently used approach to model cross-sectional dependence is the 'common factor' technique, see e.g. Chudik and Pesaran (2015) for a review.
Estimation of SAR models has long been considered in the regional science literature, see e.g. Anselin (1988). Rigorous asymptotic theory for instrumental variables (IV) estimation was initially provided by Kelejian and Prucha (1998), leading to the present flourishing theoretical literature. Lee (2002) studied ordinary least squares (OLS) estimation of SAR models, stressing the need for lack of sparsity in the spatial weight matrix to establish desirable asymptotic properties such as consistency and efficiency. This was followed by Lee (2004), a seminal contribution that provided a taxonomical asymptotic theory for Gaussian pseudo maximum likelihood estimates (PMLE) of SAR models. Recently Prucha (2013, 2020) have provided general theory for such models in a panel data setting.
The flexible nature of SAR modelling is further embellished by the seamless ability to integrate more than one spatial weight matrix in the model (1.1), thus permitting simultaneous connections between units across a number of channels. This is an accurate representation of typical economic situations, e.g. countries are 'connected' by both geographical proximity as well as trade ties. Furthermore, in many economic settings the sample partitions naturally into p clusters or groups, leading to block diagonal structure for the spatial weight matrix V n = diag (V 1n , . . . , V pn ), where V in is m i × m i and p i=1 m i = n. To permit the modelling of heterogenous spillover effects across clusters, one may take W in to be the n × n block diagonal matrix with the m i × m i dimensional i-th diagonal block given by V in . This approach has been suggested by Robinson (2015, 2018). Here and more generally, the specification (1.1) is termed a 'higher-order' SAR model if p > 1, see e.g Blommestein (1983), Lee and Liu (2010), Li (2017), Han, Hsieh, and Lee (2017), Kwok (2019).
In the study of higher-order SAR models, Robinson (2015, 2018) have suggested that p, k be allowed to diverge slowly to infinity as functions of sample size. The motivation for such generality is typically threefold: first, it is desirable to permit a richer model as the sample size permits. Second, clustered data as mentioned in the previous paragraph naturally imply asymptotic regimes with increasing p. For instance, when m i = m for each i = 1, . . . , p, we have n = mp and the results of Lee (2004) imply that p → ∞ is necessary for consistent estimation, analogous to the problems created in the spatial statistics literature by 'infill asymptotics', see e.g. Lahiri (1996). Finally, a theory that allows the model dimension to grow with sample size provides a more incisive analysis of large models in practice, much as typical asymptotic theory with a fixed parameter space itself can be thought as providing an approximation in finite samples.
The estimation of such increasing-order SAR models has been studied by Robinson (2015, 2018) using IV, OLS and PMLE approaches. The first two methods have the advantage of being in closed-form, while even for p = 1 PMLE (in)famously requires grid search and the inversion of an n × n matrix in every iteration, leading to many ingenious solutions for faster computation, see e.g. Ord (1975) and Pace and Barry (1997). The computational cost of PMLE in SAR type models is particularly salient as data sets increase in size, as stressed by Zhu, Huang, Pan, and Wang (2020). Modern network data sets are amenable to modelling via SAR techniques and can feature, or accommodate, large parameter spaces but computation remains a serious challenge. Han, Lee, and Xu (2020) provide a discussion of the problems and propose a Bayesian solution.
These problems are naturally exacerbated if p > 1, with grid search requiring more iterations to converge and each iteration requiring inversion of an n × n matrix, as well as risk of convergence to local optima. Furthermore, the requirement of a compact parameter space for λ 0n can severely restrict the admissible parameter values (see Gupta and Robinson (2018)). On the other hand, under Gaussianity the PMLE becomes the MLE and is efficient. This property is shared by OLS, but under rather delicate and specific conditions even for p = 1 (see Lee (2002)). Thus the IV/OLS and PMLE approaches each have their advantages and it is desirable to combine the positive properties of both.
One method of obtaining closed-form estimates with the same asymptotic covariance matrix as a target estimate is to use Newton-type iterations commencing from an initial consistent estimator that is straightforward to compute. The approach dates back at least to Fisher (1925) and LeCam (1956). It enjoys the added attraction of avoiding a potentially complicated consistency proof for an implicitly defined estimate, as well as the compactness assumptions this typically entails. As a result, the technique has been used in a vast variety of settings, see e.g. Rothenberg and Leenders (1964) (simultaneous equations), Hartley and Booker (1965) (nonlinear least squares), Janssen, Jurečkova, and Veraverbeke (1985) (M -estimation), Rothenberg (1984) (generalized least squares), Hualde and Robinson (2011), Kristensen and Linton (2006), Robinson (2005) (time series and adaptive estimation), Andrews (1997) (generalized method of moments), Kasahara and Shimotsu (2008), Kristensen and Salanié (2017) (structural estimation), De Luca, Magnus, and Peracchi (2018) (generalized linear models) and Frazier and Renault (2017) (efficient two-step estimation), to name just a few.
In this paper we use IV and OLS estimates as initial estimates to form a single Newtonstep asymptotic approximation to the Gaussian PMLE with p = p n and k = k n allowed to diverge as functions of n → ∞. The approach has been studied in the case of fixed-dimensional SAR models by Robinson (2010) and Lee and Yu (2013), but the previous discussion hints at its particular usefulness when considering large models. One avoids grid search over a highdimensional parameter space, compactness assumptions on this space and the inversion of large (n×n) matrix for every search iteration, as well as various headaches related to convergence and local optima. When commencing from IV estimates, this leads to closed-form efficient estimates under Gaussianity. As suggested by the results of Lee (2002) and Gupta and Robinson (2015), commencing iteration from OLS preserves the efficiency property. However, we show that the Newton step approach cancels out certain terms of large stochastic order that allows for weaker rate conditions than those imposed in these papers.
In a simulation study, we demonstrate that the Newton step can lead to much improved estimates in finite samples, both in terms of bias and efficiency. While a single step is sufficient to establish desirable asymptotic properties, in our simulation study we also explore the finite sample implications of additional Newton steps, reporting results with up to six iterations. We find large finite sample gains in both bias and mean squared error that are robust to heavy tailed error distributions. We also observe fast convergence of iterations, which conforms to extant theoretical observations. The gains are particularly notable when the parameter space and sample size is large, a situation in which PMLE becomes computationally onerous. In a small illustration with real world data, we show that the estimates work well in practice and lead to more precise results.
We collect some frequently used notation here for the convenience of the reader. For a generic matrix A denote A = (η (A ′ A)) 1 2 , with η(·) and η(·) denoting the largest and smallest eigenvalues, respectively, of a symmetric positive semidefinite matrix. Note that if A is a vector then A is simply its Euclidean norm. Let A R denote the maximum absolute row sum norm of A. For any parameter τ , function f (τ ) and generic estimateτ , we will writef ≡ f (τ ).
We denote true parameter values with 0 subscript and suppress the argument for a quantity evaluated at a true parameter value, i.e. f (τ 0 ) ≡ f .

Approximations to Gaussian PMLE
The (−2/n times) log pseudo Gaussian likelihood function for model (1.1) at any admissible point θ = (λ ′ , β ′ ) ′ is given by where S n (λ) = I n − pn i=1 λ in W in , with I n denoting the n × n identity matrix. If S n is invertible, (1.1) admits the reduced form y n = S −1 n X n β n + S −1 n u, and we define R n = A n + B n , where A n = (G 1n X n β n , . . . , G pnn X n β n ), B n = (G 1n u, . . . , G pnn u), G in (λ) = W in S −1 n (λ), i = 1, . . . , p n , and so R n = (W 1n y n , . . . , W pnn y n ).
Defining R y n (θ) = R n λ n + X n β n − y n , the derivative of (2.1) at any admissible θ, σ 2 is The Hessian at any admissible point in the parameter space is where P ji,n (λ n ) is the p n × p n matrix with (i, j)-th element given by tr (G jn (λ n )G in (λ n )).
Let Z n be an n × r n matrix of instruments, with r n ≥ p n , and define the IV and OLS estimates asθ Define the respective 'one-step' estimatesθ n andθ n by the following equationŝ θ n =θ n −Ĥ −1 nξn , (2.7) θ n =θ n −H −1 nξ n . (2.8) We observe that other initial estimates, such as the GMM estimates of Kelejian and Prucha (1999) and Lee (2007), can also be used. However we choose initial estimates that are available in closed form for computational ease. While consistent initial estimates are needed to obtain a desirable asymptotic theory, even in the fixed-dimension parametric case these are permitted to be n ψ -consistent, where ψ < 1/2, see Robinson (1988) and references therein.
While our theorems below establish desired asymptotic properties for the one step estimates, from a practical point of view more iterations may be desirable. In fact, these also improve the statistical rate of convergence to the target PMLE, yielding an even faster statistical counterpart to the famous quadratic numerical rate of convergence of Newton estimates, see for example Theorem 2 of Robinson (1988) and p. 312-313 of Ortega and Rheinboldt (1970). We examine this issue in more detail in the next section and also the Monte Carlo study.

Asymptotic properties
The following assumptions are discussed in Lee (2002Lee ( , 2004, and Robinson (2015, 2018), amongst other spatial papers in which they are routinely employed. These conditions are by no means the weakest possible set, but we opt for tractability to convey the main message especially in view of the large number of spatial parameters involved. For example, stochastic regressors can be easily accommodated but complicate the notation.
Assumption 1. u = (u 1 , . . . , u n ) ′ has iid elements with zero mean and finite variance σ 2 0 . Assumption 2. For i = 1, . . . , p n , the elements of W in are uniformly O (1/h n ), where h n is some positive sequence which may be bounded or divergent, but always bounded away from zero and such that n/h n → ∞ as n → ∞. The diagonal elements of each W in are zero.
Assumption 3. S n is non-singular for all sufficiently large n. Assumption 5. The elements of X n are constants and are uniformly bounded in n, in absolute value, for all sufficiently large n.
Assumption 6. The elements of Z n are constants and are uniformly bounded in absolute value, for all sufficiently large n.
Assumption 11. E u 4 i ≤ C for i = 1, . . . , n. Let Ψ n be an s×(p n +k n ) matrix of constants with full row-rank. The claims of the following theorems also hold when p n and k n are fixed, but we state and prove the results for the more challenging case when these diverge.  n , the latter being sharper since n/h n → ∞ as n → ∞. To more transparently illustrate the implications of our weaker rate conditions, consider data collected in a 'farmer-district' type of environment, such as in Case (1991). Suppose that there are D districts, each containing m farmers, so that n = Dm, and D, m → ∞ simultaneously. There is independence across districts, but equal dependence within districts, yielding h n = m − 1 (see Lee (2002Lee ( , 2004 for a more detailed discussion). Then, with fixed p n , Lee (2002) and Gupta and Robinson (2015)  The author thanks an anonymous referee for suggesting this illustration. We note that Robinson (2010) obtained asymptotic normality, indeed efficiency, in a semiparametric setup with p n = 1 requiring only h n → ∞ if the disturbances are symmetrically distributed or the weight matrix is symmetric. This condition would likely need to be suitably amended as p n → ∞.
If h n is bounded as n → ∞, a more complicated analysis is required to establish that one-step estimates achieve the PMLE asymptotic covariance matrix, because the information equality does not hold asymptotically. Denote µ l = E u l i for natural numbers l, and introduce, with i, j = 1, . . . , p n , the p n × p n matrix Ω λλ,n with (i, j)-th element 4µ 3 nσ 4 0 n r=1 c rr,in b r,jn X n β 0n + (µ4−3σ 4 0 ) nσ 4 0 n r=1 c rr,in c rr,jn and the k n × p n matrix Ω λβ,n with i-th column 2µ 3 nσ 4 0 n r=1 c rr,in x r,n where c pq,in is the (p, q)-th element of C in , b jn = G jn X n β 0n with t-th element b t,jn (j = 1, . . . , p n and t = 1, . . . , n) and x p,n is the p-th column of X ′ n . Define (3.5) When h n is bounded OLS cannot be consistent (see Lee (2002)), so the following theorem considers only initial IV estimates.
Theorem 3.2. Let Assumptions 1-7 hold. Suppose that h n is bounded away from zero and that there is a real number δ > 0 such that E |u i | 4+δ ≤ C for i = 1, . . . , n. In addition, assume that Suppose also that Then n 1 2 where the asymptotic covariance matrix exists, and is positive definite, by (3.6).
The rate condition (3.7) can simplify depending on the value of δ, i.e. the order of the finite moments assumed for u i . As δ grows larger, the last term in the rate condition becomes redundant, indeed the numerator therein tends to p 2 n k 2 n as δ → ∞, which is evidently dominated by the numerator of the other rate restriction. In the 'farmer-district' setting discussed earlier, we have bounded h n = m − 1 in this case. To further illustrate the rate condition, suppose that we are in the just identified case p n = r n . Then (3.7) requires p 5 n k 3 n + p 4 n k 4 n + p 3 n k 5 n + (p n k n ) 2+ 8 δ = o(n). Then the term involving δ dominates the other three if δ ≤ 8/3.
As indicated earlier, further iterations on the Newton step can improve the rate of statistical convergence to the target as well as finite sample properties. To see this, letθ ℓ n be the ℓ-th Newton iteration towards the PMLEθ n . By Theorem 2 of Robinson (1988) , an identical bound holding also forθ ℓ+1 n . A factor that depends on ℓ is suppressed in the stated stochastic bound, indicating that this is not uniform in ℓ. Because the results of Gupta and Robinson (2018) and this paper show that one-step Newton estimates anď thus yielding the rate at which the iterations approximate the target estimate in a statistical sense, pointwise in ℓ.
4 Finite-sample performance of Newton-step estimates

Fixed number of neighbours (bounded h n )
We examine finite-sample performance ofθ n in this section, since the IV case entails a change in limiting distribution due to the Newton step and OLS requires divergent h n to be consistent.
Following Das, Kelejian, and Prucha (2003) and the design in Gupta and Robinson (2015), define W * in as the symmetric circulant matrix with first row Davis (1979) p. 73). Thus W c in is also a symmetric circulant matrix with first row given by w * 1j,in /2i. This is an example of spatial weight matrices with bounded h n .
The choices of λ i satisfy the sufficient condition p i=1 |λ i | < 1 for invertibility of S.
With the aim of comparing initial IV estimates and MLE to Newton-step estimates, we first report three statistics: Monte Carlo mean, Monte Carlo mean squared error, and relative root Monte Carlo mean squared error, the latter being a straightforward ratio of the root MSE for IV and the iterated estimate. We also examine the use of more than one iteration in finite samples, and for this recall the notationθ ℓ n for the ℓ-th Newton iteration. Our results are reported for ℓ = 1, 3, 6. The set of instruments that we use for our initial estimates are the linearly independent columns of Z = W c 1 X, . . . , W c p X, X . In Tables 1 and 2, we report the Monte Carlo mean of our estimates for standard normal and t 6 errors, respectively. For standard normal errors, we notice that the initial IV estimate can be heavily biased but Newton iterations improve matters, sometimes spectacularly. Indeed, for p = 6 and n = 200 the performance ofθ n can be appalling, withλ 5 < 0. However after six Newton steps this has improved to 0.1216 and even three iterations lead to a significant improvement. The reduction of bias from Newton iterations is not a universal feature, however broadly speaking the Newton steps reduce bias in the estimates, even for smaller values of p.
As the sample size increases the iterations converge substantially, with little to choose typically betweenθ 3 n andθ 6 n for n = 800. However for n < 800, we notice that three iterations usually do the job quite satisfactorily, especially when p < 6.
For t 6 errors, Table 2 paints a similar picture to Table 1. Once again, the noticeable 'rogue' estimate is for λ 5 when p = 6 and n = 200. Considering that all our simulations start from the same seed, this outlier may possibly be attributed to a bad draw. As in the normal errors case, results are quite stable for larger n and smaller p, and typically show bias reduction due to Newton steps and near convergence after three iterations.
Tables 3 and 4 report mean squared error (MSE) for the IV estimates and iterated estimates with N (0, 1) and t 6 errors, respectively. As may be expected, MSE is very high for designs that combine the largest values of p with the smallest values of n. The efficiency improvement due to the Newton step is apparent, with iterations leading to very clear improvements (i.e. reductions) in MSE. These gains can be spectacular in many cases, for example for the λ i estimates when p = 6 and n = 800. These patterns of improvement with iteration are similar for both error distributions but the magnitude of MSE is generally much larger for t 6 errors, which features heavier tails than the normal distribution.
In Tables 5 and 6, we report the ratio of the Monte Carlo root mean squared error ofθ n to that ofθ ℓ n , ℓ = 1, 3, 6, abbreviating this quantity to RRMSE. An RRMSE of two indicates that the RMSE of the IV estimate is twice that of the Newton iteration it is being compared to. Our results in Table 5 show that Newton iterations can lead to tremendous finite sample gains in MSE. These gains are present in 100% of the cases considered, but are generally larger for the spatial parameters λ i than the regression parameters β i .
We discuss the spatial parameter estimates first. Note that for greater sample sizes we have greater MSE gains, often the gains more than doubling from n = 200 to n = 800, and sometimes even tripling. As observed for the means in Table 1, there is usually not much to choose from between the third and sixth iterations. With and n = 800 we nearly always obtain Newton estimates with RMSE a quarter of that for IV, and occasionally even a fifth of the IV RMSE.
In most cases three iterations are enough to achieve these superb gains.
These patterns for the λ i qualitatively repeat themselves when the errors are t 6 , as seen in Table 6. In this case when n = 800 we achieve RMSE improvements over IV of a factor of 2.15 always when three iterations are carried out, with factors of three commonly seen and one case with nearly a fourfold improvement. The factors of efficiency improvement that we observe in our results can dominate similar precedents in other settings. Indeed, the greatest relative root MSE improvement that Robinson (2005) finds in his fractional time series setting is 1/0.23 = 2.085 (see Table 4 of that paper).
Moving to the estimates of the regression parameters β 1 and β 2 , in both Tables 5 and 6 we see almost universal improvement over IV. The exceptions are four cases out of a total of 54 in Table 6, for the t 6 case. These RMSE gains are not as spectacular as for the λ i , but are generally noticeably large as both n and p increase. Indeed, for n = 800 we observe that the RMSE for the IV estimate can sometimes be almost one and a half times are large as the Newton iterations when p = 6 and n = 800. For n ≥ 400, IV performs worse than the Newton iterations almost uniformly (there are only two exceptions for t 6 errors) over both β 1 and β 2 , the values of p, the number of iterations and the error distribution. Thus there is evidence of the usefulness of Newton iterations even for the regression parameters, albeit the gains are greater for the spatial parameters.
Finally, we also present the RRMSE of MLE (denotedθ n ) to our proposed iterated estimates in Table 7 for N (0, 1) errors. Naturally, we anticipate MLE to outperform iterated IV estimates for smaller sample sizes and, because our iterations target the MLE limiting covariance matrix, a reasonable aim is to approach the RMSE of the MLE as n grows larger. Indeed, we find that this is the case. Recall that our estimates are designed to approximate but not outperform MLE: the main focus of the paper is computational simplicity. Our estimates are available in closed form and can be computed much faster than those requiring grid search and inversion of an n × n matrix. Thus, approaching the MLE in RMSE as n grows is an encouraging and desirable property of our estimates. Finally, we observe that the RMSE ofβ n is much closer to the iterated estimates than is the case forλ n . For the latter, larger sample sizes are needed for the RRMSE to approach unity.

Growing number of neighbours (divergent h n )
In this section we explore the performance of the Newton step estimates when the number of neighbours diverges with sample size, i.e. h n → ∞. This design, with diverging h n , also allows us to study the performance of iterations on OLS starting values. For each i = 1, . . . , p, we generate a n × n matrix W * in as w * rs,in = Φ (−d rs,i ) I c rs,i < n 1/3 /100 if r = s, and w * rr,in = 0, where Φ(·) is the standard normal cdf, d rs,i ∼iid U [−3, 3], and c rs,i ∼iid U [0, 1]. This construction generates W * in with approximately n 1/3 % (up to closest integer) nonzero elements. These W * in are then symmetrized and normalized by spectral norm to ensure stability, yielding the final set of W in that we employ. The remaining design details are as in the previous subsection. To conserve space, we report results only for N (0, 1) errors.
Tables 8 and 9 display the Monte Carlo mean ofθ n ,θ 1 n ,θ 3 n ,θ n ,θ 1 n andθ 3 n . Convergence of iterations is achieved after three Newton steps, so we do not report the sixth iteration as in the previous subsection. In fact, convergence is practically fully achieved by just a single iteration with the IV starting valuesθ n , as Table 8 indicates. Examining Table 9 suggests that a third iteration has more influence for OLS starting values, but modestly so. Tables 10 and 11 report MSE for the same sets of estimates and we find a similar pattern: for IV starting values one iteration seems to do the job and reduces MSE. On the other hand, for OLS starting values the first iteration increases MSE but the third iteration reduces it, following which performance is stable and so we do not report further iterations.
In Table 12 we report RRMSE of the estimates studies above. We notice that IV estimates improve in MSE with a single Newton step, and subsequent iterations do not help much, because convergence is achieved. On the other hand, when starting with OLS valuesθ n , further iterations are beneficial and yield more efficient estimates. Convergence is completely achieved after three iterations in this case. We also find that Newton steps, whether they commence fromθ n orθ n , give greater efficiency gains for the spatial parameters λ i rather than the regression coefficients β i . This matches the results in the previous subsection. Because the λ i correspond to the potentially endogenous spatial lags W in y, we might expect initial estimates of these to have greater potential for improvement compared to the β i .

Heteroskedastic errors
In this design, we confirm the robustness of our findings to heteroskedasticity in the error distribution. We generate the errors using multiplicative heteroskedasticity via the regressors, and report only the bounded h n weight matrices of Section 4.1 and designs with Gaussian errors to conserve space. Specifically, we employ a N (0, h jn ) distribution for the errors, where h jn = n ( n r=1 (|x r1 | + |x r2 |)) −1 (|x j1 | + |x j2 |), see Liu and Yang (2015) and also Lin and Lee (2010). Monte Carlo mean, mean squared error and RRMSE of IV estimates to iterated Newton step estimates are presented in Tables 13-15. We find the same qualitative patterns as were observed for the homoskedastic designs presented earlier, with the 'rogue' IV estimate for λ 5 appearing again because we start our simulations from the same seed. As far as quantitative results are concerned, the improvements due to the Newton step are generally smaller than the homoskedastic case but still substantial.

Empirical illustration
In this small empirical illustration we show that the Newton step estimates perform well in practice and can lead to more precise estimation. The example is based on Kolympiris, Kalaitzandonakes, and Miller (2011) (KKM), and is also studied in Gupta and Robinson (2015). KKM seek to model the venture capital funding (provided by venture capital firms (VCFs)) for dedicated biotechnology firms (DBFs) with a SAR model. The hypothesis is that the level of VC funding for a DBF increases with the number of VCFs located in close proximity. Denoting by d lk the distance in miles between the l-th and k-th DBFs, we estimate where W b i is the (row-normalised) weight matrix having off-diagonal (l, k)-th element equal to 1 if i − 1 < d lk ≤ i, i = 1 . . . , p, and if d lk = 0 for i = 1. Thus the matrices are based on each one of p sequential 1-mile rings from the origin DBF. y is the vector of natural logs of the amount of VC funding (million $) received by each of n = 816 DBFs.
We first focus on estimates of the main parameters of interest λ i in (5.1). We estimate (5.1) with p = 2, 4, 6 using initial IV and the Newton-step estimates that we have justified theoretically. We only report the Newton-step for a single iteration as convergence is achieved.
Like Gupta and Robinson (2015), we find that only λ 1 and λ 2 are statistically significant at the 1% level, and the magnitude of our parameter estimates is also close to their findings, with our results reported in Table 16. The table reports t statistics in parentheses. In square brackets we report for each parameter estimate the ratio of IV standard error to Newton-step standard error, and find that this difference can be as great as 12.53%. Thus the iteration scheme we propose can lead to more accurate inference in practice as the estimates are more precise.
As far as the β i are concerned, our simulations generally show that the efficiency gains are smaller for these as compared to the λ i . Table 17 reports standard error ratios and absolute t-statistics for exclusion tests and confirms this. Indeed, all standard error ratios are very close to unity and the t-statistics are practically identical. We note that our proposed iteration does not make the estimation precision of the β i worse and improves the estimation precision of the λ i , leading to an overall improvement in estimation quality.
We give a very brief description of the explanatory variables in X n and refer the reader to KKM for details. The covariates include the number of proximate VCFs and DBFs to capture the effects of being in areas of high VCF or DBF concentration. Firm-specific characteristics include the distance from each DBF to its funding VCFs, , the average age of each funding VCF, exposure of VCFs through syndication and an indicator for foreign VCF investment. Variables controlling for DBF-specific factors include firm age, dummies for receiving a grant and being in an R&D tax credit state, a cost of business index for the DBF's home state, distance to the closest university and the number of non-biotech establishments in the DBF's zip code. Two further variables recognize that additional factors can affect the cost of doing business in ways that influence the VC funding levels of a given DBF.

A Proofs of theorems
Write a n = p n + k n , b n = r n + k n , c n = p n k 2 n + k n and τ n = n 1 2 /a 1 2 n .
Proof of Theorem 3.1. (i) By the mean value theorem (2.7) implies that θ n − θ 0n = I an −Ĥ −1 n H n θ n − θ 0n −Ĥ −1 n ξ n =θ n − θ 0n −Ĥ −1 n H n θ n − θ 0n −Ĥ −1 n ξ n (A.1) where H n = ∂ 2 Q n (θ n ,σ 2 n )/∂θ∂θ ′ and θ n − θ 0n ≤ θ n − θ 0n , with each row of the Hessian matrix evaluated at possibly different θ n . The latter point is a technical comment that we take as given in the remainder of the paper whenever a mean-value theorem is applied to vector of values. For any s×1 vector α, we can use (A.1) to write τ n α ′ Ψ n θ n − θ 0n = τ n α ′ Ψ nĤ by Theorem 3.1 of Gupta and Robinson (2015). We conclude that the first term on the RHS of (A.2) is O p max p We have E φ n 2 ≤ pn i=1 E n −1 trC in − n −1 σ −2 0 u ′ C in u 2 = pn i=1 var n −1 u ′ C in u = O (p n /nh n ) , (see (A.20) in the proof of Theorem 3.3 and Lemma B.2 in Gupta and Robinson (2018)), so that  .7), in a similar way to the preceding proofs. Thus we need to establish the asymptotic distribution of −τ n α ′ Ψ n Ξ −1 n ξ n , which is established under the assumed conditions in Theorem 3.4 of Gupta and Robinson (2018).

B Lemmas
In the subsequent lemmas the assumptions of the theorems that these are used to prove are taken to hold. Lemma B.6. H n − H n and H n − H n are O p max c n /n, p 2 n /h 2 n , p