Nonlinear Factor Models for Network and Panel Data ∗

Factor structures or interactive eﬀects are convenient devices to incorporate latent variables in panel data models. We consider ﬁxed eﬀect estimation of nonlinear panel single-index models with factor structures in the unobservables, which include logit, probit, ordered probit and Poisson speciﬁcations. We establish that ﬁxed eﬀect estimators of model parameters and average partial eﬀects have normal distributions when the two dimensions of the panel grow large, but might suﬀer from incidental parameter bias. We show how models with factor structures can also be applied to capture important features of network data such as reciprocity, degree heterogeneity, homophily in latent variables, and clustering. We illustrate this applicability with an empirical example to the estimation of a gravity equation of international trade between countries using a Poisson model with multiple factors


Introduction
Factor structures or interactive effects are convenient devices to incorporate latent variables in panel data models. They are commonly used to capture aggregate shocks that might have heterogeneous impacts on the agents in macroeconomic models, and multidimensional individual heterogeneity that might have time varying effects in microeconomic models. More generally, the inclusion of these structures serves to account for dependences along the cross-section and time series dimensions in a parsimonious fashion. While methods for linear factor models are well-established, there are very few studies that develop methods for nonlinear factor models.
(We provide a literature review at the end of this section.) Nonlinear models are commonly used when the outcome variable is discrete or has a limited support. In this paper we introduce factor structures in single-index nonlinear specifications such as the logit, probit, ordered probit and Poisson models.
The model that we consider is semiparametric. It includes an outcome, strictly exogenous covariates, and a fixed number of factors and factor loadings. The parametric part is the distribution of the outcome conditional on the covariates, factors and loadings, which is specified up to a finite dimensional parameter. The nonparametric part is the distribution of the factors and loadings conditional on the covariates. In other words, our model is of the "fixed effects" type because we do not impose any restriction on the relationship between the observed covariates and the unobserved factors and loadings. This flexibility allows us to capture features of economic behavior more realistically, but poses important challenges to estimation and inference.
The objects of interest are the model parameter and average partial effects (APEs), which are averages of functions of the data, parameter, factors and loadings. The APEs measure the effect of covariates on moments of the outcome conditional on the covariates, factors and loadings. We consider a fixed effects estimation approach that treats the factors and loadings as parameters to be estimated. As it is well-known in the panel data literature, the resulting estimators generally suffer from the incidental parameter problem coming from the high-dimensionality of the estimated parameter (Neyman and Scott, 1948).
We derive asymptotic theory for our estimators of the model parameter and APEs under sequences where the two dimensions of the panel pass to infinity with the sample size. Even establishing consistency is complicated in our setting because the dimension of the estimated parameters increases with the sample size. We develop a new proof of consistency that relies on concavity of the log-likelihood function on a single-index that captures the dependence on covariates, parameter, factors and loadings. However, unlike Fernández-Val and Weidner (2016), we need to deal with the complication that our log-likelihood function is not concave in all the estimated parameters because the factors and loadings enter multiplicatively in the index. We also establish that our estimators are normally distributed in large samples, but might have biases of the same order as their standard deviations. For example, we find that the estimator of the model parameter is asymptotically unbiased in the Poisson model, but is biased in logit and probit models. Following the recent panel data literature, we develop analytical and split-sample corrections for the case where the estimator has asymptotic bias. One specific feature of our estimator is that the bias depends on the number of factors. In particular, we show that the bias grows proportionally with the number of factors in examples.
We discuss implementation details of our methods including the computation of the estimator and selection of the number of factors. Thus, we propose an EM-type algorithm based on Chen (2014) and a concrete proposal to estimate the number of factors based on the eigenvalue ratio test of Ahn and Horenstein (2013). The estimator of the number of factors requires to specify an upper bound for the number of factors, but does not rely on any arbitrary choice of penalty function or other tuning parameter. We do not provide asymptotic theory for this estimator, but show that it performs well in numerical simulations. Formally deriving the theory is rather challenging, because it requires to study the asymptotic properties of the initial fixed effects estimators of the parameters and factor structure obtained from a specification with too many factors, which is a difficult problem even in linear panel factor model (Moon and Weidner 2015). We leave this analysis to future research.
We also introduce factor structures as practical tools to model network data. We show how the inclusion of latent factors is useful to incorporate important features of the network such as reciprocity, degree heterogeneity, homophily on latent variables, and clustering (Snijders, 2011;Graham, 2015). We focus on directed networks with unweighted and weighted outcomes. These cover binary response models for network formation where the outcome is an indicator for the existence of a link between sender and receiver, and count data models for network flows where the outcome is a measure of the volume of flow between sender and receiver. As we shall discuss, our factor model provides a parsimonious reduced-form specification that captures the important network features mentioned above. The statistical treatment of the network factor model is identical to the panel factor model after noticing that a network is isomorphic to a panel after labeling the senders as individuals and the receivers as time periods.
We illustrate the use of the factor structure in network data with an application to gravity equations of trade between countries. We estimate a Poisson model where the outcome is the volume of trade and the covariates include typical gravity variables such as the distance between the countries or whether the country pair belongs to a currency union or a free trade area. The unobserved factors and loadings serve to account for scale and multilateral resistance effects, unobserved partnerships, presence of multinational firms, and differences in natural resources or industrial composition. We find that accounting for these multiple unobserved factors changes the effects of the gravity variables, making all of them to have the expected signs while keeping most of them to be statistically significant.

Literature review:
This paper contributes to the econometric panel data and network data literatures. Regarding the panel data literature, our statistical analysis relies on the recent developments in fixed effects methods. We refer to Fernández-Val and Weidner (2018) for a recent review on fixed effects estimation of nonlinear panel models with additive individual and time effects, and to Bai and Wang (2016) for a recent review on fixed effects estimation of linear factor or interactive effects panel models. Since the first draft of this paper appeared in Chen et al. (2014), Boneva and Linton (2017) and Ando and Bai (2016) have considered special cases of nonlinear factor models. Boneva and Linton (2017) analyzed a probit model using the common correlated random effects approach of Pesaran (2006), and Ando and Bai (2016) a logit model using a Bayesian approach with data augmentation. Our analysis is different in the modeling assumptions and estimation method. 1 The most closely related work is Wang (2018). This paper derives the asymptotic distribution of the estimators of the factors and loadings in nonlinear single index models without covariates. By contrast, we focus on covariate coefficients and average partial effects and treat the factors and loadings as nuisance parameters. Accordingly, we view our results as complementary to the results in Wang (2018).
In terms of the network literature, our paper is related to the recent work on the application of panel fixed effects methods to network data including Fernández-Val and Weidner (2016) Graham (2017), and Yan (2018). These papers account for degree heterogeneity by including additive unobserved sender and receiver effects. Additive effects, however, do not capture other network features such as homophily in latent factors and clustering. Graham (2016) considered a binary response model of network formation with all these features plus state dependence, for the case where the network is observed at multiple time periods. Compared to Graham (2016), our method can capture all these features, except for state dependence, applies to ordered and count outcomes in addition to binary outcomes, and only requires observing the network at one time period. A stream of the statistic literature has considered nonlinear factor network models using a random effects approach including Hoff et al. (2002), Hoff (2005), Krivitsky et al. (2009), andHandcock et al. (2007). Unlike the fixed effects approach that we adopt, the random effects approach assumes independence between covariates and factors and between covariates and loadings. This assumption is regarded as implausible for most economic applications where the loadings reflect unobserved individual heterogeneity and 1 We refer to Boneva and Linton (2017) and Ando and Bai (2016) for more detailed comparisons with our analysis. some of the covariates are individual choice variables. There is also a recent econometric literature on structural models of strategic network formation where the main focus is on identification. We refer to de Paula (2017) for an excellent up-to-date review on this topic. The focus of our paper is on estimation and inference.
Finally, there is an extensive literature in international economics on the estimation of the gravity equation including Harrigan (1994), Eaton and Kortum (2001), Anderson and van Wincoop (2003), Santos Silva and Tenreyro (2006), Helpman et al. (2008), Charbonneau (2012) and Jochmans (2017). We refer to Head and Mayer (2014) for a recent review on this literature. These papers estimate models with additive unobserved sender and receiver country effects to account for scale or multilateral resistence effects. Our innovation to this literature is the inclusion of multiple unobserved factors to account for not only scale effects, but also unobserved partnerships, and homophily induced by differences in natural resources, industrial composition or other country characteristics.
To sum-up, our paper makes the following contributions. First, we derive asymptotic theory for fixed effects estimators of model parameters and APEs in a class of nonlinear single-index factor models that include logit, probit, ordered probit and Poisson models. Second, we provide bias corrections for fixed effects estimators of model parameters and APEs. Third, we propose an estimator of the number of factors in nonlinear single-index models with factor structure. Fourth, we bring in the factor structure to model important features of network data such as reciprocity, degree heterogeneity, homophily in latent factors and clustering in a reduced form fashion. Fifth, we apply our methods to the estimation of a gravity equation of trade between countries and confirm the importance of the gravity variables even after conditioning on multiple unobserved latent factors.

Outline:
In Section 2, we introduce the model and estimators. Section 3 discusses the statistical issues in the estimation and inference of factor models with a simple example. Section 4 derives asymptotic theory for our estimators. Section 5 provides implementation details for the estimators of the parameters and number of factors. Section 6 describes the results of the empirical application to the gravity equation and a calibrated simulation. The proofs of the main results and other technical details are given in the Appendix.

Model
We observe the data {(Y ij , X ij ) : (i, j) ∈ D}, where Y ij is a scalar outcome variable and X ij is a d x -dimensional vector of covariates. The subscripts i and j index individuals and time periods in traditional panels, but they might index different dimensions in other data structures such as network data. In our empirical application, for example, we use country trade network data where Y ij is the volume of trade between country i and country j, and X ij includes gravity variables such as the distance between country i and country j. We assume that the outcome is generated by where f is a known density function with respect to some dominating measure, β is d x -dimensional parameter vector, and α i and γ j are R-vectors of unobserved effects. We collect these effects in the I × R matrix α = (α 1 , . . . , α I ) , and the J × R matrix γ = (γ 1 , . . . , γ J ) , which are further stacked in the R(I + J)-vector φ n = (vec(α) , vec(γ) ) . We make explicit in φ n that the number of unobserved effects changes with the sample size because it will have important effects on the asymptotic theory. We assume that the dimension of the unobserved effects R is known, and provide a practical method to estimate R in Section 5. The effects α i and γ j are unobserved factors and factor loadings. In panel data they represent individual and time effects that in economic applications capture individual heterogeneity and aggregate shocks, respectively. In network data α i and γ j represent unobserved characteristics of senders and receivers that affect the network flow. The model is semiparametric because we do not specify the distribution of the unobserved effects nor their relationship with the covariates. This flexibility is important for economic applications where some of the covariates are choice variables with values determined in part by the unobserved effects. The conditional distribution f represents the parametric part of the model.
The model has a single-index specification because the covariates and unobserved effects enter f through the index z ij = X ij β + α i γ j . The parameter β is a quantity of interest because it measures the effect of the covariates on the distribution of the outcome controlling for the unobserved effects. For example, in network data β can measure homophily in an observable characteristic W if X ij includes (W i − W j ) 2 as one of its components. The unobserved effects have a factor or interactive structure because they enter the index z ij multiplicatively through π ij = α i γ j . The standard additive structure α 1i + γ 1j can be seen as a special case of the factor structure with R = 2, α i = (α 1i , 1) , and γ j = (1, γ 1j ) . More generally, in panel data applications the factor structure allows one to incorporate multiple aggregate shocks γ t with heterogeneous effects across agents α i , or multidimensional individual heterogeneity α i with time-varying returns γ t . For example, we can have productivity and monetary shocks with heterogeneous effects across industries, or multiple dimensions of individual ability and skills with time-varying returns in the labor market.
One of the contributions of the paper is to introduce factor structures to network data. In this case the factor structure serves to capture important network features in an unspecified or reduced-form fashion. For example, degree heterogeneity can be captured with the additive structure α 1i + γ 1j mentioned above, and reciprocity by allowing Y ij to be arbitrarily related to Y ji even after conditioning on the covariates and unobserved effects. Another important feature is homophily in latent factors, in addition to the homophily on observed factors captured by X ij .
Assume that there is a latent factor ξ i such that the flow between i and j increases or decreases with the distance between ξ i and ξ j as measured by (ξ i − ξ j ) 2 . This type of homophily can also be captured by a factor structure with R = 3, α i = (ξ 2 i , 1, −2ξ i ) and γ j = (1, ξ 2 j , ξ j ). The factor structure can also account for clustering or transitivity of links due to latent factors. Assume that there is a cluster of individuals such that there are more flows within the cluster. This would be captured by a factor structure with R = 1, α i = ξ i I i and γ j = χ j I j , where ξ i and χ j are positive cluster effects on the sender and receiver, and I i is an indicator for cluster membership. The factor structure can also account for combinations of these network features. Indeed, one of its advantages is that the researcher has the flexibility of specifying some features and leaving other features unspecified. For example, in the trade application we use a specification that includes additive effects to account explicitly for degree heterogeneity and multiple interactive effects to account for the possibility of having homophily in latent factors and clustering without explicitly modelling any of them.
We consider three running examples throughout the analysis: Example 1 (Linear model). Let Y ij be a continuous outcome. We can model the conditional distribution of Y ij using the Gaussian linear model where ϕ is the density function of the standard normal and σ is a positive scale parameter.
Example 2 (Binary response model). Let Y ij be a binary outcome and F be a cumulative distribution function of the standard normal or logistic distribution. We can model the conditional distribution of Y ij using the probit or logit model Example 3 (Count response model). Let Y ij be a count or non-negative integer-valued outcome, and ψ(·; λ) be the probability mass function of a Poisson random variable with parameter λ > 0.
We can model the conditional distribution of Y ij using the Poisson model

Average Partial Effects
In addition to the model parameter β, we might be interested in average partial effects (APEs).
These effects are averages of the data, parameters and unobserved effects. They measure the effect of the covariates on moments of the distribution of the outcome conditional on the covariates and unobserved effects. The leading case is the conditional expectation, where the partial effects are differences or derivatives of this expression with respect to the components of X ij . We denote generically the partial effects by ∆(Y ij , X ij , β, α i γ j ) = ∆ ij (β, α i γ j ), where the restriction that they depend on α i and γ j through π ij is natural given the model for the conditional density of Y ij . We allow the partial effect to depend on Y ij to cover scale and other parameters not included in the single-index. The APE is Example 1 (Linear model). The variance σ 2 in the linear model can be expressed as an APE Example 2 (Binary response model). If X ij,k , the kth element of X ij , is binary, its partial effect on the conditional probability of Y ij is

4)
where β k is the kth element of β, and X ij,−k and β −k include all elements of X ij and β except for the kth element. If X ij,k is continuous and F is differentiable, the partial effect of X ij,k on the conditional probability of Y ij is Example 3 (Count response model). If X ij,k , the kth element of X ij , is binary, its partial effect on the conditional probability of Y ij in the Poisson model is where β k is the kth element of β, and X ij,−k and β −k include all elements of X ij and β except for the kth element. If X ij,k is continuous, the partial effect of X ij,k on the conditional expectation (2.7)

Fixed effects estimator
We adopt a fixed effects approach and treat the unobserved effects φ n as a vector of nuisance parameters to be estimated. Let be the conditional log-likelihood function of the data constructed from the parametric part of the model. The fixed effects estimator is This problem has a unique solution with probability one for β under the assumption that z → log f (· | z) is concave. This assumption holds for all the cases that we consider including logit, probit, ordered probit and Poisson models. The solution for φ n is only unique up to normalization -see Remark 1 below. Obtaining the solution to (2.8) can be computationally challenging because the objective function is not concave in the parameter φ n and the high-dimensionality of the parameter space. In Section 5 we provide an iterative method based on Chen (2014) to obtain the estimates. This method performs well in simulations.
Remark 1 (Normalization). As in linear factor models, the solution to the problem (2.8) for φ n = (vec(α) , vec(γ) ) is only unique up to normalization because the log-likelihood function is invariant under the transformation α → αA and γ → γA −1 for any non-singular R×R matrix A.
The estimators β and δ are invariant to the normalization used to eliminate this indeterminancy.
Moreover, we can always reparametrize the model in (2.1) with respect to φ n in a way that the true value of φ n satisfies the adopted normalization. This invariance allows us to choose different normalizations for different purposes. For example, we use a standard normalization for linear factor models in the computation of the estimators, whereas we employ another normalization to derive the asymptotic distributions of the estimators in the Appendix. We refer to Robertson and Sarafidis (2015) for a discussion on the effect of the normalization in the context of linear factor models.

A Simple Motivating Example
We illustrate the statistical issues that arise in the estimation of factor models with a simple example. This example is analytically tractable and might be of practical interest as it provides an estimator of the variance of a random variable in network and panel data allowing for flexible patterns of dependence. The analysis in this section is mainly heuristic leaving technical details such as the derivation of the orders of some remainder terms in the asymptotic expansions for Section 4.
Consider a version of Example 1 without covariates where Y ij | φ n ∼ N (α i γ j , σ 2 ). Assume that the observations Y ij are independent over i and j, and that there is no missing data, i.e. D = D 0 . The quantity of interest is the scale parameter σ 2 , which can be treated as an APE.
This is a linear factor model where φ n can be obtained using the principal component algorithm of Bai (2009). Then, the plug-in estimator of σ 2 is To analyze the properties of σ 2 , it is useful to consider an asymptotic expansion of α i γ j around α i γ j as I, J → ∞. This yields where ≈ means equal up to terms of lower order. Plugging this expansion in (3.1) shows that σ 2 behaves asymptotically as a sample variance with R(I + J) estimated fixed effects corresponding to the α i 's and γ t 's. Then, standard degrees of freedom calculations give We carry out 50,000 simulations with σ 2 = 1, and α i and γ j drawn independently from multivariate normal distributions with mean zero and covariance function I R , the identity matrix of order R. Table 1 compares the bias of σ 2 with the asymptotic approximation (3.2) in datasets with I, J ∈ {10, 25, 50}, and R ∈ {1, 2, 3}. We only report the results for J ≤ I since all the expressions are symmetric in I and J. Comparing the two rows in each panel of the table, we find that the asymptotic bias provides a very accurate approximation to the finite-sample bias of the estimator for all the sample sizes and numbers of factors.
The bias of σ 2 can be removed using analytical and split-sample methods. Thus, an analytical bias corrected estimator can be formed as A split-sample bias corrected estimator can be formed as where · and · are the ceil and floor functions. As in nonlinear panel data, we expect these corrections to remove most of the bias of the estimator without increasing dispersion. Moreover, constructing confidence intervals around the corrected estimators should help bring coverage probabilities close to their nominal levels. We confirm these predictions in a numerical simulation. Table 2 reports the bias, standard deviation and RMSE of the uncorrected and bias corrected estimators, together with coverage probabilities of 95% confidence interval constructed around them. The results are based on 50,000 simulations of datasets generated as in Table 1 with I, J ∈ {10, 25, 50}, and R = 3. The confidence intervals around the estimator σ 2 ∈ { σ 2 , σ 2 ABC , σ 2 SBC } are constructed as σ 2 (1 ± 1.96 2/(IJ)), where we use that the asymptotic variance of all the estimators is 2σ 4 /(IJ). We find that the corrections offer huge improvements in terms of bias reduction and coverage of the confidence intervals. The corrections increase the dispersion for small sample sizes, but always reduce the RMSE. In this case the analytical correction slightly outperforms the split-sample correction.

Asymptotic Theory
We derive the asymptotic distribution of the estimators of the model parameter and APEs under sequences where I and J grow with the sample size at the same rate. We focus on these sequences because they are the only ones that deliver a non-degenerate limit distribution. Moreover, they are very natural choices for network data where I = J. Throughout this section, all the stochastic statements are conditional on the realization of the unobserved effects φ n and should therefore be qualified with almost surely. We shall omit this qualifier to lighten the notation.

Model parameter
We consider single-index models with strictly exogenous covariates and unobserved effects that enter the density of the outcome through z ij = X ij β + π ij , where π ij = α i γ j . These models cover the linear, probit and Poisson specifications of Examples 1-3. We focus on strictly exogenous covariates because for some data structures of interest such as network data there is no natural ordering of the observations. The results can be extended to predetermined covariates when one of the dimensions is time, see the earlier version of the paper . Let be the conditional log-likelihood coming from the parametric part of the model. We denote the denote the values of β, α i , γ j , and π ij that generated the data. We drop the argument z ij when the derivatives are evaluated at the true value of the index z 0 ij := X ij β 0 +π 0 ij , i.e., ∂ z q ij := ∂ z q ij (z 0 ij ). Let X = {X ij : (i, j) ∈ D}, α 0 = (α 0 1 , . . . , α 0 I ) , and γ 0 = (γ 0 1 , . . . , γ 0 J ) .
We make the following assumptions: Assumption 1 (Nonlinear Factor Model). Let ε > 0 and let B 0 ε be a bounded subset of R that contains an ε-neighborhood of z 0 ij for all i, j, I, J.
The number of factors R is known.
(ii) Asymptotics: we consider limits of sequences where I n /J n → κ 2 , 0 < κ < ∞, as n = |D| → ∞. We shall drop the indexing by n from I n and J n in the following.

(iii) Smoothness and moments
, q ≤ 4, are uniformly bounded over I, J for some ν > 0. In addition, X ij is bounded uniformly over i, j, I, J.
(iv) Concavity: for all I, J, the function z → ij (z) is strictly concave over z ∈ R a.s. Furthermore, there exist positive constants b min and b max such that for all z ∈ B 0 ε , b min ≤ −∂ z 2 ij (z) ≤ b max a.s. uniformly over i, j, I, J.
(v) Strong factors: (vi) Generalized non-collinearity: for any matrix A, define the coprojection matrix as M A := where I denotes the identity matrix of appropriate size and the superscript † denotes the Moore-Penrose generalized inverse. Let α 0 := (α 0 1 , . . . , α 0 I ) and X k be a I × J matrix with elements X ij,k , i = 1, . . . , I, j = 1, . . . , J. The d x ×d x matrix D(γ) with elements (vii) Missing data: there is a finite number of missing observations for every i and j, that is, The two cases considered in Assumption 1(i) are designed for different data structures. Case (b) is more suitable for network data because it allows for reciprocity between the observations (i, j) and (j, i), whereas case (a) is more suitable for panel data where there is no special relationship between these observations. Assumption 1(i) also imposes that the number of factors is known. We provide a practical method to choose the number of factors in Section 5. We also recommend checking the sensitivity to this number by reporting the maximum value of the average log-likelihood and the parameter estimates for multiple values of R. We provide an example in the empirical application of Section 6. Assumption 1(i) − (iii) are similar to Fernández-Val and Weidner (2016), so we do not discuss them further here. The concavity condition in Assumption 1(iv) holds for the logit, probit, ordered probit and Poisson models. The strong factor and generalized noncollinearity conditions in Assumption 1(v) − (vi) were previously imposed in Bai (2009) and Weidner (2015, 2017) for linear models with interactive effects. Generalized noncollinearity rules out covariates that do not display variation in the two dimensions of the dataset. Boneva and Linton (2017) and Ando and Bai (2016) impose very similar conditions to Assumption 1, so we refer to these papers for further discussion.
We introduce more notation that is convenient to simplify the expressions in the asymptotic distribution. Let Ξ ij be a d x -dimensional vector defined by the following population weighted least squares projection for each component of E(∂ z 2 ij X ij ), Also define the residual of the projectionX Finally, let E := plim I,J→∞ , D i := {j : (i, j) ∈ D} and D j := {i : (i, j) ∈ D}.
The following theorem establishes the asymptotic distribution of β defined in (2.8).
Theorem 1 (Asymptotic distribution of β). Suppose that Assumption 1 holds, that the following limits exist Remark 2 (Panel Data). In case (a) of Assumption 1(i), the asymptotic variance of β simplifies by the fact that the scores ∂ z ijXij and ∂ z jiXji are uncorrelated and the information equality.
Theorem 1 shows that β is consistent and normally distributed, but can have bias of the same order as its standard deviation. The scaling factors in the expressions for B ∞ and D ∞ are such that those expressions are of order one, for example, we can express B ∞ equivalently as where all sums explicitly appear as part of a sample average. We verify the presence of bias in our running examples.
Example 1 (Linear model). In this case Example 2 (Binary response model). In this case for any function G and j = 0, 1, 2. Substituting these values in the expressions of the bias of Theorem 1 for the probit model yields The asymptotic bias is therefore a positive definite matrix weighted average of the true parameter value as in the case of the probit model with additive individual and time effects in Fernández-Val and Weidner (2016). The bias grows linearly with the number of factors because and E (∂ z 2 ij ) and E ∂ z 2 ijXijX ij are bounded uniformly in i, j.
Example 3 (Count response model). In this case where the symbol ! denotes the factorial function, so that Substituting these values in the expressions of the bias of Theorem 1 yields

Average Partial Effects
We use additional assumptions to derive the asymptotic distribution of the estimator of the APEs.
They involve smoothness conditions on the partial effect function (β, π) → ∆ ij (β, π) needed to obtain the limit distribution of δ from the limit distribution of ( β, φ n ) via delta method. For a vector of nonnegative integer numbers Assumption 2 (Partial effects). Let > 0, and let B 0 ε be a subset of R dx+1 that contains an ε-neighborhood of (β 0 , π 0 ij ) for all i, j, I, J.
We are now ready to present the asymptotic distribution of δ defined in (2.9).
Theorem 2 (Asymptotic distribution of δ). Suppose that the assumptions of Theorem 1 and Assumption 2 hold, and that the following limits exist: Remark 3 (Panel Data). In case (a) of Assumption 1(i), the term involving the cross products Γ ji Γ ij drops out from the asymptotic variance V δ ∞ .
Theorem 2 shows that δ is consistent and normally distributed, but can have bias of the same order as its standard deviation. The first two terms of the bias come from the bias of β. They drop out when either β does not have bias or the APE is estimated from a bias corrected estimator of β. We verify the presence of bias in two of the running examples.
Example 1 (Linear model). In this case B ∞ = D ∞ = 0 and Substituting these values in the expressions of the bias of Theorem 2 yields where we use (4.2). This result formalizes the analysis in Section 3 Example 2 (Binary response model). Let ∆ ij (β, π) be as defined in either (2.4) or (2.5). Using the notation previously introduced for this example, the expressions of B As for the model parameter, these bias terms grow linearly with the number of factors R.

Bias correction and Inference
Theorems 1 and 2 establish that the estimators of the model parameter and APEs have a bias of the same order as their standard deviations in some models. In this section, we describe how to apply recent developments in nonlinear panel data to correct the bias from the estimators. To simplify the notation we assume that there is no missing data. 2 We consider a generic estimator θ of the parameter θ, which may correspond to the model parameter or an APE. In this notation, Theorems 1 and 2 show that θ can have a bias The intuition behind this structure is that there are J observations that are informative to estimate each α i and I observations that are informative to estimate each γ j .
An analytical correction based on Hahn and Newey (2004) and Fernández-Val and Weidner (2016) can be formed as A split-sample correction based on Dhaene and Jochmans (2015) and where θ −i is the estimator in the subpanel {(k, j) : k = 1, . . . , I; j = 1, . . . , I, k = i, j = i}, that is, the original panel leaving out the observations corresponding to the entity i as either sender or receiver.
The discussion of bias correction so far is applicable very generally to network and panel models with two-way fixed effects. We now specialize it to our nonlinear models with interactive fixed effects. For the analytic bias correction and for variance estimation we require consistent estimators for the quantities B ∞ , D ∞ , W ∞ , and Σ ∞ defined in Theorem 1. Let B, D, W and Σ be the corresponding sample analogs, obtained by simply dropping expectations and plugging in the fixed effect estimators for the true value of the parameters. For example, Once those sample analogs are constructed, then the analytic bias correction of β reads Analogously, we can construct sample analogs for B δ ∞ , D δ ∞ , (D β ∆) ∞ , defined in Theorem 2, in order to construct δ ABC . Also, let V δ be the sample analog of V δ ∞ .
Theorem 3 (Asymptotic Distribution of β ABC and δ ABC ). Under the conditions of Theorem 1, If, in addition, the conditions of Theorem 2 hold, then Theorem 3 shows that analytic bias correction can be used to obtain estimators of β 0 and δ 0 that are asymptotically unbiased. It also shows that the simple plug-in estimators of the asymptotic variances are consistent, thus allowing to perform asymptotically valid hypothesis tests and to construct asymptotically valid confidence intervals for β 0 and δ 0 .
Showing that the Jackknife corrected estimators β JBC and δ JBC have the same asymptotic distribution as β ABC and δ ABC requires an additional homogeneity assumption, which guarantees that the unconditional distribution of the data is stationary across i and j. This assumption ensures that the terms B and D in the bias expansion of θ are the same as in the bias expansions of the half-panel estimatesθ I,J/2 andθ I/2,J , so that forming the Jackknife linear combination θ SBC indeed cancels those bias terms. In other words, the data distribution should not systematically differ across the subsamples used for the Jackknife correction (Dhaene and Jochmans, 2015; Fernández-Val and Weidner, 2016).
The derivation of the asymptotic distribution of the leave-one-out correction θ NBC furthermore requires a third-order bias expansion (i.e., up to terms of order 1/I 2 ), because in the expression of θ NBC the estimators θ andθ I−1 are multiplied by the factors I and (I − 1) that grow with the sample size. We have not worked out those higher-order expansion here, but we refer to Sun and Dhaene (2017) for an example of higher-order expansions in nonlinear panel models.
Chen (2014) analyzed the convergence guarantees for this algorithm. She showed that the algorithm converges to a local maximum of the log-likelihood. Since the log-likelihood can have multiple local maxima, we recommend to run the algorithm for several initial values and choose the solution that yields the highest value of the log-likelihood.
Remark 4 (Additive Effects). Separate additive effects in both dimensions can be treated as one known factor of ones with unknown loading and one known loading of ones with unknown factor. They can therefore be included by imposing the constraints that the second column of α (k) and the first column of γ (k) are equal to vectors of ones in part (b) of step (ii). Other known factors with unknown loadings or known loadings with unknown factors can be incorporated similarly by imposing constraints in part (b) of step (ii).

Estimating the Number of Factors
The problem of estimating the number of factors R has been extensively discussed for linear factor models without covariates, see for example, Bai and Ng (2002) Ahn and Horenstein (2013). These methods can be extended to linear models with covariates, provided that an appropriate preliminary estimator β of the regression parameters β is available that does not require knowing R. In this case the existing methods are applied to the residuals Y ij − X ij β. If there exists an upper bound for the number of factors, R max ≥ R, then the preliminary estimator β is given by the least squares estimator with R max factors, see Moon and Weidner (2015). These methods can also be extended to the nonlinear factor models that we consider. For example, the various information criteria in Bai and Ng (2002) are all based on minimizing the sum of squared residuals plus a penalty function, and can be adapted to the likelihood problem in the spirit of classic model selection criteria (AIC, BIC, etc), see Ando and Bai (2016) for an example of this approach. 3 It is less obvious, however, how to extend the eigenvalue ratio (ER) test of Ahn and Horenstein (2013) to nonlinear models. This method is attractive because it does not depend on somewhat arbitrary functional form assumptions or tuning parameters. It only requires to specify R max , but there is no penalty function or any other tuning parameter. Assuming that there exists an upper bound R max > R, we propose adapting this method to nonlinear factor single-index models using the following algorithm: Algorithm 2 (Estimation of R). (1) Obtain preliminary estimates β, α and γ using Algorithm 1 with R = R max . (2) Compute preliminary estimates of the factor structure as the I × J matrix π with elements π ij := α i γ j . By construction, rank( π) ≤ R max . (3) Apply the eigenvalue ratio criterion of Ahn and Horenstein (2013) to π in order to estimate R, that is, where λ r ( π π ) denotes the r'th largest eigenvalue of π π .
Remark 5 (Additive Effects). When the specification includes factors with known loadings and/or loadings with known factors, π ij is the estimator of the part of the factor structure that does not contain known factors and known loadings and R max refers to the number of factors in this part.
This algorithm can be seen as a natural generalization of the Ahn and Horenstein (2013) to single-index models. Indeed, if we applied it to the linear model which corresponds to the eigenvalue ratio criterion of Ahn and Horenstein (2013) applied to the residuals Y ij − X ij β. Based on this coverage of the linear model, we conjecture that R is a consistent estimator of R under suitable conditions. To formalize this argument, a key step is to establish the consistency of the preliminary estimator β, extending the results of Moon and Weidner (2015) from linear to nonlinear models, and the properties of the estimator of the factor structure π. The main technical challenge is to characterize π, which is not even available for the  linear model with covariates and R > R 0 . We leave this analysis to future research. In the rest of the section we show that the method performs well in numerical simulations.
6 Application to Gravity Equation

Gravity Equation with Multiple Latent Factors
The gravity equation is a fundamental empirical relationship in international economics. We estimate a gravity equation of trade between countries using data from Helpman et al. (2008) on bilateral trade flows and other trade-related variables for 157 countries in 1986. 4 The data set contains a network of trade data where both i and j index countries as senders (exporters) and receivers (importers), such that I = J = 157. The outcome Y ij is the volume of trade in thousands of constant 2000 US dollars from country i to country j, and the covariates X ij include determinants of bilateral trade flows such as the logarithm of the distance in kilometers between country i's capital and country j's capital and indicators for common colonial ties, currency union, regional free trade area (FTA), border, legal system, language, and religion. Table 4 reports  We estimate a Poisson model with the following specification of the intensity where α 2i and γ 2i are R 2 -dimensional vectors of factors and factor loadings. This model is a special case of Example 3 with α i = (α 1i , 1, α 2i ) , γ j = (1, γ 1j , γ 2j ) , and R = 2 + R 2 . We explicitly include additive importer and exporter effects to account for scale and multilateral resistance effects following Eaton and Kortum (2001) and Anderson and van Wincoop (2003). Moreover, we also include interactive country effects to capture possible clustering and homophily induced by latent factors such as country trade partnerships, presence of multinationals or immigrant communities, or differences in natural resources or industrial composition. Table 5 reports the estimates and standard errors of the parameter β. 5 We consider specifications with different number of interactive effects, R 2 , in addition to the additive effects . The last row of the table reports the maximum value of the average log-likelihood, L( β, φ n )/n. We report two sets of standard errors corresponding to the dependence structures of cases (a) and (b) of Assumption 1(i). The standard errors in brackets account for possible reciprocity in the data. In this case, the method of Section 5 selects R 2 = 3 factors when R max = 4 and R max = 5. We take R 2 = 3 as our preferred specification, but we also note that, relative to the standard errors, the estimates are not very sensitive to the R 2 in the range of values that we consider. One possible concern with the use of the Poisson model in the trade application is the excess zeros, i.e. the high probability of zero trade. 6 In this case, however, it does not seem to be a problem because the estimated model with R 2 = 3 predicts a probability of zero trade of 0.61, which is higher than the observed probability of 0.55.
We find that the sign of most of the effects is robust to the inclusion of latent factors. The only exceptions are the effects of common religion and language, which in the specification with only additive effects have counterintuitive negative signs that turn positive in our preferred specification. Comparing across columns, we observe that the model without factors seems to exaggerate the role of common border, whereas it downplays the effect of distance and colonial links. For example, increasing by 10% the distance reduces by 6.9% the volume of trade and sharing border increases it by 36% according to our preferred specification with R 2 = 3, whereas the same effects are 6% and 71% according to the specification with R 2 = 0. Except for language, all the coefficients are individually significant at the 5% level. Overall, increasing the number of factors makes the estimates less precise due to the loss of degrees of freedom. This observation showcases 5 We do not report estimates of APEs because in the specification of the Poisson model that we use the parameters can be interpreted as elasticities. 6 We thank an anonymous referee for raising this issue.  Notes: all the columns include importer and exporter additive effects.
Standard errors in parenthesis. Standard errors robust to reciprocity in brackets. * Number of factors selected with R max = 5. Log-likelihood is multiplied by 100. a trade-off in estimation between efficiency and robustness to richer dependence structures in the unobservables. Finally, accounting for reciprocity slightly increases the standard errors, but does not change the statistical significance of the estimates.

Calibrated Monte Carlo Simulation
We evaluate the finite-sample properties of our estimation and inference methods in a Monte Carlo simulation that mimics the trade application. The design is calibrated to the Poisson model with additive importer and exporter country effects and one factor. We analyze the performance of the estimator of β in terms of bias, dispersion and inference accuracy. To speed up computation, we include only one covariate: the log distance. More specifically, we generate Y ij from a Poisson distribution with intensity exp(X ij β + α 1i + γ 1j + α 2i γ 2j ) independently across i and j, where X ij takes the values of log-distance in the trade data set, and β and { α 1i , α 2i , γ 1i , γ 2i } 157 j=1 , are equal to the estimates of the parameter, importer effects, exporter effects, factors and factor loadings.
We repeat this procedure in 1, 000 simulations for four different sample sizes: I = 50, I = 75, I = 100 and I = 157 (full sample in the application). For each sample size and simulation, we draw a random sample of I countries both as importers and exporters without replacement, so that the number of observations is I ×(I −1). For each simulated sample, we reestimate the model parameter and standard errors, and construct 95% confidence interval for the model parameter. Table 6 reports the bias (Bias), standard deviation (SD), and root mean squared error (RMSE) of the estimator of the parameter β, together with the ratio of average standard error to the simulation standard deviation (SE/SD), and the empirical coverage in percentage of a confidence interval with 95% nominal value (p;95). We estimate models with four different numbers of factors in addition to the additive effects, R 2 ∈ {1, 2, 3, R * 2 }, where R * 2 is the number of factors selected by the method of Section 5 with R max = 4, which can vary across simulations. The results for the bias, SD and RMSE are reported in percentage of the true parameter value. We find that the bias is smaller than the standard deviation for every sample size. When we use the true number of factors R 2 = 1, the confidence intervals cover the parameter in more than 95% of the simulations. The excess coverage is due to the overestimation of the dispersion of the estimators by the standard errors. Selecting the number of factors does not introduce bias, but increases the dispersion of the estimator of the parameter. The additional variability yields slight undercoverage of the confidence intervals for small sample sizes. On the other hand, adding unnecessary factors to the specification increases the bias and dispersion of the estimator, but the confidence intervals continue having good coverage properties. This robustness to the inclusion of too many factors is consistent with the theoretical results of Moon and Weidner (2015) for linear factor models. Overall, the simulations show that the asymptotic theory of Section 4 provides a good approximation to the finite-sample behavior of the estimator.

A.1 Notation and Normalization
Remember the log-likelihood defined in the main text, and also define the rescaled version, For the true value of the fixed effect parameters φ 0 = (vec(α) 0 , vec(γ 0 ) ) we impose the normal- Imposing φ ∈ Φ is an infeasible normalization, because the true value of the parameters appears in the definition of Φ. However, all our final asymptotic results are on the estimators β and δ, which are invariant to the chosen normalization for φ, that is, those results on β and δ also hold unchanged for any other normalization, and imposing φ ∈ Φ is simply a matter of convenience for the following proofs. There is always a need for a normalization choice when estimating the factor loadings and factors in interactive fixed effect models, because the model only depends on the product α i γ j , which is unchanged under the transformation for some invertible R × R matrix A. Notice that in the definition of Φ there are R 2 normalization constraints, which is exactly enough to uniquely determine the R 2 continuous degrees of freedom of the matrix A. In addition, there is still a discrete sign change possible (α i → −α i and γ j → −γ j ), and we assume in the following that this discrete choice is specified somehow (e.g. by imposing α 11 > 0) to make the estimator φ unique. The details of this final discrete choice do not matter, as long as the same sign normalization is imposed on φ and φ 0 .
Our normalization constraints in the definition of Φ are linear in φ. It is this linearity which makes this particular normalization attractive for our purposes. In particular, instead of imposing this normalization directly we can also impose it via a quadratic penalty function by defining the penalized objective function where b > 0 is some constant, and V is a d φ × R 2 matrix, which depends on α 0 and γ 0 , and is implicitly defined by Thus, the above penalty term can also be expressed as where . F denotes the Frobenius norm. The constrained estimator in (A.1) can then equivalently be obtained by solving the unconstrained problem and we also define
Proof of Lemma 1. For all z 1 , z 2 ∈ B 0 ε a second order Taylor expansion of ij (z 1 ) around z 2 gives wherez ∈ [min(z 1 , z 2 ), max(z 1 , z 2 )]. Let e ij := ∂ z ij /b min . Using (A.4) we find that Note that the penalty term of the objective function does not enter here, because it is zero when evaluated both at the estimator and at the true values of the parameters. Let e be the I × J matrix with entries e ij . Let X k be the I × J matrix with entries X k,ij , k = 1, . . . , d β . Let β · X = k β k X k . In matrix notation, the above inequality reads Analogous to the consistency proof for linear regression models with interactive fixed effects in Bai (2009) and Moon and Weidner (2017) we can conclude that where we used that e.g.
Tr X k P α 0 e ≤ rank X k P α 0 e X k P α 0 e ≤ X k e , Tr e P α 0 e ≤ rank e P α 0 e e P α 0 e ≤ e 2 .
Lemma D.6 in Fernández-Val and Weidner (2016) shows that under Assumption 1, ∂ z = O P (I 5/8 ), where ∂ z is the I × J matrix with entries ∂ z ij . We thus also have e = O P (I 5/8 ).

A.3 Inverse Expected Incidental Parameter Hessian
We define the expected incidental parameter Hessian for the log-likelihood with and without the penalty term as Our definition of L * (β, φ) = n −1/2 L(β, φ) includes the factor n −1/2 , which makes sure that the eigenvalues of H * remain of order one asymptotically as I, J → ∞ at the same rate. Similarly, the factor 1/ √ n in the second term of H makes sure that the eigenvalues of b √ n V V remain of order one asymptotically. The Hessian matrix H * has R 2 zero eigenvalues corresponding to the R 2 flat directions in the log-likelihood described by the transformations (A.2) that leave the likelihood unchanged. Correspondingly, the matrix V V is exactly of rank R 2 , making sure that H has no more zero eigenvalues and is invertible, as formally shown by Lemma 2 below. Those considerations explain why we have chosen the penalty term b 2 φ V V φ and the pre-factor n −1/2 in our definition of L(β, φ) in (A.3) above.
Here, H * (αα) is a block-diagonal IR × IR matrix with R × R diagonal blocks, and H * (γγ) is a block-diagonal JR × JR matrix with R × R diagonal blocks, that is wpa1 (with probability approaching one), where existence of c > 0 is guaranteed by our strong factor Assumptions 1(v). The result of the last display implies that We have thus obtained a spectral bound for H −1 . This turns out to be the key step in the proof.
The remainder of the proof is just a relatively straightforward expansion of H −1 . Namely, using and therefore From the expressions for D and A above one finds that D is block-diagonal with entries of order one, and A max = O(n −1/2 ), which implies A

A.4 Local Concavity of the Objective Function
The consistency results for β and φ(β) in Lemma 1 provide initial convergence rates, implying that we only need to consider a shrinking neighborhood around β 0 and φ 0 for the remaining asymptotic analysis. The following lemma shows that the objective function L(β, φ) is strictly concave in such a local neighborhood. Later in the proof this strict concavity will allow us to apply the general expansion results in Fernández-Val and Weidner (2016).
Analogously to the expected incidental parameter Hessian H at the true parameters that was discussed above, we now introduce the following notation for incidental parameter Hessian (without expectations, and not necessarily at the true parameters), Lemma 3. Let Assumption 1 be satisfied, and let r β = r β,n = o P (1) and r φ = r φ,n = o P (n 1/4 ).
Proof. Let ij (β, π ij ) := ij (z ij ), where π ij = α i γ j and z ij = X ij β + α i γ j . Then, We decompose the Hessian into the contribution from the first and from the second derivative of Notice that H(β, φ) has the same structure as H. Analogously to the bound (A.8) derived in the proof of Lemma 2 we can thus show that there exists a constant c > 0 such that wpa1 we have, for φ ∈ B(r φ , φ 0 ) and β ∈ B(r β , β 0 ), The new terms that need to be accounted for here are the first derivative terms F (β, φ), which are zero in expectation at the true parameter and therefore did not show up in our discussion of H above. The goal in the following is to show that F (β, φ) = o P (1), or equivalently F (αγ) (β, φ) = o P (1), within the shrinking neighborhood of the true parameters. Here, . refers to the spectral norm.
For ease of notation we consider R = 1 in the remainder of this proof. Then, F (αγ)ij (β, φ) = − 1 √ n ∂ π ij (β, α i γ j ). A Taylor expansion gives The spectral norm of the I × J matrix with entries ∂ β k π ij (β ij ,π ij ) is bounded by the Frobenius norm of this matrix, which is of order √ n, since we assume uniformly bounded moments for ∂ β k π ij (β ij ,π ij ). The spectral norm of the I ×J matrix with entries (α i γ j −α 0 i γ 0 j )∂ π 2 ij (β ij ,π ij ) is also bounded by the Frobenius norm of this matrix, which equals for φ ∈ B(r φ , φ 0 ) and β ∈ B(r β , β 0 ), where we also used that αγ − α 0 γ 0 Combining the result in the last display with (A.8) we find that there exists a constant c > 0 such that wpa1 we have, for φ ∈ B(r φ , φ 0 ) and β ∈ B(r β , β 0 ), We have thus shown that L(β, φ) is indeed strictly concave (or that −L(β, φ) is strictly convex) within this shrinking neighborhood.

A.5 Stochastic Expansion
Once we have the consistency result of Lemma 1 and the local strict concavity result of Lemma 3, then the derivation of the stochastic expansion of the fixed effect estimators β and δ does not rely on the specific single index and interactive fixed effect structure of our model. Some of the conceptual issues indeed become more transparent when ignoring that structure. Therefore, in this subsection, let ij (β, α i , γ j ) := ij (X ij β + α i γ j ) and ∆ ij (β, α i , γ j ) := ∆ ij (β, π ij ). Remember that our fixed effect estimators β and γ maximize the objective function , and the corresponding plug-in estimator reads δ = ∆( β, φ). For partial derivatives of ij (β, α i , γ j ) and ∆( β, φ) we use superscripts in the following, expectations are always conditional on φ and are indicated by a bar, and arguments are omitted when evaluated at the true parameters. For example, α i α i ij is the d α × d α expected Hessian matrix of ij (β, α i , γ j ) with respect to α i evaluated at the true parameters. This is the notation also used in Fernández-Val and Weidner (2018), but here the α i and γ j are vectors of length d α and d γ , respectively. For our interactive fixed effect model we have d α = d γ = R, but this is not used in the rest of this subsection. The advantage of this generality is that, for example, the following formulas are also applicable to models where in addition to the interactive effects we include separate additive effects in the single index.
It is convenient to make the log-likelihood information-orthogonal between β and the incidental parameters. This can be achieved by the transformation 7 * This transformation corresponds to the reparameterization α The log-likelihood with respect to these parameters is ij (β, α * i + ξ , which gives our definition of * ij after renaming (α * i , γ * j ) as (α i , γ j ) again.
where the d α × d β matrices ξ (α) i , and the d γ × d β matrices ξ (γ) j are a solution to the system of Analogously, let the d α -vectors ψ j be solutions to the system of equations Finally, let (1) is simply the probability limit of W , that is, Theorem 4 (Stochastic Expansion for β and δ). Let Assumption 1 be satisfied. We then have where the d β -vector U has elements Furthermore, if also Assumption 2 holds, then Proof. # Expansion of β. Our assumptions together with results of Lemma 1, 2 and Lemma 3 guarantee that the conditions of Theorem B.1 and Corollary B.2 in Fernández-Val and Weidner (2016) are satisfied, so that by applying that corollary we have Here, tilde symbols indicate deviations from expectation, for example, L βφ = L βφ − L βφ , with L βφ = EL βφ . Analogous to the proof of Theorem C.1 in Fernández-Val and Weidner (2016), and also using the above Lemma 2 again, one can then show that the terms in U (1) only contribute asymptotic bias, namely In component notation we can now rewrite the above terms as follows (remember that we define the Hessian matrix H with a negative sign) Combining the above gives the expansion for β − β 0 in the theorem.
# Expansion of δ. Again, our assumptions and lemmas guarantee that the conditions of Theorem B.4 in Fernández-Val and Weidner (2016) are satisfied, so that by applying that theorem we have Again, following the logic in the proof of Theorem C.1 in Fernández-Val and Weidner (2016) one finds that U ∆ only contributes asymptotic bias, namely In component notation we can now rewrite the above terms as follows (again, remember that we define the Hessian matrix H with a negative sign) Combining the above gives the expansion for δ − δ 0 in the theorem.

A.6 Proof of Main Text Theorems
Proof of Theorem 1. According to Theorem 4 we have √ n β − β 0 = W −1 ∞ U + o P (1). The first term in U is 1 √ n (i,j)∈D * β ij , where in main text notation we have * β ij = ∂ z ijXij . Assumption 1(i) guarantees that * β ij has mean zero (a linear combination of scores evaluated at the true parameters) and is either independent across all (i, j), or only correlated within pairs (i, j) and (j, i). This term therefore only contributes variance, no bias, to the limiting distribution of β.
Applying the Lindeberg-Levy CLT and the Cramer-Wold device we find where we use that * β ij = ∂ z ijXij . This is the formula for Σ ∞ given in Theorem 4, and this formula covers both case (a) and case (b), because independence across pairs (i, j) ↔ (j, i) is of course a special case of dependence across those pairs.
All the remaining terms in U contribute asymptotic bias but no variance. We consider case (a) of Assumption 1(i) in the following, but one can easily verify that the additional bias terms stemming from correlation across pairs (i, j) ↔ (j, i) are asymptotically negligible, so that the same asymptotic bias expressions are obtained in case (b).
Using * β k α i ij = γ 0 j ∂ z 2 ijXij,k and α i α i ih = γ 0 j γ 0 j ∂ z 2 ij and α i ij = γ 0 j ∂ z ij we obtain and also using * β k α i α i ih = γ 0 j γ 0 j E ∂ z 3 ijXij,k and the Bartlett identity Here, we also used the Bartlett identity E * β ij * β ij Analogous to the proof of Theorem 1 one can show for the bias terms that Using the above and the expansion in Theorem 4 gives the statement of Theorem 2.
Proof of Theorem 3. Under the conditions of Theorem 1, B → P B ∞ , D → P D ∞ , W → P W ∞ , and Σ → P Σ ∞ . If, in addition, the conditions of Theorem 2 hold, then also V δ → P V δ ∞ , and the sample analogs of B Once we have established the consistency of the estimators of the bias terms, the asymptotic distributions of the analytical corrections β ABC and δ ABC follow as corollaries of Theorems 1 and 2, respectively. For example, by Slutsky's theorem.