Network and Panel Quantile Effects Via Distribution Regression

This paper provides a method to construct simultaneous confidence bands for quantile functions and quantile effects in nonlinear network and panel models with unobserved two-way effects, strictly exogenous covariates, and possibly discrete outcome variables. The method is based upon projection of simultaneous confidence bands for distribution functions constructed from fixed effects distribution regression estimators. These fixed effects estimators are bias corrected to deal with the incidental parameter problem. Under asymptotic sequences where both dimensions of the data set grow at the same rate, the confidence bands for the quantile functions and effects have correct joint coverage in large samples. An empirical application to gravity models of trade illustrates the applicability of the methods to network data.


Introduction
Standard regression analyzes average effects of covariates on outcome variables. In many applications it is equally important to consider distributional effects. For example, a policy maker might be interested in the effect of an education reform not only on the mean but also the entire distribution of test scores or wages. Availability of panel data is very useful to identify ceteris paribus average and distributional effects because it allows the researcher to control for multiple sources of unobserved heterogeneity that might cause endogeneity or omitted variable problems. The idea is to use variation of the covariates over time for each individual or over individuals for each time period to account for unobserved individual and time effects. In this paper we develop inference methods for distributional effects in nonlinear models with two-way unobserved effects. They apply not only to traditional panel data models where the unobserved effects correspond to individual and time fixed effects, but also to models for other types of data where the unobserved effects reflect some grouping structure such as unobserved sender and receiver effects in network data models.
We develop inference methods for quantile functions and effects. The quantile function corresponds to the marginal distribution of the outcome in a counterfactual scenario where the treatment covariate of interest is set exogenously at a desired level and the rest of the covariates and unobserved effects are held fixed, extending the construction of Chernozhukov Fernandez-Val and Melly (2013) for the cross section case. The quantile effect is the difference of quantile functions at two different treatment levels. Our methods apply to continuous and discrete treatments by appropriate choice of the treatment levels, and have causal interpretation under standard unconfoundedness assumptions for panel data. The inference is based upon the generic method of Chernozhukov, Fernandez-Val, Melly, and Wuthrich (2016) that projects joint confidence bands for distributions into joint confidence bands for quantile functions and effects. This method has the appealing feature that applies without modification to any type of outcome, let it be continuous, discrete or mixed.
The key input for the inference method is a joint confidence band for the counterfactual distributions at the treatment levels of interest. We construct this band from fixed effects distribution regression (FE-DR) estimators of the conditional distribution of the outcome given the observed covariates and unobserved effects. In doing so, we extend the distribution regression approach to model conditional distributions with unobserved effects. This version of the DR model is semiparametric because not only the DR coefficients can vary with the level of the outcome as in the cross section case, but also the distribution of the unobserved effects is left unrestricted. We show that the FE-DR estimator can be obtained as a sequence of binary response fixed effects estimators where the binary response is an indicator of the outcome passing some threshold. To deal with the incidental parameter problem associated with the estimation of the unobserved effects (Neyman and Scott (1948)), we extend the analytical bias corrections of Fernandez-Val and Weidner (2016) for single binary response estimators to multiple (possibly a continuum) of binary response estimators. In particular, we establish functional central limit theorems for the fixed effects estimators of the DR coefficients and associated counterfactual distributions, and show the validity of the bias corrections under asymptotic sequences where the two dimensions of the data set pass to infinity at the same rate. As in the single binary response model, the bias corrections remove the asymptotic bias of the fixed effects estimators without increasing their asymptotic variances.
We implement the inference method using multiplier bootstrap (Giné and Zinn, 1984). This version of bootstrap constructs draws of an estimator as weighted averages of its influence function, where the weights are independent from the data. Compared to empirical bootstrap, multiplier bootstrap has the computational advantage that it does not involve any parameter reestimation. This advantage is particularly convenient in our setting because the parameter estimation require multiple nonlinear optimizations that can be highly dimensional due to the fixed effects. Multiplier bootstrap is also convenient to account for data dependencies. In network data, for example, it might be important to account for reciprocity or pairwise clustering. Reciprocity arises because observational units corresponding to the same pair of agents but reversing their roles as sender and receiver might be dependent even after conditioning on the unobserved effects. By setting the weights of these observational units equal, we account for this dependence in the multiplier bootstrap. In addition to the previous practical reasons, there are some theoretical reasons for choosing multiplier bootstrap. Thus, Chernozhukov, Chetverikov and Kato (2016) established bootstrap functional central limit theorems for multiplier bootstrap in high dimensional settings that cover the network and panel models that we consider.
The methods developed in this paper apply to models that include unobserved effects to capture grouping or clustering structures in the data such as models for panel and network data. These effects allow us to control for unobserved group heterogeneity that might be related to the covariates causing endogeneity or omitted variable bias. They also serve to parsimoniously account for dependencies in the data. We illustrate the wide applicability with an empirical example to gravity models of trade. In this case the outcome is the volume of trade between two countries and each observational unit corresponds to a country pair indexed by exporter country (sender) and importer country (receiver). We estimate the distributional effects of gravity variables such as the geographical distance controlling for exporter and importer country effects that pick up unobserved heterogeneity possibly correlated with the gravity variables. We uncover significant heterogeneity in the effects of distance and other gravity variables across the distribution, which is missed by traditional mean methods. We also find that the Poisson model, which is commonly used in the trade literature to deal with zero trade in many country pairs, does not provide a good approximation to the distribution of the volume of trade due to heavy tails.
Literature review. Unlike mean effects, there are different ways to define distributional and quantile effects. For example, we can distinguish conditional effects versus unconditional or marginalized effects, or quantile effects versus quantiles of the effects. Here we give a brief review of the recent literature on distributional and quantile effects in panel data models emphasizing the following aspects: (1) type of effect considered; (2) type of unobserved effects in the model; and (3) asymptotic approximation. For the unobserved effects, we distinguish models with one-way effects versus two-way effects. For the asymptotic approximation we distinguish short panels with large N and fixed T versus long panels with large N and large T , where N and T denote the dimensions of the panel. We focus mainly on fixed effects approaches where the unobserved effects are treated as parameters to be estimated, but also mention some correlated random effects approaches that impose restrictions on the distribution of the unobserved effects. This paper deals with inference on marginalized quantile effects in large panels with two-way effects, which has not been previously considered in the literature. Indeed, to the best of our knowledge, it is the first paper to provide inference methods for quantile treatment effects from panel and network models with two-way fixed effects. Koenker (2004) introduced fixed effects quantile regression estimators of conditional quantile effects in large panel models with one-way individual effects using shrinkage to control the variability in the estimation of the unobserved effects. Lamarche (2010) discussed the optimal choice of a tuning parameter in Koenker's method. In the same framework, Kato, Galvao, and Montes-Rojas (2012), Galvao, Lamarche and Lima (2013), Galvao and Kato (2016) and Arellano and Weidner (2016) considered fixed effects quantile regression estimators without shrinkage and developed bias corrections. All these papers require that T pass to infinity faster than N , making it difficult to extend the theory to models with two-way individual and time effects. Graham, Hahn and Powell (2009) found a special case where the fixed effects quantile regression estimator does not suffer of incidental parameter problem. Machado and Santos Silva (2018) has recently proposed a method to estimate conditional quantile effects in a location-scale model via moments.
In short panels, Rosen (2012) showed that a linear quantile restriction is not sufficient to point identify conditional effects in a panel linear quantile regression model with unobserved individual effects. Chernozhukov, Fernandez-Val, Hahn and Newey (2013) and Chernozhukov, Fernandez-Val, Hoderlein, Holzmann and Newey (2015) discussed identification and estimation of marginalized quantile effects in nonseparable panel models with unobserved individual effects and location and scale time effects under a time homogeneity assumption. They showed that the effects are point identified only for some subpopulations and characterized these subpopulations. Graham, Hahn, Poirier and Powell (2015) considered quantiles of effects in linear quantile regression models with two-way effects. Finally, Abrevaya and Dahl (2008) and Arellano and Bonhomme (2016) developed estimators for conditional quantile effects in linear quantile regression model with unobserved individual effects using correlated random effects approaches. None of the previous quantile regression based methods apply to discrete outcomes.
Finally, we review previous applications of panel data methods to network data. These include Charbonneau (2017), Cruz-Gonzalez, Fernandez-Val and Weidner (2016), Dzemski (2017), Fernandez-Val and Weidner (2016), Graham (2016Graham ( , 2017, Jochmans (2018), and Yan, Jiang, Fienberg and Leng (2016), which developed methods for models of network formation with unobserved sender and receiver effects for directed and undirected networks. None of these papers consider estimation of quantile effects as the outcome variable is binary, whether or not a link is formed between two agents.
Plan of the paper. Section 2 introduces the distribution regression model with unobserved effects for network and panel data, and describes the quantities of interest including model parameters, distributions, quantiles and quantile effects. Section 3 discusses fixed effects estimation, bias corrections to deal with the incidental parameter problem, and uniform inference methods. Section 4 provides asymptotic theory for the fixed effects estimators, bias corrections, and multiplier bootstrap. Section 5 and 6 report results of the empirical application to the gravity models of trade and a Monte Carlo simulation calibrated to the application, respectively. The proofs of the main results are given in the Appendix, and additional technical results are provided in the Supplementary Appendix.
Notation. For any two real numbers a and b, a ∨ b = max{a, b} and a ∧ b = min{a, b}. For a real number a, a denotes the integer part of a. For a set A, |A| denotes the cardinality or number of elements of A.

Model and Parameters of Interest
2.1. Distribution Regression Model with Unobserved Effects. We observe the data set {(y ij , x ij ) : (i, j) ∈ D}, where y ij is a scalar outcome variable with region of interest Y, and x ij is a vector of covariates with support X ⊆ R dx . 1 The variable y ij can be discrete, continuous or mixed. The subscripts i and j index individuals and time periods in traditional panels, but they might index other dimensions in more general data structures. In our empirical application, for example, we use a panel where y ij is the volume of trade between country i and country j, and x ij includes gravity variables such as the distance between country i and country j. Both i and j index countries as exporters and importers respectively. The set D contains the indexes of the pairs (i, j) that are observed. It is a subset of the set of all possible pairs D 0 := {(i, j) : i = 1, . . . , I; j = 1, . . . , J}, where I and J are the dimensions of the panel. We introduce D to allow for missing data that are common in panel and network applications. For example, in the trade application I = J and D = D 0 \ {(i, i) : i = 1, . . . , I} because we do not observe trade of a country with itself. We denote the total number of observed units by n, i.e. n = |D|.
Let v i and w j denote vectors of unspecified dimension that contain unobserved random variables or effects that might be related to the covariates x ij . In traditional panels, v i are individual effects that capture unobserved individual heterogeneity and w j are time effects that account for aggregate shocks. More generally, these variables serve to capture some forms of endogeneity and group dependencies in a parsimonious fashion. We specify the conditional distribution of y ij given (x ij , v i , w j ) using the distribution regression (DR) model with unobserved effects where Λ y is a known link function such as the normal or logistic distribution, which may vary with y, x → P (x) is a dictionary of transformations of x such us polynomials, bsplines and tensor products, β(y) is an unknown parameter vector, which can vary with y, and (v, y) → α(v, y) and (w, y) → γ(w, y) are unspecified measurable functions. This DR model is a semiparametric model for the conditional distribution because y → θ(y) := (β(y), α 1 (y), . . . , α I (y), γ 1 (y), . . . , γ J (y)) is a function-valued parameter and the dimension of θ(y) varies with I and J, although we do not make this dependence explicit. We shall treat the dimension of P (x) as fixed in the asymptotic analysis When y ij is continuous, the model (1) has the following representation as an implicit nonseparable model by the probability integral transform where the error u ij represents the unobserved ranking of the observation y ij in the conditional distribution. The parameters of the model are related to derivatives of the conditional quantiles.
The DR coefficients therefore are proportional to (minus) derivatives of the conditional quantile function, and ratios of DR coefficients correspond to ratios of derivatives.
Remark 1 (Parametric models). There are many parametric models that are special cases of the DR model. Thus, Chernozhukov, Fernández-Val and Melly (2013) and Chernozhukov, Fernandez-Val, Melly, and Wuthrich (2016) showed that the standard linear model, Cox proportional hazard model and Poisson regression model are encompassed by the DR model in the cross section case. These inclusions carry over to the panel versions of these models with two-way unobserved effects. 2 We use the convention inf{∅} = +∞. 3 Indeed, Λy(x ij β(y) + α(vi, y) + γ(wj, y)) = u at y = Qy ij (u | xij, vi, wj). Differencing this expression with respect to x k ij yields where λy(z) = ∂Λy(z)/∂z. Note that the first term of the right hand side does not depend on k and is positive because y → Fy ij (y | xij, vi, wj) = Λy(x ij β(y) + α(vi, y) + γ(wj, y)) is strictly increasing at y = Qy ij (u | xij, vi, wj).

2.2.
Estimands. In addition to the model parameter β(y), we are interested in measuring the effect on the outcome of changing one of the covariates holding the rest of the covariates and the unobserved effects fixed. Let x = (t, z ) , where t is the covariate of interest or treatment and z are the rest of the covariates that usually play the role of controls. One effect of interest is the quantile (left-inverse) function (QF) ij is a level of the treatment that may depend on t ij , and k ∈ {0, 1}. We provide examples below. Note that in the construction of the counterfactual distribution F k , we marginalize (x ij , v i , w j ) using the empirical distribution. The resulting effects are finite population effects. We shall focus on these effects because conditioning on the covariates and unobserved effects is natural in the trade application. 4 We construct the quantile effect function (QEF) by taking differences of the QF at two treatment levels We can also obtain the average effect using the relationship between averages and distributions. Thus, the average effect is where µ k is the counterfactual average obtained from F k as The integral in (2) is over the real line, but the formula nevertheless is applicable to the case where the support of dF k is discrete or mixed.
The choice of the levels t 0 ij and t 1 ij is usually based on the scale of the treatment: • If the treatment is binary, ∆(τ ) is the τ -quantile treatment effect with t 0 ij = 0 and t 1 ij = 1. • If the treatment is continuous, ∆(τ ) is the τ -quantile effect of a unitary or one standard deviation increase in the treatment with t 0 ij = t ij and t 1 ij = t ij + d, where d is 1 or the standard deviation of t ij .
• If the treatment is the logarithm of a continuous treatment, ∆(τ ) is the τ -quantile effect of doubling the treatment (100% increase) with t 0 ij = t ij and t 1 ij = t ij + log 2.
4 The distinction between finite and infinite population effects does not affect estimation, but affects inference (Abadie, Athey, Imbens and Wooldridge, 2014). The estimators of infinite population effects need to account for the additional sampling variation coming from the estimation of the distribution of (xij, vi, wj).
For example, in the trade application we use the levels t 0 ij = 0 and t 1 ij = 1 for binary covariates such as the indicators for common legal system and free trade area, and t 0 ij = t ij and t 1 ij = t ij + log 2 for the logarithm of distance.
All the previous estimands have causal interpretation under the standard unconfoundedness or conditional independence assumption for panel data where the conditioning set includes not only the observed controls but also the unobserved effects.

Fixed Effects Estimation and Uniform Inference
To simplify the notation in this section we write P (x ij ) = x ij without loss of generality, and define α i (y) := α(v i , y) and γ j (y) := γ(w j , y).

Fixed Effects Distribution Regression
Estimator. The parameters of the PDR model can be estimated from multiple binary regressions with two-way effects. To see this, note that the conditional distribution in (1) can be expressed as Accordingly, we can construct a collection of binary variables, and estimate the parameters for each y by conditional maximum likelihood with fixed effects. Thus, θ(y) := ( β(y), α 1 (y), . . . , α I (y), γ 1 (y), . . . , γ J (y)), the fixed effects distribution regression estimator of θ(y) := (β(y), α 1 (y), . . . , α I (y), γ 1 (y), . . . , γ J (y)), is obtained as for y ∈ Y. When the link function is the normal or logistic distribution, the previous program is concave and smooth in parameters and therefore has good computational properties. See Fernandez-Val and Weidner (2016) and Cruz-Gonzalez, Fernandez-Val and Weidner (2016) for a discussion of computation of logit and probit regressions with two-way effects.
The quantile functions and effects are estimated via plug-in rule, i.e., Remark 2 (Computation). When Y is not finite, we replace Y by a finite subsetȲ such that the Hausdorff distance betweenȲ and Y goes to zero at a rate faster than 1/ √ n. For example, if Y is an interval [y,ȳ], thenȲ can be a fine mesh of √ n log log n equidistant points covering Y, i.e.,Ȳ = {y, y + d, y + 2d, . . . ,ȳ} for d = (ȳ − y)/( √ n log log n). If Y is the support of y ij ,Ȳ can be a grid of √ n log log n sample quantiles with equidistant indexes.
3.2. Incidental Parameter Problem and Bias Corrections. Fixed effects estimators can be severely biased in nonlinear models because of the incidental parameter problem (Neyman and Scott, 1948). These models include the binary regressions that we estimate to obtain the DR coefficients and estimands. We deal with the incidental parameter problem using the analytical bias corrections of Fernandez-Val and Weidner (2016) for parameters and average partial effects (APE) in binary regressions with two-way effects. We note here that the distributions F 0 (y) and F 1 (y) can be seen as APE, i.e., they are averages of functions of the data, unobserved effects and parameters.
The bias corrections are based on expansions of the bias of the fixed effects estimators as I, J → ∞. For example, where nR This result generalizes the analysis of Fernandez-Val and Weidner (2016) from a single binary regression to multiple (possibly a continuum) of binary regressions. This generalization is required to implement our inference methods for quantile functions and effects.
The expansion (4) is the basis for the bias corrections. Let B 5 Fernandez-Val and Weidner (2016) considered the case where n = IJ, i.e., there is no missing data, so that I/n = 1/J and J/n = 1/I.
We also use the corrected estimators F k as the basis for inference and to form a bias corrected estimator of the average effect.
Remark 3 (Shape Restrictions). If the bias corrected estimator y → F k (y) is nonmonotone on Y, we can rearrange it into a monotone function by simply sorting the values of function in a nondecreasing order. Chernozhukov, Fernandez-Val and Galichon (2009) showed that the rearrangement improves the finite sample properties of the estimator. Similarly, if the F k (y) takes values outside of [0, 1], winsorizing its range to this interval improves the finite sample properties of the estimator (Chen, Chernozhukov, Fernández-Val, Kostyshak and Luo (2018)).
We use the generic method of Chernozhukov, Fernández-Val, Melly and Wüthrich (2016) to construct confidence bands for quantile functions and effects from confidence bands for the corresponding distributions. Let D denote the space of weakly increasing functions, mapping Y to [0, 1]. Assume we have a confidence band I k = [L k , U k ] for F k , with lower and upper endpoint functions y → L k (y) and y → U k (y) such that L k , U k ∈ D and L k (y) ≤ U k (y) for all y ∈ Y. 6 We say that I k covers F k if F k ∈ I k pointwise, namely L k (y) ≤ F k (y) ≤ U k (y) for all y ∈ Y. If U k and L k are some data-dependent bands, we say that I k is a confidence band for F k of level p, if I k covers F k with probability at least p. Similarly, we say that the set of bands {I k : k ∈ K} is a joint confidence band for the set of functions {F k : k ∈ K} of level p, if I k covers F k with probability at least p simultaneously over k ∈ K. The index set K can be a singleton to cover individual confidence bands or K = {0, 1} to cover joint confidence bands. In Section 4 we provide a multiplier bootstrap algorithm for computing joint confidence bands based on the joint asymptotic distribution of the bias corrected estimators { F k : k ∈ K}.
The following result provides a method to construct joint confidence bands for {Q k = F ← k : k ∈ K}, from joint confidence bands for {F k : k ∈ K}. Lemma 1 (Chernozhukov, Fernández-Val, Melly and Wüthrich (2016, Thm. 2(1))). Consider a set of distribution functions {F k : k ∈ K} and endpoint functions {L k : k ∈ K} and {U k : k ∈ K} with components in the class D. If {F k : k ∈ K} is jointly covered by 6 If [L k , U k ] is a confidence band for F k that does not obey the constraint L k , U k ∈ D, we can transform This Lemma establishes that we can construct confidence bands for quantile functions by inverting the endpoint functions of confidence bands for distribution functions. The geometric intuition is that the inversion amounts to rotate and flip the bands, and these operations preserve coverage.
We next construct simultaneous confidence bands for the quantile effect function τ → ∆(τ ) defined by The basic idea is to take appropriate differences of the bands for the quantile functions Q 1 and Q 0 as the confidence band for the quantile effect. Specifically, suppose we have the set of confidence bands Chernozhukov, Fernández-Val, Melly and Wüthrich (2016) showed that a confidence band for the difference Q 1 − Q 0 of size p can be constructed as Lemma 2 (Chernozhukov, Fernández-Val, Melly and Wüthrich (2016, Thm. 2(2))). Consider a set of distribution functions {F k : k = 0, 1} and endpoint functions {L k : k = 0, 1} and {U k : k = 0, 1}, with components in the class D. If the set of distribution functions {F k : k = 0, 1} is jointly covered by the set of bands {I k : k = 0, 1} with probability p, then the quantile effect function

Asymptotic Theory
This section derives the asymptotic properties of the fixed effect estimators of y → β(y) and {F k : k ∈ K}, as both dimensions I and J grow to infinity. We focus on the case where the link function is the logistic distribution at all levels, Λ y = Λ, where Λ(ξ) = (1 + exp(−ξ)) −1 . We choose the logistic distribution for analytical convenience. In this case the Hessian of the log-likelihood function does not depend on y it , leading to several simplifications in the asymptotic expansions. In particular, there are various terms that drop out from the second order expansions that we use to characterize the structure of the incidental parameter bias of the estimators β(y) and F (y). For the case of single binary regressions, Fernandez-Val and Weidner (2016) showed that the properties of fixed effects estimators are similar for the logistic distribution and other smooth log-concave distributions such as the normal distribution. Accordingly, we expect that our results can be extended to other link functions, but at the cost of more complicated proofs and derivations to account for additional terms.
We make the following assumptions: Assumption 1 (Sampling and Model Conditions).
(i) Sampling: The outcome variable y ij is independently distributed over i and j conditional on all the observed and unobserved covariates (ii) Model: For all y ∈ Y, where y → β(y), y → α(·, y) and y → γ(·, y) are measurable functions.
(iii) Compactness: X VW, the support of (x ij , v i , w j ), is a compact set, and y → α(v i , y) and y → γ(w j , y) are a.s. uniformly bounded on Y.
(iv) Compactness and smoothness: Either Y is a discrete finite set, or Y ⊂ R is a bounded interval. In the latter case, we assume that the conditional density function f y ij (y | x ij , v i , w j ) exists, is uniformly bounded above and away from zero, and is uniformly continuous in y on int(Y), uniformly in ( (v) Missing data: There is only a fixed number of missing observations for every i and j, that is, max i (J − |{(i , j ) ∈ D : i = i}|) ≤ c 2 and max j (I − |{(i , j ) ∈ D : j = j}|) ≤ c 2 for some constant c 2 < ∞ that is independent of the sample size.
(vi) Non-collinearity: The regressors x ij are non-collinear after projecting out the twoway fixed effects, that is, there exists a constant c 3 > 0, independent of the sample size, such that (vii) Asymptotics: We consider asymptotic sequences where I n , J n → ∞ with I n /J n → c for some positive and finite c, as the total sample size n → ∞. We drop the indexing by n from I n and J n , i.e. we shall write I and J.
Remark 4 (Assumption 1). Part (i) holds if (y ij , x ij ) is i.i.d. over i and j, v i is i.i.d. over i, and w j is i.i.d. over j; but it is more general as it does not restrict the distribution of (x ij , v i , w j ) nor its dependence across i and j. We show how to relax this assumption allowing for a form of weak conditional dependence in Section 4.4. Part (ii) holds if the observed covariates are strictly exogenous conditional on the unobserved effects and the conditional distribution is correctly specified for all y ∈ Y. We expect that our theory carries over to predetermined or weakly exogenous covariates that are relevant in panel data models, following the analysis Fernandez-Val and Weidner (2016). We focus on the strict exogeneity assumption because it is applicable to both panel and network data, and leave the extension to weak exogeneity to future research. Part (iii) imposes that the support of the covariates and the unobserved effects is a compact set. For fixed values y it is possible to obtain asymptotic results of our estimators without the compact support assumption, see e.g. Yan, Jiang, Fienberg and Leng (2016), but deriving empirical process results that hold uniformly over y is much more involved without this assumption. The compact support assumption guarantees that the conditional probabilities of the events {y ij ≤ y} are bounded away from zero and one, that is, the network of binarized outcomes 1{y ij ≤ y} is assumed to be dense. In the network econometrics literature Charbonneau (2017), Graham (2017) and Jochmans (2018) provide methods that are also applicable to sparse networks. Part (iv) can be slightly weakened to Lipschitz continuity with uniformly bounded Lipschitz constant, instead of differentiability. It covers discrete, continuous, and mixed outcomes with mass points at the boundary of the support such as censored variables. For the mixed outcomes, the data generating process for the mass points can be arbitrarily different from the rest of the support because the density y → f y ij (y | ·) only needs to be continuous in the interior of Y. If the panel is balanced, part (vi) can be stated as x ij . This is the typical condition in linear panel models requiring that all the covariates display variation in both dimensions. The asymptotic sequences considered in part (v) are convenient because they exactly balance the order of the bias and standard deviation of the fixed effect estimator yielding a non-degenerate asymptotic distribution. 4.1. Asymptotic Distribution of the Uncorrected Estimator. We introduce first some further notation. Denote the q th derivatives of the cdf Λ by Λ (q) , and define Λ and let α x,i (y) and γ x,j (y) be the d x -vectors with components α x,i (y) and γ x,j (y), where Notice that x ij,k (y) is defined using projections of x ij instead of x ij,k . Also, while the locations of α x,i (y) and γ x,j (y) are not identified, x ij (y) and x ij,k (y) are uniquely defined. Analogous to the projection of For example, if x ij,k = x ij , then Ψ ij,k (y) = 1. Furthermore, we define 7 are the subsets of observational units that contain the index i and j, respectively. In the previous expressions, k (y) and Ψ ij,k (y) are scalars for each k ∈ K, that we stack in the |K| × 1 vectors Let ∞ (Y) be the space of real-valued bounded functions on Y equipped with the supnorm · Y , and denote weak convergence (in distribution). We establish a functional central limit theorem for the fixed effects estimators of y → β(y) and y → F (y) in Y. All stochastic statements are conditional on Theorem 1 (FCLT for Fixed Effects DR Estimators). Let Assumption 1 hold. For all y 1 , y 2 ∈ Y with y 1 ≥ y 2 we assume the existence of ij,k (y) xij(y) = 0, and we can therefore equivalently , Ω(y 2 , y 1 ) := Ω(y 1 , y 2 ) , and W (y 1 ) := V (y 1 , y 1 ). Then, in the metric space as stochastic processes indexed by y ∈ Y, where y → Z (β) (y) and y → Z (F ) (y) are tight zeromean Gaussian processes with covariance functions (y 1 , and (y 1 , y 2 ) → Ω(y 1 , y 2 ), respectively.
Assumption 1(vi) guarantees the invertibility of W (y) and W (y). Notice that W (y) is equal to the limit of W (y) because Λ by the properties of the logistic distribution. This information equality follows by the correct specification condition in Assumption 1(ii). By Assumption 1(v), we could have used √ IJ instead of √ n, 1/J instead of I/n, and 1/I instead of J/n. We prefer the expressions in the theorem, because they might provide a more accurate finite-sample approximation.
Remark 5 (Comparison with binary response models). Fernandez-Val and Weidner (2016) derived central limit theorems (CLTs) for the fixed effects estimators of coefficients and APEs in panel regressions with two-way effects. Pointwise, for given y ∈ Y, Theorem 1 yields these CLTs. Moreover, it covers multiple binary regressions by establishing the limiting distribution of β(y) and F (y) treated as stochastic processes indexed by y ∈ Y. This generalization is key for our inference results and does not follow from well-known empirical process results. We need to deal with a double asymptotic approximation where both I and J grow to infinity, and to bound all the remainder terms in the second order expansions used by Fernandez-Val and Weidner (2016) uniformly over y ∈ Y. We refer to the appendix and supplementary material for more details.
k (y) = 0, and ∂ β F k (y) = 0 (see footnote 7). In fact, in that case F k is equal to the empirical distribution function, namely by the first order conditions of the fixed effects logit DR estimator with respect to the fixed effect parameters. This property provides another appealing feature to choose the logistic distribution.

4.2.
Bias Corrections. Theorem 1 shows that the fixed effects DR estimator has asymptotic bias of the same order as the asymptotic standard deviation under the approximation that we consider. The finite-sample implications are that this estimator can have substantial bias and that confidence regions constructed around it can have severe undercoverage. We deal with these problems by removing the first order bias of the estimator.
We estimate the bias components using the plug-in rule. Define Λ ij,k (y) in the definitions of α x (y), γ x (y), α Ψ (y), and γ Ψ (y) yields the corresponding estimators. We plug-in these estimators to obtain .
The following theorem shows that the estimators of the asymptotic bias and variance are consistent, uniformly in y ∈ Y. For a matrix A, we denote by A the Frobenius norm of A, i.e. A = trace(AA ).
Theorem 2 (Uniform Consistency of Estimators of Bias and Variance Components). Let Assumption 1 hold. Then, Bias corrected estimators of β(y) and F (y) are formed as and .

It can be shown that sup
, that is, the difference between those alternative bias corrected estimators is asymptotically negligible. There is no obvious reason to prefer one over the other, and we present result for F k in the following, which equivalently hold for F * k . 8 Remark 7 (Alternative Approaches). The conditional approach of Charbonneau (2017) and Jochmans (2018) for the logit model with two-way effects could be also adopted to estimate the coefficient β(y). However, this approach does not produce estimators of F (y) as it is based on differencing-out the unobserved effects. The bias correction method proposed is analytical in that it requires explicit characterization and estimation of the bias. A natural alternative is a correction based on Jackknife or bootstrap following the analysis of Cruz-Gonzalez, Fernandez-Val and Weidner (2016), Dhaene and Jochmans (2015), Fernandez-Val and Weidner (2016), Hahn and Newey (2004), and Kim and Sun (2016) for nonlinear panel models. We do not consider any of these corrections because they require repeated parameter estimation that can be computationally expensive in this case.
Combining Theorem 1 and 2 we obtain the following functional central limit theorem for the bias corrected estimators.

Corollary 1 (FCLT for Bias Corrected Fixed Effects DR Estimators
as stochastic processes indexed by y ∈ Y, where Z (β) (y) and Z (F ) (y) are the same Gaussian processes that appear in Theorem 1.

Uniform Confidence Bands and Bootstrap.
We show how to construct pointwise and uniform confidence bands for y → β(y) and y → F (y) on Y using Corollary 1. The uniform bands for F can be used as inputs in Lemmas 1 and 2 to construct uniform bands for the QFs τ → Q k (τ ) = F ← k (τ ), k ∈ K, and the QEF τ → ∆(τ ) on T .
Let B ⊆ {1, . . . , d x } be the set of indexes for the coefficients of interest. For given y ∈ Y, ∈ B, k ∈ K, and p ∈ (0, 1), a pointwise p-confidence interval for β (y), the 'th component and a pointwise p-confidence intervals for F k (y) is where Φ denotes the cdf of the standard normal distribution, σ β (y) is the standard error of β (y) given in (13), and σ F k (y) is the standard error of F k (y) given in (14). These intervals have coverage p in large samples by Corollary 1 and Theorem 2.
We construct joint uniform bands for the coefficients and distributions using Kolmogorov-Smirnov type critical values, instead of quantiles from the normal distribution. A uniform p-confidence band joint for the vector of functions {β (y) : ∈ B, y ∈ Y} is where t where , , the square root of the ( , ) element of the matrix W (y) −1 . Similarly, a uniform p-confidence band joint for the set of distribution functions {F k (y) : k ∈ K, y ∈ Y} is where t where σ 1/2 k,k , the square root of the (k, k) element of the matrix Ω(y, y). The previous confidence bands also have coverage p in large samples by Corollary 1 and Theorem 2.
The maximal t-statistics used to construct the bands I β and I F are not pivotal, but their distributions can be approximated by simulation after replacing the variance functions of the limit processes by uniformly consistent estimators. In practice, however, we find it more convenient to use resampling methods. We consider a multiplier bootstrap scheme that resamples the efficient scores or influence functions of the fixed effects estimators β(y) and F (y). This scheme is computationally convenient because it does not need to solve the high dimensional nonlinear fixed effects conditional maximum likelihood program (3) or making any bias correction in each bootstrap replication. In these constructions we rely on the uncorrected fixed effects estimators instead of the bias corrected estimators, because they have the same influence functions and the uncorrected estimators are consistent under the asymptotic approximation that we consider.
To describe the standard errors and multiplier bootstrap we need to introduce some notation for the influence functions of θ(y) and F (y). Let θ = (β, α 1 , . . . , α I , γ 1 , . . . , γ J ) be a generic value for the parameter θ(y), the influence function of θ(y) is the (d e i,I is a unit vector of dimension I with a one in the position i, e j,J is defined analogously, H(θ) † is the Moore-Penrose pseudo-inverse of H(θ), and is minus the Hessian of the log-likelihood with respect to θ, which does not depend on y in the case of the logistic distribution. 9 The influence function of The standard error of β (y) is constructed as the square root of the ( , ) element of the sandwich matrix n −2 (i,j)∈D ψ y ij ( θ(y))ψ y ij ( θ(y)) . Similarly, the standard error of F k (y) is constructed as The following algorithm describes a multiplier bootstrap scheme to obtain the critical values for a set of parameters indexed by ∈ B ⊆ {1, . . . , d x } and a set of distributions indexed by k ∈ K ⊆ {0, 1}. This scheme is based on perturbing the first order conditions of the fixed effects estimators with random multipliers independent from the data.
Algorithm 1 (Multiplier Bootstrap). (1) LetȲ be some grid that satisfies the conditions of Remark 2. (2) Draw the bootstrap multipliers {ω m ij : (i, j) ∈ D} independently from the data as Here we have normalized the multipliers to have zero mean as a finite-sample adjustment. (3) For each y ∈Ȳ, obtain the bootstrap draws of θ(y) as θ m (y) = θ(y) (4) Construct the bootstrap draw of the maximal t-statistic for the parameters, t (13), and ψ y ij, (θ) is the component of ψ y ij (θ) corresponding to β . Similarly, construct the bootstrap draw of the maximal t-statistic for the distributions, 9 We use the Moore-Penrose pseudo-inverse because H(θ) is singular if we do not impose a normalization on the location of αi(y) and γj(y). (14). (5) The next result shows that the multiplier bootstrap provides consistent estimators of the critical values of the inferential statistics. The proof follows from Theorem 2.2 of Chernozhukov, Chetverikov and Kato (2016). (10) and (12), respectively.
Theorem 3 together with Theorems 1 and 2 guarantee the asymptotic validity of the confidence bands I β and I F defined in (9) and (11)  4.4. Pairwise Clustering Dependence or Reciprocity. The conditional independence of Assumption 1(i) can be relaxed to allow for some forms of conditional weak dependence. A form of dependence that is relevant for network data is pairwise clustering or reciprocity where the observational units with symmetric indexes (i, j) and (j, i) might be dependent due to unobservable factors not accounted by unobserved effects. 10 In the trade application, for example, these factors may include distributional channels or multinational firms operating in both countries.
The presence of reciprocity does not change the bias of the fixed effects estimators, but affects the standard errors and the implementation of the multiplier bootstrap. The standard error of β (y) becomes Similarly, the standard error of F k (y) needs to be adjusted to In the previous expressions we assume that if (i, j) ∈ D then (j, i) ∈ D to simplify the notation. The modified multiplier bootstrap algorithm becomes: Algorithm 2 (Multiplier Bootstrap with Pairwise Clustering).
(1) LetȲ be some grid that satisfies the conditions of Remark 2.
(2) Draw the bootstrap multipliers {ω m ij : (15), and and ψ y ij, (θ) is the component of ψ y ij (θ) corresponding to β . Similarly, construct the bootstrap draw of the maximal t-statistic for the distributions, t (1) The clustered multiplier bootstrap preserves the dependence in the symmetric pairs (i, j) and (j, i) by assigning the same multiplier to each of these pairs. 4.5. Average Effect. A bias corrected estimator of the average effect can be formed as where Here the integral is over the real line, and C is an operator that extends F k (y) from Y to R as a step function, that is, it maps any f : Y → R to Cf : R → R, where Cf (y) = 0 for y ≤ inf Y, Cf (y) = 1 for y ≥ sup Y, and Cf (y) = f (sup{y ∈ Y : y ≤ y}) otherwise. The following central limit theorem for the bias corrected estimator of the average effect is a corollary of Theorem 1 and 2 together with the functional delta method.
Remark 8 (Support of Y ). The condition that Y dF k (y) = 1 guarantees that Y is the support of the potential outcome corresponding to the distribution F k , so that (2) yields the average potential outcome under F k . Together with Assumption 1, this condition is satisfied when Y is discrete with finite support Y, or continuous or mixed with bounded support Y and conditional density bounded away from zero in the interior of Y. This support condition is not required for the estimation of the quantile effects.
We can construct confidence intervals for the average effect using Corollary 2. Let Then, σ ∆ is an estimator of σ ∆ , the standard deviation of the limit process Z (∆) in (18), and , is an asymptotic p-confidence interval for ∆. The normal critical value Φ −1 (1 − p/2) can be replaced by a multiplier bootstrap critical value t (∆) (p) obtained from Algorithm 1 as The standard errors and critical values of the average effects can be adjusted to account for pairwise clustering following the procedure described in Section 4.4. Thus, the pairwise clustering robust standard error is

Quantile Effects in Gravity Equations for International Trade
We consider an empirical application to gravity equations for bilateral trade between countries. We use data from Helpman, Melitz and Rubinstein (2008), extracted from the Feenstra's World Trade Flows, CIA's World Factbook and Andrew Rose's web site. These data contain information on bilateral trade flows and other trade-related variables for 157 countries in 1986. 11 The data set contains network data where both i and j index countries as senders (exporters) and receivers (importers), and therefore I = J = 157. The outcome y ij is the volume of trade in thousands of constant 2000 US dollars from country i to country j, and the covariates x ij include determinants of bilateral trade flows such as the logarithm of the distance in kilometers between country i's capital and country j's capital and indicators for common colonial ties, currency union, regional free trade area (FTA), border, legal system, language, and religion. Following Anderson and van Wincoop (2003), we include unobserved importer and exporter country effects. 12 These effects control for other country specific characteristics that may affect trade such as GDP, tariffs, population, institutions, infrastructures or natural resources. We allow for these characteristics to affect differently the imports and exports of each country, and be arbitrarily related with the observed covariates. Table 1 reports descriptive statistics of the variables used in the analysis. There are 157× 156 = 24, 492 observations corresponding to different pairs of countries. The observations with i = j are missing because we do not observe trade flows from a country to itself. The trade variable in the first row is an indicator for positive volume of trade. There are no trade flows for 55% of the country pairs. The volume of trade variable exhibits much larger standard deviation than the mean. Since this variable is bounded below at zero, this indicates the presence of a very heavy upper tail in the distribution. This feature also makes quantile methods specially well-suited for this application on robustness grounds. 13 The previous literature estimated nonlinear parametric models such as Poisson, Negative Binomial, Tobit and Heckman-selection models to deal with the large number of zeros in the volume of trade (e.g., Eaton and Kortum, 2001, Santos Silva and Tenreyro, 2006, and Helpman, Melitz and Rubinstein, 2008. 14 These models impose strong conditions on the process that generates the zeros and/or on the conditional heteroskedasticity of the volume of trade. The DR model deals with zeros and any other fixed censoring points in a very flexible and natural fashion as it specifies the conditional distribution separately at the mass point. In particular, the model coefficients at zero can be arbitrarily different from the model coefficients at other values of the volume of trade. Moreover, the DR model can also accommodate conditional heteroskedasticity. Figure 1 shows estimates and 95% pointwise confidence intervals for the DR coefficients of log distance and common legal system plotted against the quantile indexes of the volume 11 The original data set includes 158 countries. We exclude Congo because it did not export to any other country in 1986. 12 See Harrigan (1994) for an earlier empirical international trade application that includes unobserved country effects. 13 In results not reported, we find that estimates of average effects are very sensitive to the trimming of outliers at the top of the distribution. 14 See Head and Mayer (2014) for a recent survey on gravity equations in international trade.  (08) of trade. We report uncorrected and bias corrected fixed effects estimates obtained from (3) and (7), respectively. The confidence intervals are constructed using (8). The x-axis starts at .54, the maximum quantile index corresponding to zero volume of trade. The region of interest Y corresponds to the interval between zero and the 0.95-quantile of the volume of trade. The difference between the uncorrected and bias corrected estimates is the same order of magnitude as the width of the confidence intervals for the coefficient of log distance. We find the largest estimated biases for both coefficients at highest quantiles of the volume of trade, where the indicators 1{y ij ≤ y} take on many ones. The signs of the DR coefficients indicate that increasing distance has a negative effect and having a common legal system has a positive effect on the volume of trade throughout the distribution. Recall that the sign of the effect in terms of volume of trade, y ij , is the opposite to the sign of the DR coefficient. Figures 2 and 3 show estimates and 95% uniform confidence bands for distribution and quantile functions of the volume of trade at different values of the log of distance and the common legal system. The left panels plot the functions when distance takes the observed levels (dist) and two times the observed values (2*dist), i.e. when we counterfactually double all the distances between the countries. The right panels plot the functions when all the countries have the same legal system (legal=1) and different systems (legal=0). The confidence bands for the distribution are obtained by Algorithm 1 with 500 bootstrap replications and standard normal multipliers, and a grid of valuesȲ that includes the sample quantiles of the volume of trade with indexes {.54, .55, . . . , .95}. The bands are joint for the two functions displayed in each panel. The confidence bands for the quantile functions are obtained by inverting and rotating the bands for the corresponding distribution functions using Lemma 1.
where y is the integer part of y, λ ij,k = exp(x ij,k β+ α i + γ j ), and θ = ( β, α 1 , . . . , α I , γ 1 , . . . , γ J ) is the Poisson fixed effects conditional maximum likelihood estimator We find that distance and common legal system have heterogeneously increasing effects along the distribution. For example, the negative effects of doubling the distance grows more than proportionally as we move up to the upper tail of the distribution of volume of trade. Putting all the countries under the same legal system has little effects in the extensive margin of trade, but has a strong positive effect at the upper tail of the distribution. The Poisson estimates lie outside the DR confidence bands reflecting heavy tails in the conditional distribution of the volume of trade that is missed by the Poisson model. 15 Figure 5 shows confidence bands of the quantile effects that account for pairwise clustering. The bands are constructed from confidence bands from the distributions using Algorithm 2 15 This misspecification problem with the Poisson model is well-known in the international trade literature. The Poisson estimator is treated as a quasi-likelihood estimator and standard errors robust to misspecification are reported (Santos Silva and Tenreyro, 2006).
with 500 bootstrap draws and standard normal multipliers. Accounting for unobservables that affect symmetrically to the country pairs has very little effect on the width of the bands in this case. . Estimates and 95% uniform confidence bands for the quantile effects of log distance and common legal system on the volume of trade.

Montecarlo Simulation
We conduct a Montecarlo simulation calibrated to the empirical application of Section 5. The outcome is generated by the censored logistic process x ij is the value of the covariates for the observational unit (i, j) in the trade data set, σ L = π/ √ 3, the standard deviation of the logistic distribution, and ( β, α 1 , . . . , α I , γ 1 , . . . , γ J , σ) are Tobit fixed effect estimates of the parameters in the trade data set with lower censoring point at zero. 16 We consider two designs: independent errors with u s ij ∼ i.i.d U(0, 1), and pairwise dependent errors with , where e s ij ∼ i.i.d N (0, 1) and Φ is the standard normal CDF. 17 In both cases the conditional distribution function of y s ij is a special case of the DR model (1) with link function Λ y = Λ, the logistic distribution, for all y, β(y) = σ L (e 1 y − β)/ σ, α i (y) = −σ L α i / σ, and γ j (y) = −σ L γ j / σ, where e 1 is a unit vector of dimension d x with a one in the first component. As in the empirical application, the region of interest Y is the interval between zero and the 0.95quantile of the volume of trade in the data set. All the results are based on 500 simulated panels {(y s ij , x ij ) : (i, j) ∈ D}.
Figures 6 and 7 report the biases, standard deviations and root mean square errors (rmses) of the fixed effects estimators of the DR coefficients of log-distance and legal system as a function of the quantiles of y ij in the design with independent errors. 18 All the results are in percentage of the true value of the parameter. As predicted by the large sample theory, the fixed effects estimator displays a bias of the same order of magnitude as the standard deviation. As in fig. 1, the bias is more severe for the coefficient of log distance. The bias correction removes most of the bias and does not increase the standard deviation, yielding a reduction in rmse of about 5% for the coefficient of log distance at the highest quantile indexes. Figure 8 reports the biases, standard deviations and rmses of the estimators of the counterfactual distributions at two levels of log-distance as a function of the quantiles of y ij in the design with independent errors. The levels of distance in these distributions are the same as in the empirical application, i.e. k = 0 and k = 1 correspond to the observed values and two times the observed values, respectively. All the results are in 16 We upper winsorize the volume of trade yij at the 95.5% quantile to reduce the effect of outliers in the Tobit estimation of the parameters. 17 The Spearman rank correlation between u s ij and u s ji in the design with pairwise-dependent errors is 0.73. 18 The design with pairwise dependent errors produces similar results, which are not reported for the sake of brevity. percentage of the true value of the functions. In this case we find that the uncorrected and bias corrected estimators display small biases relative to their standard deviations, and have similar standard deviations and rmses at both treatment levels. Indeed the standard deviations and rmses are difficult to distinguish in the figure as they are almost superposed.
In results not reported, we find very similar patterns in the design with pairwise dependent errors and for the estimators of the counterfactual distributions at the same two levels of legal as in the empirical application. Table 2 shows results on the finite sample properties of 95% confidence bands for the DR coefficients and counterfactual distributions in the design with independent errors. The confidence bands are constructed by multiplier bootstrap with 500 draws, standard normal weights, and a grid of valuesȲ that includes the sample quantiles of the volume of trade with indexes {.54, .55, . . . , .95} in the trade data set. For the coefficients, it reports the average length of the confidence bands integrated over threshold values, the average value of the estimated critical values, and the empirical coverages of the confidence bands. For the distributions, it reports the same measures averaged also over the two treatment levels and where the coverage of the bands is joint for the two counterfactual distributions. 19 For comparison, it also reports the coverage of pointwise confidence bands using the normal distribution, i.e. with critical value equal to 1.96. The last row computes the ratio of the standard error averaged across simulations to the simulation standard deviation, integrated over threshold values for the coefficients and over thresholds and treatment levels for the distributions. We consider standard errors and confidence bands with and without accounting for pairwise clustering. All the results are computed for confidence bands centered at the uncorrected fixed effects estimates and at the bias corrected estimates. For the coefficients, we find that the bands centered at the uncorrected estimates undercover the true coefficients, whereas the bands centered at the bias corrected estimates have coverages close to the nominal level. The joint coverage of the bands for the distributions is close to the nominal level regardless of whether they are centered at the uncorrected or bias corrected estimates. We attribute this similarity in coverage to the small biases in the uncorrected estimates of the distributions found in fig. 8. As expected, pointwise bands severely undercover the entire functions. The standard errors based on the asymptotic distribution provide a good approximation to the sampling variability of both the uncorrected and bias corrected estimators. Accounting for pairwise clustering in this design where it is not necessary has very little effect on the quality of the inference. Table 3 reports the same results as table 2 for the design with pairwise dependent errors. The bands that do not account for pairwise clustering undercover the functions because the standard errors underestimate the standard deviations of the estimators. Compared to the design with independent errors, the critical values are similar but the bands that 19 The joint coverage of the bands for the quantile functions and quantile effect is determined by the joint coverage of the bands of the distribution functions in our construction. We refer to Chernozhukov, Fernández-Val, Melly and Wüthrich (2016) for a numerical analysis on the marginal coverage of the bands for the quantile effects.

RMSE of Counterfactual Distribution −− ldist, k=1
Quantile of Trade % of True Parameter Value Uncorrected Bias Corrected Figure 8. Bias, standard deviation and root mean squared error for the estimators of the counterfactual distributions of log-distance. account for clustering are wider due to the increase in the standard errors. To sum up, inference methods robust to pairwise clustering perform well in both designs, whereas inference methods that do not account for clustering undercover in the presence of pairwise dependence. The bias corrections are effective in reducing bias and bringing the coverage  probabilities of the bands close to their nominal level for the coefficients, whereas they have little effect for the distributions.

Appendix A. Proofs of Main Text Results
We present the proofs of Theorems 1 and 2, and relegate various technical details to the supplementary appendix. Once Theorems 1 and 2 are shown, the proof of Theorem 3 for the multiplier bootstrap follows from Theorem 2.2 in Chernozhukov, Chetverikov and Kato (2016). The uniform confidence bands I F for the cdfs in (11) obtained by the multiplier bootstrap can then be inverted and differenced to obtain uniform confidence bands for the quantile function and quantile effects, see Chernozhukov, Fernández-Val, Melly and Wüthrich (2016) and also Lemma 1 and 2 above. This appendix thus contains the proofs of all the main results that are new to the current paper. The proofs for all of the lemmas below are given in the supplementary appendix. All stochastic statements in the following are conditional on {(x ij , v i , w j ) : (i, j) ∈ D}.
Let s y be the n-vector obtained by stacking the elements s y,ij across all observations (i, j) ∈ D. Similarly, let Λ (1) y be the n × n diagonal matrix with diagonal elements given by Λ (1) y,ij , (i, j) ∈ D. Finally, let w be the n × (d x + I + J) matrix with rows given by w ij , (i, j) ∈ D. We define the n × n symmetric idempotent matrix where † is the Moore-Penrose pseudoinverse. For the elements of this matrix we write Q y,ij,i j . We have (Q y s y ) ij = (i ,j )∈D Q y,ij,i j s y,i j . The constraint ∃ θ : π y,ij = w ij θ y in (A.1) can then equivalently be written as 20 The matrix Q y projects onto the column span of Λ w. This projector acts in the space of weighted index vectors Λ (1) y,ij 1/2 π y,ij : (i, j) ∈ D , and the weighting of each π y,ij by Λ (1) y,ij y,ij is simply the expected Hessian for observation (i, j).
A.1. Technical Lemmas. We require some results for the proofs of the main theorems below. The following lemma provides an asymptotic expansion of π y,ij − π 0 y,ij . Lemma 3 (Score expansion of fixed effect estimates). Under Assumption 1, for y ∈ Y and (i, j) ∈ D, we have and the remainder r y,ij satisfies sup y∈Y max (i,j)∈D |r y,ij | = o P (n −1/2 ).
The expansion in the preceding lemma is a second-order stochastic expansion, because it does not only describe the terms linear in the score s y , but also the terms quadratic in s y . We need to keep track of those quadratic terms, because they yield the leading order incidental parameter biases that appear in Theorem 1. The remainder r y,ij contains higher-order terms in s y (cubic, quartic, etc), which turn out not to matter for the result in Theorem 1. Note also that Λ (2) y,ij = ∂ π 3 y,ij . Thus, the term quadric in the score is proportional to the third derivative of the objective function.
We now want to decompose the projector Q y into the parts stemming from x ij , e i,I and e j,J , respectively. We have already introduced the d x -vector x y,ij = x ij (y) in Section 4.1. Let x y be the n × d x matrix with rows given by (1) y x y was also introduced in Section 4.1. Invertibility of W y is 20 In matrix notation the constraint can be written as πy = w θy, and we thus have Qy Λ  ij , respectively. Let x y,ij is defined as the part of x y,ij that is orthogonal to the fixed effects under a metric given by Λ x y = 0, which implies that Next, define the projection matrices y,ij and i∈D j Λ (1) y,ij , respectively, and therefore It is not exactly true that Q y , but Lemma 5 shows that this is approximately true in a well-defined sense.
Lemma 5 (Properties of Q y ). Under Assumption 1, Remark 9 (Bias of π y,ij ). According to part (i) of this lemma the remainder term Q y has elements uniformly bounded of order n −1 , and it can easily be seen from (A.5) that the same is true for Q (1) y , because the elements of x y are also uniformly bounded under our assumptions. By contrast, Q (2) y and Q (3) y have elements of order J −1 and I −1 , respectively, that is, of order n −1/2 . Using this and the fact that s y,ij has variance one and is independent across observations (i, j) we find where we use that Q y is idempotent in the second step, and (A.6) in the third step. Combining this with Lemma 3 one finds that the leading order bias term in π y,ij − π 0 y,ij is given by which then translates into corresponding bias terms for all other estimators as well.
For the following lemma, let Z k (y) be as defined in and before Theorem 1 in the main text.
Lemma 6 (Properties of score averages). Under Assumption 1, Regarding part (i) of this lemma, notice that pointwise we have (Q y s y ) ij = O P (n −1/4 ), because (A.7) implies that E (Q y s y ) ij 2 = O P (n −1/2 ). However, after taking the supremum over y, i, j the term is growing faster than n −1/4 . The rate o P (n −1/6 ) in part (i) of the lemma is crude, but sufficient for our purposes.
By combining this with Lemma 3 we obtain √ n β y − β 0 and r (β) where we also use that Λ (1) y,ij and x y,ij are uniformly bounded under our assumptions. For the term linear in the score we find where in the second step we used (A.4), and the final step follows from part (ii) of Lemma 6.
Employing again (A.4) we find and according to part (iv) of Lemma 6 we thus have uniformly in y ∈ Y. Combining the above gives the result for √ n β y − β 0 y in the theorem.
The projection Ψ y,ij,k = Ψ ij,k (y), defined just before (6) in the main text, can be written in terms of the matrix Q Using ij,k (y) x y,ij,k we obtain According to part (iii) of Lemma 6 the vector T asymptotically.
The terms quadratic in s y read where for the term quadratic in π y,ij,k − π 0 y,ij,k in the expansion of F y,k − F y,k we do not insert (A.10) but rather insert (A.9), and we ignore the terms involving β y −β 0 y here -they give contributions quadratic in the score s y , but only of smaller order, and we therefore rather include those in the remainder term r (F ) y,k below. Using again (A.12) we find Using part (v) of Lemma 6, and our previous result for T (2,β) y , we thus obtain uniformly in y ∈ Y and k ∈ K.
The remainder term of the expansion reads Our assumptions guarantee that Λ are all uniformly bounded. Lemma 3 guarantees that r y,ij = o P (n −1/2 ), uniformly over y, i, j, and using Lemma 5(ii) this also implies that Q (FE) r y ij = o P (n −1/2 ), uniformly over y, i, j.
By a mean value of expansion in π y around π 0 y we thus obtain The proof of consistency for the other estimators in Theorem 2 is analogous.
21 Note that instead of B (β) y (πy) we could simply write B (β) (πy) here, because all the dependence on y is through the parameter πy. The only reason to write B (β) y (πy) is to avoid confusion with the notation B (β) (y) in the main text.

Supplementary Appendix
The following proof of Lemma 3 also relies on the results of Lemma 5 and Lemma 6(i), whose proof is presented afterwards, without using Lemma 3 of course.
Proof of Lemma 3. Define Q ⊥ y := I n − Q y , which is the n × n symmetric idempotent matrix that projects onto the space orthogonal to the column span of Λ (1) y 1/2 w, with w = (w ij : (i, j) ∈ D). In component notation we have Q ⊥ y,ij,i j = δ ii δ jj − Q y,ij,i j , where δ .. refers to the Kronecker delta. We also define π * y,ij := Λ (1) y,ij 1/2 π y,ij , π * 0 y,ij := Λ (1) y,ij which is simply a rescaling of π y,ij by Λ (1) y,ij depends on the true parameter values, but for the analysis here it is more convenient to work with * y,ij (π * y,ij ) than with y,ij (π y,ij ). After the rescaling we have s y,ij = ∂ π * * y,ij := ∂ π * * y,ij π * 0 y,ij and 1 = ∂ π * 2 * y,ij := ∂ π * 2 * y,ij π * 0 y,ij , that is, the variance of the score and the Hessian of * y,ij (π * y,ij ) evaluated at the true parameter values are normalized to one. Equation (A.2) can be rewritten as Q ⊥ y π * y = 0. where π * y is the n-vector with elements π * y,ij . Solving (A.1) is then equivalent to minimizing the function (i,j)∈D * y,ij (π * y,ij ) + (i,j)∈D π * y,ij (i ,j )∈D Q ⊥ y,ij,i j µ y,i j over π * y and µ y , where the µ y are the Lagrange multipliers corresponding to the constraint Q ⊥ y π * y = 0, which is equivalent to existence of θ such that π y = w θ y . The FOCs with respect to π * y read ∂ π * * y ( π * y ) + Q ⊥ y µ y = 0, where ∂ π * * y ( π * y ) and µ y are n-vectors obtained by stacking the elements of ∂ π * * y,ij ( π * y ) and µ y,ij for all (i, j) ∈ D. Existence of µ y that satisfy those FOCs is equivalent to In addition to this first order condition we have the constraint Q ⊥ y π * y = 0, which implies that π * y = Q y ξ y for some ξ y ∈ R n , that is, we only need to consider parameters π * y that can be represented as Q y ξ y . In the following we perform three expansion steps for the loglikelihood function (or for the corresponding score function), each time restricting π * y,ij − π * 0 y,ij further.

#
Step 1: We assume uniform boundedness of all parameters and variables that enter into the single index. Therefore, for all y ∈ Y and (i, j) ∈ D we have π 0 y,ij ∈ [π min , π max ], where [π min , π max ] is some bounded interval. By strict convexity of minus the logistic loglikelihood function there exist constants c min and c max such that 0 < c min ≤ Λ (1) y,ij ≤ c max < ∞ for all y ∈ Y and (i, j) ∈ D. Hence, π * 0 y,ij ∈ [c 1/2 min π min , c 1/2 max π max ], for all y ∈ Y and (i, j) ∈ D. Define Π bnd := [c 1/2 min π min − , c 1/2 max π max + ], where > 0 is an arbitrary finite constant. In the following we only need to consider values of π * y,ij inside Π bnd .

3
. By the result of Lemma 6(i) we know that there exists a sequence κ n = o(1) such that wpa1 sup y∈Y max (i,j)∈D (Q y s y ) ij ≤ κ n n −1/6 , which implies that sup y∈Y (i,j)∈D (Q y s y ) ij 3 ≤ n 1/2 κ 3 n .

VICTOR CHERNOZHUKOV, IVAN FERNANDEZ-VAL, AND MARTIN WEIDNER
Here, Π * y,n is the boundary of Π * y,n within the set of all π * y that satisfy the constraint Q ⊥ y π * y = 0. For any ζ ∈ R n with Q ⊥ y ζ = 0 we have Q y ζ = ζ, and by applying Cauchy-Schwarz inequality we thus find where we also used that Q y Q y = Q y and employed Lemma 5(iii). By applying (S.5) to ζ ij = π * y,ij − π * 0 y,ij + (Q y s y ) ij we find that for π * y ∈ Π * y,n we have and also using Lemma 6(i) we thus have Hence, when applying (S.4) to π * y ∈ Π * y,n with (Q y ζ y ) ij = π * y,ij − π * 0 y,ij + (Q y s y ) ij , then the term 4b 3 (Q y ζ y ) ij is of order o P (1) and the term (Q y s y ) ij 3 is of smaller order than (Q y ζ y ) 2 ij . In addition, note that π * 0 y,ij − (Q y s y ) ij ∈ Π * y,n . Thus, by applying (S.4) with (Q y ζ y ) ij = π * y,ij −π * 0 y,ij +(Q y s y ) ij , and using (S.6) we find that with probability approaching one we have (i,j)∈D * y,ij π * y,ij − (i,j)∈D * y,ij π * 0 y,ij − (Q y s y ) ij > 0, for all π * y ∈ Π * y,n . (S.7) Thus, we have a convex set Π * y,n such that the convex function π * y → (i,j)∈D * y,ij π * y,ij takes a smaller value inside the set Π * y,n than on any point of its boundary Π * y,n (within the set of all π * y that satisfy the constraint Q ⊥ y π * y = 0). This guarantees that the minimizer of the objective function needs to be inside the set Π * y,n , that is, we have π y ∈ Π * y,n , which implies sup y∈Y (i,j)∈D π * y,ij − π * 0 y,ij + (Q y s y ) ij 2 ≤ κ 2 n n 1/2 = o P (n 1/2 ), and by the inequality (S.5) with ζ ij = π * y,ij − π * 0 y,ij + (Q y s y ) ij , and Lemma 6(i), we find Step 2: An expansion of (S.1) in π * y,i j around π * 0 y,i j up to second order yields where π * y,i j is a value between π * 0 y,i j and π * y,i j . By combining this expansion with the constraint Q ⊥ y ( π * y − π * 0 y ) = 0, which implies that π * y − π * 0 y = Q y ξ y , for some ξ y , we obtain Using our initial convergence rate result (S.8) in part 1 of this proof, and Lemma 5, and also uniform boundedness of all the derivatives of * y,i j (π * ) within Π bnd , we find Hence, by Lemma 6(i), Using (S.9) instead of (S.8), and reapplying the same argument a second time we obtain sup y∈Y max (i,j)∈D π * y,ij − π * 0 y,ij = o P (n −1/6 ) + O P (κ 4 n ).
Proof of Lemma 4. We prove this lemma by showing that the smallest eigenvalue of W y is bounded from below, uniformly over y ∈ Y. We know that there exists b min > 0 such that Λ (1) y,ij ≥ b min , uniformly over y, i, j. Then, where existence of c 3 > 0 is guaranteed by Assumption 1(vi).
Proof of Lemma 5. We showed that Q y = Q (1) . We now want to find the bound on Q Then, where we use the Moore-Penrose pseudo-inverse †, because H y has one zero-eigenvalue with corresponding eigenvector v = (1 I , −1 J ), that is, v is a column vector with I ones follows by J minus ones. 22 We can therefore write H † y = H y + vv /(I + J) −1 − vv /(I + J).
The matrix H y = ∂ φyφ y (i,j)∈D y,ij (π 0 y,ij ) is simply the (I + J) × (I + J) Hessian matrix of minus the log-likelihood function with respect to all the fixed effects φ y = (α y , γ y ) . We decompose H y = D y + R y , where Note that the additively separable structure αi(y) + γj(y) is invariant to adding a constant to all the αi(y) and subtracting the same constant to all the γj(y).
where A y is the I × J matrix with entries A y,ij = Λ (1) y,ij if (i, j) ∈ D, and zero otherwise. Fernandez-Val and Weidner (2016) shows that this incidental parameter Hessian satisfies where A max refers to the maximum over the absolute values of all the elements of the matrix A. Note that Lemma D.1 in Fernandez-Val and Weidner (2016) is for the "expected Hessian", but for our logit model we have H y = EH y , conditional on regressors and fixed effects, so the distinction between Hessian and expected Hessian is irrelevant here. Also, in Fernandez-Val and Weidner (2016) the Hessian is not indexed by y, but the derivation of the bound there is in terms of global constants b min , b max and thus holds uniformly over y. Finally, Fernandez-Val and Weidner (2016) does not allow for missing observations, but since we only allow for a finite number of missing observations for every i and j that can only have a negligible effect on the Hessian matrix.
We thus have Intermediate results for the proof of of Lemma 6 The proof of Lemma 6 requires several intermediate results from the theory of stochastic processes, which are presented in the following. The notation a n b n means that a n ≤ C b n for some constant C that is independent of the sample size n. It is also convenient to define I := {1, 2, . . . , I} and J := {1, 2, . . . , J}. In this section we assume that Y is a bounded interval, and that y ij is continuously distributed with density bounded away from zero. The results for the case where y ij is discrete follow directly from Fernandez-Val and Weidner (2016). Results for a mixed distribution of y ij follow by combining the results for the continuous and discrete cases.
Bounds on sample averages over the score ∂ π y,ij . For every i ∈ I we define the empirical process Here, for ease of notation, we use the subscript J to denote the sample size, corresponding to the balanced panel case where |D i | = J. Following standard notation we write G J,i F := sup f ∈F |G J,i f |. Every element of F correspond to exactly one y ∈ Y, and in the following we write f y : y → 1( y ≤ y) for that element. Since E 1{y ij ≤ y} = Λ y,ij , where ∂ π y,ij := ∂ π y,ij (π 0 y,ij ) = Λ y,ij − 1{y ij ≤ y}.
Our goal is to show that max i∈I G J,i F = o P I 1/6 by using the following theorem. The theorem uses standard notation G n for the empirical process, denoting the sample size by n (not J or |D i |), and without an extra index i. The definition of J(δ, F) is given on p.239 of van der Vaart and Wellner (1996). All that matters to us is that J(1, F) only depends on F (not on the probability measure or on the empirical process) and that for the F defined in (S.12) we have J(1, F) < ∞, because F is VC class. An obvious envelope function for that F is F :ỹ → 1, which satisfies F n = 1. The minimal measurable majorant of {G n f : f ∈ F} is denoted by G n * F , and is identical to G n F for our purposes. Lemma S.1 (Restatement of Theorem 2.14.1 in van der Vaart and Wellner, 1996, for INID case). Let G n be the empirical process of an i.n.i.d. sample. 23 Let F be a P -measurable class of measurable functions with measurable envelope F . Then, for p ≥ 2, where F n is the L 2 (P n )-seminorm and the inequality is valid up to a constant depending only on the p involved in the statement. 23 An example is (S.12). In that example the sample size is n = |Di|. We require results for nonidentically distributed samples, because yij conditional on regressors and fixed effects is independent across j under our assumptions, but not identically distributed.
Proof. In van der Vaart and Wellner (1996) the theorem is stated for empirical processes from iid samples , but their proof relies only on symmetrization arguments (their Lemma 2.3.1) and sub-Gaussianity of the symmetrized process, which continue to hold for INID samples that we consider here (our y ij are conditionally independent, but not identically distributed).
Corollary S.1. Under Assumption 1 we have sup y∈Y max i∈I 1 |D i | j∈D i ∂ π y,ij = o P n 1/12 , and sup y∈Y max j∈J 1 |D j | i∈D j ∂ π y,ij = o P n 1/12 .
Proof. The definition (S.12) implies (S.13), so we want to show max i∈I G J,i F = o P I 1/6 . Applying Lemma S.1 for the function class Fwith the envelope function F :ỹ → 1 we find for p ≥ 1, where J(1, F) is a finite constant, independent of i, as noted earlier above. By Markov's inequality we thus find max i∈I G J,i F = O P (I 1/p ). Choosing p > 6 gives the desired result.
The second statement sup y∈Y max j∈J 1 √ |D j | i∈D j ∂ π y,ij = o P n 1/12 can be shown analogously.
Bounds on weighted sample averages over the score ∂ π y,ij . We also need results on sample averages of the form e.g. 1 y,hij are weights that also depend on the index y. The following lemma is useful for that purpose.
Lemma S.2. Suppose Z 1 (t), . . . , Z n (t) are independent, stochastic processes indexed by t ∈ T which are suitably measurable. Let B i denote a measurable envelope of Let B i denote a measurable envelope of {Z i (t), t ∈ T }. Let T be equipped with the pseudometric Let N ( , T, d n ) denote the covering number of T under d n balls of radius . Let Then X n − EX n * T P,p J n (1, T ) B n P,p .
Proof. The proof is analogous to the proof of Theorem 2.14.1 in van der Vaart and Wellner (1996), p.239, with a few notational adjustments.
denote the symmetrized version of X n , where ε = (ε i ) n i=1 are independent Rademacher. By Lemma 2.3.6 in van der Vaart and Wellner (1996) the L p (P ) norm of X n − EX n * T is bounded by the L p (P ) norm of 2 X o n * T . Let P ε denote the distribution of ε. Then by the standard argument, conditional on (Z i ) n i=1 , X o n is sub-Gaussian with respect to d n : Hence by Corollary 2.2.5 in van der Vaart and Wellner (1996), we conclude By a change of variables the right side is bounded by which is further bounded by B n J n (1, T ).
Every L p -norm is bounded by a multiple of the Ψ 2 -Orliczs norm. Hence where E ε is the expectation conditional on (Z i ) n i=1 . Take expectations over (Z i ) n i=1 to obtain the lemma.
Using Lemma S.2 we obtain the following corollary.
Corollary S.2. Let Assumption 1 hold. For y ∈ Y, h ∈ {1, . . . , I + J}, i ∈ I and j ∈ J, let b y,j be real numbers, which can depend on the sample size n, and on the regressors and fixed effects, but not on the outcome variable, and assume that sup y∈Y max h∈{1,...,I+J} max i∈I max j∈I max b y,hij ∂ π y,ij = O P n 1/6 , and (ii) sup Proof. For part (i) we apply Lemma S.2 with T = Y and y,hij ∂ π y,ij , for given h ∈ {1, . . . , I + J}, we can use constant envelope B i = B and the bound J n (1, T ) ≤ C, which can be established using standard arguments, we find that A h = sup y∈Y and therefore max h∈{1,...,I+J} A h = O P (n 1/6 ) as desired.
For the first result in part (ii), we apply Lemma S.2 with T = Y and Verification of the conditions of the lemma gives the desired result. The second result in part (ii) follows analogously.
FCLT for weighted sample averages over the score ∂ π y,ij . The following theorem will be used in the proof of part (ii) and (iii) of Lemma 6. Lemma S.3 (Theorem 2.11.11 in van der Vaart and Wellner, 1996). For each n, let Z n1 , . . . , Z n,mn be independent stochastic processes indexed by an arbitrary index set F. Suppose that there exists a Gaussian-dominated semimetric ρ on F such that mn =1 E (Z n (f ) − Z n (g)) 2 ≤ ρ 2 (f, g), for every f, g ∈ F, for every ρ-ball B(ε) ⊂ F of radius less than ε and for every n. Then the sequence mn =1 (Z ,n − E Z ,n ) is asymptotically tight in ∞ (F). It converges in distribution provided it converges marginally.
A semi-metric ρ is Gaussian-dominated if it is bounded above by a Gaussian semi-metric. Any semi-metric such that ∞ 0 log N ( , F, ρ)d < ∞ is Gaussian dominated.

Proof of Lemma 6
Proof of Lemma 6, Part (i). We have Combining the above we find that Q y s y = Q (1) y s y + Q (2) y s y + Q (3) y s y + Q (rem) y s y indeed satisfies sup y∈Y max (i,j)∈D (Q y s y ) ij = o P (n −1/6 ).
Z ,n is a triangular array, because W y and x y,ij both implicitly depend on n, implying that Z ,n 1 = Z ,n 2 for n 1 = n 2 . Remember that the probability measure we use throughout is conditional on x, α 0 , γ 0 , implying that the Z ,n are independent (but not identically distributed) across , according to our assumptions. Using the Lyapunov CLT it is easy to verify that all the finite dimensional marginals (Z n (f 1 ), Z n (f 2 ), . . . , Z n (f p )) of the stochastic process Z n converge weakly to a zero mean Gaussian limit process (Z (β) (f 1 ), Z (β) (f 2 ), . . . , Z (β) (f p )). It is also easy to show that the second moments of the limit process are given by EZ (β) (f 1 )Z (β) (f 2 ) = W −1 In order to conclude that the process Z n is weakly convergent we also need to show that Z n is tight. For this we employ Lemma S.3 above with m n = n and metric ρ given in (S.18). This ρ is Gaussian dominated on F, because we have log N (ε, F, ρ) log(K/ε), for 0 < ε < K, 0, for ε ≥ K, for some constant K > 0, implying that (S.19) is satisfied.
To verify condition (i) of Lemma S.3, we calculate where for the second inequality we multiplied with Z n F /η inside the expectation, which is larger than one for Z n F > η; for the third inequality we used that Z n F ≤ sup y W −1 y x y,i j ∞ / √ n; and for the final conclusion we used that sup i,j E sup y W −1 y x y,ij 2+δ is uniformly bounded.
for sufficiently large choice of C. In the last step we also use that Y is bounded, which together with B(ε) ⊂ F implies that the possible values of ε are bounded, so that we can always choose C sufficiently large to guarantee that 1 t 3 C 1 ≤ 1 & C ≤ ε = 0.
Thus, we can apply Lemma S.3 to find that n =1 (Z ,n − E Z ,n ) Z (β) , where Z (β) is a tight zero mean Gaussian process with second moments given above.
The proof of part (iii) of Lemma 6 is analogous.
Proof of Lemma 6, Part (iv) and (v). Decomposing Q y s y = Q Using that E Q y,j = |D j | −1 i∈D j x y,ij Λ (2) y,ij which are of order O P (1), uniformly over y and i and j. By employing part (ii) of Corollary S.2 we thus find that x y,ij Λ (2) y,ij |D i | 1/2 |D j | 1/2 |D i | −1/2 j ∈D i ∂ π y,ij |D j | −1/2 i ∈D j ∂ π y,i j x y,ij Λ (2) y,ij |D i | 1/2 |D j | 1/2 y,i ∂ π y,ij , where e (n) which is of order one, uniformly over y and i. Thus, by applying Corollary S.1, and Corollary S.2 with b (n) y,hij equal to the elements of the d x -vector e (n) y,i (i.e. no j-dependence), we find that C (3,β) y = O P (n −1/4 )o P n 1/12 O P n 1/6 = o P (1). Combining the above we conclude