Identification and Estimation of Unconditional Policy Effects of an Endogenous Binary Treatment: An Unconditional MTE Approach

This paper studies the identification and estimation of policy effects when treatment status is binary and endogenous. We introduce a new class of marginal treatment effects (MTEs) based on the influence function of the functional underlying the policy target. We show that an unconditional policy effect can be represented as a weighted average of the newly defined MTEs over the individuals who are indifferent about their treatment status. We provide conditions for point identification of the unconditional policy effects. When a quantile is the functional of interest, we introduce the UNconditional Instrumental Quantile Estimator (UNIQUE) and establish its consistency and asymptotic distribution. In the empirical application, we estimate the effect of changing college enrollment status, induced by higher tuition subsidy, on the quantiles of the wage distribution.


Introduction
An unconditional policy effect is the effect of a change in a target covariate on the (unconditional) distribution of an outcome variable of interest. 1 When the target covariate has a continuous distribution, we may be interested in shifting its location and evaluating the effect of such a shift on the distribution of the outcome variable. For example, we may consider increasing the number of years of education for every worker in order to improve the median of the wage distribution. When the change of the covariate distribution is small, such an effect may be referred to as the marginal unconditional policy effect.
In this paper, we consider a binary target covariate that indicates the treatment status. In this case, a location shift is not possible, and the only way to change its distribution is to change the proportion of treated individuals. We analyze the impact of such a marginal change on a general functional of the distribution of the outcome. For example, when the functional of interest is the mean of the outcome, this corresponds to the marginal policyrelevant treatment effect (MPRTE) of Vytlacil (2010, 2011). For the case of quantiles, we obtain an unconditional quantile effect (UQE). Previously, Firpo, Fortin and Lemieux (2009) proposed using an unconditional quantile regression (UQR) to estimate the UQE. 2 However, we show that their identification strategy can break down under endogeneity. An extensive analysis of the resulting asymptotic bias of the UQR estimator is provided.
The first contribution of this paper is to introduce a new class of unconditional marginal treatment effects (MTEs) and show that the corresponding unconditional policy effect can be represented as a weighted average of these unconditional MTEs. The new MTEs are based on the influence function of the functional of the outcome distribution that we care about. The weights depend on the characteristics of the subpopulation who is indifferent about the treatment take-up. For example, the UQE is a weighted average of the unconditional MTEs built on the influence function of quantiles. This shows that the MPRTE and UQE belong to the same family of parameters. To the best of our knowledge, this was not previously recognized in either the literature on MTEs or the literature on unconditional policy effects.
To illustrate the usefulness of this general approach, we provide an extensive analysis of the unconditional quantile effects. This is empirically important since the UQR estimator proposed by Firpo, Fortin and Lemieux (2009) is consistent for the UQE only if a certain distributional invariance assumption holds. Such assumption is unlikely to hold when the treatment status is endogenous. We note that treatment endogeneity is the rule rather than the exception in economic applications.
The second contribution of this paper is to provide a closed-form expression for the asymptotic bias of the UQR estimator when endogeneity is neglected. The asymptotic bias can be traced back to two sources. First, the subpopulation of the individuals at the margin of indifference might have different characteristics than the whole population. We refer to this source of the bias as the marginal heterogeneity bias. Second, the treatment effect for a marginal individual might be different from an apparent effect obtained by comparing the treatment group with the control group. It is the marginal subpopulation, not the whole population or any other subpopulation, that contributes to the UQE. We refer to the second source of the bias as the marginal selection bias.
For the case with a binary treatment, the UQR of Firpo, Fortin and Lemieux (2009) does not account for the endogeneity and the difference between the marginal subpopulation and the whole population in terms of their covariate distributions and treatment effects. As a result, it can be plagued with biases from both sources. We show that in some cases, even if the treatment status is exogenous, the presence of other covariates 3 can render the UQR estimator inconsistent. For example, this can occur in situations where the treatment status is partly determined by the covariates that also affect the outcome variable, that is, when the selection equation and the outcome equation have covariates in common. In such a case with no endogeneity, there is no marginal selection bias, but the UQR estimator can still suffer from a severe marginal heterogeneity bias. This intriguing result cautions the use of the UQR without careful consideration.
One may ask if, as in a linear model, the asymptotic bias of the UQR estimator can be signed based on a correlation coefficient. The answer to this question is negative. The reason is that the asymptotic bias can be non-uniform across quantiles. We could have a positive bias for the 10 th quantile while a negative bias for the 11 th quantile. Strong assumptions have to be imposed on the data generating process to sign the bias a priori.
The third contribution of this paper is to show that if assumptions similar to instrument validity are imposed on the policy variables under intervention, the resulting UQE can be point identified using the local instrumental variable approach as in Carneiro and Lee (2009). Based on this, we introduce the UNconditional Instrumental QUantile Estimator (UNIQUE) and develop methods of statistical inference based on the UNIQUE when the binary treatment is endogeneous. We take a nonparametric approach but allow the propensity score function to be either parametric or nonparametric. We establish the asymptotic distribution of the UNIQUE. This is a formidable task, as the UNIQUE is a four-step estimator, and we have to pin down the estimation error from each step. variate vector. In our setting, selection into treatment follows a threshold-crossing model, where we use the exogenous variation of an instrument to obtain different counterfactual scenarios. Martinez-Iriarte (2020) introduces the quantile breakdown frontier in order to perform a sensitivity analysis to departures from the distributional invariance assumption employed by Firpo, Fortin and Lemieux (2009).
Estimation of the MTE and parameters derived from the MTE curve is discussed in Heckman, Urzua and Vytlacil (2006), Carneiro and Lee (2009) Vytlacil (2010, 2011), and, more recently, Sasaki and Ura (2021). All of these studies make a linear-in-parameters assumption regarding the conditional means of the potential outcomes, which yields a tractable partially-linear model for the MTE curve. This strategy is not very helpful in our case because our newly defined unconditional MTE might involve a nonlinear function of the potential outcomes. For example, for the case of quantiles, there is an indicator function involved. Using our expression for the weights, we can write the UQE as a quotient of two average derivatives. One of them, however, involves as a regressor the estimated propensity score as in the setting of Hahn and Ridder (2013). We provide conditions, different from those in Hahn and Ridder (2013), under which the error from estimating the propensity score function, either parametrically or nonparametrically, does not affect the asymptotic variance of the UNIQUE. This may be of independent interest.
Outline. Section 2 introduces the new MTE curve and shows how it relates to the unconditional policy effect. Section 3 presents a model for studying the UQE under endogeneity. Section 4 considers intervening an instrumental variable in order to change the treatment status and establishes the identification of the corresponding UQE. Section 5 introduces and studies the UNIQUE under a parametric specification of the propensity score. Section 6 provides simulation evidence. In Section 7 we revisit the empirical application of Carneiro, Heckman and Vytlacil (2011) and focus on the unconditional quantile effect. Section 8 concludes. The Appendix contains all proofs as well as some additional explanations or derivations of some results in the main text and considers the UNIQUE under a nonparametric specification of the propensity score.
Notation. For any generic random variable W 1 , we denote its CDF and pdf by F W 1 (·) and f W 1 (·) , respectively. We denote its conditional CDF and pdf conditional on a second random variable W 2 by F W 1 |W 2 (·|·) and f W 1 |W 2 (·|·) , respectively.

Policy Intervention and Unconditional Policy Effects
We employ the potential outcomes framework. For each individual, there are two potential outcomes: Y(0) and Y(1), where Y(0) is the outcome had she received no treatment and Y(1) is the outcome had she received treatment. We assume that the potential outcomes are given by Y(0) = r 0 (X, U 0 ), and Y(1) = r 1 (X, U 1 ), for a pair of unknown functions r 0 and r 1 . The vector X ∈ R d X consists of observables and (U 0 , U 1 ) consists of unobservables. Depending on the individual's actual choice of treatment, denoted by D, we observe either Y(0) or Y(1), but we can never observe both. The observed outcome is denoted by Y: Following Heckman and Vytlacil (1999, 2001a, 2005, we assume that selection into treatment is determined by a threshold-crossing equation where W = (Z, X) and Z ∈ R d Z consists of covariates that do not affect the potential outcomes directly. In the above, the unknown function µ (W) can be regarded as the benefit from the treatment and V as the cost of the treatment. Individuals decide to take up the treatment if and only if its benefit outweighs its cost. Alternatively, we can think of µ (W) as the utility and V as the disutility from participating in the program. While we observe (D, W, Y), we observe neither U := (U 0 , U 1 ) nor V. Also, we do not restrict the dependence among U, W, and V. Hence, they can be mutually dependent and D could be endogenous.
To change the treatment take-up rate, we manipulate Z. More specifically, we assume that Z is a vector of continuous random variables and consider a policy intervention that changes Z into Z δ = G (W, δ) for a vector of smooth functions G (·, ·) ∈ R d Z . We assume that G(W, 0) = Z so that the status quo policy corresponds to δ = 0. For example, we can take G (W, δ) = Z + δ. In this case, we change the value of Z by δ for everyone in the population. When Z has two components, say, Z = (Z 1 , Z 2 ) , we may consider intervening only one component. For example, G (W, δ) = (Z 1 + s(δ), Z 2 ) for some function s (δ) with s (0) = 0. More general interventions, such as G (W, δ) = Z (1 + δ) , Z + δZ 2 , Z + δZX when d X = d Z = 1, are allowed. Even though Z can be correlated with other variables, observed or not, we assume that Z does not contain causal factors of other variables so that the induced change in Z does not cause other variables to change. However, we do not rule out that other variables may cause Z or there is a common factor that causes Z and other variables, leading to a nonzero correlation between Z and other variables. 5 When we induce the covariate Z to change, the distribution of this covariate will change. However, we do not specify the new distribution a priori. Instead, we specify the policy rule that pins down how the value of the covariate will be changed for each individual in the population. Our intervention may then be regarded as a value intervention. This is in contrast to a distribution intervention that stipulates a new covariate distribution directly. An advantage of our policy rule is that it is directly implementable in practice while a hypothetical distribution intervention is not. The latter intervention may still have to be implemented via a value intervention, which is our focus here.
With the induced change in Z, the selection equation becomes The outcome equation, in turn, is now These two equations are the same as the status quo equations; the only exception is that Z has been replaced by Z δ . We have maintained the structural forms of the outcome equation and the treatment selection equation. Importantly, we have also maintained the stochastic dependence among U, W, and V, which is is manifested through the use of the same notation U, W, and V in equations (3) and (4) as in equations (1) and (2). Our policy intervention has a ceteris paribus interpretation at the population level: we apply the same form of intervention on Z for all individuals in the population but hold all else, including the causal mechanism and the stochastic dependence among the status quo variables, constant. Note that the model described by (3) and (4) coincides with the model given in (1) and (2) if we set δ = 0. For notational convenience, when δ = 0, we drop the subscript, and we write Y and D for Y 0 and D 0 , respectively. It may be useful to reiterate that, regardless of the value of δ, the pattern of stochastic dependence among U, W, and V is the same. In particular, the conditional distribution of (U, V) given W is invariant to the value of δ.
Under the status quo policy regime, the propensity score is P(w) := Pr (D = 1|W = w). In view of (2), we can represent it as If the conditional CDF F V|W (·|w) is a strictly increasing function for all w ∈ W, the support of W, we have where U D := F V|W (V|W) measures an individual's relative resistance to the treatment. It can be shown that U D is uniform on [0, 1] and is independent of W.
Under the counterfactual policy regime, W = (Z, X) becomes W δ := (Z δ , X), and we have is a function of W and δ. Note that U D in (5) is still defined as F V|W (V|W), and so it does not change under the counterfactual policy regime. This is to say that relative to others, an individual's resistance to the treatment is maintained across the two policy regimes. In particular, U D is still uniform on [0,1] and independent of W. The propensity score under the new policy regime is then equal to To maintain the generality of the policy change, we will not specify the policy function G(·, δ) for now, but we assume that the policy will increase the participation rate (in expectation) by δ. That is, under the intervention G(·, δ), P δ (·) satisfies and for δ = 0, P 0 (w) = P(w) for all w ∈ W, the support of W.
Let F * be the space of finite signed measures ν on Y ⊆ R with distribution function F ν (y) = ν(−∞, y] for y ∈ Y. We endow F * with the usual supremum norm: for two distribution functions F ν 1 and F ν 2 associated with the respective signed measures ν 1 and where ν Y is the measure induced by the distribution of Y. Define F Y δ similarly. Clearly, both F Y and F Y δ belong to F * . We consider a general functional ρ : F * → R and study the general unconditional policy effect.

Definition 1. General Unconditional Policy Effect
The general unconditional policy effect for the functional ρ is defined as whenever this limit exists.
The definition above is the same as that of the marginal partial distributional policy effect defined in Rothe (2012). Examples of ρ are the mean as in the marginal policyrelevant treatment effect of Vytlacil (2010, 2011), and the quantiles as in the unconditional quantile effects of Firpo, Fortin and Lemieux (2009). In order to ensure that Π ρ exists, we will assume a certain smoothness of ρ and the closeness of F Y δ to F Y.
We first consider a Hadamard differentiable functional ρ. For completeness, we provide the definition of Hadamard differentiability below.
Definition 2. ρ : F * → R is Hadamard differentiable at F ∈ F * if there exists a linear and continuous functionalρ F : F * → R such that for any G ∈ F * and G δ ∈ F * with To see how we can use the Hadamard differentiability to obtain Π ρ , we write As long as we can show that lim δ→0 G δ − G ∞ = 0 for some G, then, we obtain Π ρ = ρ F Y (G) . Next, we provide sufficient conditions for lim δ→0 G δ − G ∞ = 0. We first make a support assumption.

Assumption 2. Regularity Conditions
(a) For d = 0, 1, the random variables (Y(d), U D , W) are absolutely continuous with joint density

Assumption 3. Domination Conditions
In Assumption 2 "for all w ∈ W" can be replaced by "for almost all w ∈ W", and the supremum over w ∈ W can be replaced by the essential supremum over w ∈ W.
The results of Lemma 1 can be used to approximate F Y δ (y) in (7) by taking an expansion around δ = 0. Lemma 2. Let Assumptions 1-3 hold. Then Remark 1. Lemma 2 provides a linear approximation to F Y δ , the CDF of the outcome variable under D δ . Essentially, it says that the proportion of individuals with outcome below y under the new policy regime, that is, F Y δ (y), will be equal to the proportion of individuals with outcome below y under the existing policy regime, that is, F Y (y), plus an adjustment given by the marginal entrants.
Consider δ > 0 and P δ (w) > P (w) for all w ∈ W as an example. In this case, because of the policy intervention, the individuals who are on the margin, namely those with u D = P(w), will switch their treatment status from 0 to 1. Such a switch contributes to F Y δ (y) by the amount F Y(1)|U D ,W (P(w) , w) − F Y(0)|U D ,W (P(w) , w), averaged over the distribution of W for a certain subpopulation. We will show later that the subpopulation is exactly the group of individuals who are on the margin under the existing policy regime.
Remark 2. The linear approximation to F Y δ is uniform over Y as δ → 0. We need the uniform approximation because we consider a general Hadamard differentiable ρ. In the special case when ρ does not depend on the whole distribution, the uniformity of the approximation over the whole support Y may not be necessary. In the quantile case, we maintain the uniformity because we consider all quantile levels in (0, 1).
Lemma 2 ensures that if we take then we have lim δ→0 G δ − G ∞ = 0 for G δ defined in (6). Hence, under the Hadamard differentiability of ρ, we obtain where ψ(y, ρ, F Y ) is the influence function of ρ at F Y . It is defined as where ∆ y is the probability measure that assigns mass 1 to the single point {y} . Plugging (8) into (9) yields the following theorem.
Theorem 1. Let Assumptions 1-3 hold. Assume further that ρ : F * → R is Hadamard differentiable. Then Definition 3. The unconditional marginal treatment effect for the ρ functional is With the above definition, we can write Π ρ as The general unconditional policy effect Π ρ can be represented as a weighted average of MTE ρ (u, w) over the marginal subpopulation: for the individuals with W = w, only those for whom u D = P (w) will contribute to the unconditional effect. Among the group defined by W = w, there is a subgroup of individuals who are indifferent between participating and not participating: those for whom u D = P(w), that is, those for whom v satisfies F V|W (v|w) = P(w). A small incentive will induce a change in the treatment status for only this subgroup of individuals. It is the change in their treatment status, and hence the change in the composition of Y(1) and Y(0) in the observed outcome Y, that changes its unconditional characteristics such as the quantiles.

The Role ofṖ (w)
Theorem 1 shows that the unconditional effect depends also onṖ (w) . Under Assumption 2(c), we haveˆW for any δ ∈ N ε , hence´WṖ (w) f W (w) dw = 1. Thus the integrals in Π ρ can be regarded as a weighted mean with the weight given byṖ (w). It is important to note thatṖ(w) depends on the form of the policy function G (·, δ), which determines how we change the propensity score and who the marginal entrants are. Different policy interventions result in different changes in the propensity score, which then lead to different sets of marginal entrants and different unconditional effects.
For intuition on this, consider the case where δ > 0 and P δ (w) ≥ P (w) for all w ∈ W. Then we haveṖ Thus,Ṗ (w) measures the relative contribution to the overall improvement in the participation rate (i.e., δ) for the individuals with W = w. For each value of W, only individuals on the margin ("the marginal individuals") will change their treatment status and contribute to the overall improvement in the participation rate. The relative "thickness" of the margin depends on w and is measured byṖ (w) . We can use Figure 1 to convey the intuition behindṖ (w). The figure illustrates the marginal individuals under the existing and new policy regimes. Under the existing policy regime, the marginal individuals lie on the 45-degree line in the (P(w), u D )-plane. For easy reference, we call it the marginal curve, which is the set of points {(P(w), u D ) : u D = P(w)} . Under the new policy regime, the marginal curve is now {(P(w), u D ) : u D = P δ (w)} . Note that we can rewrite u D = P δ (w) as u D = P(w) + [P δ (w) − P (w)] . Thus the new marginal curve can be obtained by shifting every point on the original marginal curve up by P δ (w) − P (w). The magnitude of the upward shift is approximatelyṖ (w) δ, which is, in general, different for different values of w. The integral of the difference of the two marginal curves (i.e., the area of the gray region) weighted by the marginal density f W (·) of W is equal to To understand the weight f W (w)Ṗ (w) that appears in Theorem 1, let be a small positive number. Then, f W (w) measures the proportion of individuals for whom W is in [w − /2, w + /2] . Note that for W ∈ [w − /2, w + /2] , the propensity scores under D and D δ are approximately P(w) and P δ (w). The proportion of the individuals for whom W ∈ [w − /2, w + /2] and who have switched their treatment status from 0 to 1 is then equal to Scaling this by δ, which is the overall proportion of the individuals who have switched the treatment status, we obtain f W (w) · [P δ (w) − P(w)] /δ · . Thus, we can regard f W (w) · [P δ (w) − P(w)] /δ as the density function of W among those who have switched the treatment status from 0 to 1 as a result of the policy intervention. Mathematically, we have Taking the limit as → 0, we obtain δ .
Thus f W (w) [P δ (w) − P(w)] /δ is the density of W among those who respond positively to the policy intervention, that is, those with D = 0 and D δ = 1. Graphically, f W (w) [P δ (w) − P(w)] /δ is the conditional density of W conditional on (P(W), U D ) being in the gray region in Figure  1.
That is, f W (w)Ṗ (w) is the limit of the density of W among those with D = 0, D δ = 1. We can therefore refer to f W (w)Ṗ (w) as the density of the distribution of W over the marginal subpopulation that consists of all marginal individuals. In view of the above interpretation of f W (w)Ṗ (w), Theorem 1 shows that the unconditional effect is equal to the change in the influence functions for the marginal individuals, weighted by the density of the distribution of W over those marginal individuals.
Noting that f W (w) is the density of the distribution of W over the entire population, we can regardṖ (w) as the Radon-Nikodym (henceforth RN) derivative of the subpopulation distribution with respect to the population distribution. Even ifṖ (w) is not positive for all w ∈ W, the RN interpretation is still valid. In this case, the distribution with density f W (w)Ṗ (w) with respect to the Lebesgue measure is a signed measure.

2.3Ṗ (w) and MPRTE
While Theorem 1 covers general functionals, it does not cover the mean functional ρ(F Y ) = Y ydF Y (y) unless Y is a bounded set. When Y is unbounded, the mean functional is not continuous on (F * , · ∞ ) and hence is not Hadamard differentiable (see, for example, Exercise 7 in Chapter 20 in van der Vaart (2000)). In such a case, we opt for a direct approach by first showing that the remainder from Lemma 2 satisfies lim δ→0 1 δˆY ydR F (δ; y) = 0, and then exploiting the linearity of the mean functional to obtain The result in (11) holds if the following stronger version of Assumption 3 holds: Corollary 1. Let Assumptions 1, 2, and 4 hold. Then, for the mean functional, we have Heckman and Vytlacil (2001bVytlacil ( , 2005 consider the policy-relevant treatment effect defined as Taking the limit δ → 0 yields the marginal policy-relevant treatment effect (MPRTE) of Carneiro, Heckman and Vytlacil (2010): MPRTE = lim δ→0 PRTE δ . Vytlacil (2010, 2011) show that MPRTE can be represented in terms of the marginal treatment effect curve. Indeed, if we drop X for simplicity and assume that Z is independent of U conditional on U D , the usual MTE curve is MTE( where F P δ is the CDF of the random variable P δ := P δ (Z). In the Appendix we show that equations (12) and (14) are equivalent. The expression in (12) has the advantage of explicitly depending onṖ(w) which, as we mentioned, has the interpretation as the density of the distribution of W over the marginal individuals. This shows that our results include the existing results on MPRTE as special cases. We note that unlike the mean functional, which is linear, the quantile functional is not linear. In the latter case and for more general nonlinear functionals, we can not use the law of iterated expectations as in the linear case to represent an unconditional effect as the expectation of the corresponding conditional effect. Because of this fundamental difference, our results are not straightforward extensions of existing results in the literature.

Understanding the Unconditional MTE
To further understand the unconditional MTE, we provide another perspective here. This subsection is for heuristics only; rigorous developments have already been given in the previous subsections.
Ifρ F is the continuous derivative of ρ at F Y , we can expect In the spirit of Vytlacil (2001b, 2005), we may then regard F Y δ (y) − F Y (y) /δ as a policy-relevant treatment effect: it is the effect of the policy on the percentage of individuals whose value of Y is less than y. The effect is tied to a particular value y, and we obtain a continuum of policy-relevant effects indexed by y ∈ Y if y is allowed to vary over δ can then be regarded as a continuum of marginal policy-relevant treatment effects indexed by y ∈ Y. By the results of Carneiro and Lee (2009), for each y ∈ Y, the marginal policyrelevant treatment effect can be represented as a weighted average of the following policyrelevant "distributional" MTE: This shows thatρ F • MTE d is exactly the unconditional MTE defined in (10). The unconditional MTE is therefore a composition of the underlying influence function with the policy-relevant distributional MTE.

Unconditional Quantile Regressions under Endogeneity
In the rest of the paper, we focus on the quantile functional: , and in this section, we apply our results to study the unconditional quantile regression of Firpo, Fortin and Lemieux (2009) in great detail. We provide an in-depth analysis of the sources of the asymptotic bias of the UQR estimator.

UQR with a Binary Regressor
First, we review the method of UQR proposed by Firpo, Fortin and Lemieux (2009), and show how the identification strategy breaks down when D is endogenous. Let y τ be the τ-quantile of Y, and let y τ,δ be the τ-quantile of Y δ . That is, That is, Pr(Y ≤ y τ ) = Pr(Y δ ≤ y τ,δ ) = τ. We are interested in the behavior of y τ,δ as δ → 0.

Definition 4. Unconditional Quantile Effect
The unconditional quantile effect (UQE) is defined as By definition, we have When W is not present, Corollary 3 in the working paper Firpo, Fortin and Lemieux (2007) makes the following assumption to achieve identification: We refer to this as distributional invariance, and it readily identifies the counterfactual distribution: Under some mild conditions, Firpo, Fortin and Lemieux (2007) obtain The distributional invariance assumption given in (16) is key for obtaining the above identification result. It requires that the conditional distribution of the outcome variable conditioning on the treatment status remains the same across two policy regimes. If treatments are randomly assigned under both policy regimes (e.g., and the distributional invariance assumption is satisfied. When we allow for D δ to be correlated with U, however, distributional invariance does not hold in general. For example, when d = 1, ) . These two conditional probabilities are different under the general dependence of (W, U, U D ).

Asymptotic Bias of the UQR Estimator
In order to obtain an expression for the asymptotic bias in UQR when distributional invariance does not hold, we work with the same potential outcomes and threshold crossing model as in equations (1)-(3). The following corollary follows directly from Theorem 1 by choosing ρ to be the quantile functional The next corollary decomposes the unconditional quantile effect into an apparent com-ponent that neglects the adjustment given byṖ (w) and a bias component. 7

Corollary 3. Let the assumptions in Corollary 2 hold. Then
and To facilitate understanding of Corollary 3, we can define and organize the average influence functions (AIF) in a table: where ψ τ (·) is short for ψ(·, ρ τ , F Y ), the influence function of the quantile functional. In the above, E w [·] stands for the conditional mean operator given W = w. For example, The unconditional quantile effect Π τ is the average of the difference ψ ∆,U D (w) with respect to the distribution of W over the marginal subpopulation. The average apparent effect A τ is the average of the difference ψ ∆,D (w) with respect to the distribution of W over the whole population distribution. It is also equal to the limit of the UQR estimator of Firpo, Fortin and Lemieux (2009), where the endogeneity of the treatment selection is ignored. 8 Note that if our model contains no covariate W, then This is identical to the unconditional quantile effect given in (17). The discrepancy between Π τ and A τ gives rise to the asymptotic bias B τ of the UQR estimator of Firpo, Fortin and Lemieux (2009): It is easy to see that B 1τ and B 2τ given above are identical to those given in Corollary 3. The decomposition in Equation (21) traces the asymptotic bias back to two sources. The first one, B 1τ , captures the heterogeneity of the averaged apparent effects averaged over two different subpopulations. For every w, for the individuals with W = w. These effects are averaged over two different distributions of W: the distribution of W for the marginal subpopulation (i.e.,Ṗ (w) f W (w)) and the distribution of W for the whole population (i.e., f W (w)). B 1τ is equal to the difference of these two average effects. If the effect ψ ∆,D (w) does not depend on w, then B 1τ = 0. IfṖ(w) = 1, then the distribution of W over the whole population is the same as that over the subpopulation, and hence B 1τ = 0 as well. For B 1τ = 0, it is necessary that there is an effect heterogeneity (i.e., ψ ∆,D (w) depends on w) and a distributional heterogeneity (i.e.,Ṗ (w) = 1 so that the distribution of W over the marginal subpopulation is different from that over the whole population). To highlight the necessary conditions for a nonzero B 1τ , we refer to B 1τ as the marginal heterogeneity bias.
The second bias component, B 2τ , embodies the second source of the bias and has a 8 To see why this is the case, we note that, in its simplest form, the UQR involves regressing the "influence on D i and W i by OLS and using the estimated coefficient on D i as the estimator of the unconditional quantile effect. Heref Y (y τ ) is a consistent estimator of f Y (y τ ) andŷ τ is a consistent estimator of y τ . It is now easy to see that the UQR estimator converges in probability to A τ if the conditional expectations in (19) are linear in W.
difference-in-differences interpretation. Each of ψ ∆,D (·) and ψ ∆,U D (·) is the difference in the average influence functions associated with the counterfactual outcomes Y (1) and Y (0) . However, ψ ∆,D (·) is the difference over the two subpopulations who actually choose D = 1 and D = 0, while ψ ∆,U D (·) is the difference over the marginal subpopulation. So is simply the average of this difference in differences with respect to the distribution of W over the marginal subpopulation. This term arises because the change in the distributions of Y for those with D = 1 and those with D = 0 is different from that for those whose U D is just above P (w) and those whose U D is just below P (w). Thus we can label B 2τ as a marginal selection bias. Equivalently, The condition resembles the parallel-paths assumption or the constant-bias assumption in a difference-in-differences analysis. If U D is independent of (U 0 , U 1 ) given W, then this condition holds and B 2τ = 0. In general, when U D is not independent of (U 0 , U 1 ) given W, and W enters the selection equation, we have B 1τ = 0 and B 2τ = 0, hence Π τ = A τ . IfṖ (w) is not identified, then B 1τ is not identified. In general, B 2τ is not identified without additional assumptions. Therefore, in the absence of additional assumptions, the asymptotic bias can not be eliminated, and Π τ is not identified.
It is not surprising that in the presence of endogeneity, the UQR estimator of Firpo, Fortin and Lemieux (2009) is asymptotically biased. The virtue of Corollary 3 is that it provides a closed-form characterization and clear interpretations of the asymptotic bias. To the best of our knowledge, this bias formula is new in the literature. If point identification can not be achieved, then the bias formula can be used in a bound analysis or sensitivity analysis. From a broad perspective, the asymptotic bias B τ is the unconditional quantile counterpart of the endogenous bias of the OLS estimator in a linear regression framework. 9 9 The bias decomposition is not unique. Corollary 3 gives only one possibility. We can also write Remark 3. Consider a setting of full independence: V ⊥ U 0 ⊥ U 1 ⊥ W (i.e., every subset of these variables is independent of the rest). In this case, B 2τ = 0 and by equation (18), the UQE is Following (19), the apparent effect is In general, therefore, we will still have a marginal heterogeneity bias term given by B 1τ unlesṡ does not depend on w. In general, both conditions fail if some covariate (i.e., X) in W enters both the outcome equations and the selection equation nontrivially. In this case, the unconditional quantile regression estimator of Firpo, Fortin and Lemieux (2009), which converges to A τ , will be asymptotically biased even though there is no endogeneity.
In the presence of endogeneity, it is not easy to evaluate or even sign the asymptotic bias. In general, the joint distribution of (U, U D ) given the covariate W is needed for this purpose. This is not atypical. For a nonlinear estimator such as the unconditional quantile estimator, its asymptotic properties often depend on the full data generating process in a nontrivial way. This is in sharp contrast with a linear estimator such as the OLS in a linear regression model whose properties depend on only the first few moments of the data. Figure 2 shows the non-uniformity of the asymptotic bias, computed as B τ := A τ − Π τ . It is derived from a model (details can be found in Section B of the Appendix) where the endogeneity is controlled by a single correlation parameter ρ. The case ρ = 0 corresponds to the case of an exogenous treatment and illustrates the marginal heterogeneity bias.

Unconditional Quantile Effect under Instrumental Intervention
In the previous section, we have shown that the UQR estimator of Firpo, Fortin and Lemieux (2009) is asymptotically biased under endogeneity or marginal heterogeneity, and The interpretations ofB 1τ andB 2τ are similar to those of B 1τ and B 2τ with obvious and minor modifications. In this case, it is even more revealing to callB 1τ the marginal heterogeneity bias, because a necessary condition for a nonzeroB 1τ is that there is an effect heterogeneity among the marginal subpopulation. the UQE is in general unidentified. In this section, we impose additional assumptions on Z in order to achieve the point identification of the UQE. The solution we propose to tackle the endogeneity problem for the quantile functional can be easily generalized to deal with a general functional. To save space, we do not pursue this straightforward generalization.

Instrumental Intervention
We impose the following additional assumptions on Z, taken directly from Heckman and Vytlacil (1999, 2001a, 2005) Assumption 5. Relevance and Exogeneity (a) µ(Z, X) is a non-degenerate random variable conditional on X.
Assumption 5(a) is a relevance assumption: for any given level of X, the variable Z can induce some variation in D. Assumption 5(b) is referred to as an exogeneity assumption. The two assumptions are essentially the conditions for a valid instrumental variable, hence we will refer to Z as the instrumental variable and refer to the intervention as the instrumental intervention.
Assumption 5(b) allows us to write Assumption 5(b) then implies that (U 0 , U 1 , U D ) is independent of Z conditional on X. If conditioning on the covariate X that enters the structural functions r 0 (·, U 0 ) and r 1 (·, U 1 ) is not enough to ensure the independence of (U 0 , U 1 , U D ) from Z, we may augment the conditioning set to include additional control variables, say X C . The control variables in X C do not enter r 0 , r 1 , or µ. However, we can think that X C enters r 0 , r 1 , and µ trivially so that with some abuse of notation we can write where X S is the structural variable that enters r 0 , r 1 and µ nontrivially and X C is the control variable that enters r 0 , r 1 and µ trivially (i.e., r 0 , r 1 and µ are constant functions of X C given other variables). Letting X = (X S , X C ) and W = (Z, X S , X C ) , the above model then takes the same form as the model in (1) and (2). With this conceptual change, our results remain valid for the case with additional controls.
Using the influence function of the quantile functional, we define an unconditional marginal treatment effect for the τ-quantile, which will be a basic building block for the unconditional quantile effect.
Definition 5. The unconditional marginal treatment effect for the τ-quantile is defined as The MTE τ is different from the quantile analogue of the marginal treatment effect of Lee (2009) andYu (2014), which is defined as . An unconditional quantile effect can not be represented as an integrated version of the latter.
The MTE τ given in Definition 5 is a special case of the general unconditional MTE in (10) Here, because of Assumption 5(b), conditioning on Z is not necessary and has been dropped. Define , which underlies the above definition of MTE τ (u, x) . The random variable ∆(y τ ) can take three values: For a given individual, ∆(y τ ) = 1/ f Y (y τ ) when the treatment induces the individual to "cross" the τ-quantile y τ of Y from below, and ∆(y τ ) = −1/ f Y (y τ ) when the treatment induces the individual to "cross" the τ-quantile y τ of Y from above. In the first case, the individual benefits from the treatment while in the second case the treatment harms her. The intermediate case, ∆(y τ ) = 0, occurs when the treatment induces no quantile crossing of any type. Thus, the unconditional expected value f Y (y τ ) × E[∆(y τ )] equals the difference between the proportion of individuals who benefit from the treatment and the proportion of individuals who are harmed by it. For the UQE, whether the treatment is beneficial or harmful is measured in terms of quantile crossing. Among the individuals with characteristics U D = u and X = x, MTE τ (u, x) is then equal to the rescaled (by 1/ f Y (y τ )) difference between the proportion of individuals who benefit from the treatment and the proportion of individuals who are harmed by it. Thus, MTE τ (u, x) is positive if more individuals increase their outcome above y τ , and it is negative if more individuals decrease their outcome below y τ .
To bring our setting closer to the case of the shift of a continuous covariate that Firpo, Fortin and Lemieux (2009) study, we consider the intervention where g (·) is a measurable function and s (δ) is a smooth function satisfying s (0) = 0. That is, we intervene to change the first component of Z. Note that while s(δ) is the same for all individuals, g (W) depends on the value of W and hence it is individual-specific. Thus, we allow the intervention to be heterogeneous. 10 In empirical applications, Z may consist of a few variables, and Z 1 is the target variable that we consider to change. The unconditional policy effect is specific to the variable Z 1 that we choose to intervene in order to improve the treatment adoption rate.
Lemma 3. Assume that (i) (V, X) are absolutely continuous random variables with joint density (vi) s (δ) is a differential function in a neighborhood of zero and s (0) = 0. Then , . Corollary 4. Let Assumptions 1-3 and 5, and the assumptions of Lemma 3 hold. Assume further that f Y (y τ ) > 0. Then, the unconditional quantile effect of the shift in Z given in (22) is .

Identification of the UQE
To investigate the identifiability of Π τ , we study the identifiability of the unconditional MTE τ and the weight function or the RN derivative separately. The proposition below shows that MTE τ (u, x) is identified for every u = P(w) for some w ∈ W. Proposition 1. Let Assumptions 2(a), 2(b), and 5(b) hold. Then, for every u = P(w) with w ∈ W, we have ∂u .
Proposition 1 can be proved using Theorem 1 in Carneiro and Lee (2009). In the Appendix we provide a self-contained proof that is directly connected to the idea of shifting the propensity score. For a more general functional ρ, we can show that for every u equal to P(w) for some w ∈ W.
Now we turn to the identification of the RN derivativeṖ(w) given in Corollary 4. Under Assumption 5(b), the propensity score becomes It is now clear thatṖ(w) can be represented using ∂P(w) ∂z 1 and g (w) . We formalize this in the following proposition.
Proposition 2. Let Assumption 5 and the assumptions in Lemma 3 hold. Theṅ .
Since g(w) is known and ∂P(w) ∂z 1 is identified,Ṗ(w) is also identified. As in the case of MTE τ , Assumption 5(b) plays a key role in identifyingṖ(w). Without the assumption that V is independent of Z conditional on X, we can have only thaṫ and The presence of the second term in the above equation invalidates the identification result in (23). Using Propositions 1 and 2, we can represent Π τ as All objects in the above are point identified, hence Π τ is point identified.

Unconditional Instrumental Quantile Estimation
This section is devoted to the estimation and inference of the UQE under the instrumental intervention in (22). We assume that the propensity score function is parametric, and we leave the case with a nonparametric propensity score to Section D of the Appendix. In order to simplify the notation, we set g(·) ≡ 1 for the remainder of this paper. 11 Letting m 0 (y τ , P(w), and using (24), we have Π τ consists of two average derivatives and a density evaluated at a point, some of which depend on the unconditional τ-quantile y τ . Altogether Π τ depends on four unknown quantities. The method of unconditional instrumental quantile estimation involves first estimating the four quantities separately and then plugging these estimates into Π τ to obtain the estimatorΠ τ . See (32) for the formula ofΠ τ . We consider estimating the four quantities in the next few subsections. For a given , we will use P n to denote the empirical measure. The expectation of a function χ (O) with respect to P n is then P n χ = n −1 ∑ n i=1 χ(O i ).

Estimating the Quantile and Density
For a given τ, we estimate y τ using the (generalized) inverse of the empirical distribution function of Y: The following asymptotic result can be found in Serfling (1980). 12

Lemma 4.
If the density f Y (·) of Y is positive and continuous at y τ , thenŷ τ − y τ = P n ψ Q (y τ ) We use a kernel density estimator to estimate f Y (y). We maintain the following assumptions on the kernel function and the bandwidth. Assumption 6. Kernel Assumption 11 The presence of g (·) amounts to a change of measure: from a measure with density f W (w) to a measure with density g(w) f W (w). When g (·) is not equal to a constant function, we only need to change the population expectation operator E [h(W)] that involves the distribution of W into E [h(W)g(W)] and the empirical average operator P n [h (W)] into P n [h(W)g (W)] . All of our results will remain valid.
12 See Section 2.5.1. Actually, Serfling (1980) provides a better rate for the remainder.
The non-standard condition nh 3 ↑ ∞ is due to the estimation of y τ . Since we need to expandf Y (ŷ τ ) −f Y (y τ ), which involves the derivative off Y (y), we have to impose a slower rate of decay for h to control the remainder. The details can be found in the proof of Lemma 5. We note, however, that nh 3 ↑ ∞ implies the usual rate condition nh ↑ ∞.
The estimator of f Y (y) is then given bŷ Lemma 5. Let Assumptions 6 and 7 hold. Then Furthermore, for the quantile estimatorŷ τ of y τ that satisfies Lemma 4, we havê In order to isolate the contributions off andŷ τ , we can use Lemma 5 to writê The first pair of terms on the right-hand side of (25) represents the dominant term and reflects the uncertainty in the estimation of f Y . The second pair of terms reflects the error from estimating y τ . In order to ensure that R f Y = o p (n −1/2 h −1/2 ), we need nh 3 ↑ ∞, as stated in Assumption 7.

Estimating the Average Derivatives
To estimate the two average derivatives, we make a parametric assumption on the propensity score, leaving the nonparametric specification to the appendix.
Assumption 8. The propensity score P(Z, X, α 0 ) is known up to a finite-dimensional vector α 0 ∈ R d α .
Under Assumption 8, the parameter Π τ can be written as where First, we estimate T 1 , the mean of the derivative of the propensity score, by whereα is an estimator of α 0 . To save space, we slightly abuse notation and write .
We adopt this convention in the rest of the paper.
Lemma 6. Suppose that (a)α admits the representationα − α 0 = P n ψ α 0 + o p (n −1/2 ), where ψ α 0 (W i ) is a mean-zero d α × 1 random vector with E ψ α 0 (W i ) 2 < ∞, and · denotes the Euclidean norm; (b) the variance of exists for all z and x and for α in an open neighborhood around α 0 ; (d) for α ∈ A 0 , a neighborhood around α 0 , the map α → E ∂ 2 P(Z,X,α) ∂α∂z 1 is continuous and a uniform law of large numbers holds: sup α∈A 0 P n Then, T 1n (α) − T 1 has the following stochastic approximation We can rewrite the main result of Lemma 6 as Equation (27) has the same interpretation as equation (25). It consists of a pair of leading terms that ignores the estimation uncertainty inα but accounts for the variability of the sample mean, and another pair that accounts for the uncertainty inα but ignores the variability of the sample mean. We estimate the second average derivative T 2 by see (29) for an explicit construction. We can regard T 2n (ŷ τ ,m,α) as a four-step estimator. The first step estimates y τ , the second step estimates α 0 , the third step estimates the conditional expectation m 0 (y, P(Z, X, α 0 ), X) using the generated regressor P(Z, X,α), and the fourth step averages the derivative (with respect to Z 1 ) over X and the generated regressor P(Z, X,α).
We use the series method to estimate m 0 . To alleviate notation, define the vectorw(α) := (P(z, x, α), x) andW i (α) := (P(Z i , X i , α), X i ) . We writew =w (α 0 ) andW i =W i (α 0 ) to suppress their dependence on the true parameter value α 0 .Bothw(α) andW i (α) are in R d X +1 . Let φ J (w(α)) = (φ 1J (w(α)), . . . , φ J J (w(α))) be a vector of J basis functions ofw(α) with finite second moments. Here, each φ jJ (·) is a differentiable basis function. Then, the series estimator of m 0 (y τ ,w(α)) ism(ŷ τ ,w(α)) = φ J (w(α)) b (α,ŷ τ ), whereb(α,ŷ τ ) is: The estimator of the average derivative T 2 is then We use the path derivative approach of Newey (1994) to obtain a decomposition of T 2n (ŷ τ ,m,α) − T 2 , which is similar to that in Section 2.1 of Hahn and Ridder (2013). To describe the idea, let {F θ } be a path of distributions indexed by θ ∈ R such that F θ 0 is the true distribution of O := (Y, Z, X, D). The parametric assumption on the propensity score need not be imposed on the path. 13 The score of the parametric submodel is S . For any θ, we define where m θ , y τ,θ , and α θ are the probability limits ofm,ŷ τ , andα, respectively, when the distribution of O is F θ . Note that when θ = θ 0 , we have T 2,θ 0 = T 2 . Suppose the set of scores {S(O)} for all parametric submodels {F θ } can approximate any zero-mean, finite-variance function of O in the mean square sense. 14 If the function θ → T 2,θ is differentiable at θ 0 and we can write for some mean-zero and finite second-moment function Γ(·) and any path F θ , then, by Theorem 2.1 of Newey (1994), the asymptotic variance of In the next lemma, we will show that θ → T 2,θ is differentiable at θ 0 . Suppose, for the moment, this is the case. Then, by the chain rule, we can write To use Theorem 2.1 of Newey (1994), we need to write all these terms in an outer-product form, namely the form of the right-hand side of (30). To search for the required function Γ(·), we follow Newey (1994) and examine one component of T 2,θ at a time by treating the remaining components as known. The next lemma provides the conditions under which we can ignore the error from estimating the propensity score in our asymptotic analysis. The following notation is used: Lemma 7. Assume that (a) (Z, X) is absolutely continuous with density f ZX (z, x) and (i) f ZX (z, x) is continuously differentiable with respect to z 1 in Z × X , (ii) for each w −1 ∈ W −1 , f Z 1 |W −1 (z 1 |w −1 ) = 0 for any z 1 on the boundary of Z 1 (w −1 ) , the support of Z 1 conditional on W −1 = w −1 .
(b) m(y τ ,w) is continuously differentiable with respect to z 1 for all orders, and for a neighborhood Θ 0 of θ 0 , the following holds: The next lemma establishes a stochastic approximation of T 2n (ŷ τ ,m,α) − T 2 and provides the influence function as well. The assumptions of the lemma are adapted from Newey (1994). These assumptions are not necessarily the weakest possible.
Then, we have the decomposition Lemma 8 characterizes the contribution of each stage to the influence function of T 2n (ŷ τ ,m,α). The contribution from estimating m 0 , given by P n ψ m 0 , corresponds to the one in Proposition 5 of Newey (1994) (p. 1362).
Equation (33) consists of six influence functions and a bias term. The bias term B f Y (y τ ) arises from estimating the density and is of order O(h 2 ). The six influence functions reflect the impact of each estimation stage. The rate of convergence ofΠ τ is slowed down through P n ψ f Y (y τ ), which is of order O p (n −1/2 h −1/2 ). We can summarize the results of Theorem 2 in a single equation:Π where ψ Π τ collects all the influence functions in (33) except for the bias, and If nh 5 → 0, then the bias term is o(n −1/2 h −1/2 ). The following corollary provides the asymptotic distribution ofΠ τ .

Corollary 5.
Under the assumptions of Theorem 2 and the assumption that nh 5 → 0, From the perspective of asymptotic theory, all of the following terms are all of order O p (h) = o p (1) and hence can be ignored in large samples: √ nhP n ψ Q (y τ ), √ nhP n ψ ∂P , √ nhP n ψ α 0 , √ nhP n ψ ∂m 0 , and √ nhP n ψ m 0 . The asymptotic variance is then given by However, V τ ignores all estimation uncertainties except that inf Y (y τ ), and we do not expect it to reflect the finite-sample variability of To improve the finite-sample performances, we keep the dominating term from each source of estimation errors and employ a sample counterpart of Ehψ 2 Π τ to estimate V τ . The details can be found in Section C of the Appendix.

Testing the Null of No Effect
We can use Corollary 5 for hypothesis testing on Π τ . SinceΠ τ converges to Π τ at a nonparametric rate, in general, the test will have power only against a local departure of a nonparametric rate. However, if we are interested in testing the null of a zero effect, that is, H 0 : Π τ = 0 vs. H 1 : Π τ = 0, we can detect a parametric rate of departure from the null. The reason is that, by (26), Π τ = 0 if and only if T 2 = 0, and T 2 can be estimated at the usual parametric rate. Hence, instead of testing H 0 : Π τ = 0 vs. H 1 : Π τ = 0, we can test the equivalent hypotheses H 0 : T 2 = 0 vs. H 1 : Our test is based on the estimator T 2n (ŷ τ ,m,α) of T 2 . In view of its influence function given in Lemma 8, we can estimate the asymptotic variance of T 2n (ŷ τ ,m,α) bŷ We can then form the test statistic: By Lemma 8 and using standard arguments, we can show that T o 2n ⇒ N (0, 1). To save space, we omit the details here.

Simulation Evidence
For our simulation, we consider the following model: Y(0) = U 0 , Y(1) = β + U 1 , and D = 1 {V ≤ Z} where Z ∈ R. By Corollary 4, and assuming that f U 0 |V = f U 1 |V , we have 15 wheref In order to compute Π τ numerically, we assume that Z is standard normal and indepen- 15 The details can be found in Section B of the Appendix. dent of (U 0 , U 1 , V), which is jointly normal with mean 0 and variance-covariance matrix Here, ρ is the correlation between U 0 and V, and between U 1 and V. It is also the parameter that governs the endogeneity of D.
Estimation of Π τ requires estimating y τ , f Y , T 1 , and T 2 . The quantiles are estimated in the usual way. To estimate f Y , we use a Gaussian Kernel with bandwidth h = 1.06 ×σ Y × n −1/5 , whereσ Y is the sample standard deviation of Y. This is Silverman's rule of thumb.
To estimate T 1 , we use a probit model. To estimate T 2 , we run a cubic series regression.

Testing the Null Hypothesis of No Effect
If we set β = 0, and by (36), Π τ = 0 and so the null hypothesis of a zero effect holds. The test statistic T o 2n is constructed following equation (35). Because the test statistic does not involve estimating the density (and also T 1 ), the test has nontrivial power again 1/ √ ndepartures (i.e., β = c/ √ n for some c = 0) from the null. To simulate the power function of the nominal 5% test, we consider a range of 25 values of β between −1 and 1. The endogeneity, governed by the parameter ρ, takes five values: 0, 0.25, 0.5, 0.75, and 0.9. We perform 1,000 simulations with 1,000 observations. For different values of τ, the power functions are shown below in Figure 3. The test has the desired level, except for the extreme quantile τ = 0.1, where under high endogeneity, the rejection probability does not increase fast enough. Omitted is the power function for the median, τ = 0.5, which is basically indistinguishable from that of τ = 0.4. Furthermore, simulation results not reported here show that the power functions for τ = 0.6, 0.7, 0.8, 0.9 are very similar to those of τ = 0.4, 0.3, 0.2, 0.1, respectively.

Empirical Coverage of Confidence Intervals
In this subsection we investigate the empirical coverage of confidence intervals built us-ingV τ , the variance estimator given in (A.34). Since We use a grid of β that takes values −1, −0.5, −0.25, 0, 0.25, 0.5, and 1. For the endogeneity parameter, ρ, we take the values 0, 0.25, 0.5, 0.75, and 0.9. Finally, τ takes the values from 0.1 to 0.9 with an increment of 0.1. We note that, for values of β = 0, where the effect is not 0, we need to numerically compute the value of Π τ . We perform 1,000 simulations with 1,000 observations each. The results are reported in the tables below for τ = 0.1 and τ = 0.5. It is clear that the confidence intervals have reasonable coverage accuracy in almost all cases.

Empirical Application
We estimate the unconditional quantile effect of expanding college enrollment on (log) wages. The outcome variable Y is the log wage, and the binary treatment is the college enrollment status. Thus p = Pr(D = 1) is the proportion of individuals who ever enrolled in a college. Arguably, the cost of tuition (Z 1 ) is an important factor that affects the college  Table 2: Empirical coverage of 95% confidence intervals for τ = 0.5. enrollment status but not the wage. In order to alter the proportion of enrolled individuals, we consider a policy that subsidizes tuition by a certain amount. The UQE is the effect of this policy on the different quantiles of the unconditional distribution of wages when the subsidy is small. This policy shifts Z 1 , the tuition, to Z 1δ = Z 1 + s(δ) for some s(δ), which is the same for all individuals, and induces the college enrollment to increase from p to p + δ. Note that we do not need to specify s(δ) because we look at the limiting version as δ → 0. In practice, we may set s(δ) equal to a small percentage of the total tuition, say 1%. We use the same data as in Carneiro, Heckman and Vytlacil (2010) and Carneiro, Heckman and Vytlacil (2011): a sample of white males from the 1979 National Longitudinal Survey of Youth (NLSY1979). The web appendix to Carneiro, Heckman and Vytlacil (2011) contains a detailed description of the variables. The outcome variable Y is the log wage in 1991. The treatment indicator D is equal to 1 if the individual ever enrolled in college by 1991, and 0 otherwise. The other covariates are AFQT score, mother's education, number of siblings, average log earnings 1979-2000 in the county of residence at age 17, average unemployment 1979-2000 in the state of residence at age 17, urban residence dummy at age 14, cohort dummies, years of experience in 1991, average local log earnings in 1991, and local unemployment in 1991. We collect these variables into a vector and denote it by X.
We assume that the following four variables (denoted by Z 1 , Z 2 , Z 3 , Z 4 ) enter the selection equation but not the outcome equation: tuition at local public four-year colleges at age 17, presence of a four-year college in the county of residence at age 14, local earnings at age 17, and local unemployment at age 17. The total sample size is 1747, of which 882 individuals had never enrolled in a college (D = 0) by 1991, and 865 individuals had enrolled in a college by 1991 (D = 1). We compute the UQE of a marginal shift in the tuition at local public four-year colleges at age 17.
To estimate the propensity score, we use a parametric logistic specification. To estimate the conditional expectation function m 0 , we use a series regression using both the estimated propensity score and the covariates X as the regressors. Due to the large number of variables involved, a penalization of λ = 10 −4 was imposed on the L 2 -norm of the coefficients, excluding the constant term as in ridge regressions. We compute the UQE at the quantile level τ = 0.1, 0.15, . . . , 0.9. For each τ, we also construct the 95% (pointwise) confidence interval. Figure 4 presents the results. The UQE ranges between 0.22 and 0.47 across the quantiles with an average of 0.37. When we estimate the unconditional mean effect, we obtain an estimate of 0.21, which is somewhat consistent with the quantile cases. We interpret these estimates in the following way: the effect of a δ (small) increase in college enrollment induced by an additive change in tuition increases (log) wages between 0.22 × δ and 0.47 × δ across quantiles. For example, for δ = 0.01, we obtain an increase in the quantiles of the wage distribution between of 0.22% and 0.47%.

Conclusion
In this paper we study the unconditional policy effect with an endogenous binary treatment. Framing the selection equation as a threshold-crossing model allows us to introduce a novel class of unconditional marginal treatment effects and represent the unconditional effect as a weighted average of these unconditional marginal treatment effects. When the policy variable used to change the participation rate satisfies a conditional exogeneity condition, it is possible to recover the unconditional policy effect using the proposed UNIQUE method.
To illustrate the usefulness of the unconditional MTEs, we focus on the unconditional quantile effect. We find that the unconditional quantile regression estimator that neglects endogeneity can be severely biased. The bias may not be uniform across quantiles. Any attempt to sign the bias a priori requires very strong assumptions on the data generating process. More intriguingly, the unconditional quantile regression estimator can be inconsistent even if the treatment status is exogenously determined. This happens when the treatment selection is partly determined by some covariates that also influence the outcome variable.
We find that the unconditional quantile effect and the marginal policy-relevant treatment effect can be seen as part of the same family of policy effects. It is possible to view the latter as a robust version of the former. Both of them are examples of a general unconditional policy effect. To the best of our knowledge, this connection has not been established in either literature.

A Additional Proofs
Proof of Lemma 1. Using the selection equation Hence, where the order of integration can be switched because the integrands are non-negative. It then follows that Under Assumptions 2(b) and 2(c), we can differentiate both sides of (A.2) with respect to δ under the integral sign to get Under Assumptions 2(b.i) and 2(c.ii), f Y(1)|U D ,W (ỹ|P δ (w), w)∂P δ (w)/∂δ is continuous in δ for eachỹ ∈ Y (1) and w ∈ W. In view of Assumptions 2(b.ii) and 2(c.iii), we can invoke the dominated convergence theorem to show that the map δ → Using the selection equation, we can write F Y(0)|D δ (ỹ|0) as where the orders of integrations can be switched because the integrands are non-negative. Therefore, Using Assumptions 2(b) and 2(c), we have The continuity of δ → ∂δ follows from the same arguments for the conti- Proof of Lemma 2. For any δ in N ε , we have We proceed to take the first order Taylor expansion of δ → (p + δ) f Y(1)|D δ and δ → (1 − p − δ) f Y(0)|D δ around δ = 0, which is possible by Lemma 1. Using (A.3), we have and 0 ≤δ 1 ≤ δ. The middle pointδ 1 depends on δ. For the case of d = 0, we have a similar expansion:   (A.9) and 0 ≤δ 0 ≤ δ. The middle pointδ 0 depends on δ. Hence where the remainder R F (δ; y) is The next step is to show that the remainder in (A.11) is o(|δ|) uniformly over y ∈ Y = Y (0) ∪ Y (1) as δ → 0, that is, lim δ→0 sup y∈Y R F (δ;y) δ = 0. Using (A.7) and (A.9), we get Assumption 3 allows us to take the limit δ → 0 under the integral signs. Also, by Lemma 1, both are continuous in δ. We get the desired result by noting that lim δ→0 sup y∈Y R F (δ;y) δ = 0. So, uniformly over y ∈ Y as δ → 0 Proof of Theorem 1. Most of the proof is in the main text immediately preceding the statement of the theorem. We only need to show that is differentiable. To this end, we consider the limit This and Assumptions 2(b.iii) and 2(c.iii) allow us to use the dominated convergence theorem to obtain and hence G (y) is indeed differentiable with derivative given above. Therefore, Proof of Corollary 1. We can see that Hence it suffices to show that lim δ→0 δ −1´Y ydR F (δ; y) = 0. Under Assumption 4, we have As in the proof of Lemma 2, each term in the above upper bound converges to zero as δ → 0. Therefore, lim δ→0 1 δˆY ydR F (δ; y) = 0, as desired.

Derivation of Equation (14). Note that
Depending on the value of P δ relative to the index U D , we observe the potential outcome Y (0) or Y (1) . By the independence of P δ from U D and the law of iterated expectations, we have Plugging (A.13) back into (A.12), we get Going back to (13) using (A.14), we get Taking the limit in (A.15) as δ → 0 yields Proof of Corollary 2. It follows from Theorem 1 by considering ψ(y, ρ, Proof of Corollary 3. For each d = 0 and 1, we have Proof of Lemma 3. For a given δ, s(δ) satisfies Pr(D δ = 1) = p + δ. But Note that G * (W, s) takes the same form as G(W, δ) but their second arguments are different. So Note that s(0) = 0. We need to find the derivative of the implicit function s(δ) with respect to δ. Define t(δ, s) = p + δ −ˆW F V|W (µ(G * (w, s), x)|w) f W (w)dw. (A.17) By Theorem 9.28 in Rudin (1976), we need to show that t (δ, s) is continuously differentiable for (δ, s) in a neighborhood of (0, 0). We do this by showing that the partial derivatives of (A.17) with respect to δ and s exist and are continuous (See Theorem 9.21 in Rudin (1976)).
For the partial derivative with respect to δ, we have ∂t(δ, s)/∂δ = 1, which is obviously continuous in (δ, s). For the partial derivative with respect to s, we use Assumption (iii) in the lemma to obtain The function is trivially continuous in δ. In view of the continuity of f V|W (v|w) in v for almost all w, the dominated convergence theorem implies that ∂t(δ, s)/∂s is also continuous in s. Therefore, we can apply the implicit function theorem to obtain s (δ) in a neighborhood of δ = 0. Taking the derivative of (A.16) with respect to δ, we get .
Next, we have .
It then follows that , and .
Proof of Corollary 4. It follows from an application of Lemma 3 to Corollary 3. Proof of Proposition 1. Note that for any bounded function L (·) , we have But, using D = 1 {U D ≤ P(W)} , we have where the last line follows because U = (U 0 , U 1 ) is independent of Z given X and U D . Now where the first equality uses the law of iterated expectations, the second equality uses the independence of U D from X, and the last equality uses U D ∼ uniform on [0,1]. Similarly, So we have By taking L(·) = 1 {· ≤ y τ }, we have Under Assumptions 2(a) and 2(b), we can invoke the fundamental theorem of calculus to obtain ∂u for any u such that there is a w ∈ W satisfying P(w) = u. Proof of Lemma 5. We havê We write this concisely aŝ This completes the proof of the first result. We now prove the second result of the lemma. Since K(u) is twice continuously differ-entiable, we use a Taylor expansion to obtain for someỹ τ betweenŷ τ and y τ . The first and second derivatives arê To find the order off Y (y), we calculate its mean and variance.
We have for any y. That is, for any > 0, there exists an M > 0 such that when n is large enough. Suppose we choose M so large that we also have Pr √ n |ỹ τ − y τ | > M < 2 when n is large enough. Then, when n is large enough, by the Lipschitz continuity of K (·) with Lipschitz constant L K . When Using the second condition on K (·) , we have, for Hence, in both cases, h 2 f Y (ỹ τ ) −f Y (y τ ) = O p n −1/2 h −3/2 . As a result, Combining this with (A.20), we obtain In view of (A.19), we then havê Now, using Lemma 4, we can writê and the o p (n −1/2 ) term is the error of the linear asymptotic representation ofŷ τ − y τ . In order to obtain the order of R f Y , we use the following results: The rate on the derivative of the density can be found on page 56 of Pagan and Ullah (1999). Therefore, because, since by Assumption 7, h ↓ 0, so O p n −1/2 h 2 = o p (n −1/2 ). We need to show that √ nhR f Y = o p (1). We do this term by term. First, as long as nh 2 ↑ ∞, which is guaranteed by Assumption 7, since it is implied by nh 3 ↑ ∞.
since by Assumption 7 nh 3 ↑ ∞. Therefore, . Proof of Lemma 6. We have the following decomposition: Under Condition (b) of the lemma, we have For the first term, we have by applying the mean value theorem coordinate-wise whereα is a vector with (not necessarily equal) coordinates between α 0 andα. Under Conditions (c) and (d) of the lemma, we have Using (A.21) together with the linear representation ofα in Condition (a), we obtain We can then write Proof of Lemma 7. Recall that m 0 (y τ ,w (α θ )) := m 0 (y τ , P(w, α θ ), x) In order to emphasize the dual roles of α θ , we definẽ m 0 (y τ , u, x; P ·, α θ 2 ) = E 1 {Y ≤ y τ } |P(W, α θ 2 ) = u, X = x .
Proof of Lemma 8. First, we prove that the decomposition in (31) is valid. We start by showing that is differentiable at θ 0 . For this, it suffices to show that each of the four derivatives below exists at θ = θ 0 : By Lemma 7, the last derivative exists and is equal to zero at θ = θ 0 . We deal with the rest three derivatives in (A.24) one at a time. Consider the first derivative. Under Condition (a.iii) and Condition (b) with = 0 of the lemma, we havê W ∂m 0 (y τ ,w (α 0 )) ∂z 1 sup Hence, the contribution associated with the first derivative is simply the influence function of T 2n (y τ , m 0 , α 0 ) − T 2 . Now, for the second derivative in (A.24), Theorem 7.2 in Newey (1994) shows that the assumptions of the lemma imply the following: 1. There is a function γ m 0 (o) and a measureF m 0 such that 2. The following approximation holdŝ It can be shown that γ m 0 (·) equals ψ m 0 (·) defined in the lemma (c.f., Proposition 5 of Newey (1994)).

B Asymptotic Bias of the UQR Estimator with Exogenous Treatment
Consider the model where (Z, X) ∈ R 2 is independent of (U 0 , U 1 , V). For simplicity, we assume that µ (Z, X) = Z + X and consider Z δ = Z + s (δ) . The selection equation under the new or counterfactual policy regime is D δ = 1 {V ≤ Z + s (δ) + X} .
In this equation,T 2n = T 2n (ŷ τ ,m,α),T 1n = T 1n (α), Most of these plug-in estimates are self-explanatory. For example,ψ α,i is the estimated influence function for the MLE when P(W i ,α) = P(W iα ) and P ∂ (a) = ∂P (a) /∂a. If the propensity score function does not take a linear index form, then we need to make some adjustment toψ α,i . We only need to find the influence function for the MLE, which is an easy task, and then plugα into the influence function.
The only remaining quantity that needs some explanation isψ m,i , which involves a nonparametric regression of To see why this may be consistent for E ∂ log f (W i ) ∂z 1 W i (α 0 ) , we note that using integration by parts, the above is just a series approximation to E The consistency ofV τ can be established by using the uniform law of large numbers. The arguments are standard but tedious. We omit the details here.

D Unconditional Instrumental Quantile Estimation under Nonparametric Propensity Score
In this section, we drop Assumption 8 and estimate the propensity score non-parametrically using the series method. With respect to the results in Section 5, we only need to modify Lemma 6, since Lemma 8 shows that we do not need to account for the error from estimating the propensity score.
LetP(w) denote the nonparametric series estimator of P(w). The estimator of T 1 := E ∂P(W) ∂z 1 is now The estimator of T 2 is the same as in (28) but with P(W i ,α) replaced byP(W i ) : T 2n (ŷ τ ,m,P) := 1 n n ∑ i=1 ∂m(ŷ τ ,P(W i ), X i ) ∂z 1 , (A.35) where, as in (28),m is the series estimator of m. The formula is the same as before, and we only need to replace P(W i ,α) byP(W i ). The nonparametric UNIQUE becomeŝ T 2n (ŷ τ ,m,P) T 1n (P) . (A.36) The following lemma follows directly from Theorem 7.2 of Newey (1994).
Lemma 9. Let Assumption (a) of Lemma 7 and Assumptions (a) and (c) of Lemma 8 hold. Assume further that P(z, x) is continuously differentiable with respect to z 1 for all orders, and that there is a constant C such that ∂ P(z, x)/∂z 1 ≤ C for all ∈ N. Then T 1n (P) − T 1 = P n ψ ∂P s + P n ψ P s + o p (n −1/2 ), where we define ψ ∂P s := ∂P(W) ∂z 1 − T 1 and ψ P s := − (D − P(W)) × ∂ log f W (W) ∂z 1 .
We summarize the results of Theorem 3 in a single equation: where ψ Π τ collects all the influence functions in (A.37) except for the bias, R Π is absorbed in the o p (n −1/2 h −1/2 ) term, andB The bias term is o p (n −1/2 h −1/2 ) by Assumption 7. The following corollary provides the asymptotic distribution ofΠ τ .
Corollary 6. Under the assumptions of Theorem 3, The asymptotic variance takes the same form as the asymptotic variance in Corollary 5. Estimating the asymptotic variance and testing for a zero unconditional effect are entirely similar to the case with a parametric propensity score. We omit the details to avoid repetition and redundancy. From the perspective of implementation, there is no substantive difference between a parametric approach and a nonparametric approach to the propensity score estimation.