Semiparametric Additive Transformation Model under Current Status Data

We consider the efficient estimation of the semiparametric additive transformation model with current status data. A wide range of survival models and econometric models can be incorporated into this general transformation framework. We apply the B-spline approach to simultaneously estimate the linear regression vector, the nondecreasing transformation function, and a set of nonparametric regression functions. We show that the parametric estimate is semiparametric efficient in the presence of multiple nonparametric nuisance functions. An explicit consistent B-spline estimate of the asymptotic variance is also provided. All nonparametric estimates are smooth, and shown to be uniformly consistent and have faster than cubic rate of convergence. Interestingly, we observe the convergence rate interfere phenomenon, i.e., the convergence rates of B-spline estimators are all slowed down to equal the slowest one. The constrained optimization is not required in our implementation. Numerical results are used to illustrate the finite sample performance of the proposed estimators.


Introduction
We consider the efficient estimation of the following semiparametric additive transformation model: where H(·) is a monotone transformation function, h j (·)'s are smooth regression functions (with possibly different degrees of smoothness), and ǫ has a known distribution F (·) with support R. A wide range of survival models and econometric models can be incorporated into the above general transformation framework, e.g., (Huang & Rossini, 1997;Shen, 1998;Huang, 1999;Banerjee et al., 2006Banerjee et al., , 2009. In particular, the model (1) can be readily applied to a failure time T by letting U = log T . We can obtain the partly linear additive Cox model, i.e., Huang (1999), by assuming F (s) = 1 − exp(−e s ) and H(u) = log A(e u ), where A is an unspecified cumulative hazard function. Specifically, the hazard function of T , given the covariates (z, w), has the form where a(t) is the baseline hazard function,β = −β andh j = −h j . However, if we change the form of F (s) to e s /(1 + e s ), the model (1) just becomes the partly linear additive proportional odds model. Motivated by the close connection with survival models, we focus on the current status data in this paper which arises not only in survival analysis but also in demography, epidemiology, econometrics and bioassay. More specifically, we observe X = (V, ∆, Z, W ), where V ∈ R is a random examination time and ∆ = 1{U ≤ V }. We assume that U and V are independent given (Z, W ). Under current status data, the model (1) is also related to the semiparametric binary model studied in econometrics. Using the link function F (·), we assume that the probability of ∆ = 1, given the covariates (Z, W, V ), is of the expression: Note that Banerjee et al. (2006) and Banerjee et al. (2009) have done a great deal of statistical estimation and hypothesis testing on the model (3) (without h j terms) by assuming F (·) to be log-log function and logistic function, respectively. An extensive discussions on the relation between (3) and survival models can be found in Doksum & Gasko (1990). Recently a similar transformation model has been considered by Chen & Tong (2010) but for the right censored data. They showed that the monotone transformation function is root-n estimable which will never be achieved in the case of current status data. This is the key theoretical difference between the two types of survival data.
In this paper, we employ the B-spline approach to simultaneously estimate the vector β, monotone H and smooth h j 's. The corresponding estimates are denoted as β, H and h j . In contrast, Ma & Kosorok (2005) apply the penalized NPMLE approach to (1) (with d = 1) which yields a non-smooth step functionȞ and the penalized estimateȟ. Our B-spline framework has the following theoretical and computational advantages over the existing penalized NPMLE approach: 1. Our B-spline estimate H is smooth and uniformly consistent. However,Ȟ is always discontinues (regardless of the smoothness of its true function H 0 ) and has a bias which does not vanish asymptotically. More importantly, the convergence rate of our H ( h) is shown to be faster than that ofȞ (ȟ), i.e., O P (n −1/3 ). Therefore, we expect more accurate inferences drawn from H ( h).
2. We are able to give an explicit B-spline estimate for the asymptotic covariance of β based on which the asymptotic confidence interval of β can be easily constructed. Under very weak conditions, its consistency is proven. However, the block jackknife approach in Ma & Kosorok (2005) requires more computation, and is even not theoretically justified.
3. Our spline estimation algorithm requires much less computation than the isotonic type algorithm used in Ma & Kosorok (2005) since the order of jumps in the step function is supposed to be much larger than the order of knots we choose for estimating H and h j 's.
The remainder of the paper is organized as follows. Section 2 describes the B-spline estimation procedure. The asymptotic properties such as consistency and convergence rates of the estimates are obtained in Section 3. The asymptotic distribution of the parametric component is studied in Section 4, and its efficient information and the corresponding explicit B-spline estimate are given in Section 5. Simulation studies are presented in Section 6.1. We close with an appendix containing technical details.

Assumptions
We first define some notations. For any vector v, v ⊗2 = vv ′ . The notations > ∼ and < ∼ mean greater than, or smaller than, up to a universal constant. We denote A n ≍ B n if A n < ∼ B n and A n > ∼ B n . The notations P n and G n are used for the empirical distribution and the empirical process of the observations, respectively. Furthermore, we use the operator notation for evaluating expectation. Thus, for every measurable function f and true probability P , We next present some model assumptions.
M1. U and V are independent given (Z, W ).

M2. (a)
The covariates (Z, W ) are assumed to belong to a bounded subset in R l+d , say The joint density for (Z, V, W ) w.r.t. Lebesgue measure stays away from zero, and the joint density for (V, W ) stays away from infinity.
M4. The residual error distribution F (·) is assumed to be known and has support R. Denote the first, second and third derivative of F as f ,ḟ andf , respectively. We assume that (a) (f (u) ∨ |ḟ (u)| ∨ |f (u)|) ≤ M < ∞ over the whole R and f (u) stays away from zero in any compact set of R; Since we employ the smooth B-spline estimation rather than the penalized NPML estimation, our residue error Condition M4 is much less restrictive than that in Ma & Kosorok (2005), and may apply to more general class of semiparametric transformation models. Note that Condition M4(b) ensures the concavity of the function s → δ log F (s) + (1 − δ) log(1 − F (s)) for δ = 0, 1.
It is easy to verify that the above Condition M4 is satisfied in the following two general classes of residue error distribution functions after some algebra.

B-spline Estimation Framework
From now on, we change the signs of β and h j for simplicity of exposition. In addition, we re-center H(v) to H(v) − H(l v ) so that H(l v ) = 0 for the purpose of identifiability. The additional parameter H(l v ) will be absorbed into the vector β, i.e., the first coordinate of z is set as one. Given a single observation at x = (v, δ, z, w), the log-likelihood of model (1) is written as We assume that β ∈ B, which is a bounded open subset in R l , and that its true value β 0 is an interior point of B. Before specifying the parameter spaces for H and h j 's, we first introduce the Hölder ball H r c (Y), which is a class of smooth functions widely used in the nonparametric estimation, e.g., Stone (1982Stone ( , 1985. For any f ∈ H r c (Y), it is J < r times continuously differentiable on Y and its J-th derivative is uniformly Hölder continuous with exponent κ ≡ r − J ∈ (0, 1], i.e., sup y 1 ,y 2 ∈Y,y 1 =y 2 The functions in the Hölder ball can always be approximated by a basis expansion, i.e., where · ∞ denotes the supremum norm.. Assume the following parameter space Condition P1 for the smooth h j .
P1. For j = 1, . . . , d and some known c j , we assume that the parameter space for h j is H j , where and that the corresponding spline space is based on a system of basis functions B j = (B j1 , . . . , B jK j ) ′ of degree d j ≥ (r j − 1).
As seen from the previous examples, it is reasonable to assume that H(·) is differentiable and strictly increasing over is well defined. Such reparametrization can get around the strict monotonicity and positivity constraints of H, and thus avoids the constrained optimization in the computation. The parameter space Condition P2 for g is specified below.
P2. For some known c 0 , we assume that the parameter space for g is G, where and that the corresponding spline space is Similarly, we define G ′ n = {H(v) = v lv exp(g(s))ds : g ∈ G n }. By some algebra, we can show that Note that in the theoretical proofs and numerical calculations the exact values of c j are not necessary. Instead, only the boundedness condition, equivalently the compactness of parameter spaces and spline spaces, is needed. Here we assume this boundedness condition, which can be relaxed by invoking the chaining arguments, only for simplifying our theoretical derivations.
In this paper, we propose the B-spline approach to estimate H and h j 's as follows.
The log-likelihood (4) for the observation i can thus be reparametrized as The corresponding B-spline estimate α is defined as We can also write α = ( β ′ , g, h 1 , . . . , Some tedious algebra reveals that the Hessian matrix of which guarantees the existence of α. See more discussions on the computation feasibility in the simulation section. The above estimation procedure also applies to other linear sieves approximating the Hölder ball (or more generally Hölder space), e.g., wavelets.

Consistency and Rates of Convergence
In this section, we show that our B-spline estimate is consistent and the convergence rate of each nonparametric estimate appears to interfere with each other. Define where · 2 is the L 2 norm. Now we give the main Theorem of this section. THEOREM 1. Suppose that Conditions M1-M4 and P1-P2 hold. If K j /n → 0 for j = 0, 1, . . . , d, then we have More specifically, we further prove that If we further require that K j ≍ n 1/(2r j +1) for j = 0, . . . , d, then we have where r = min 0≤j≤d {r j }.
According to Theorem 1, the smooth H can achieve the faster convergence rate, i.e., O P (n −r/(2r+1) ), than n 1/3 -rate derived in the penalized estimation context, see Ma & Kosorok (2005), when we assume that g 0 and h j0 's are all at least continuously differentiable, i.e., r > 1. More importantly, we can further show that H is uniformly consistent, i.e., H − H 0 ∞ = o P (1), by applying Lemma 2 in Chen & Shen (1998) The above theorem also holds when we employ the constrained monotone B-spline to approx- However, such constrained optimization usually requires additional computational effort, see Zhang et al. (2010). REMARK 2. From the above Theorem 1, we observe the interesting convergence rate interfere phenomenon, i.e., the convergence rate for each B-spline estimate is forced to equal the slowest one.
In Ma & Kosorok (2005), they also show that the convergence rate of the penalized estimate h is unfortunately slowed down to O P (n −1/3 ) by the NPMLE H regardless of the smoothness degree of h 0 . One possible solution in achieving the optimal rate for each nonparametric estimate is to extend the most recent mixed rate asymptotic results Radchenko (2008) to the semiparametric setup.
Since we assume that r > 1/2, the convergence rate given in (11) is always o P (n −1/4 ). Such a rate is usually fast enough to guarantee the regular asymptotic behavior of β, i.e., √ n-consistency and asymptotic normality. Indeed, we will improve the current suboptimal rate of β in (11) to the optimal √ n rate, and further show that β is semiparametric efficient in next section.

Weak Convergence of the Parametric Estimate
In this section, we study the weak convergence of the spline estimate β in the presence of multiple nonparametric nuisance functions. We first calculate the semiparametric efficient information based on the projection onto the nonorthogonal sumspace. Let . Denote θ 0 as the true value of θ. The score functions (operators) for β, g and h j are separately calculated aṡ We assume that a ∈ L 2 (H) ≡ {a : uv lv a 2 (s)dH(s) < ∞} and b j ∈ L 0 2 (w j ) ≡ {b j : so that all the score functions defined above are square integrable. To calculate the efficient score function ℓ β , we need to find the projection ofl β onto the sumspace For simplicity, we definel β (X; α 0 ) andl β (X; α) asl β 0 andl β , respectively. The same notation rule for k = 1, . . . , l. Similarly, denote ℓ β (X; α 0 ) and ℓ β (X; α) as ℓ β 0 and ℓ β , respectively. By taking the two-stage projection approach from Sasieni (1992), we havẽ for every b jk ∈ L 0 2 (w j ), j = 1, . . . , d and k = 1, . . . , l. By slightly modifying the proof of Lemma 4 in Ma & Kosorok (2005), we can show that the above nonorthogonal projection is well defined and b † (·) exists by the alternating projection Theorem A.4.2 in Bickel et al. (1993).
Define Π j and Π a as the projection operators respectively. Define We say a function f (s, t) belongs to a uniform Hölder ball H r Here, we assume some model assumptions implying that both b † jk and a † k belong to some Hölder balls for any j = 1, . . . , d and k = 1, . . . , l.

M5. We assume that
in Condition M6 when we assume that V and W are independent and that W is pairwise independent.
where I 0 is the efficient information matrix defined as E ℓ β 0 ℓ ′ β 0 .

B-spline Estimate of the Efficient Information
In this section, we give an explicit B-spline estimate for the efficient information as a by-product of the establishment of asymptotic normality of β. Indeed, it is simply the observed information matrix if we treat the semiparametric model as a parametric one after the B-spline approximation, i.e., H j = H jn and G = G n . Specifically, we treat ℓ i (α) defined in (7) as if it were a parametric likelihood ℓ i (β, γ 0 , γ 1 , . . . , γ d ).
We construct the corresponding information estimator for (β ′ , γ 0 , γ 1 , . . . , γ 2 ) ′ : The parametric inferences imply that the information estimator for β is of the form Some calculations further reveal that where where 1 k represents the l-vector with its k-th element as one and others as zeros. We will use (18) as our estimator for I 0 .
We need the following additional assumption for Theorem 3.
M7. We assume that 6 Numerical Results

Simulations
We perform a Monte-Carlo study to assess the finite-sample performance of our proposed method.
To compare with the penalized NPMLE in Ma & Kosorok (2005), we adopt the same setting used in their paper. We simulate the current status data from the partly linear additive Cox model which is a special case of general transformation model. We choose The regression coefficients β 1 = 0.3 and β 2 = 0.25. The covariate Z 1 is Uniform[0.5, 1.5] and Z 2 is Bernoulli with success probability 0.5. We choose W as Uniform [1,10] and h(w) = sin(w/1.2 − 1) − k 0 . Censoring times are standard exponential distribution conditional on being in the interval [0.2, 1.8]. The sample sizes are n = 400 and n = 1600. We simulate 400 realizations for both sample sizes. In practice, the numbers of knots for H and h j need to be determined. Common variable selection methods such as the Akaike information criterion (AIC), and the Bayesian information criterion (BIC) can be employed for selecting the optimal number of knots. In this paper, we determine K 0 , K 1 , . . . , K d by the AIC given by In our simulation, we use a quadratic spline to approximate both function h and function g in H.
Then, AIC = −2 n i=1 ℓ i (α) + 2(K 0 + K 1 + 2). Based on our experiences, it is generally adequate to choose less than ten knots to achieve reasonable approximation, provided that h and H are not overly erratic. Figure 1 shows the AIC scores under different combinations of K 0 and K 1 for one realization of the simulation with the sample size n = 1600. It shows that the optimal choices for K 0 and K 1 are 5 and 5, respectively. The estimated h and H with various values of K 0 and K 1 are plotted in Figure 2. In the left panel of Figure 2, we fix K 0 = 5 and plot the estimated h with K 1 = 3, 5, 10. When K 1 is small (e.g., K 1 = 3), there seems be to a big bias in our estimator. On the other hand, when K 1 is large (e.g., K 1 = 10), the estimator displays a wiggly behavior. In the right panel of Figure 2, we fix K 1 = 5 and plot the estimated H with K 0 = 5, 7, 10. As the number of knots is increasing, the estimated H shows a similar wiggly shape. Hence, the numbers of knots should be chosen with caution.
Simulation results show that our B-spline estimation procedure performs quite well in the semiparametric transformation model. The bias and standard errors of the spline estimates of β 1 and β 2 are given in Table 1. The table shows that the sample biases of both β 1 and β 2 are small. The ratio of the standard errors for the two sample sizes is close to 2, a result consistent with a √ n-convergence rate for β 1 and β 2 . The estimated standard errors from (18) (denoted as ESD) are also displayed in Table 1, which are very close to the simulation results. Although our proposed method tends to overestimate the standard error slightly but the overestimation lessens as sample size increases. The 95% confidence interval constructed from (18) generally have coverage close to the nominal value. Histograms of β 1 and β 2 are shown in Figure 3. It is clear that the marginal distributions of β 1 and β 2 are Gaussian. The left panel of Figure 4 displays the spline estimate of h(w) and the monotone  estimate H is given in the right panel of Figure 4. The dashed line is the true function, the solid line is the average estimate over 400 realizations, and the dash-dotted line is the 95% pointwise confidence band for h(w) or H(v) when we know the true model, which is obtained by taking 2.5 percentile and 97.5 percentile of these 400 estimates at each w or v.
To compare our spline based method with the penalized method in Ma & Kosorok (2005), there are four obvious advantages of our method. First, the computational cost of our spline estimate H is much less expensive than that used in Ma & Kosorok (2005), i.e. the cumulative sum diagram approach. This is because the number of basis B-splines (thus the number of knots), e.g., K 0 = 5 and K 1 = 5, is often taken much smaller than the sample size n, thus the dimension of the estimation problem is greatly reduced. Secondly, our estimate of the transformation function H is smooth with a higher convergence rate. We obtain a narrower confidence interval for H shown in the right panel of Figure 4. Thirdly, we can obtain an explicit consistent estimate I. However, the block jackknife approach proposed in Ma & Kosorok (2005) is not theoretically justified. At last, we do not require the constrained optimization in our implementations.

Application: Calcification data
We illustrate the proposed method in a dataset from the calcification study. Yu et al. (2001) investigated the calcification of intraocular lenses, which is an infrequently reported complication of cataract treatment. Understanding the effect of some clinical variables on the time to calcification of the lenses after implantation is the objective of the study. The patients were examined by an ophthalmologist to determine the status of calcification at a random time ranging from zero to thirty six months after implantation of the intraocular lenses. The severity of calcification was graded into five categories ranging from zero to four. In our analysis, we simply treat those with severity > 1 as calcified and those with severity ≤ 1 as not calcified. This dataset can be treated as the current status dataset because only the examination time and the calcification status at examination are available. The interesting covariates include Z 1 incision length, Z 2 gender (0 for female and 1 for male), and W age at implantation/10. The original dataset has 379 records. We remove the one record with missing measurement, resulting the sample size n = 378. This dataset has been studied by Xue et al. (2004) We fit this dateset using the semiparametric additive transformation model. We assume the error distribution F to be one of the two distributions: extreme value distribution and logistic distribution. We approximate h and logḢ by quadratic splines. The optimal choices of knots for h and logḢ are 6 and 5, respectively. The estimates and their corresponding estimated standard errors for the para-  Table 2. The estimates for h(w) based on different error distributions are displayed in the left panel of Figure 5, and the estimates of H(v) are plotted in the right panel of Figure 5. The analysis shows very similar results for these two error distributions. From Table 2, both incision length and gender are insignificant at the 5% level of significance. From the left panel of Figure 5, h(w) increases steadily from age 50, achieving a peak at age 60, decreasing gradually thereafter, which means that patients ages around 60 tend to enjoy a longer time to calcification. The estimated transformation function H in the right panel of Figure 5 displays a nonlinear behavior and it shows that the transformation is necessary.
We can incorporate an unknown scale parameter into to the residual error distribution F (·) to further improve the above analysis. Our general B-spline estimation framework can also handle this type of transformation models easily.
Yu for providing the Calcification data and thank Professors Michael Kosorok and Donglin Zeng for many helpful comments and suggestions to improve the paper.
For any β 1 , β 2 ∈ B, h 1 , h 2 ∈ d j=1 H jn and g 1 , g 2 ∈ G n , we have The first and second inequalities in the above follow from the fact that l * (β 0 , h n , H n ) is strictly positive for sufficiently large n by (A.8), and Condition M4(a), respectively. As shown in (A.9), the functions in the class K are Lipschitz continuous in (β, h, g). Therefore, by combining Lemma 2 and Theorem 2.7.11 in (Van de Geer & Wellner, 1996), we obtain that where M = max 0≤j≤d {4c j }. In the end, we apply Lemma 3.4.2 in (Van de Geer & Wellner, 1996) to this uniformly bounded class of functions K to obtain (A.7). 2 LEMMA 4. Suppose the following Conditions (B1)-(B3) hold.
If α is consistent and I 0 is invertible, then we have LEMMA 5. (i) If a(s, t) = a(s 1 , s 2 , t) ∈ H r c (S 1 × S 2 × T ) in t relative to s 1 and s 2 , then S 1 a(s 1 , s 2 , t) ds 1 ∈ H r c ′ (S 2 × T ) in t relative to s 2 . then f (a(s, t)) ∈ H r c ′ (S × T ) in t relative to s.

Proof:
Let ⌊r⌋ be the largest integer smaller than r. Denote the m-th derivative of a(s, t) w.r.t. t as D m t a(s, t) for m = 0, 1, . . . , ⌊r⌋. (i) Note that D m t a(s 1 , s 2 , t) is bounded for 0 ≤ m ≤ ⌊r⌋, by the dominated convergence theorem, we can take derivative inside the integral to obtain which implies that D m t ( S 1 a(s 1 , s 2 , t) ds 1 ) is bounded for 0 ≤ m ≤ ⌊r⌋. Using this and the fact that for all s 2 and t 1 = t 2 , we conclude that S 1 a(s 1 , s 2 , t) ds 1 ∈ H r c ′ (S 2 × T ) in t relative to s 2 for some c ′ < ∞.
(ii) The result is true because is bounded for 0 ≤ m ≤ ⌊r⌋. Also we note that for i < ⌊r⌋, (iii) When 0 < α ≤ 1, the result follows from the observation that Using the chain rule, the above observation and part (ii) of the lemma, the desired result can be obtained by induction for general β. 2 Denote for all α ∈ N 0 and k = 1, . . . , l.
Proof: In view of (12)- (14) , we can bound the left hand side of (A.10) by after some algebra. The compactness of G n and H jn imply that the third and fifth term in the above are both of the order Q θ − Q θ 0 2 2 . For the second term, we can further bound it by Considering the compactness of G and G n , we know the second term is also of the order Q θ −Q θ 0 2 2 . Assumption M4(a) together with Cauchy-Schwartz inequality implies that Q θ − Q θ 0 Since we assume that the density for W is bounded away from zero and infinity, we have that 2 2 considering the identifiability condition 1 0 h j (w j )dw j = 0. Assumption M7 implies that the fourth term is of the order H − H 0 2 2 . Considering the form of d(α, α 0 ), we conclude the whole proof. 2

Proof of Theorem 1
Recall that h = (h 1 , . . . , h d ). Denote h 0 , h n and h as the corresponding true value, B-spline approximation and sieve estimate, respectively. Recall that l * (β 0 , h n , H n ) is bounded away from zero for sufficiently large n as implied by (A.8). Then, by the definition of α, we have P n log{l * ( β, h, H)/l * (β 0 , h n , H n )} ≥ 0, which implies that, by the inequality that α log(x) ≤ log(1 + α(x−1)) for any x > 0 and α ∈ (0, 1), Lemma 3 implies that (P n −P )ζ( β, h, H) = o P (1) since K j /n = o(1) for any j = 0, 1, . . . , d. Thus, P ζ( β, h, H) ≥ o P (1) based on (A.11). Let U n (X) = l * ( β, h, H)/l * (β 0 , h n , H n ). Based on (A.8) we know P U n (X) = 1 + o P (1), which further implies P ζ( β, h, H) ≤ o P (1) by the concavity of s → log(s). This in turn implies that P ζ( β, h, H) = o P (1). This forces P |( by the strict concavity of s → log s, Conditions M4(a), P1 and P2. It is easy to verify that ER 2 n = o P (1) if E|R n | = o P (1). Thus, we further have Combining the above equation with the identifiability condition M3, we can show ( β −β 0 ) = o P (1). This, in turn, implies that Since we assume that the joint density of (V, W ) is bounded away from zero in M2(b), we have Considering that 1 0 h j (w j )dw j = 0 for h j ∈ H j ∪ H jn and that the joint density of (V, W ) is bounded away from infinity, we have d j=1 h j − h jn 2 + H − H n 2 = o P (1). The spline approximation result (A.2) and (A.3) conclude the proof of (9).

Proof of Theorem 2
We apply Lemma 4 to prove this theorem. We first check Condition B1. Obviously, P nl β = 0 since β maximizes l(β, g, h 1 , . . . , h d ), β is consistent and β 0 is an interior point of B. Following the analysis in Page 2282 of Ma & Kosorok (2005), we can write, withā † According to Lemma 5 and dominated convergence theorem, we know that by (6) and the assumption that K j ≍ n 1/(2r j +1) .

Proof of Theorem 3
For simplicity, we write S k (X; α 0 , w k ) and S k (X; α, w k ) as S 0 k [w k ] and S k [w k ], respectively. Based on the definitions of I 0 and (19), we know their (k, k ′ )-th entry can be written as It is easy to show that E sup α∈N 0 ,w k ∈Wn |S k (X; α, w k )| 2 ≤ const. < ∞ (A.27) since A and W k are both assumed to be compact. Note that (A.27) implies that {S k (x; α, w k ) : α ∈ N 0 , w k ∈ W n } is P-Glivenko-Cantelli. Then, we know that, uniformly over w k , w k ′ ∈ W n , by considering Corollary 9.27 of Kosorok (2008). Uniformly over w k , w k ′ ∈ W n , we have where the last inequality follows from (A.10) (together with the consistency of α) & (A.27). Combining (A.28) and (A.29), we have obtained that