Oracle estimation of parametric transformation models ∗

: Transformation models, like the Box-Cox transformation, are widely used in regression to reduce non-additivity, non-normality, and heteroscedasticity. The question of whether one may or may not treat the es- timated transformation parameter as ﬁxed in inference about other model parameters has a long and controversial history (Bickel and Doksum, 1981, Hinkley and Runger, 1984). While the frequentist wisdom is that uncertainty regarding the true value of the transformation parameter cannot be ignored, in practice, diﬃculties in interpretation arise if the transformation is regarded as random and not ﬁxed. In this paper, we suggest a golden mean methodology which attempts to reconcile these philosophies. Penalized estimation yields oracle estimates of transformations that enable treating the transformation parameter as known when the data indicate one of a prespeciﬁed set of transformations of scientiﬁc interest. When the true transformation is outside this set, rigorous frequentist inference is still achieved. The methodology permits multiple candidate values for the transformation, as is common in applications, as well as simultaneously accommodating variable selection in regression model. Theoretical issues, such as selection consistency and the oracle property, are rigorously estab-lished. Numerical studies, including extensive simulation studies and real data examples, illustrate the practical utility of the proposed methods.


Introduction
In regression analysis, it is sometimes worthwhile to transform the response variable Y , the explanatory vector X, or both, in order to reveal some basic properties of the data (Tukey, 1977, page 93). Using such transformations, one hopes to achieve the following three goals. Firstly, one may obtain a linear model in which the mean response is a known function of a linear transformation of the explanatory variables. Secondly, one may remove heteroscedasticity from the residual error. Thirdly, one may isolate a normal or nearly normal error distribution.
Parametric transformation models were pioneered in the early work of Box and Cox (1964) on the power transformation of the response in the linear model, where for some λ 0 and regression parameter β 0 , the covariate X includes the constant 1, and is a power transformation indexed by λ. The distribution of the residual ε is often assumed to be a mean zero normal random variable with variance σ 2 independent of X. Parameter estimation with a sample of independent and identically distributed data, denoted by (Y i , X i ), i = 1, . . . , n, has been well studied. In practice, since λ 0 is unknown, it is estimated byλ, the maximizer of the profile likelihood. After λ 0 is estimated, the parameters β 0 and σ 0 may be estimated via least squares with the transformed responses h (Y i ,λ). Inference for the regression coefficient vector β 0 in the transformed model is challenging. As pointed out by Bickel and Doksum (1981), the estimation of β 0 depends on the estimation of λ 0 . The problem is that the transformation parameter is not generally orthogonal to other model parameters. There is substantial empirical evidence demonstrating the potential for rather large variance inflation associated with the estimation of λ 0 . Hence, inference for β 0 needs to take into account the uncertainty regarding the true value of the parameter λ 0 . More formally, the construction of confidence intervals for β 0 requires use of the adjusted information matrix which reflects the decrease in information due to estimation of λ 0 .
In practice, the uncertainty regarding λ 0 is rarely taken into account. It is common to treatλ as fixed (Hinkley andRunger, 1984, Carroll andRuppert, 1988), or to restrict λ 0 to some finite set. For example, it has been recommended that λ 0 ∈ {0, ±1/2, ±1, ±2} (see Carroll, 1982, for analysis of estimators restricted to a finite set). This may be justified as in Hinkley and Runger (1984), who explain that the regression parameters have meaning only with reference to particular scales or at least give a partial explanation on general scales (Brillinger, 1982, Stoker, 1986. For the Box-Cox transformation model, estimating λ 0 is akin to determining the right scale for the data. As argued by Hinkley and Runger (1984), "This leads to the conclusion that when inference about parameters refers to specified scales of measurement (as must usually be the case), no allowance need be made for selecting those scales with the aid of the data". These principles lead Hinkley and Runger (1984) to a rejection of Bickel and Doksum (1981) regarding the need to account for the uncertainty regarding λ 0 .
In this paper, we attempt to reconcile these conflicting philosophies. A golden mean methodology is presented which provides a theoretically justified framework in whichλ can be regarded as fixed when the data indicate that λ 0 belongs to some finite candidate set, but otherwise takes into account the uncertainty regardingλ. We propose a regularization procedure that maximizes a penalized version of the log likelihood with respect to β, λ and σ. The penalization consists of a weighted sum of the 1 distance of λ from a prespecified set of values, with the weights calculated from the data. The procedure is shown to correctly shrinkλ to the true value when the value is in the set, with the resulting inferences for the other model parameters adaptive to whether or not the true λ 0 is contained in the candidate set. When λ 0 is in the set, the limit distribution is equivalent to that for an oracle estimator where λ 0 is known a priori. This theoretical finding supports treatingλ as fixed, as advocated by Hinkley and Runger (1984). When λ 0 is not in the set, the joint estimator of λ 0 and the other model parameters is asymptotically equivalent to the unpenalized estimator, with inferences corresponding to those of Bickel and Doksum (1981) which account for uncertainty inλ.
The approach we take here is sensible in practice since typically one has a small set of candidate transformations of interest, where the transformations are inherently meaningful and yield straightforward model interpretation. If one uses standard model selection criterion, like AIC, restricting transformations to this set, then one is implicitly finding the best fitting model amongst those models. This best fitting model may of course be misspecified, if the true model is not contained in that set. Our approach is conceptually different in that it will not select a model in the finite set unless it is the true model. Thus, our approach might be viewed as providing a goodness-of-fit assessment for procedures which restrict models to the finite set.
The issues discussed above for the Box-Cox response transformation model occur quite generally in regression models involving transformations of either the response Y or the covariates X. A comprehensive overview is given in the definitive text of Carroll and Ruppert (1988). In Section 2, we formulate a unifying model for the mean of the response in which there may potentially exist multiple transformation parameters influencing the response, the covariates, and the relationship of the mean of the response to the covariates. This includes model (1) as a special case, as well as permitting generalized linear models for categorical outcomes where the link function is specified up to an unknown parameter λ 0 . A broadly applicable shrinkage approach is discussed, in which each transformation parameter is shrunk towards values in a candidate set, with the level of shrinkage determined by a weight calculated from the data. The approach further allows shrinkage of regression parameters, enabling simultaneous variable and transformation selection. The theoretical results described above for the Box-Cox model are demonstrated to be valid in the unifying model. If the true parameter lies in the candidate set, then with probability that converges to one, the estimated parameter will equal this value in finite samples and when it does, the corresponding inferences may regard the estimated value as fixed. On the other hand, if the true value does not lie in the set, then the usual asymptotic results for joint estimation of all parameters applies. We refer to this adaptation as the oracle properties of the estimator.
The oracle properties discussed above are valid for fixed values of the parameters. The construction of confidence regions for parameters typically demand stronger results such as uniform convergence. As in previous theoretical work on variable selection with oracle properties, the convergence is not uniform over the parameter space Leeb, 2009, Pötscher andSchneider, 2010). While uniform convergence does not hold, we prove that uniform convergence over arbitrarily large subsets of the parameter space does hold. Here, arbitrarily large means that the subset for which uniform convergence does not hold can be chosen to have arbitrarily small Lebesgue measure. For Box-Cox transformation models, the subsets for which uniform convergence does not hold consist only of points that are very close to candidate transformations, and thus by the Hinkley and Runger (1984) paradigm, no allowance is needed for selecting those scales. In this work, we construct confidence regions which asymptotically achieve the nominal coverage level over arbitrary large subsets of the parameter space. Furthermore, we show that for any continuous and bounded prior on the parameter space these confidence regions asymptotically attain the desired coverage level.
One may view the proposed penalization methods in the spirit of earlier work on variable selection, inspired by the seminal lasso paper (Tibshirani, 1996). With a suitable choice of tuning parameter, one may consistently select important covariates by shrinking the coefficients of unimportant covariates to zero, with the coefficient estimates for the important covariates having asymptotic distribution which is equivalent to an oracle estimator with the unimportant covariates known a priori. Penalty functions yielding such estimators include the adaptive lasso (Zou, 2006), SCAD, the smoothly clipped absolute deviation (Fan and Li, 2001), the minimum convex penalty (Zhang, 2010), the smooth integration of counting and absolute deviation (Lv and Fan, 2009) and the log penalty (Friedman, 2012). Our penalization strategy adapts that in Zou (2006), with unpenalized estimates of the unifying model providing the necessary weights for penalized estimation. A major technical innovation is that we permit simultaneous shrinkage to multiple values of interest, as needed with transformation models. In addition, we present results which allow the size of the value-of-interest set to converge to infinity as the size of the sample grows. As in the variable selection setting, in our approach, if one constructs the weights to be large when the transformation is close to the candidate values, then the penalty enforces shrinkage to those candidate values, with the tuning parameter providing the necessary counterbalance.
In Section 2, we present our unifying model and penalization method, along with the associated theoretical results. Section 3 presents their specialization to the Box-Cox model (1). Simulations studies are discussed in Section 4, with real data examples used to illustrate the methods in Section 5. Some remarks conclude in Section 6. Detailed proofs and simulation results may be found in the Appendix. The R code for algorithm can be found in Goldberg et al. (2016).

Data and model
Let V 1 , . . . , V n be independent d-dimensional random vectors with distributed function G(v). The observations V i can be pairs (X i , Y i ) of explanatory and response variables but are not limited to this setting. Let the parameter space Θ be a compact subset of a R p . Let {F (v; θ), θ ∈ Θ} be a family of distributions. Denote the density of F (v; θ), with respect to some dominating measure ν, as f (v; θ). Define the log-likelihood of the data with respect to the family of , and Λ(θ) = Γ(θ) −1 Δ(θ)Γ(θ) −1 . This setting allows for model misspecification, such that the true distribution G needs not to be in F (see White, 1982, for discussion).
We assume the following conditions.
(A1) G has a density g with respect to the dominating measure ν.
(A2) θ 0 is an inner point of Θ and is the unique maximizer of E{f (V, θ)}.
The conditions above ensure consistency of θ n for θ 0 and that √ n( θ n − θ 0 ) → d N (0, Λ(θ 0 )) (White, 1982, Theorem 3.1). When G(v) is in the family of distributions F, then we return to the usual setting of maximum likelihood estimation. We then have that G(v) ≡ F (v; θ 0 ) and Λ(θ 0 ) = I(θ 0 ) −1 where I(θ) is the information matrix at θ. However, the above conditions do not require that the assumed model is correctly specified, permitting model misspecification, similarly to Lu et al. (2012), who studied likelihood based variable selection under misspecification. For some misspecified models, the parameter θ 0 being estimated may still be meaningful. This generality is important for the transformation model, as discussed in Section 3.

Penalized estimators
Let θ = (θ 1 , . . . , θ p ) T be the vector of parameters. For each component of θ, one can select a set of values of interest. We refer to the sets of values of interest as candidate sets. In the variable selection problem (see, for example, Fan and Li, 2001), for each component, the value of interest is zero, and thus the candidate set is of size one. In power transformations of the response or of both-sides (Carroll and Ruppert, 1988), typical values of interest for the power parameter λ 0 are 0, ±1/2, ±1, ±2. Thus the candidate set for λ 0 is {0, ±1/2, ±1, ±2}. In power transformation models, there may not be interest in shrinking the estimator of the regression parameter β 0 , in which case the size of the component of β 0 is 0. Of course, the candidate set may be different for each parameter, as would occur, for example, when performing variable selection together with power transformation (see discussion on this model in Yeo, 2005). In this case, one can simultaneously penalize the power transformation parameter in the Box-Cox response model to a finite set of values and perform variable selection in which each regression parameter is shrunk to zero.
Let {θ 1 j , . . . , θ kj j } be the candidate set for the jth component of the parameter vector, k j ∈ {0, 1, 2, . . .}. Here we let k j be equal to 0 to allow no values of interest for some of the components.
We define the penalized log-likelihood function whereŵ k j = | θ nj − θ k j | −γ for some γ > 0 are weights, and a n = (a n1 , . . . , a np ) T is a vector of tuning parameters with positive components. Letθ n denote the maximizer of Φ n (θ). In the next subsection, we demonstrate that these penalized estimators are selection consistent in the sense that with probability that converges to one, a particular parameter estimator will equal a candidate value for finite n if that candidate value is the true value. This generalizes earlier work on variable selection (Fan and Li, 2001, Zou, 2006, Lv and Fan, 2009, Zou and Zhang, 2009, Friedman, 2012, where with probability that converges to one, a particular regression parameter estimator will equal 0 for finite n if the corresponding covariate is unimportant. Moreover, the resulting estimators are oracle, having a limiting normal distribution whose variance equals that of an unpenalized estimator in which it is known a priori which of the candidate values are the true parameter values.

Theoretical properties
Let A and A C be sets of indices defined as Goldberg et al. Without loss of generality, we assume that A = {1, . . . , p 1 }, A C = {p 1 + 1, . . . , p}, and that for all j ∈ A C , θ 1 j = θ 0j . Write θ 0 = (θ 0 T 1 , θ 0 T 2 ) T , where θ 01 is a p 1 -dimensional vector of parameters, corresponding to the indices in A, and θ 02 is a p 2 = (p − p 1 )-dimensional vector, corresponding to the indices in A C . Accordingly, we writeθ n = (θ T n1 ,θ T n2 ) T . In the following we present the main theoretical results of the paper. We first show that asymptotically, one can estimate θ 0 as if the components of θ 02 were known. In other words, with probability that tends toward 1,θ n2 = θ 02 , and the asymptotic variance matrix ofθ n1 achieves the information bound of the estimation problem in which θ 02 is known. We then use this result to derive pointwise asymptotically-consistent confidence regions for θ 0 .

For all j ∈ A and for all
Moreover, The proof of the theorem appears in Appendix A.1. Such results do not require that the model is correctly specified, with the limiting variance Λ 11 being robust to model misspecification. In addition, as noted previously, these asymptotic results are pointwise and may not hold for contiguous sequences converging to parameter values which include candidate shrinkage points. The implications of this fact for the theoretical properties of the resulting confidence intervals and regions are discussed in the next subsection.
In general one can have a different regularization constant for each parameter that has a non-empty candidate set. For the transformation model in Section 3, there is only one candidate set; therefore only one a n is used, with tuning parameter selection using standard methods. In the more general case with multiple parameters being shrunk, as in variable selection, typically one uses the same regularization constant for all candidate sets (see Zou, 2006, for adaptive lasso, as well as other key papers in penalization). This is in part because the use of multiple tuning parameters tends to complicate the analysis, owing to the need to simultaneously select the multiple tuning parameters, and in part because there is little evidence in the literature to suggest that this yields marked improvements in the empirical performance of the penalization methods.

Inference
We now derive general pointwise asymptotically valid confidence intervals and confidence regions for the components of θ 0 . First, we need some definitions. Define the sets and similarly for matrices and sets. In the following we discuss confidence regions for a subset of parameters in which one treats parameters having been shrunk to a candidate value as fixed. The confidence regions may then be constructed using standard methods for maximum likelihood estimators.
where Z s is a Gaussian random vector with mean 0 and identity variance matrix of dimension s. For fixed θ 0 with parameter subset θ 0 (J) , define the set C n as where s is the cardinality of J n1 . Then See proof in Appendix A.2. The above theorem can be simplified when there is only one component of the vector θ 0 for which there are values of interest. This result is useful in the Box-Cox model (1), when the regression parameters have no candidate values.
The above result provide guarantees regarding the coverage probabilities of confidence intervals and regions for fixed θ 0 . These guarantees are not uniform in θ 0 , owing to the lack of uniform (in θ) convergence of the limit distributions of estimators based on penalization procedures. The difficulties occur for points in the parameter space which are arbitrarily close to shrinkage values. Hence, assuming our model is correctly specified, the conventional definition of a confidence region, that is, cannot be satisfied. This raises fundamental questions concerning the practical utility of the pointwise confidence regions in Theorem 2. In particular, one may not know a priori whether the true parameter value is sufficiently separated from shrinkage points to yield valid inferences.
To address this issue, we now investigate the extent to which poor performance of the confidence regions may occur under weak a priori assumptions on the true parameter values with a correctly specified model. We say that a sequence of (potentially random) sets C n is an asymptotically almost-everywhere confidence set for θ 0 in Θ if there is a sequence of parameter subspaces Θ n ⊂ Θ such that the Lebesgue measure of the sets Θ/Θ n converges to zero as n → ∞ and We have the following result.
Theorem 3. Let J = {j 1 , . . . , j r } be a set of indices. Let C n be defined as in (5). Assume (A3) and that (A4) holds for all inner points of Θ. Then the sequence C n is asymptotically almost-everywhere confidence sets in Θ.
See proof in Appendix A.3. The above result shows that the confidence regions defined in Theorem 2 guarantee asymptotic coverage probabilities of 1 − α uniformly on a large subset of Θ. Here, a large subset means that the Lebesgue measure of difference set of Θ and the subset can be arbitrary small. Moreover, the following corollary shows that for any continuous and bounded prior on Θ, the probability of θ being in the confidence region achieves asymptotically the nominal coverage rate 1 − α.
Corollary 2. Let C n be defined as in (5). Assume (A3) and that (A4) holds for all inner points of Θ. Assume that θ is a random vector with bounded density π θ (ϑ), ϑ ∈ Θ. Then The proof appears in Appendix A.4.

Generalization to growing number of shrinkage points
We now consider the case that the number of shrinkage points for each component of θ may grow with the sample size. This permits scenarios with infinite shrinkage points, a set-up which to our knowledge has not been considered previously in the penalization literature. Let Θ can change with the sample size n. We assume that for all n, We define the penalized log-likelihood function whereŵ 1 , . . . ,ŵ p and a n are defined as before. Letθ n denote the maximizer of Φ (n) n (θ). Let B and B C be sets of indices defined as for all n = 1, 2, ... , Without loss of generality, we assume that where θ 01 is a p 1 -dimensional vector of parameters, corresponding to the indices in B, and θ 02 is a p 2 = (p − p 1 )-dimensional vector, corresponding to the indices in B C . Accordingly, we writeθ n = (θ T n1 ,θ T n2 ) T . We need the following definitions regarding the size and denseness of the sets Θ (n) j . Define Goldberg et al. Moreover The proof appears in Appendix A.5. Theorem 4 shows that one can obtain an asymptotically oracle estimator of θ 0 , even when the candidate set size tends to infinity, or when the limit of the candidate set is a dense in Θ, or both. Specifically, the candidate set size can grow to infinity at a rate of up to n 1/2−ε for an arbitrary small ε > 0. The candidate set can also converge to a dense set such that in the limit, for each given j, there are points arbitrarily close to θ 0j . The distance between θ 0j and its neighboring points can decrease at a rate not less than n −1/2+ε for an arbitrary small ε > 0. For the case that both size and denseness enlarge with n, the results of Theorem 4 hold when choosing γ = 1, a nj n α1 for some α 1 > 0, η n n α2 , for some α 2 > 0 such that α 1 + α 2 < 1 2 , and δ nj n 1 2 −α3 for α 1 + α 2 < α 3 < 1 2 .

Application to transformation models
In this section, we discuss application of the general results in Section 2 to the Box-Cox transformation model. Let (X 1 , Y 1 ), . . . , (X n , Y n ) be n independent and identically distributed observations, where X i are random vectors and Y i are positive random variables. Assuming the usual Box-Cox transformation model defined (1)-(2), one can write the log likelihood for this model by where θ = (β T , σ, λ) T and ε = (h(Y, λ) − β T X)/σ. Following Hernandez and Johnson (1980), we note that this model may not be correctly specified but that conditions (A1)-(A4) may be satisfied (see also Bickel andDoksum, 1981, Yeo andJohnson, 2000). The parameter being estimated corresponds to the maximizer of the expectation of the likelihood function with respect to the true underlying distribution of (X, Y ). Such parameter may still have a meaningful interpretation in the regression context. Let θ n be the solution the estimation equation where ∂h(y, λ) The matrix Λ(θ) = Γ(θ) −1 Δ(θ)Γ(θ) −1 can be consistently estimated bŷ The goal is estimation of the regression parameters which is adaptive to the unknown power transformation. Denote the candidate set for lambda by {λ 1 , . . . , λ k } and define the penalized log-likelihood function where θ = (β T , σ, λ) T ∈ R p , a n is a regularization constant, andŵ j n = |λ n − λ j | −γ whereλ n is the maximum likelihood estimator of λ 0 without adaptive selection. Letθ n = (β T n ,σ n ,λ n ) T be the maximizer of (11). Lemma 1. Assume conditions (A1)-(A4), that a n /n 1/2 → 0 and that a n n (γ−1)/2 → ∞ as n → ∞. Assume also that E{h(Y, λ 0 ) | X} = β 0 T X and var{h(Y, λ 0 ) | X} = σ 2 0 . Then otherwise .

Simulation study
We conducted simulations to evaluate the performance of the penalized estimators of the power transformation model. The data were generated from the Box-Cox model (1), under transformation (2). We considered a one-dimensional covariate vector that was generated from the standard normal distribution. The error terms were generated from the standard normal distribution. We considered five values for the true transformation parameter: λ 0 = 0, 1/2, −1/2, 1, −1.
The candidate set A λ equals {0, 1/2, −1/2, 2, −2} for the transformation parameter. Note that A λ does not include 1, −1. For our method, we first computed the maximum likelihood estimators for λ 0 and β 0 without penalization, denoted byλ n and β n , respectively. Then we obtained our proposed estimators as defined in (11). The tuning parameter a n was selected using the 5-fold cross-validation. The weights' parameter γ was set to 1. For comparison, we computed results for the unpenalized estimator and the oracle estimator, in which the maximum likelihood estimator is calculated with the true value of the transformation parameter.
For each method, we computed the bias and the median absolute deviation of the estimates divided by 0.6745. The reason we computed the median absolute deviation instead of the sample standard deviation of the estimates is that the maximum likelihood estimator and penalized estimators may have a few outliers due to the instability of the estimation. In addition, for the maximum likelihood estimator and penalized estimators, we computed the median of estimated standard errors and the empirical coverage probability of Wald-type 95% confidence intervals. The standard error of the maximum likelihood estimator was estimated using the standard likelihood theory. The standard error of the adaptive estimator was estimated based on the asymptotic results established in Lemma 1. The empirical coverage probabilities for 95% confidence intervals based on the model-based standard errors is provided for the maximum likelihood estimator and adaptive estimators. For the maximum likelihood estimator, we also computed the coverage probabilities whenλ is treated as fixed. Finally, for the adaptive estimator, we include the selection frequency of the true power transformation parameter when it is contained in the candidate set and the selection frequency of the values in the candidate set when the true power transformation parameter is not contained in the candidate set. We report these results in Appendix B, Tables 3-5. Based on the simulation results, we make the following observations: (i) The oracle estimators for the regression parameters generally have much smaller median absolute deviation compared to the maximum likelihood estimator. (ii) When the true power transformation parameter is contained in the candidate set, the adaptive and oracle estimators show very comparable performance in terms of both bias and median absolute deviation. In addition, the median of estimated standard errors are all close to the median absolute deviation with the empirical coverage probability close to the nominal level. The adaptive method selects the true power transformation parameter with a high frequency and the selection performance improves as the sample size increases. (iii) When the true power transformation parameter is not contained in the candidate set, the adaptive and the maximum likelihood estimator estimators show comparable performance, and the estimation performance improves as the sample size increases, as expected. In addition, for the adaptive method, the selection frequency of the values in the candidate set is low and decreases to 0 as the sample size increases. (iv) When using maximum likelihood estimator and treatingλ as fixed, the confidence intervals may severely undercover. The findings (i)-(iv) support the theoretical results in Sections 2 and 3.
Next, we conducted simulations to evaluate the performance of the proposed adaptive estimator when varying the signal-to-noise ratio. In particular, we study the inflation in the standard error estimates of the maximum likelihood estimator compared with the adaptive estimator and the power of Wald test for testing β 02 = 0 based on the adaptive estimator, as the signal-to-noise ratio varies. We consider the same simulation settings with λ 0 = 0, 1/2, and −1/2, and set β 02 = 0.05, 0.1, 0.25, 0.5, 0.75 and 1.0. For each setting, we conducted 500 runs with sample size n = 100. The true value λ 0 is contained in the candidate set. Therefore, the standard error estimates of the adaptive estimator are obtained as if λ 0 was known as long as the corresponding estimator is shrunk to a value in the candidate set. The mean standard error ratios of the maximum likelihood estimator over the adaptive estimator for β 02 are plotted in the upperleft panel of Figure 1, while the power of Wald test for testing β 02 = 0 based on the adaptive estimator is plotted in other panels of Figure 1. We observe that comparing with the adaptive estimator, the maximum likelihood estimator shows larger standard error inflation as the signal-to-noise ratio increases, and when the signal-to-noise ratio is close to 0, there is almost no inflation. These agree with the findings in Bickel and Doksum (1981) and Doksum and Wong (1983). Furthermore, the power of Wald test increases as the signal-to-noise ratio increases as expected, and the adaptive estimator has significantly improved power compared with the maximum likelihood estimator for most β 02 values under all scenarios.
Finally, we conducted simulations to evaluate the performance of the proposed adaptive estimator in terms of prediction interval and compare it with the maximum likelihood estimator. Here the prediction interval is constructed following the method of Cho et al. (2001). The coverage probabilities of corresponding 95% prediction intervals and their standard errors are reported in Table 6 in the Appendix. Based on simulation results, we observe that the prediction intervals constructed by both estimates give reasonable coverage probabilities. In addition, the coverage probabilities of the adaptive method tend to have smaller standard deviations compared with the maximum likelihood estimator method when λ 0 = 0, 1/2, and −1/2, i.e. λ 0 is contained in the candidate set, while they are comparable when λ 0 = 1, −1, i.e., λ 0 is not contained in the candidate set.

Data examples
In this section, we apply the proposed adaptive selection method to two datasets studied in Box and Cox (1964) and Hinkley and Runger (1984). The R code for algorithm can be found in Goldberg et al. (2016).
For the textile example, the response is cycles to failure. There are three explanatory variables, v 1 , v 2 , v 3 , denoting the factor levels. A linear regression model was considered for the Box-Cox transformation of the response. The sample size is n = 27. The analysis results for the maximum likelihood estimator and the proposed adaptive estimation method are given in Table 1. The maximum likelihood estimator givesλ = −0.06 and the results of the maximum likelihood estimator are the same as those reported in Table 5 of Hinkley and Runger (1984). For the proposed adaptive estimation method, we considered the λ, the power transformation parameter; σ, the standard deviation of the normal error; intercept, v 1 , v 2 , and v 3 are regression parameters; Est., the estimates; SE, the estimated standard errors of the estimates.  candidate set A λ = {0, 1/2, −1/2, 1, −1} as suggested in Hinkley and Runger (1984). Moreover, we used 3-fold cross-validation for choosing the tuning parameter. The adaptive oracle estimator isλ = 0. It is noted that the estimated standard errors of the adaptive estimates for regression parameters are much smaller than those of the maximum likelihood estimator estimates. This agrees with our simulation findings, since whenλ is in the candidate set, it is taken as known when making inference for the estimates of the regression coefficient vector. Next, we consider the biological example from Hinkley and Runger (1984). The data consists of survival times from animals in a 3 × 4 factorial experiment. In this experiment, one unit of time equals ten hours. The two factors are treatment with four levels (A-D), and poison with three levels (I,II,III). There are 48 subjects, with four subjects in each one of the twelve treatment/poison combinations. As in Hinkley and Runger (1984), we consider a saturated model with 11 dummy variables indicating the 11 combinations. The baseline was taken as the combination of treatment A and poison I. The maximum likelihood estimator givesλ = −0.82, which agrees with the results in Hinkley and Runger (1984). The results of the maximum likelihood estimator estimates are given in Table 2. For the proposed adaptive estimation method, we used the same candidate set as in the textile example with 4-fold cross-validation for choosing the tuning parameter. The penalization method also givesλ = −0.82. That is, λ is not in the candidate set. Interestingly, in this example, the penalized and unpenalized estimates of the parameters are the same, indicating that there was no shrinkage of the transformation parameter. Therefore, the other estimates from our method are all the same as the maximum likelihood estimator estimates.
These two examples show the adaptivity of the proposed method for estimating the power transformation parameter given a candidate set, and then estimating the regression parameters and making the inference accordingly. The textile example provides rigorous treatment for scenarios where the estimated transformation is shrunk to a candidate value and may be treated as fixed, while the biological example evidences correct inferences in scenarios where the uncertainty in transformation estimation must be addressed via standard likelihood inference.

Concluding remarks
We have assumed throughout that the number of parameters in the model is finite, although we allowed the size of the candidate set for each of these parameters to grow to infinity as a function of the sample size. It is interesting to consider the asymptotic behavior and oracle properties of the proposed estimators for a model in which one allows the number of parameters to grow as a function of the number of observations. Similar research questions were investigated by Fan and Li (2001), Zou and Zhang (2009), and others, in the context of variable selection. It seems that for finite candidate values greater than one such proofs might be adapted, with this being a topic for future research.
The proposed penalized estimators are nonregular estimators. The irregularity arises because the estimators behave differently for parameters that are in the candidate set and close-by points. The irregularity poses challenges when constructing confidence regions to the parameters of interest. Discussion can be found in Leeb (2009), Pötscher andSchneider (2010), among others, for the variable-selection setting where each parameter has at most a single candidate point. In this work, we proposed a definition for an asymptotically almost everywhere confidence region and showed that our penalization based procedure yields inferences satisfying this definition. That is, the confidence region holds except on a set of parameter values close to the shrinkage points having asymptotically measure zero. Additional investigation is needed for understanding the properties of these nonregular estimators under general parametric models involving transformation, regression, and scale parameters when multiple candidate values are considered.
The proposed approach seems to simplify the predictions on the original untransformed scale. Typically, when fitting transformation models, one generally needs to account for estimation of both the regression parameter and the transformation parameter when making inference on the original scale. Such backtransformation procedures are greatly complicated by estimation of the transformation parameter. With our approach, one may ignore estimation of the transformation parameter when it is shrunk to one of the candidate values; otherwise, one must account for the estimation, similarly to other methods. The study of prediction on the original scale is thus an interesting topic for future research.

A.2. Proof of Theorem 2
We need to prove that lim inf P (θ 0 (J) ∈ C n ) ≥ 1−α. First, in order to θ 0 (J) ∈ C n we need J n2 = J 2 where J 2 = J ∩ A C . Hence, we can write where the one before last inequality holds since when A n = A, we also have J n2 = J 2 ; and where the last inequality follows from Theorem 1, since Recall that by Theorem 1, (nΛ n ) weakly converges to is a Gaussian random vector Z with mean 0 and identity variance matrix of dimension equals to the cardinality of J 1 . It thus follows from the Portmanteau Lemma (van der Vaart, 2000, Lemma 2.2) that since P (Z ∈ D s ) = 1 − α. Substituting (20) in (19) and the result follows.

A.3. Proof of Theorem 3
We define the sets Θ n as follows. For each pair (j, k), j ∈ 1, . . . , p and k ∈ 1, . . . , k j and ε > 0, define the sets for some fixed 0 < δ < 1 2 . Clearly, the Lebesgue measure of the sets Θ/Θ n converges to zero as n → ∞. We now need to show that (7) holds.
It is enough to show that for every sequence θ n ∈ Θ n that converges to some θ 0 ∈ Θ, we have that Indeed, if (7) does not hold, then there is a sequence {θ n }, θ n ∈ Θ n , such that for some ε > 0. By the compactness of Θ, this subsequence has a subsequence θ n k that converges to some limit θ 0 for which (23) holds, and therefore (22) will not hold. Fix a sequence θ n ∈ Θ n that converges to some θ 0 ∈ Θ. Note that whereθ is between θ n and θ n + u/ √ n. By the Lindeberg-Feller central limit theorem, By the law of large numbers, Assumptions (A3) and (A4), and the continuous mapping theorem, T (n) 2 (u) → P θn −u T Γ(θ 0 )u. Hence, since θ n → θ 0 , there exists a constant C such that for all n large enough Sinceθ n is consistent to θ n , we obtain that Define whereθ is between θ n and θ n + u/ √ n and The limiting distribution of T (n) 1 and T (n) 2 was discussed above. Consider the limiting behavior of T (n) 3 . Recall thatŵ k j ≡ |θ nj − θ k j | −γ . By (25), for all n large enough, and for every k = 1, . . . , k j , and j ∈ 1, . . . , p, where the left inequality follows since by the definition of Θ n , ( n for all n large enough. Also, for all n large enough, the sign of θ nj + uj √ n − θ k j is constant and hence We conclude that where W ∼ N (0, Δ(θ 0 )). Simple algebra shows that the maximizer of Ψ(u) isû = Γ(θ 0 ) −1 W . Using the same arguments as in the proof of Theorem 1, we can show that all the conditions of the Argmax Theorem (Kosorok, 2008, Theorem 14.1) hold, and thus Similarly to the proof of Theorem 2, we can now show that that by the Portmanteau Lemma (van der Vaart, 2000, Lemma 2.2), where by construction P (Z ∈ D p ) = 1 − α.

A.5. Proof of Theorem 4
Define

Appendix B: Simulation results
As described in Section 4, we conducted the following simulations. First, we considered five values for the true transformation parameter: λ 0 = 0, 1/2, −1/2, 1, −1. The regression parameters were set as β 0 = (β 01 , 1) T , where β 01 is the intercept parameter. For λ 0 = 0, 1/2 we chose β 01 = 5; for λ 0 = 1, β 01 = 8; for λ 0 = −1/2, β 01 = −5; and for λ 0 = −1, β 01 = −8. The different choices of β 01 were chosen to ensure positive response. The candidate set A λ equals {0, 1/2, −1/2, 2, −2} for the transformation parameter. Note that A λ does not include 1, −1. For each setting, we conducted 500 simulation runs with sample sizes of n = 100, 400, 800 and 1600. The simulation results for λ 0 = 0, 1/2, and −1/2 are summarized in Tables 3 and 4. For such cases, λ 0 is contained in the candidate set A λ . The simulation results for λ 0 = 1, −1 are summarized in Table 5. For such cases, λ 0 is not contained in the candidate set A λ . We also conducted a comparison between the prediction intervals based on the proposed adaptive method and the method of Cho et al. (2001). We considered the same simulation settings as above with the sample size of n = 100, 400. Five different prediction points are used, which are x 0 = (1, 0), (1, 1), (1, −1), (1, 2) and (1, −2) with the first component being the intercept. The coverage probabilities of corresponding 95% prediction intervals and their standard errors are reported in Table 6  MAD, the median absolute deviation of the estimates divided by 0.6745; ESE, the median of the estimated standard errors; CP, the empirical coverage probabilities of the Wald-type 95% confidence interval; Freq., the selection frequency of the true power transformation parameter over 500 runs.   (15) The numbers in the parenthesis are the standard deviations of the coverage probabilities×10 3 .

R Code
(doi: 10.1214/15-EJS1083SUPP; .zip). The R files that contains the code for the case studies analysis.