SiAM: A hybrid of single index models and additive models

: While popular, single index models and additive models have potential limitations, a fact that leads us to propose SiAM, a novel hybrid combination of these two models. We ﬁrst address model identiﬁabil- ity under general assumptions. The result is of independent interest. We then develop an estimation procedure by using splines to approximate un- known functions and establish the asymptotic properties of the resulting estimators. Furthermore, we suggest a two-step procedure for establishing ∗ Ma’s research was NSF grant Lian’s conﬁdence bands for the nonparametric additive functions. This procedure enables us to make global inferences. Numerical experiments indicate that SiAM works well with ﬁnite sample sizes, and are especially robust to model structures. That is, when the model reduces to either single-index or additive scenario, the estimation and inference results are comparable to those based on the true model, while when the model is misspeciﬁed, the superi-ority of our method can be very great.


Introduction
Because of the complexity of data sets in practice, there has been much interest in developing statistical analysis tools for problems involving high dimensional covariates. Examples of these models include additive models [AM,13] and single index models [SiM,15]. A common feature of these models is that they achieve dimension reduction [30] to circumvent the "curse of dimensionality" [1] while retaining flexibility of the nonparametric regression.
Additive models (AM) [AM, 13,39,40] and additive partially linear models [APLM, 13,17,19,25,32] corresponding to continuous response variables have been well studied in the literature [13]. The latter parsimoniously specifies the relationship between the response variable and some of the covariates in a linear function form, and the relationship between the dependent variable and the remaining covariates in a form of additive nonlinear unknown functions. The APLM enjoys the simplicity property of the linear model and the flexibility of the AM, due to the combination of parametric and nonparametric components. For estimation, [4] applied a backfitting procedure, proposed by [3], to approximate the additive components. [19] proposed to estimate the nonparametric components by polynomial splines [28,30]. After the spline basis is chosen, the coefficients can be estimated by least squares, leading to great gains computationally when contrasted with backfitting.
Although the AM and APLM are flexible and widely used for data exploration [13,14,35], their limitations are also evident from their relatively special structures. For instance, they can be used only for the additive case and are unable to reflect interactions of two or more variables, which we may encounter in the analysis of complex biomedical data.
Single index models (SiM), another attempt to gain dimensional reduction, have attracted great attention for estimating a conditional mean function because they relax restrictive assumptions imposed on parametric models of conditional mean functions such as linear or generalized linear models [12,16], and therefore gain more flexibility. There are various estimation procedures for single-index models. See [15] for a comprehensive survey and various applications of single-index models. An advantage of the SiM and their various extensions over additive models is that they can take interactions of multiple variables into account, which are frequently encountered in the analysis of complex biomedical data, for example from gene regulatory networks, while still enjoying dimension reduction. However, SiM have their own limitations, in that they assume a common nonlinear structure for the linear combination of predictors with different weights, and can not reflect the nonlinear main effect of each predictor when this feature truly exists.
To overcome the limitations of AM and SiM but still enjoy their advantages, we propose SiAM, a combination of these two structures. However, such a combination immediately raises several concerns: (i) under what assumptions is the resulting model identifiable? (ii) The resulting model contains two classes of nonparametric functions; i.e., the first one for the single index part, and the second one for the individual components. Whether we can equally treat them and similar criteria can be applied is unclear. (iii) Are the resulting models robust? That is when the true model is a sole SiM or AM, does the hybrid model and associated methods for it perform the same (or almost the same) as the true model and its associated methods? Technically, it is much more challenging to develop estimation and inference procedures for such a combination due to the complexity of the model structure. As a result, establishing theory for these procedures, such as asymptotic properties, is much more difficult.
In this paper, we address these concerns, and provide an alternative but more flexible tool for data exploration. To further gain simplicity in the implementation, we apply spline approximation to estimate each unknown component functions. This strategy has been applied in the recent literature of estimation and inference for semiparametric models [33,39]. Further, we suggest a two-step procedure for establishing confidence bands for the nonparametric additive functions. This procedure enables us to make global inference for the nonparametric functions.
The paper is organized as follows. Section 2 presents the modelling framework, and addresses model identifiability. Section 3 proposes estimation for the single index and nonparametric components. Section 4 establishes asymptotic properties for the resulting estimators. The asymptotic normality for index estimators and the rates of convergence for the nonparametric estimators are developed. Section 5 describes the two-step procedure and presents the simultaneous confidence band. Section 6 illustrates the numerical performance of the proposed method through simulation experiments. The last section provides remarks and discussions. All proofs are provided in an Appendix. Additional tables and graphs are also provided in the Supplemental Material [24].

The models and identifiability
To combine additive models and single index models, we propose single index additive models (SiAM ), given by where X = (X 1 , · · · , X p ) is the p-dimensional continuous covariates, X α is referred to as the index component, and g(·), m 1 (·), · · · , m p (·) are unknown smooth functions. We briefly discuss the incorporation of discrete covariates in Section 7.
Since single-index models and additive models are two special cases of SiAM, two obvious constraints for identifiability borrowed from these two special cases are α = 1 with α 1 > 0, and E{m j (X j )} = 0, respectively. However, these two constraints alone are not sufficient for identifiability. To see that identifiability can fail even with these two constraints, a simple example is that 2{(X 1 + To achieve identifiability, we first need to decompose m j as the sum of a linear term and a term orthogonal to the space of linear functions. That is, we . . , β p ) , with β orthogonal to α, the conditional expectation in an SiAM can be written as We call (2) the canonical form of SiAM. It is natural to require the canonical form to be unique. The following theorem gives sufficient conditions for the parameters to be identified. Although these are not necessary conditions, in view of our previous counterexample these conditions are reasonably weak.
has a density function supported on an interval S j ⊆ R and X has a joint positive density on the interior of j S j . Consider (2), with α = 1, α 1 > 0, α⊥β, E{m j (X j )} = E{X j m j (X j )} = 0. There are two situations.
(i) g and m j , j = 1, . . . , p are second order differentiable. g ≡ 0 on the support of X α. α has at least three nonzero components. (ii) g and m j , j = 1, . . . , p are second order differentiable. g is a nonconstant continuous function on the support of X α. α has at least two nonzero components.

Estimation
The full SiAM model is with α 0 = 1, α 0 1 > 0, α 0 ⊥β 0 , E{m j (X j )} = E{X j m j (X j )} = 0, and ε is the error term satisfying E(ε |X ) = 0. Define By simple calculation, In what follows, for any vector ζ = (ζ 1 , . . . , ζ s ) ∈ R s , denote ζ ∞ = max 1≤ ≤s |ζ | and ζ 2 = (|ζ 1 | 2 + · · · + |ζ s | 2 ) 1/2 . For any symmetric matrix where A ⊗2 = AA for any matrix A. For given α and m, the corresponding β which minimizes L(α, β, m) is According to Theorem 2 of [38], by assuming that X has a joint positive density function on an open convex subset in R p , for given m, the minimum point of L(α, β, m) with α⊥β is unique at α 0 and where α 0 and β 0 are the true parameters in model (3), and {ω(α)} + is the Moore-Penrose inverse of ω(α). We approximate the nonparametric functions g(·) and m j (·) by means of Bsplines, and the estimators of α, β, g(·) and m j (·) can be obtained by minimizing an objective function, which is the sample analog of (4). However, to simultaneously obtain those estimators is computationally very challenging given that the estimates of the parameters and the nonparametric functions intrinsically depend on each other. We then apply an iterative algorithm to minimize the objective function with respect to one parameter vector and fixing the others. The iterative algorithm has been commonly used for estimation in partially linear single-index models (PLSiMs) and it converges well [5,20,31,36,38]. When the nonparametric functions are given, estimation of the parameters follows the same procedure as given in [38]. When the parameters are fixed, we deal with estimation of nonparametric additive functions instead of nonparametric univariate function in PLSiMs, but they should have the same computational convergence property. Let X i = (X ij , 1 ≤ j ≤ p) and Y i be the i th realization of X and Y . The estimation is achieved in three steps: Step I. For given α and m j (·), the estimate of β is obtained by a sample estimate of (6). We first estimate ϕ α (u) and Γ α (u) given in (5) by means of B-splines. Let B (u) = {B 1 (u) , . . . , B K (u) be a set of q-th order B-spline basis functions, in which K is the number of basis functions which increases with the sample size n, and thus the number of interior knots is K − q. Moreover, the distance between neighboring knots satisfies the assumptions given in [26]. Then the spline estimators of ϕ α (u) and Γ α (u) are given as ϕ α (u) = B (u) λ and Γ α (u) = B (u) Θ, where λ and Θ are spline coefficient estimates obtained from least squares estimation with responses Y i and X, respectively. Then the estimate of β is given as Step II. The estimate of α, denoted as α, is obtained by minimizing where m j (·) and g(·) are the spline estimates of m j (·) and g(·) from the previous step of the iteration. Then we let α = α/ α .
Step III. For given α and β, we estimate g(u) and m j (x j ) by the spline estimates g(u) = B (u) γ and m j ( } are sets of basis functions for j = 1, . . . , p defined as follows. Let be sets of q-th order B-spline basis functions for j = 1, . . . , p. To en- We iterate Steps I-III until convergence. The initial estimates α ini , β ini and g ini (·) of α, β and g (·) are obtained by fitting the partially linear single-index model: Y = g(X α)+X β+ε * 1 by the method used in [38]. The initial estimates of m j are obtained by fitting the additive model:

Asymptotic properties
In this section, we study the large-sample properties of the SiAM parameter estimators, which are obtained by minimizing the objective function where B * (X i ) = {b 1 (X i1 ) , . . . , b p (X ip ) } and δ = (δ 1 , . . . , δ p ) . For given α and β, the spline coefficient estimators are γ(α, β) and δ(α, β) and the estimators of the nonparametric functions are g (u; β). We obtain the estimators α and β of α and β as the minimizers of The spline estimators of the nonparametric functions g(u) and m j (x j ) are given as g u; α, β and m j (x j ; α, β), respectively.
We next introduce the following notation and definitions. Let α 0 and β 0 be the true parameters in model (3). Now define the Hilbert space H as a collection of additive functions with finite L 2 norm on where x = (x 1 , . . . , x p ) . Moreover, define where Z = (Z 1 , . . . , Z p ) , and X = X − P (X) and For any positive numbers a n and b n , let a n b n denote that a n /b n = o(1), and a n b n means that lim n→∞ a n /b n = c, where c is some nonzero constant. Let r with r ≥ 2 be the smoothness order of the coefficient functions m (·) as given in Condition (C2) in the Appendix. Denote var(Y |X = x )= σ 2 (x). We assume that α and β are in a neighborhood of α 0 and β 0 .

Remark 1. The asymptotic expansion stated in Theorem 4.1 can be used to conduct inferences for the parameters such as constructing confidence intervals and Wald test statistics. We estimate
respectively, and the residuals are estimated by Thus, the covariance matrix Σ given in (14) is estimated by Let S 0 be the support of X α 0 . The following theorem presents the global convergence rates of the estimators for the nonparametric functions.

Theorem 4.2. Under Conditions (C1)-(C4) in the Appendix, and
SiAM: A hybrid of single index models and additive models 2405

Goal
Although the one-step spline approximation can quickly estimate multiple nonparametric functions, according to [29], no asymptotic distribution is available for the resulting estimators. In this section, we adopt a refined two-step spline estimation procedure as proposed for additive models in [27] and [21], based on which asymptotic confidence bands are further constructed for global inferences of the nonparametric functions.

Oracle nonparametric estimator
Next we will describe the oracle estimator for m j (x j ), and the oracle estimator for g (u) can be defined accordingly. By "oracle" here, we mean the estimation of one of the component functions of the SiAM model when all parameters and all the other functions are known.
Without loss of generality, we let j = 1. By assuming that g (u), m j (x j ) for j ≥ 2, α 0 and β 0 are known, we rewrite model (3) as Thus we obtain the oracle estimator of m 1 (x 1 ) as the least squares spline estimator given as m OR } is a set of B-spline basis functions with the same spline order as B(x 1 ) given in (7) but different number of basis functions L and We propose a smooth simultaneous confidence band (SCB) for m 1 (·) by studying the asymptotic behavior of maximum of the normalized deviation of the spline functional estimate. To construct asymptotic SCBs for m 1 (·) over the interval x 1 ∈ S 1 with confidence level 100(1 − α)%, α ∈ (0, 1), we need to find two functions l n (x 1 ) and u n (x 1 ) such that In practice, we consider a variant of (16) and construct SCBs over a finite subset S n,1 of S 1 with S n,1 becoming denser as n → ∞. Without loss of generality, we let S 1 = [a, b] where a and b are two finite numbers. Thus, we partition [a, b] according to N n equally spaced points a < ξ 0 < ξ 1 and thus an asymptotic where where ε i is given in (15).

Remark 3.
Compared to the pointwise CI with width 2Z 1−α/2 σ n (x 1 ), the width of the confidence bands (17) is inflated by

Remark 4.
To construct the SCB based on Theorem 5.1, we propose a finite sample approximation scheme to compute the cutoff value Q Nn (α) as follows.

Two-step estimator
Since g (u), m j (x j ) for j ≥ 2, α and β are unknown in reality, we replace the true functions and parameters by their estimators g (u), m j (x j ) for j ≥ 2, α and β from Section 3 to obtain the two-step estimator of m 1 (x 1 ), denoted as m SS 1 (x 1 ). The following theorem gives the uniform efficiency of the two-step estimator m SS 1 (x 1 ).

Remark 5.
Based on the uniform rate given in Theorem 5.2, the difference between the two-step and the oracle estimator is asymptotically negligible, so that the asymptotic 100(1 − α)% confidence band for m 1 (x 1 ) is given as m SS . Moreover, to have the result in Theorem 5.2, we need the spline estimator in the first step to be undersmoothed with the number of basis functions satisfying L n 1/(2r+1) . The number of basis functions in the second step has the optimal order given as L n 1/(2r+1) .

Remark 6. As suggested by a reviewer, since the effect of any single covariate appears in both the single-index part and the additive part, it is of interest to make inferences on the function
where x * 2 , · · · , x * p are fixed values of x 2 , . . . , x p . By an analogy with the inferences for m 1 above, we can construct the confidence bands for (19) usinĝ ) . However, unlike the previous inferences for m 1 , theoretical investigation of this requires joint asymptotics of different estimated components and we are not able to provide a rigorous justification in this paper and will only demonstrate this using simulations.

Simulations
We conducted simulation studies to investigate the finite sample performance of SiAM, the additive model (AM) and the partially linear single-index model (PLSiM). The algorithm is implemented in R and the code can be obtained from the second author upon request. The first example is Example 1. 4). The above generating model is not presented in the canonical form. The associated canonical form (2) can be found numerically. For example, we find by numerical integration that β = (−1.236, 2.034, 0.938, −2.313) .
We generated the covariates from a multivariate Gaussian distribution with cov(X ij , X ij ) = ρ |j−j | and marginally transformed the covariates into [0, 1] using the cumulative distribution function of the standard normal distribution. We use cubic B-splines with the number of internal knots equal to n 1/9 , which is the theoretically optimal order when using cubic splines, and a denotes the largest integer no greater than a. Although it is possible to choose the number of internal knots in a more data-adaptive way, the strategy of using such fixed choice is much more convenient and even a small number of internal knots can provide a flexible fit to various functions and is thus commonly adopted in the literature of regression splines [21,27,34].
We let ρ = 0.2 and choose n ∈ {200, 500, 1000} and σ = 0.5, a total of 3 settings. In each setting, 100 data sets are generated and fitted using SiAM. First we consider the estimation of standard errors for the parameters α and β. It is easy to obtain standard error estimates based on the asymptotic normality results. On each generated data set, we can get an estimate of standard errors and the average of these over 100 data sets are reported in Table 1, on rows indicated by "Est". The sample standard errors of the estimated parameter values based on 100 data sets are taken as the empirical standard errors, reported on rows indicated by "Emp". It is seen that the estimated standard errors are reasonably close to the empirical values, especially for large sample size.
For an illustration of the construction of the confidence band, Figure 1 and the Supplemental Material Figures 6 and 7 show visually the 95% confidence bands obtained on one data set for σ = 1, for functions g, m 1 , . . . , m 4 , as well as h defined in (19) with x * 2 = · · · = x * p = 1/2. To construct these bands, except for h which only uses a one-step estimator, we use 2n 1/7 +1 internal knots for the first-stage estimator and use n 1/9 + 1 internal knots in the second stage, which takes a similar form as recommended in previous works such as [27]. We set N n = 20. To investigate the coverage of the confidence bands, 500 data sets are generated in each parameter setting and the results are reported in Table  2. The coverage improves with sample size. The more severe under-coverage of the band for m 1 with n = 200 is possibly due to the relatively large bias in  estimation for this sample size, which can also be seen in Figures 3-5 of the Supplemental Material, for example. We now use this example to illustrate that the performances actually critically depend on the choice of a good initial estimator. As mentioned before, our initial estimators for α and β are obtained as in [38]. Under the setting of n = 500, we add independent normal perturbation errors with standard deviation σ p = The trajectory of the estimator error α − α 2 /p + β − β 2 /p varying with iterations are shown in Figure 2, together with the results using initial estimator based on [38] without perturbation. It is seen that even with a small perturbation σ p = 0.1 the results become worse and do not seem to converge to the correct value. Thus a good selection of initial values are important in the estimation, and one may even use multiple starting values to safeguard estimation.
The Supplemental Material has additional results. We report the estimation errors of α, β, g, m 1 , . . . , m 4 in Table 4 (the quantities here refer to those in the canonical form). For α the estimation error is defined as α − α and similarly for β. The estimation error of g is defined by   Table 4 of the Supplemental Material, we see that as sample size increases or the noise decreases, the estimation errors become smaller, as expected. For σ = 1, in Figures 3-5 of the Supplemental Material, we show the estimated nonparametric curves for 20 generated data sets, to-gether with the truth (solid red curve), to visually illustrate how the estimation accuracy improves with sample size.
Our next aim is to compare the performance of SiAM with two of its special cases, PLSiM and AM. We generated data from the following three examples. Examples 2 and 3 correspond to PLSiM and AM, respectively, and Example 4 represents a more general model that is actually not within the class of SiAM.
We consider the same parameter settings as before and in each case generate 100 data sets. Whichever example is the true generating model, we fit the data using the three different models: SiAM, PLSiM and AM. Of course we expect that the estimation would be best when the model used in fitting matches the true generating model. However, calculating the estimation errors is generally not appropriate in comparing different models. For example if the true model is SiAM while an AM is applied in model fitting, it is expected that the estimator is consistent for estimating the "best approximation" of SiAM using AM, which is not necessarily the additive part in the true generating model. In particular, it is difficult to find numerically what quantity the AM is trying to estimate when the true model is SiAM.
Thus we compare the performance of different methods in terms of their prediction accuracy by generating independently 500 observations from the true model. The prediction error is define to be where Y i is the predicted response value and Y i is the generated true response. The prediction errors are reported in Table 3. We can see that among three different fitting methods, the prediction errors are smallest when the true model is used in data fitting. However, the prediction errors for SiAM are close to the best fitting model for Examples 2 and 3, and much smaller than AM and PLSiM in Example 1. This illustrates that the cost of overfitting using a more flexible model is relatively small compared to the cost of misspecification. Finally, in Example 4 for which all fitting models are misspecified, SiAM still has by far the smallest prediction error, which is more obvious in the low-noise setting.

Discussion
In this paper, we have proposed a new model, SiAM, that combines the additive model (AM) and single index model (SiM), and have developed statistical theory for model identifiability. We have further developed a two-step procedure for making global inferences for nonparametric functions. In brief, the model and the proposed methods have the following properties: (i) the estimators of the index parameters have been shown to be asymptotically normal, and the estimators of the nonparametric functions have optimal rates of convergence; (ii) the two-step estimators have the oracle property; (iii) the proposed methods show promising performance in finite sample situations; and (iv) the implemented algorithm is computationally efficient. Using regression splines, the implementation of the method is much simpler than that of the backfitting-based or profile-based estimation. Because SiAM contains the single index component and additive components, it can detect interactions among the covariates as well as uncover possible nonlinear main effects, while SiM or AM can only achieve one or the other. Thus, the proposed model is more flexible than those two models.
As a starting point, SiAM can be used for flexible exploratory analysis. If no main effect is detected, one may simplify the model to a SiM, and if the single index component is not significant, one may reduce the model to an additive one. It appears possible that SiAM can be modified by using penalization to develop a variable selection procedure for identifying which elements should enter in the index component, and which ones can be treated as additive components: we will consider this in future work. Moreover, as a future research topic, it would be interesting to develop a method for adaptively choosing among the AM, SiM and our proposed SiAM. Such a model selection problem can be tackled by the "structural adaptation" strategy proposed in [11]. The detailed method and the associated theories need a further investigation.
In this article, we have focused on modeling with fixed dimensional covariates. It is of interest to extend the methods to high dimensional SiAM. However, the theory and implementation of such an extension is much more complicated and warrants further study.
In this article we developed our theory assuming that all predictors are continuous. If some predictors are discrete, they can be added to the linear part of the model: it is straightforward to extend our theory to this slightly more general case, with a few complications of notation only.

Software
The algorithm is implemented in R and the code can be obtained from the second author upon request.

A.1. An example about identifiability
Here we give an example to illustrate that if data are generated from the SiAM model, while we use an additive model to fit the data, then the results will be totally misleading.
Suppose the data are generated from ∼ Unif(0, 1). This is a SiAM. If we try to use an additive model to fit the data instead of SiAM. We will see what exactly can we obtain.

A.2. Assumptions
Denote the space of r-th order smooth function as C (r) [0, 1] = ϕ ϕ (r) ∈ C[0, 1] . Let C 0,1 (X w ) be the space of Lipschitz continuous function on X w , i.e., in which ϕ 0,1 is the C 0,1 -norm of ϕ. To establish the consistency and asymptotic normality for the proposed estimators, we need the following regularity conditions.
(C1) The density function f U (α) (·) of the random variable U (α) = X α is bounded away from 0 on S U and f U (α) (·) ∈ C 0,1 (S U ) for α in the neighborhood of α 0 , where S U = X α, X ∈ S and S is a compact support set of X, and the density function f Xj (x j ) of random variable X j is bounded away from 0 on the support S j of X j for j = 1, . . . , p. (C2) The nonparametric functions g ∈ C (r) (S U ) and m j ∈ C (r) (S j ), 1 ≤ j ≤ p, for some integer r ≥ 2, and the spline order q satisfies q ≥ r. (C3) The conditional variance function var(Y |X = x )= σ 2 (x) is measurable and bounded above from C σ , for some constant 0 < C σ < ∞. (C4) The functions h 0 and h j given in (12) Conditions (C1)-(C4) are commonly used in the nonparametric smoothing literature; for example, see [7] and [41].