Smoothing Spline ANOVA Models: R Package gss

This document provides a brief introduction to the R package gss for nonparametric statistical modeling in a variety of problem settings including regression, density estimation, and hazard estimation. Functional ANOVA (analysis of variance) decompositions are built into models on product domains, and modeling and inferential tools are provided for tasks such as interval estimates, the “testing” of negligible model terms, the handling of correlated data, etc. The methodological background is outlined, and data analysis is illustrated using real-data examples.


Introduction
Nonparametric function estimation using stochastic data, otherwise known as smoothing, has been studied by generations of statisticians. While scores of methods have proved successful for univariate smoothing, ones practical in multivariate settings number far less. Smoothing spline ANOVA models are a versatile family of smoothing methods that are suitable for both univariate and multivariate problems.
In this article, we introduce the package gss for R (R Core Team 2014) that embodies suites of functions implementing smoothing spline ANOVA models in the settings of Gaussian and non-Gaussian regression, density estimation, and hazard estimation. The first public release of gss dated back to 1999, when the total number of R packages on CRAN, the Comprehensive R Archive Network, was in dozens. The package was originally designed as a front end to RKPACK (Gu 1989), a collection of Ratfor routines for Gaussian regression. Over the years, new functionalities have been added, numerical efficiency improved, the user interface refined, and gss has now matured into a comprehensive package that can be used in a great variety of problem settings. As active development tapers off, gss is likely to remain stable in the foreseeable future, and it is time to compile an introductory document for the current version of the package.
A treatise on the theory and practice of smoothing spline ANOVA models can be found in a recently updated monograph by the author (Gu 2013), which contains detailed discussions of gss with extensive code illustrations. This document is intended as a software-oriented executive summary of the monograph, and for the exposition of the key methodological ingredients, technical discussions are unavoidable. Attempts have been made to keep the technical discussions lucid if not rigorous, yet readers unfamiliar with the methodology may still find the contents more mathematical than they might prefer.
The rest of the article is organized as follows. In Section 2, the methodological background is outlined, and in Section 3, the basic functionality of the gss suites are introduced with limited illustrations. Model configurations via explicit and implicit means are described in Section 4, and extra features such as semiparametric models and mixed-effect models are discussed in Section 5. After a brief discussion of the numerical engine in Section 6, three real-data examples are presented in Section 7 to demonstrate data analysis using gss facilities. Section 8 notes on some further software utilities without detailed elaboration, and Section 9 concludes the article with a brief summary.
Chunks of executable R code are collected in the supplementary material. The output of the code, which includes a few figures, are however not reproduced in this document.

Methodological background
Observing Y i ∼ N (η(x i ), σ 2 ), i = 1, . . . , n on x i ∈ [0, 1], one may estimate η(x) via the minimization of a penalized least squares functional, The minimization takes place in a function space f : 1 0 f (x) 2 dx < ∞ of infinite dimension, and the solution is known as a smoothing spline. The minimizer of (1) is a piecewise cubic polynomial, so it is called a cubic smoothing spline.
The method in (1) has been generalized from univariate Gaussian regression to a variety of estimation problems on generic domains, which include Gaussian and non-Gaussian regression, density estimation, and hazard estimation. Key ingredients of the methodology are outlined below.
Smoothing splines are not to be confused with penalized regression splines, with which esimation takes place in a finite-dimensional function space pre-configured using basis functions of local support. Penalized regression splines are numerically more vulnerable to the curse of dimensionality, as the number of local-support basis functions explodes when the dimension of the domain goes up.

Penalized likelihood estimation
The penalized least squares functional in (1) is a special case of the general penalized likelihood functional, where L(η) is usually taken as the minus log likelihood of the observed data and J(η) is a quadratic "roughness" functional; the minimization of (2) takes place in H = f : J(f ) < ∞ . With different configurations of L(η) and J(η), the minimizer of (2) may or may not be a piecewise polynomial, but it is called a smoothing spline for being the solution to a penalized minimization problem.
Function evaluation η(x) appears in L(η), so the evaluation functional A space in which the evaluation functional is continuous is known as a reproducing kernel Hilbert space possessing a reproducing kernel R(x, y), a nonnegative-definite function satisfying the reproducing property, f, R(x, ·) = f (x), ∀f ∈ H, where (·, ·) is the inner product in H.
J(f ) is usually a square semi norm in H, and a square full norm in an orthogonal complement of the null space . . , n, as is the case in (1), the minimizer of (2) resides in the space N J ⊕ span R J (x i , ·), i = 1, . . . , n , of finite dimension (Kimeldorf and Wahba 1971).
For (1), N J = span{1, x}, and an orthogonal complement of N J

Functional ANOVA decomposition
where I is the identity operator, A 1 , A 2 are averaging operators acting respectively on arguments x 1 , x 2 that satisfy A1 = 1; subscripts in brackets denote coordinates of a point on a multi-dimensional domain while ordinary subscripts are reserved for multiple points. Examples of averaging operators include Af = b a f (x)dx/(b − a) and Af = m i=1 f (x i )/m. The two-way ANOVA decomposition of (3) also implies a one-way ANOVA decomposition on X where η x = η 1 + η 2 + η 12 for η 1 , η 2 , and η 12 as in (3). Similar constructions in more than two dimensions can be done directly or recursively.
Selective term elimination in ANOVA decompositions helps to combat the curse of dimensionality in estimation; it also facilitates the interpretation of the fitted models. Models containing only main effects are known as additive models (Hastie and Tibshirani 1990).

Efficient approximation
As noted earlier, the minimizer of (2) often resides in N J ⊕ span R J (x i , ·), i = 1, . . . , n , permitting numerical computation, but the computation is generally of the order O(n 3 ). When L(η) depends on η via more than a finite number of function evaluations, however, the exact solution is usually intractable numerically.
Under mild conditions, as n → ∞ and λ → 0, the minimizerη of (2) in H converges to the true η 0 at a rate U (η − η 0 ) = O p (n −1 λ −1/r + λ p ), where U (η − η 0 ) is a setting-specific quadratic loss, r > 1 codes the "smoothness" of functions in H implied by J(f ), and p ∈ [1, 2] depending on how smooth η 0 is; in the setting of (1), U (g) = 1 0 g 2 (x)f (x)dx for f (x) the limiting density of {x i }, r = 4, and p = 2 when where {z j } is a random subset of {x i }. The minimizerη * of (2) in H * converges to η 0 at the same rate asη when qλ 2/r → ∞, hence is an efficient approximation ofη. The optimal convergence rate O p (n −pr/(pr+1) ) is achieved at λ n −r/(pr+1) , so one needs q n 2/(pr+1)+ , ∀ > 0. One may calculateη * for practical estimation, as is done in most of the gss suites, and the computation is of the order O(nq 2 ).
Unlike penalized regression splines, the space H * is data-adaptive rather than pre-configured, with resources allocated only where needed. This does not make the estimation problem any easier, but does make the numerical tasks much more manageable on high-dimensional domains.
Unless q = n, the minimizer of (1) in H * is no longer a piecewise cubic polynomial, but we will abuse the terminology and still call it a cubic spline.

Cross-validation
For the method in (2) to work in practice, a critical issue is the proper selection of smoothing parameters, the λ in front of J(η) and the θ's hidden in J(η) for tensor product splines.
For Gaussian regression as in (1), write Y = Y 1 , . . . , Y n andŶ = η λ (x 1 ), . . . , η λ (x n ) , where the dependence ofη = η λ on λ is spelled out. One hasŶ = A(λ)Y, where A(λ) is the so-called smoothing matrix and the argument λ also represents the θ's if present. One may select the smoothing parameters by minimizing the generalized cross-validation score of Craven and Wahba (1979), for α = 1, which is designed to track the mean square error n i=1 η λ (x i ) − η 0 (x i ) 2 . Crossvalidation occasionally yields severe undersmoothing, and a fudge factor α = 1.4 proves to be effective in curbing undersmoothings on "bad" samples while maintaining the generally favorable performances of cross-validation on "good" ones.

Bayesian confidence intervals
The square norm J(f ) in H J and the reproducing kernel R J (·, ·) are "inverses" of each other, and λJ(η) in (1) acts like the minus log likelihood of a Gaussian process prior with E η(x) = 0 and E η(x)η(y) ∝ R J (x, y). The minimizerη yields the posterior mean under such a prior and one may also calculate the posterior standard deviation s(x), yielding Bayesian confidence intervals,η(x) ± 1.96 s(x), for η(x) (Wahba 1983). Doing the analysis on individual terms in an ANOVA decomposition, one gets the component-wise intervals (Gu and Wahba 1993).
When L(η) is non-quadratic such as with non-Gaussian regression, one may consider its quadratic approximation atη, Qη(η); the approximate posterior minus log likelihood Qη(η) + λJ(η) is Gaussian with meanη, based on which approximate Bayesian confidence intervals can be constructed.
Lacking sampling distributions in settings with infinite-dimensional nulls, the classical testing approach is of little help in this situation. Instead, an approach based on the Kullback-Leibler geometry was developed in Gu (2004): one calculates an estimateη ∈ H 0 ⊕ H 1 , obtains its Kullback-Leibler projectionη ∈ H 0 by minimizing a setting-specific KL(η, η) over η ∈ H 0 , then inspects an "entropy decomposition," KL(η, η c ) = KL(η,η) + KL(η, η c ), an exact or approximate identity, where η c is a degenerate fit such as a constant regression function or a uniform density. When KL(η,η)/KL(η, η c ) is small, one loses little by cutting out H 1 .

Basic functionality: Estimation and inference
Listed in Table 1 are gss suites with brief descriptions. To be presented in this section are their basic user interfaces with core arguments and some defaults. The syntax of the gss suites has been designed to resemble that of the lm and glm suites in base R. Each suite has a fitting function and utility functions for the evaluation of the fits, and most also have a method project implementing the Kullback-Leibler projection or a variant thereof.

Suite
Purpose/Feature ssanova Gaussian regression ssanova9 Gaussian regression, with correlated data ssanova0 Gaussian regression, legacy routine gssanova Non-Gaussian regression, direct cross-validation gssanova1 Non-Gaussian regression, indirect cross-validation gssanova0 Non-Gaussian regression, legacy routine with indirect CV ssden Density estimation ssden1 Density estimation, in high dimensions sscden Conditional density estimation sscden1 Conditional density estimation, for large samples ssllrm Conditional density estimation, with discrete y in f (y|x) sshzd Hazard estimation, with or without covariate sshzd1 Hazard estimation, with continuous covariate sscox Estimation of relative risk in proportional hazard models

Regression suites
For Gaussian regression, one may use ssanova or ssanova0. For non-Gaussian regression, one may use gssanova, gssanova1, or gssanova0. For Gaussian regression with correlated data, one may use ssanova9.

ssanova
The following sequence loads the NOx data included in gss and calculates the minimizer of (1) in H * of (4), with λ minimizing V (λ) in (5) for α = 1.4.
R> data("nox", package = "gss") R> fit <-ssanova(log(nox)~equi, data = nox) The default α = 1.4 is based on simulations of limited scales (Kim and Gu 2004), and may be overridden via an argument alpha. One may also choose to select λ using the GML score of Wahba (1985) through method = "m". To evaluate the fit on a grid, one may try where est is a list withη(x) in est$fit and s(x) in est$se. The fitted curve can then be plotted along with the Bayesian confidence intervals.
R> plot(nox$equi, log(nox$nox), col = 3) R> lines(xx, est$fit) R> lines(xx, est$fit + 1.96 * est$se, col = 5) R> lines(xx, est$fit -1.96 * est$se, col = 5) The default q in (4) is set at 10n 2/9 , again based on simulations (Kim and Gu 2004), and can be overridden via an argument nbasis. Due to the random selection of {z j } in (4), repeated calls to ssanova generally return slightly different fits unless q = n, but one may ensure the same selection of {z j } by using an argument seed, or one may pass the same {z j } into subsequent calls through id.basis = fit$id.basis.
To fit an additive model, use gssanova and gssanova1 Non-Gaussian regression models can be fitted using gssanova or gssanova1, which calculate the minimizer of (2) in H * of (4); gssanova uses direct cross-validation as in (6) for smoothing parameter selection, while gssanova1 employs the indirect cross-validation of Gu (1992a). Discussions concerning direct and indirect cross-validation can be found in Gu (2013, Section 5.2). Practical performances of gssanova and gssanova1 for different distribution families were compared in Gu (2013, Section 5.4, 8.6) via limited simulations.
The syntax is similar to that of glm in base R, but the argument family only takes character strings from the list "binomial", "poisson", "Gamma", "inverse.gaussian", "nbinomial", "weibull", "lognorm", and "loglogis"; only one link is used for each family, one that is free of constraint. The likelihood term L(η) in (2) is of the form, The link and the minus log likelihood l(η; y) for exponential families binomial, Poisson, Gamma, and inverse Gaussian are listed in Table 2, along with those of the negative binomial family. For the binomial family, the response may be two columns of (y i , m i − y i ), or a column of y i /m i with m i entered via an optional argument weights. For the negative binomial family, the response is either two columns of (y i , ν i ), or a column of y i with a common ν to be estimated along with η(x).
The following sequence generates some synthetic binomial data and calculates a cubic spline logistic fit. Family Poisson gssanova fits can also be evaluated using predict, withη(x) and s(x) on the link scale.
The fit can then be plotted along with Bayesian confidence intervals on the probability scale.
The Weibull, log normal, and log logistic families are the same accelerated life models as implemented in the survreg suite of the survival package (Therneau and Grambsch 2000;Therneau 2014). The response is of the form Y = (X, δ, Z), where X = min(T, C) is the follow-up time for T the lifetime of an item and C the right-censoring time, δ = I [T ≤C] is the censoring indicator, and Z < X is a possible left-truncation time at which the item enters surveillance. Of interest is the estimation of the hazard function For the accelerated life models, log T follows a location-scale family distribution, with where f (z) is a probability density on (−∞, ∞) and F (z) is the associated distribution function; the Weibull family has f (z) = we −w for w = e z , the log normal family has f (z) = φ(z), and the log logistic family has f (z) = w/(1 + w) 2 for w = e z . One is to estimate the location parameter η = µ as a function of the covariate u, and the minus log likelihood is given by Instead of a Surv(...) construct for survreg, the response for gssanova should be a matrix of two or three columns, with X i in the first column, δ i in the second, and Z i in an optional third. The scale parameter σ is assumed to be common and is estimated along with η(u), in the form of ν = σ −1 .
The following sequence loads the Stanford heart transplant data included in gss and fits a Weibull regression, with the age at transplant as the covariate.
R> est <-predict(fit, stan, se = TRUE, inc = "age") R> ix <-order ( ssanova0 and gssanova0 The original ssanova and gssanova referred to in Gu (2002) have been renamed as ssanova0 and gssanova0, which calculate the minimizer of (2) in H using the legacy RKPACK routines. The numerical treatments in RKPACK take advantage of a special structure that only holds for q = n, and the O(n 3 ) algorithms in RKPACK often execute faster than the O(nq 2 ) algorithms powering ssanova, gssanova, and gssanova1. Repeated calls to ssanova0 or gssanova0 yield identical results, as q = n in this setting. Unfortunately, important modeling tools such as the Kullback-Leibler projection and the mixed-effect models (see the section on optional arguments) are numerically incompatible with the algorithms implemented in RKPACK, and an α > 1 in (5) is difficult to insert in the RKPACK routines. With limited capabilities compared to the alternatives, these legacy suites are largely obsolete.

ssanova9
For Gaussian regression with correlated data, Y ∼ N (η(x), σ 2 W −1 ), where Y = (Y 1 , . . . , Y n ) and η(x) = η(x 1 ), . . . , η(x n ) , one may set in (2), where W may depend on a few correlation parameters. One may use ssanova9 in this setting, with the smoothing parameters selected along with correlation parameters by a cross-validation score derived in Han and Gu (2008).
An argument cov specifies W in ssanova9 calls. For W −1 known, use cov = list("known", w), where w contains W −1 . For longitudinal observations, use cov = list("long", id), where id is a factor; when data at the same id levels are grouped together in Y, W −1 = I + γZZ , where ZZ is block-diagonal with blocks of the form 11 . For serially correlated data, one may set cov = list("arma", c(p, q)) to use an ARMA(p, q) model for = Y − η(x).
More generally, one may enter W −1 via cov = list(fun = fun, env = env, init = init), with W −1 to be evaluated by cov$fun(gamma, cov$env); env contains constants and init provides initial values for the correlation parameters γ, which should be on scales free of constraint.

Density estimation suites
To estimate a probability density f (x) on X , one may use ssden or ssden1. To estimate a conditional density f (y|x) on X × Y, use sscden or sscden1. For Y discrete, one may use ssllrm to fit f (y|x), which is regression with cross-classified responses.

ssden
A probability density f (x) is positive and integrates to one. Consider a logistic density transform f (x) = e η(x) / X e η(x) dx, which can be made one-to-one by cutting out η ∅ in a one-way ANOVA decomposition η = η ∅ + η x . One may set in (2), where X i are independent samples from f (x).
The following sequence draws samples from a normal mixture and fits a cubic spline to the log density.
R> u <-runif(100) R> x <-rnorm(100) * 0.1 R> x <-ifelse(u > 1/3, x + 0.7, x + 0.3) R> fit <-ssden(~x) The domain X does contribute to estimation through X e η(x) dx, and it is a good idea to specify X explicitly via domain = data.frame(x = c(a, b)); if unspecified, as is the case in the call above, the domain is set internally, extending the data range by 5% on both ends.
To use the fitted univariate density, one has the usual d-, p-, and q-functions, but no r-.
sscden Consider a logistic conditional density transform f (y|x) = e η(x,y) / Y e η(x,y) dy, which can be made one-to-one by cutting out η ∅ + η x in an ANOVA decomposition η = η ∅ + η x + η y + η x,y . Observing (X i , Y i ), i = 1, . . . , n, one may set in (2) for the estimation of f (y|x). Domains X and Y are generic, and can be product domains themselves, so η y and η x,y may be further decomposed.
The following sequence loads the penny thickness data included in gss and fits a tensor product spline to the log conditional density.
R> data("penny", package = "gss") R> fit <-sscden(~year * mil,~mil, data = penny) The first formula in a sscden call specifies model terms and the second formula lists the y-variables; η ∅ and terms in η x are removed internally. It would be a good idea to specify Y explicitly via ydomain.
The fitted f (y|x) can be evaluated using dsscden, and when Y is a real interval, as is the case here, one may also use psscden and qsscden. The code below plots selected quantiles of f (y|x) as functions of x.
ssllrm For Y binary, the estimation of f (y|x) reduces to logistic regression, and for Y discrete in general, it generalizes logistic regression to settings with cross-classified responses. To accommodate features specific to an all discrete Y, ssllrm was created.

ssden1
Integrals of the form X h(x)e η(x) dx have to be performed in ssden, which is numerically prohibitive for a high-dimensional X . Jeon and Lin (2006) set in (2) for ρ(x) a known density, with the resulting estimate f (x) ∝ e η(x) ρ(x). When ρ(x) = γ ρ γ (x γ ) on X = γ X γ , X η(x)ρ(x)dx can be calculated as sums of products of univariate integrals, and ANOVA structures in η(x) have the same conditional independence implications; the marginal density estimates on X γ are natural choices for ρ γ (x γ ).
The approach is implemented in ssden1 with the same user interface as ssden. Utility functions dssden, cdssden, cpssden, and cqssden also work with ssden1 fits, except that the f (x) values returned from dssden are unnormalized. A variant of Kullback-Leibler projection is implemented in project, one that avoids integrals of the form X h(x)e η(x) dx.

sscden1
For the estimation of f (y|x), integrals of the form Y h(X i , y)e η(X i ,y) dy are performed repeatedly in sscden, which can be numerically formidable for n large despite the typically low dimension of Y; the drag here is the number of distinctive X i 's. One may set (2) for some ρ(x, y) satisfying Y ρ(x, y)dy = 1, ∀x, with the resulting estimate f (y|x) ∝ e η(x,y) ρ(x, y). The term η x is needed in η for the approach to work, though it cancels out in the fitted f (y|x). The integrals Y η(X i , y)ρ(X i , y)dy are linear combinations of integrals of basis functions, which only need to be computed once for all.
The approach is implemented in sscden1, with syntax the same as for sscden except for the specification of ρ(x, y) via an argument rho. The default rho = list("xy") "estimates" ρ(x, y) internally using ssanova, and rho = list("y") orders a marginal density estimate on Y from ssden for use as ρ(x, y). One may also create ρ(x, y) externally and pass it into sscden1 through rho = list(fun = fun, env = env), to be evaluated via rho$fun(x, y, rho$env, outer), where env contains constants and outer indicates whether to return a vector of ρ(x i , y i ) or a matrix ρ(x, y ).
The utility functions for the evaluation of f (y|x) also work for sscden1 fits. A variant of Kullback-Leibler projection is implemented in project, but the tool is not as useful here; with ρ(x, y) in the scene, conditional independence may not be inferable just from the ANOVA structures in η(x, y).

Hazard estimation suites
Observing , Z < X, and U is a possible covariate, one is to estimate the hazard λ(t, u) = e η(t,u) . The accelerated life models via gssanova are parametric in t, and as more flexible alternatives, one may use sshzd or sshzd1 for fully nonparametric estimation. In case a proportional hazard model holds, λ(t, u) = λ 0 (t)λ 1 (u), one may also use sscox to estimate the relative risk λ 1 (u).

sshzd
With or without a covariate, one may set in (2), where U i drops out if absent in the data, and Z i = 0 if there is no left-truncation.
The following sequence fits a log hazard of the form η = η ∅ + η t + η u + η t,u to the Stanford heart transplant data.
R> data("stan", package = "gss") R> fit <-sshzd(Surv(futime, status)~futime * age, data = stan) Data scatter more evenly on the t * = √ t scale, and futime is thus transformed; the hazard on the original time scale should be e η( The Surv(t, delta, z = 0) construct has a similar appearance as that in the survival package, but is parsed differently here; the t main effect must appear among the model terms.
The following sequence estimates the relative risk of age in stan and evaluates the fit on the data points.
R> fit <-sscox(Surv(time, status)~age, data = stan) R> risk <-predict(fit, data.frame(age = stan$age), se = TRUE) Only the ordering of X i matters here, so one may use either the original time or the transformed futime and get the same fit. The fitted relative risk eη u(ui) are in risk$fit and the standard error s(u i ) ofη u (u i ) are in risk$se. To estimate the base hazard given r i = η u (u i ), one may use sshzd with offset.
R> hzd.b <-hzdcurve.sshzd(fit.b, tt, se = TRUE) Note that the relative risk and the base hazard in a proportional hazard model can be estimated jointly using sshzd through penalized full likelihood.
The approach is implemented in sshzd1, with the same syntax and utility functions as sshzd. The default rho = list("marginal") uses as ρ(t, u) a covariate-free hazard estimate from sshzd, and rho = list("weibull") fits a Weibull regression using gssanova and uses as ρ(t, u) the resulting hazard estimate λ(t, u) = νt ν−1 e −νη(u) .

Model configurations
Model terms are specified by the model formula as in lm, and schematically, the configurations of the terms are through the construction of the "inverse" of J(η) = β θ −1 β J β (η β ), R J (x, y) = β θ β R β (x, y). For η β a main effect, say η 1 (x 1 ), R β (x, y) = R 1 (x 1 , y 1 ) is simply a kernel on X 1 , and for η β an interaction, say η 12 (x 1 , x 2 ), R β (x, y) = R 1 (x 1 , y 1 )R 2 (x 2 , y 2 ) is the product of kernels on X 1 and X 2 . Besides the model formula, model configurations are through the specifications of the marginal kernels.

Numerical vectors
For x a numerical vector x γ ∈ X γ = [a, b], the default marginal kernel is the "inverse" of b a f (x) 2 dx, and terms η β involving x γ satisfy side conditions b a η β (x)dx γ = 0. The domain may be set internally by extending the data range 5% on each end, or may be specified explicitly via type = list(x = list("cubic", c(a, b))) In density estimation suites, specifications via domain or ydomain also get used in kernel generation.
A useful alternative to the default kernel is a periodic cubic spline kernel, the "inverse" of b a f (x) 2 dx for f (x) periodic on [a, b]. Such a kernel is specified via type = list(x = list("per", c(a, b))) and terms involving such a x γ are periodic in x γ .

Numerical matrices
The marginal domains X γ are generic, which can be mathematically multi-dimensional. Two such cases are the Euclidean space (−∞, ∞) d and the unit sphere S. With a matrix x in the model formula, one may entertain multi-dimensional marginals, usually geographic/spatial, in tensor product splines.
The default kernel for a numerical matrix x is a thin-plate spline kernel of order m = 2, which, for One needs 2m > d, and to specify an order m = 2, use type = list(x = list("tp", m)) Unless more detailed customization is done via type, terms η β involving such an x γ add up to zero on the data points along x γ , i η β (. . . , x i γ , . . . ) = 0. Thin-plate splines are invariant to arbitrary shifting and rotation of the coordinate system on the domain.
For x with two columns, one may alternatively treat it as x γ ∈ S and specify a spherical spline via type = list(x = list("sphere", m)) where m = 2, 3, 4 with 2 the default order; technical details are to be found in Wahba (1981).
The first column in x should be the latitude in the range [−90, 90] and the second column the longitude in the range [−180, 180]. Spherical splines are invariant to rotations of the coordinates on S, and terms η β involving such an x γ satisfy S η β (x)dx γ = 0.
To preserve a matrix element in a data frame, one may use the as-is function I(...), as shown below.

Factors
On a discrete X γ , a function f (x) is a vector and a reproducing kernel R(x, y) is a square matrix, and a quadratic "roughness" functional can be written as a quadratic form J(f ) = f Jf for J a nonnegative-definite matrix. In the column space of J with the square norm f Jf , the reproducing kernel is J + , the Moore-Penrose inverse of J.
Discrete variables enter the model formula as factors. For x nominal, the kernel is the one asso-

Optional arguments
We now discuss a couple of optional arguments that help to broaden the scope of application.

Parametric terms
In some applications, semiparametric models of the form ζ(x, z) = η(x) + z β are desirable, where η is "nonparametric" and z β comprises parametric terms. Replacing η by ζ in the likelihood L(η) in (2), and minimizing the resulting functional with respect to η and β, one obtains the so-called partial splines.
Partial spline models can be fitted in the regression and hazard estimation suites; due to normalization, such models make little sense for density estimation. To enter parametric terms z 1 β 1 + z 2 β 2 + z 1 z 2 β 12 , say, use partial =~z1 * z2 which acts just like a formula in lm. Variables in partial are expected to be numerical vectors, and the partial terms are centralized internally; centralization serves the same purpose as the averaging operators for the ANOVA terms. Like the ANOVA terms, the partial terms may also be selectively included/excluded in predict or hzdrate.sshzd. To maintain model identifiability, one should avoid "overlaps" among the variables in x and z.

Random effects
Mixed-effect models are widely used for the modeling of correlated data, such as those arising in longitudinal studies. Consider a mixed-effect model ζ(x, z) = η(x) + z b, where b ∼ N (0, B). One may estimate η and b jointly via the minimization of where Σ ∝ B −1 . Mixed-effect models for the log hazard are known as frailty models.
The random effects z b in a mixed-effect model appear identical to the parametric terms z β in a partial spline model, except that they need to be penalized via b Σb. The matrix Σ is typically structured containing correlation parameters γ, which are to be selected along with the smoothing parameters.
The term z β in partial spline models are often more important than η(x), but the random effects z b are a nuisance and are usually set to zero when the fits are evaluated. These terms may coexist, but one would need different symbols to distinguish the two z's.
Random effects are specified via Z = (z 1 , . . . , z n ) and Σ. One may use an optional argument random in the regression and hazard estimation suites, as a formula or a list. With

random =~1 | id
where id is a factor with p levels, Z = diag(1, . . . , 1) when data at the same id levels are grouped together, with the 1's possibly of different lengths, and Σ = γI p with a single γ; this typically characterizes correlations in longitudinal data. With where gid is either the same as id or a coarser grouping with levels collapsed from those of id, Z is the same as above, but Σ is now block-diagonal with blocks of the form γ k I p k , for k p k = p; this can be used to characterize correlations in clustered data. More generally, one may use a list random = list(z = z, sigma = list(fun, env), init = init) where z contains Z, sigma specifies Σ to be evaluated via sigma$fun(gamma, sigma$env) and init holds initial values for γ, presumably on scales free of constraint.
For Gaussian regression with longitudinal data, ssanova9 with cov = list("long", id) should be used when p n, and ssanova with random =~1 | id should be used when Random effects find little use in density estimation due to normalization, but z b for a univariate ζ can be propagated into multivariate versions for use in ssllrm (Gu and Ma 2011), which accepts random.

Numerical engines
Apart from ssanova0, gssanova1, and gssanova0, the estimation is done in two nested loops, with the outer loop minimizing a cross-validation score V (λ) with respect to smoothing parameters plus possible correlation parameters, and with the inner loop minimizing (2) with fixed tuning parameters. The inner loop follows Newton iteration with safeguards such as step-halving, and the outer loop uses the quasi-Newton iteration of Dennis and Schnabel (1996) as implemented in the R function nlm; the inner loop is absent in ssanova where the estimate is analytical.
It can be shown that for η ∈ H * as in (4), and c j are coefficients of the basis functions R J (z j , ·). Fixing the θ's in J(η) = β θ −1 β J β (η β ), the outer loop with a single λ is a simple task, and the starting values for the θ iteration via quasi-Newton are obtained through two passes of fixed-θ outer loop: 1. Setθ −1 β ∝ tr(Q β ), then minimize V (λ) with respect to λ to obtainη.
Step 1 is invariant to arbitrary scalings of J β (η β ), allowing equal opportunity for all.
Step 2 grants more allowances to terms that showed strengths in Step 1. The ensuing θ iteration fixes λ and starts fromθ β .
The starting valuesθ β proved to be highly effective, often leaving only the "last 20%" performance for the θ iteration to pick up. When the number of θ β 's are large, quasi-Newton iteration with numerical derivatives is slow to converge, and one may simply takeη and forgo the θ iteration. This can be done by setting skip = TRUE in the fitting functions.
If correlation parameters are involved in the process, as is the case when random is specified or when ssanova9 is called upon, the computational savings via skip = TRUE would be less dramatic.

Examples
Three examples follow, one involving longitudinal data, another concerning density estimation on a truncated domain, and a third having a two-dimensional marginal domain.

Treatment of bacteriuria
A group of 72 patients with acute spinal cord injury and bacteriuria (bacteria in urine) were randomly assigned to two treatment groups, 36 each. A weekly binary indicator of bacteriuria was recorded for every patient over 4 to 16 weeks. The data are listed in Joe (1997, Section 11.4). Removing the week-1 data, in which the disease indicator is positive for all, one has n = 820. The data are included in gss as a data frame.
R> data("bacteriuria", package = "gss") R> bact <-bacteriuria There are 30 distinctive x i 's (15 time points by 2 trt levels), and id 3 and 38 had complete follow-up under the two trt levels, so the selection of {z j } can be done deterministically.

Miscellaneous
Optional arguments affiliated with model formulas, such as subset and weights, are accepted by the fitting functions in gss, where weights specifies multiplicity counts for duplicated data, except in Gaussian regression where it calls for weighted least squares. The regression suites also accept offset.
The cosine diagnostics of Gu (1992b) can be obtained for fitted regression models by using the method summary, with a flag diag = TRUE.
For density estimation using data that are subject to sampling bias, one may use ssden with an optional argument bias. Further details can be found in Gu (2013, Sections 7.6.4-7.6.5).
In case one needs to use marginal kernels that are not "canned" in gss, "custom" types can be defined for variables and be entered via type = list(x = list("custom", ...)). Further details can be found in Gu (2013, Section A.1.3).

Summary
The gss package offers nonparametric modeling tools in many of the standard data analysis settings. It also defines some non-standard settings, such as regression with cross-classified responses.
Functional ANOVA decompositions are built-in on multivariate domains, with continuous, discrete, and even multi-dimensional marginals. Semiparametric models and mixed-effect models can also be entertained. Smoothing parameters are selected by cross-validation, and the "testing" of model terms can be done using Kullback-Leibler projection.