Additive models in censored regression

https://doi.org/10.1016/j.csda.2009.02.008Get rights and content

Abstract

Additive models in censored regression are considered. A randomly weighted version of the backfitting algorithm that allows for the nonparametric estimation of the effects of the covariates on the response is provided. Given the high computational cost involved, binning techniques are used to speed up the computation in the estimation and testing process. Simulation results and the application to real data reveal that the predictor obtained with the additive model performs well, and that it is a convenient alternative to the linear predictor when some nonlinear effects are expected.

Introduction

Let Y be a lifetime which is observed under censoring from the right. Let X=(X1,,Xp) be a vector of p covariates. Put f(x)=E[ψ(Y)X=x] for the regression function of ψ(Y) on X, so the model becomes ψ(Y)=f(X)+ε=f(X1,,Xp)+ε, where the error term satisfies E[εX]=0. Here ψ denotes a time transformation such as the logarithm. Taking ψ(y)=lny is useful in regression analysis because ψ(Y) is no longer restricted to (0,). Indeed, under (1) we have, provided that X and ε are independent, FYX(yX)=FW(ef(X)y),y0, where FYX and FW are the cumulative distribution functions of Y given X and of the transformed error W=eε, respectively. This is the so-called accelerated failure time model, widely used to analyze survival data in the regression framework. Note that an increasing value of f(X) results in a decreasing value of the time acceleration factor ef(X), thus leading to a better survival prognosis.

In the censored setup, we observe (X1,Z1,δ1),,(Xn,Zn,δn) independent observations with the same distribution as (X,Z,δ), where Z=min(Y,C), C is the right-censoring variable assumed to be independent of Y, and δ=I(YC). Unlike in the “iid” scenario (in which each observation receives mass or weight 1/n), the weight associated to the ith observation (Xi,Zi,δi) under censoring will be typically the jump of the Kaplan–Meier estimator at each point Zi (i=1,,n), namely Wi=δinRankZi+1RankZj<RankZi[1δjnRankZj+1], where RankZi is the rank of Zi among the ordered Z’s and where (in the case of ties) uncensored observations are assumed to precede the censored ones. When the error distribution is unknown, an approach that leads to consistent estimators is choosing f in order to minimize fFi=1nWi(ψ(Zi)f(Xi1,,Xip))2, where the family F represents the a priori knowledge on the true regression. See Stute, 1993, Stute, 1996, Stute, 1999 for the parametric linear and nonlinear case, in which it is assumed f{f(.;β)}β, and see Orbe et al. (2003) for the partly linear case f(X)=β1X1++βp1Xp1+g(Xp). A different estimator based on ranks was proposed by Chen et al. (2005) for the partial linear model. Gannoun et al. (2005) used the Kaplan–Meier weights in the context of quantile regression. Another possible approach (which we will not follow here) is that based on the so-called synthetic data, see for example Leurgans (1987) and Qin and Jing (2000) who considered the parametric linear case and the partial linear model, respectively (see also Liang and Zhou (1998), for the latter setup).

In some applications the linear model can be very restrictive. This constraint can be avoided by replacing the linear structure by a nonparametric structure. Here we consider a flexible approach to estimate the regression function f(x) through a semiparametric model under which the effect of each covariate on the response is represented in an additive way, the qualitative form of this effect being unknown otherwise. We assume the additive model f(X)=α+f1(X1)++fp(Xp) (Hastie and Tibshirani, 1990), where α is a constant and f1,,fp are one-dimensional functions. If the influence of the covariates Xj is linear, then the corresponding partial functions can be expressed parametrically as fj(Xj)=βjXj. Therefore, the model given in (3) nests the linear model. Moreover, on assuming that effects are additive, these types of models maintain the interpretability of linear models. Yet, at the same time, they incorporate the flexibility of nonparametric smoothing methods because, rather than following a fixed parametric form, the effect of each of the covariates, Xj, depends on a totally unknown function, fj, which is only required to possess a certain degree of smoothness so that it can be estimated. The compromise between flexibility, dimensionality and interpretability, ranks these types of models among the statistical tools with the greatest capacity for data analysis in different fields of research.

Additive models have been used for relaxing the linear hypothesis in the scope of Cox proportional hazards model, which is the most popular regression model when analyzing censored survival data. For example, Huang (1999) introduced efficient estimation for a partly additive Cox model. Huang and Liu (2006) considered a nonparametric link function which controls for the effect of the parametric predictor under proportional hazards. See also Ganguli and Wand (2006) and the references therein for extensions of the Cox proportional hazards model via additive regression. However, the proportional hazards assumption may not hold in some applications, and hence there is some need of additive models which can be a valid alternative to Cox regression. To the best of our knowledge, additive models in the scope of the censored accelerated failure time model have not been explored so far. This is the gap we fill through the present work.

The layout of this paper is as follows. In Section 2 we give a description of the weighted kernel smoothing backfitting we use for fitting additive models with censored response. Moreover, in Section 2.1 we discuss the bandwidth selection problem and some related practical issues. To assess the validity of this estimation procedure, a simulation study is performed in Section 3. In Section 4 we apply the proposed methodology to real data. Finally, we conclude with a discussion in Section 5.

Section snippets

Fitting censored additive models

This section describes an algorithm for fitting the model effects f1,,fp in (3) for censored response. The algorithm discussed below is a modification of the backfitting algorithm (Opsomer, 2000) used for fitting additive models. The backfitting algorithm cycles through the covariates Xj (j=1,,p), and estimates each fj by applying local linear kernel smoothers to the partial residuals. These residuals are obtained by removing the estimated effects of the other covariates. Although our focus

Simulation study

A simulation study was conducted to assess the finite sample behavior of our proposed algorithm. Given the vector of covariates X=(X1,X2,X3) in R3, the response variable Y was generated according to the model lnY=j=13fj(Xj)+ε, with εN(0,1). The censored variable C was drawn independently from a Uniform[0,a]. Note that the constant a determines the expected proportion of censored responses. We have chosen several values for a in order to get censoring percentages of about 0%, 15%, 33%, 50% and

Application to real data

Between January 1974, and May 1984, the Mayo Clinic conducted a double-blinded randomized trial in Primary Biliary Cirrhosis (PBC) of the liver. A total of n=312 patients agreed to participate in this clinical trial. The data were analyzed in 1986 for presentation in clinical literature (see Fleming and Harrington (1991)). Main variable of interest (the Y) was the number of days between registration and death, possibly censored because of end of study or liver transplantation. By July 1986, 125

Discussion

In this paper we introduce a new approach to the estimation of additive models in censored regression. Specifically, we propose an extension of the accelerated failure time regression model through additive regression, which constitutes a novelty in the context of censored regression. Weighted backfitting based on kernel smoothers has been used for estimating the model, and the smoothing windows were selected employing the cross-validation techniques. Using cross-validation bandwidths implies

Acknowledgements

We thank two anonymous referees for their careful reading of the paper and suggestions which have improved the presentation of the paper. Javier Roca Pardiñas was supported by grants MTM2005-00818 (European FEDER funding included) and SEJ2004-04583 /ECON. Jacobo de Uña Álvarez was supported by grants MTM2005-01274 (European FEDER funding included) and MTM2008-03129. Both authors also acknowledge support by grant PGIDIT07PXIB300191PR.

References (36)

  • P.H.C. Eilers et al.

    Flexible smoothing with B-splines and penalties

    Statistical Science

    (1996)
  • J. Fan et al.

    Fast implementation of nonparametric curve estimators

    Journal of Computational and Graphical Statistics

    (1994)
  • T.R. Fleming et al.

    Counting Processes and Survival Analysis

    (1991)
  • B. Ganguli et al.

    Additive models for geo-referenced failure time data

    Statistics in Medicine

    (2006)
  • A. Gannoun et al.

    Non-parametric quantile regression with censored data

    Scandinavian Journal of Statistics

    (2005)
  • W. Guo

    Inference in smoothing spline analysis of variance

    Journal of the Royal Statistical Society Series B

    (2002)
  • W. Hardle et al.

    Bootstrap inference in semiparametric generalized additive models

    Econometric Theory

    (2004)
  • T.J. Hastie et al.

    Generalized Additive Models

    (1990)
  • Cited by (9)

    View all citing articles on Scopus
    View full text