Derivatives diagnostics and robustness for smoothing splines

https://doi.org/10.1016/S0167-9473(03)00170-1Get rights and content

Abstract

Diagnostic measures for nonparametric regression using splines are given. Those measures which incorporate important information provided by the derivatives of the fitted curve have the potential of identifying different types of influential observations. These influential observations can substantially influence the spline fit. A robust nonparametric procedure can thus be developed by downweighting these influential observations. Numerical example and simulation results illustrate the techniques. Several applications are also given for a variety of nonparametric regression models.

Introduction

Consider first the modelyk=f(tk)+εk,k=1,…,n,where yk are observations at design points tk,f(t) is a smooth curve and εk are zero mean, uncorrelated random variables. Many competing methods for estimating the curve f(t) are available, for example, kernel-based methods and smoothing splines. Nonparametric regression using splines is a rapidly growing branch of statistics in recent years. Suppose the estimate is the minimizer ofk=1n{yk−f(tk)}2+λ∫f″(t)2dt,over the class of all twice differentiable functions f, where λ is referred to as the smoothing parameter. λ plays a key role in controlling the trade-off between the goodness of fit represented by the residual sum of squares and “smoothness” of the estimate measured by integral of the square of the second derivative.

For smoothing splines, deletion diagnostics and local influence diagnostics have been developed. Thomas (1991) proposed local influence diagnostics for the smoothing parameter in spline smoothing based on Cook's (1986) approach. Recently, some local influence diagnostics in partially linear models and nonlinear mixed-effects models were given by Zhu et al. (2003) and Lee and Xu (2003), respectively. For the deletion diagnostics, when the value of the smoothing parameter is fixed, the current diagnostics for smoothing splines are mostly based on the change on the fitted value, for example, Eubank (1985), Eubank and Gunst (1986) and Fung et al. (2002). In addition to fitted values, the slope and the second derivative also supplement important information in characterizing the behavior of plane curves. However, there is no diagnostic yet developed which incorporates the information provided by the slope and the second derivative in nonparametric regression using splines. Such diagnostics are developed and their properties are investigated. Secondly, as indicated by Carroll and Ruppert (1988, p. 175), neither diagnostics nor robust methods alone are as useful as the appropriate combination of both. The more the influential observations are learned, the more likely a sensible robust method is developed. Therefore, a sensible robust nonparametric regression method is proposed by appropriately downweighting these influential observations. These influence measures alone could be used to detect the observations inconsistent with the model while the associated robust method could provide an alternative fit.

One motivating example is given in next section to illustrate these interesting issues. Some influence measures are developed in Section 3 while the associated robust nonparametric regression methods are described in Section 4. In Section 5, the proposed diagnostics and robust methods are applied to the data used in Section 2. In addition, a numerical simulation is set up under different situations, including influential observations of different types, different sample sizes, noise levels, and mean functions. Several extensions to a variety of nonparametric regression models are given in Section 6. Finally, a concluding discussion is given in Section 7.

Section snippets

Motivating examples

The data are taken from a study by Brinkman (1981) (see also Simonoff, 1996). Eighty-eight measurements for three variables, NOx,C, and E from an experiment in which ethanol was burned in a single-cylinder automobile test engine were recorded, where NOx is the concentration of nitric oxide (NO) and nitrogen dioxide (NO2) in engine exhaust, normalized by the work done by the engine, C is the compression ratio of the engine, and E is the equivalence ratio at which the engine was run. Different

Diagnostics

In this section, the influence measures for assessing the case influence on the slope and the second derivative of the fitted curve are proposed.

For ease of exposition and consistency with later development in Section 6, let f(t)=∑j=1pajBj(t), where p is the number of suitably chosen basis functions, usually at least large enough to ensure the accuracy of the approximation, for example, p=n+2, as all knots included and Bj(t) are basis functions, for example, the commonly used B-splines. Thus,

What next? robust nonparametric regression

A variety of robust methods have been developed for nonparametric regression (see Eubank, 1988, pp. 173–176; Simonoff, 1996, pp. 200–203; He et al., 2002). Most of them are M-type estimators of which goal is to reduce the influence of outliers in the estimation process, rather than identifying them. On the other hand, in this section, a robust method which identifies influential observations first and then downweights these influential observations is proposed.

When the smoothing parameter is

Numerical illustrations

In this section, one simulated example and one real data example are presented. In these examples, the influential observations discussed in Section 2 will be identified. In addition, a thorough simulation experiment had been set up for a range of scenarios, including different sample sizes, replications and noise levels. The simulation results attest to the validity of the warning limit proposed in the previous section. Also, the simulation results also illustrate the effectiveness of the

Spline smoothing in generalized linear models

Consider the standard generalized linear model in which each component of the response vector has a distribution taking the formf(yk;θ,φ)=expykθk−m(θ)u(φ)+c(yk,φ),where θk and φ are scalar parameters, and m(·),u(·) and c(·) are specific functions. The dependence of the response yk on the associated explanatory variable tk can be modeled through the link function d(·), where θk=d(α+βtk).u(φ)=1 and the natural link are assumed hereafter. Also, let θk=f(tk) and f is estimated by the penalized

Discussion

As illustrated in the examples given in Section 5, the identified influential observations might have great impact on the behavior of the fitted curve and further result in different interpretations of the data. The robust method thus provides an alternative sensible fit. The influence measures in Section 3 might still underestimate the case influence since only the differences at the design points are used. It is also sensible to use integration rather than summation in developing these

Acknowledgements

I would like to thank a referee, an associate editor, and the editor, Professor Kontoghiorghes, for helpful suggestions that led to a substantial improvement in the paper. I would also like to thank Professor Kosorok at Madison, USA, for useful comments on an earlier version of this manuscript. The author is partly supported by Taiwan NSC Grant (Project: NSCg2-2118-M-029-004).

References (24)

  • R.L. Eubank et al.

    Diagnostics for penalized least-squares estimators

    Statist. Probab. Lett.

    (1986)
  • K. Kafadar

    Choosing among two-dimensional smoothers in practice

    Comput. Statist. Data Anal.

    (1994)
  • T. Lee

    Smoothing parameter selection for smoothing splinesa simulation study

    Comput. Statist. Data Anal.

    (2003)
  • N.D. Brinkman

    Ethanol fuel-A single-cylinder engine study of efficiency and exhaust emissions

    SAE Trans.

    (1981)
  • R.J. Carroll et al.

    Transformations and Weighting in Regression

    (1988)
  • R.D. Cook

    Assessment of local influence (with discussion)

    J. Roy. Statist. Soc. Ser. B

    (1986)
  • P. Craven et al.

    Smoothing noisy data with spline functions

    Numer. Math.

    (1979)
  • R.L. Eubank

    Diagnostics for smoothing splines

    J. Roy. Statist. Soc. Ser. B

    (1985)
  • R.L. Eubank

    Spline Smoothing and Nonparametric Regression

    (1988)
  • W.K. Fung et al.

    A note on local influence based on normal curvature

    J. Roy. Statist. Soc. Ser. B

    (1997)
  • W.K. Fung et al.

    Influence diagnostics and outlier tests for semiparametric models

    J. Roy. Statist. Soc. Ser. B

    (2002)
  • G. Golub et al.

    Generalized cross validation as a method for choosing a good ridge parameter

    Technometrics

    (1979)
  • Cited by (0)

    View full text