An improved strategy for regression of biophysical variables and Landsat ETM+ data

https://doi.org/10.1016/S0034-4257(02)00173-6Get rights and content

Abstract

Empirical models are important tools for relating field-measured biophysical variables to remote sensing data. Regression analysis has been a popular empirical method of linking these two types of data to provide continuous estimates for variables such as biomass, percent woody canopy cover, and leaf area index (LAI). Traditional methods of regression are not sufficient when resulting biophysical surfaces derived from remote sensing are subsequently used to drive ecosystem process models. Most regression analyses in remote sensing rely on a single spectral vegetation index (SVI) based on red and near-infrared reflectance from a single date of imagery. There are compelling reasons for utilizing greater spectral dimensionality, and for including SVIs from multiple dates in a regression analysis. Moreover, when including multiple SVIs and/or dates, it is useful to integrate these into a single index for regression modeling. Selection of an appropriate regression model, use of multiple SVIs from multiple dates of imagery as predictor variables, and employment of canonical correlation analysis (CCA) to integrate these multiple indices into a single index represent a significant strategic improvement over existing uses of regression analysis in remote sensing.

To demonstrate this improved strategy, we compared three different types of regression models to predict LAI for an agro-ecosystem and live tree canopy cover for a needleleaf evergreen boreal forest: traditional (Y on X) ordinary least squares (OLS) regression, inverse (X on Y) OLS regression, and an orthogonal regression method called reduced major axis (RMA). Each model incorporated multiple SVIs from multiple dates and CCA was used to integrate these. For a given dataset, the three regression-modeling approaches produced identical coefficients of determination and intercepts, but different slopes, giving rise to divergent predictive characteristics. The traditional approach yielded the lowest root mean square error (RMSE), but the variance in the predictions was lower than the variance in the observed dataset. The inverse method had the highest RMSE and the variance was inflated relative to the variance of the observed dataset. RMA provided an intermediate set of predictions in terms of the RMSE, and the variance in the observations was preserved in the predictions. These results are predictable from regression theory, but that theory has been essentially ignored within the discipline of remote sensing.

Introduction

Biogeochemical cycling models are increasingly run in a spatially explicit mode, requiring as model drivers moderate to high spatial resolution surfaces of land cover and leaf area index (LAI) derived from satellite imagery Bonan, 1993, Reich et al., 1999, Running et al., 1999. Mapping of continuous variables like LAI from high-resolution imagery such as Landsat TM or ETM+ has largely depended on modeling empirical relationships derived from single-date spectral vegetation indices (SVIs). The most important of these are the normalized difference vegetation index (NDVI) and its counterpart, the simple ratio (SR) Chen & Cihlar, 1996, Fassnacht et al., 1997, White et al., 1997. These and other ratio-based indices, although important, utilize only a fraction of the spectral information available in many image datasets (Cohen, Spies, & Fiorella, 1995). Moreover, with the cost of ETM+ data substantially reduced from that of its predecessor, TM, there are increasing opportunities to utilize multiple dates of imagery in these analyses.

Traditional methods for empirical modeling of continuous variables, such as LAI, from SVIs rely on ordinary least squares (OLS) regression (Steel & Torrie, 1980), a technique that has important limitations for such applications (Curran & Hay, 1986). In particular, a violation of assumptions about measurement error can have undesirable effects on OLS estimates of the biophysical variable. In spite of the cogent arguments against the use of OLS regression in remote sensing offered by Curran and Hay (1986), we could find only one subsequent remote sensing paper that heeded their advice (Larsson, 1993). Alternative regression models may provide improved estimates of biophysical variables in remote sensing. Several such models are discussed in the literature, but almost all of that literature is outside of our discipline.

When conducting regression analyses that utilize multiple SVIs and multi-date data, it would be useful to construct a single, integrated index to represent the multiple predictor variables. This would facilitate visual assessment of model strength and whether the integrated relationship is linear. An integrated index could also help in subsequent analyses, or for screen viewing and interpretation (similar to the NDVI or SR). Additionally, an integrated index would be useful for comparisons among possible model formulations. Most important, however, is that certain regression procedures are best conducted in a simple linear context, and thus rely on a single predictor variable. These needs can be met using a statistical tool known as canonical correlation analysis (CCA).

The goal of this paper is to demonstrate an improved strategy for regression modeling of biophysical variables in remote sensing. That strategy includes use of multiple SVIs from multiple dates of ETM+ imagery, development of a CCA-based index that integrates these, and choice of an appropriate type of regression model. We test three regression-modeling approaches: traditional OLS (RegT), inverse OLS (RegI), and reduced major axis (RMA). The test is done for two biophysical variables, one in each of two different biomes, to highlight the general applicability of the analyses and results. At an agricultural site, we model LAI for two separate dates; at a boreal forest site, live tree cover is modeled. The objective is to compare and contrast the three regression approaches in terms of basic statistical characteristics of the predicted variables relative to the statistical characteristics of the observed variables. Numerous examples of such comparisons exist in the general literature, and the lessons learned are applied in various disciplines. However, the two papers in remote sensing literature that address this issue Curran & Hay, 1986, Larsson, 1993 have been essentially ignored. With continued use of regression in remote sensing, and an increased reliance on Landsat imagery to drive ecosystem process models with regression-derived surfaces, it is imperative that we consider the weaknesses of our common methods and the potential strengths of alternative methods.

The value of SVIs for modeling the relationship between vegetation variables and reflectance data is well established. In particular, since their inception, the SR (Birth & McVey, 1968) and the NDVI (Rouse, Haas, Schell, & Deering, 1974) have dominated the remote sensing and related literature (e.g., Chen & Cihlar, 1996, Huete et al., 1985, Sellers, 1987, Tucker, 1979, Turner et al., 1999). Modifications to these indices have been proposed to account for background effects associated with incomplete canopy cover (Huete, 1988), some of which take advantage of shortwave-infrared reflectance Brown et al., 2000, Nemani et al., 1993.

Although these “ratio-based” indices have the advantage of being simple to understand and apply, an alternative set of indices, called “n-space indices” (Jackson, 1983), is designed to more fully exploit the spectral domain of reflectance data. Numerous such indices exist, including the Perpendicular Vegetation Index (Richardson & Wiegand, 1977) and the widely used Tasseled Cap, which consists of the brightness, greenness (Kauth & Thomas, 1976), and wetness (Crist & Cicone, 1984) indices. The Tasseled Cap indices, in particular, provide standardized coefficients for all spectral bands of Landsat MSS and TM data. One study, across a forested scene containing bare ground, brush, and broadleaf and needleleaf forests of varying ages, demonstrated that brightness, greenness, and wetness accounted for 85% of the total spectral variability contained in a single date of TM reflectance data (Cohen et al., 1995); this, in comparison to only 52% in the red and near-infrared bands.

The temporal domain of spectral data can greatly enhance our ability to map vegetation Helmer et al., 2000, Lefsky et al., 2001, Loveland et al., 2000, Oetter et al., 2001. Incorporating a temporal series of data into SVIs directly, however, has been given minimal attention. Malila (1980) described the change magnitude and angle calculations, which are indices of a sort, required for change vector analysis (CVA). Whereas CVA was designed for two spectral dimensions and two dates of imagery, it has been crudely extended to three spectral dimensions (Virag & Colwell, 1987), and the magnitude calculation has been generalized to n-dimensions by Lambin and Strahler (1994). The concept for generalizing CVA angles and magnitudes for n spectral and/or temporal dimensions was described by Cohen and Fiorella (1998), but they stopped short of implementing the procedure. Collins and Woodcock (1996) developed a two-date Tasseled Cap transformation for use in change detection, but an n-date transformation was not attempted.

Principal components analysis (PCA) is an attractive means of incorporating spectral data from numerous dates into a small set of axes that contain most of the spectral information contained in the full multispectral, multitemporal dataset Eastman & Fulk, 1993, Richards, 1984. Although not vegetation indices per se, the value of PCA for reducing the size of the spectral–temporal dataset is great. However, there are two important problems using PCA for spectral–temporal analyses. First, the resulting axes are dataset-dependent. Although this means that it is difficult to generalize the interpretation of PCA axes to other datasets, this is not unique to PCA, as the same problem is common to all correlation-based, empirical analyses. The second and more meaningful problem is that the coefficients for PCA axes are normally obtained without regard for the axes' relationships with the variable we are interested in predicting (e.g., LAI).

In the simple linear case, OLS regression analysis is an empirical approach for modeling the relationship between two observed variables, X and Y. The form of the OLS regression model isY=β01X+ε,where Y is the variable to be predicted, X is the variable Y is predicted from, β0 is the intercept, β1 is the slope of the relationship between X and Y, and ε is error. Data for the analysis are supplied by paired observations of the two variables. Commonly, one variable is difficult or costly to measure (e.g., vegetation attributes from field sampling), and the other is relatively easy or inexpensive to observe (e.g., SVIs from remote sensing). Although often the intent of OLS regression is to determine the feasibility of estimating or predicting the expensive variable from observations of the inexpensive one, sometimes the analysis is used simply to determine the form and strength of the relationship between the two variables. If the objective is the latter, it is not particularly important which variable is X and which is Y, and common in the remote sensing literature are both X and Y representing the vegetation variable and the SVI Butera, 1986, Chen & Cihlar, 1996, Cohen, 1991, White et al., 1997. If we are interested in actually using the regression model to predict one variable from the other, however, the distinction between X and Y becomes very important. This is because of specifications and assumptions associated with OLS regression.

One specification is that Y is the dependent variable and X is the independent variable (Steel & Torrie, 1980). Although it can be argued that spectral response is dependent on vegetation state and not the other way around, much of the remote sensing literature reports the vegetation attribute being modeled as the dependent variable. Curran and Hay (1986) discuss this as the “specification problem”. Specification in this manner is important because of an assumption associated with OLS regression: that the independent variable, X (e.g., an SVI), is measured without error (Steel & Torrie, 1980). As the coefficients for the regression equation are calculated by minimizing the sums of squares of error in Y (e.g., LAI), illustrated graphically by Curran and Hay (1986), the result is an attenuation (or compression) of the variance of LAI predictions. In other words, values above the mean of Y tend to be underpredicted and values below the mean tend to be overpredicted (e.g., Cohen et al., 2001, Hudak et al., 2002).

An alternative form of OLS regression is inverse estimation (Brown, 1979), also known as inverse regression (Cohen, 1991) and calibration (Scheffé, 1973). Curran and Hay (1986) refer to this as X on Y regression and illustrate the concept graphically. With inverse estimation, the specification problem is addressed in that the dependent and independent variables are properly assigned, or “specified” (e.g., X is the vegetation variable and Y is the SVI). In practice then, to predict the vegetation variable, the coefficients for the OLS regression model are derived using Eq. (1) and then the equation must be inverted to solve for X, such that X=(Yβ0)/β1. The error term in Eq. (1), ε, is expressed as prediction residuals for each observation.

Although for remote sensing, inverse estimation eliminates the specification problem, it does not address the more important problem that X is assumed to be measured without error. Curran and Hay (1986) provide an in-depth discussion of sources of error for both X and Y in remote sensing. The impact of measurement error in X when using inverse estimation is known to be the opposite to that of its effect using the Y on X form of OLS regression, i.e., amplification of the variance of predicted biophysical values such that values above the mean of X are overestimated and those below the mean are underestimated.

Recognizing that there are errors in both X and Y, Curran and Hay (1986) tested three alternative methods to predict grassland LAI from the SR: Wald's grouping method, RMA, and an alternative least squares procedure that incorporates a priori knowledge of relative errors in X and Y. They recommended using Wald's method or RMA if no estimates of measurement error are available, and the alternative least squares procedure if such estimates are available. From a practical perspective, it will be rare for analysts to have precise estimates of error from all the various sources associated with measurements of vegetation and spectral variables. As such, it is perhaps more prudent to make no assumptions regarding the relative amounts of measurement error and use RMA or Wald's method. In spite of the convincing arguments made by Curran and Hay (1986), we could find only one subsequent remote sensing article that used one of these methods (Larsson, 1993), where RMA was used to predict woodland canopy cover from single-date NDVI measurements.

RMA is one of a class of similar models known as orthogonal regression, total least squares regression, or errors-in-variables modeling, depending on the discipline in which the specific technique was developed (Van Huffel, 1997). Orthogonal regression minimizes the sum of squared orthogonal distances from measurement points to the model function. The RMA version of orthogonal regression is graphically depicted in Curran and Hay (1986). Van Huffel (1997) contains examples of orthogonal regression's usage in astronomy, meteorology, 3-D motion estimation, biomedical signal processing, and multivariate calibration. RMA, specifically, is quite commonly applied in allometry Conrad & Gutmann, 1996, Gower et al., 1999, Nicol & Mackauer, 1999, Niklas & Buchman, 1994. Besides making no assumptions about errors in X and Y, RMA likewise makes no assumptions about dependency. Conrad and Gutmann (1996) refer to RMA as geometric mean regression, in that the slope (β1) is defined as the ratio of sample standard deviation for Y over the sample standard deviation for X, thus preserving in the model the relative variance structure of the sample dataset. The effect of this is to minimize or eliminate any attenuation or amplification of predictions. For RMA, β0 is defined as the sample mean of Y minus the quantity β1 times the sample mean of X. One important component of the slope term (β1) is that it must be given the sign (+/−) of the correlation between X and Y (Conrad & Gutmann, 1996), which is not given by Curran and Hay (1986) or Larsson (1993). The form of the regression model is identical to Eq. (1), but the calculations of β0 and β1 are different. Mathematical similarities in the formulations of the two OLS and the RMA regression models mean that the model intercepts are all equivalent, as are the coefficients of determination. What differ among these models are the root mean square errors (RMSEs) and the slopes of the relationships.

OLS regression has both simple (single X) and multiple (several X) forms (Steel & Torrie, 1980). The use of OLS regression in its multiple form, Y on multiple X, is familiar to most remote sensing analysts conducting regression modeling. Although much less familiar, there is also a formulation for multiple X inverse calibration (Brown, 1979). A simple application of RMA requires one X and one Y. Thus, to incorporate n-space indices and/or temporal datasets into an RMA, the multiple X dataset must be linearly combined into a single X variable. In essence, we must develop a new, integrated index that is a linear combination of the multiple X indices (or bands) from a single date or multiple dates. As discussed earlier, this need is directly facilitated by CCA.

CCA is a generalized form of multiple regression that permits the examination of interrelationships between two sets of variables (multiple X's and multiple Y's) (Tabachnick & Fidell, 1989). CCA maximizes the correlation between a composite of variables from one set with a composite of variables from another set. When there is only one X (i.e., vegetation variable, such as LAI), CCA provides a set of coefficients for the Y's that aligns them with the variation in the X variable. When those coefficients are applied to the Y variables, the result is a CCA score for each observation. CCA scores are indexed values in the same way that brightness, greenness, or wetness (or NDVI) values are indexed values. However, with CCA, the alignment is dataset-specific, whereas with the Tasseled Cap or NDVI, the formulations are generalized and fixed.

Section snippets

Methods

This work was conducted in a temperate broadleaf agro-ecosystem, consisting of corn and soybeans, and a boreal needleleaf evergreen forest. The biophysical variable of interest within the agro-ecosystem was LAI, which was modeled for two separate measurement dates. The dominant tree species in the boreal forest is black spruce, and the variable we modeled was percent tree cover. This work was done in the context of the BigFoot project, which was designed to provide local validation of global

Agricultural example—LAI at AGRO

A scatterplot of corn and soybean greenness from the July measurement date (Fig. 1) revealed that these two crops represented different populations and were best modeled separately. This was done for both July and August dates, yielding four separate modeling sets (Table 2). For all four model sets, the three regression approaches had equivalent coefficients of determination and model intercepts. The differences among approaches were expressed in the slope term, with RegT, having the least, RegI

Boreal forest example—tree cover at NOBS

The models for tree cover at NOBS exhibited similar relative characteristics as those at AGRO (Table 2). The coefficients of determination and the intercepts were all identical. The only difference among modeling approaches was the slope of the relationships, with RegT having the lesser value, RegI having the greater value, and RMA having an intermediate value. Likewise, the cross-validation results indicated identical correlations between tree cover and the CCA axis and essentially no overall

Discussion and summary

This paper presents to a remote sensing audience a verification of existing regression theory. The remote sensing literature contains little of regression theory, and even less of the numerous options for its application. With rare exception, the remote sensing literature contains rote application of OLS (ordinary least squares) regression, without ever questioning its disadvantages relative to other forms of regression. Questioning these disadvantages and demonstrating alternative regression

Acknowledgements

This research was funded by NASA's Terrestrial Ecology Program. We greatly thank Karin Fassnacht for thoughtful and lively discussion and for her review of early drafts, and Robert Kennedy for his contributions to the discussion.

References (59)

  • A.R Huete et al.

    Spectral response of a plant canopy with different soil backgrounds

    Remote Sensing of Environment

    (1985)
  • R.D Jackson

    Spectral indices in n-space

    Remote Sensing of Environment

    (1983)
  • E.F Lambin et al.

    Change-vector analysis in multitemporal space: a tool to detect and categorize land-cover change processes using high temporal-resolution satellite data

    Remote Sensing of Environment

    (1994)
  • D Oetter et al.

    Land cover mapping in an agricultural setting using multiseasonal Thematic Mapper data

    Remote Sensing of Environment

    (2001)
  • P Reich et al.

    An approach to spatially distributed modeling of net primary production (NPP) at the landscape scale and its application in validation of EOS NPP products

    Remote Sensing of Environment

    (1999)
  • J.A Richards

    Thematic mapping from multitemporal image data using principal components transformation

    Remote Sensing of Environment

    (1984)
  • S Running et al.

    A global terrestrial monitoring network integrating tower fluxes, flask sampling, ecosystem modeling and EOS data

    Remote Sensing of Environment

    (1999)
  • P.J Sellers

    Canopy reflectance, photosynthesis, and transpiration II. The role of biophysics in the linearity of their interdependence

    Remote Sensing of Environment

    (1987)
  • C Song et al.

    Classification and change detection using Landsat TM data: when and how to correct atmospheric effects?

    Remote Sensing of Environment

    (2001)
  • P.M Teillet et al.

    Radiometric cross-calibration of the Landsat-7 ETM+ and Landsat-5 TM sensors based on tandem data sets

    Remote Sensing of Environment

    (2001)
  • C.J Tucker

    Red and photographic infrared linear combinations for monitoring vegetation

    Remote Sensing of Environment

    (1979)
  • D Turner et al.

    Relationships between leaf area index and Landsat TM spectral vegetation indices across three temperate zone sites

    Remote Sensing of Environment

    (1999)
  • J.E Vogelmann et al.

    Effects of Landsat 5 Thematic Mapper and Landsat 7 Enhanced Thematic Mapper Plus radiometric and geometric calibrations and corrections on landscape characterization

    Remote Sensing of Environment

    (2001)
  • Berterretche, M. (2002). Comparison of regression and geostatistical methods to develop LAI surfaces for NPP modeling....
  • G.S Birth et al.

    Measuring the color of growing turf with a reflectance spectrophotometer

    Agronomy Journal

    (1968)
  • G Brown

    An optimization criterion for linear inverse estimation

    Technometrics

    (1979)
  • Burrows, S., Gower, S., Clayton, M., Mackay, D., Ahl, D., Norman, J., & Diak, G. Application of geostatistics to...
  • M.K Butera

    A correlation and regression analysis of percent canopy closure versus TMS spectral response for selected forest site in the San Juan National Forest, Colorado

    IEEE Transactions on Geoscience and Remote Sensing

    (1986)
  • J.L Campbell et al.

    BigFoot: characterizing land cover, LAI, and NPP at the landscape scale for EOS/MODIS validation. Field Manual Version 2.1

    (1999)
  • Cited by (362)

    View all citing articles on Scopus
    View full text