Application of spectra cross-correlation for Type II outliers screening during multivariate near-infrared spectroscopic analysis of whole blood

https://doi.org/10.1016/j.chemolab.2011.04.015Get rights and content

Abstract

In this study, a simple screening algorithm was developed to prevent the occurrence of Type II errors or samples with high prediction error that are not detected as outliers. The method is used to determine “good” and “bad” spectra and to prevent a false negative condition where poorly predicted samples appear to be within the calibration space, yet have inordinately large residual or prediction errors. The detection and elimination of this type of sample, which is a true outlier but not easily detected, is extremely important in medical decisions, since such erroneous data can lead to considerable mistakes in clinical analysis and medical diagnosis. The algorithm is based on a cross-correlation comparison between samples spectra measured over the region of 4160–4880 cm 1. The correlation values are converted using the Fisher's z-transform, while a z-test of the transformed values is performed to screen out the outlier spectra. This approach allows the use of a tuning parameter used to decrease the percentage of samples with high analytical (residual) errors. The algorithm was tested using a dataset with known reference values to determine the number of false negative and false positive samples. The cross-correlation algorithm performance was tested on several hundred blood samples prepared at different hematocrit (24 to 48%) and glucose (30 to 500 mg/dL) levels using blood component materials from thirteen healthy human volunteers. Experimental results illustrate the effectiveness of the proposed algorithm in finding and screening out Type II outliers in terms of sensitivity and specificity, and the ability to predict or estimate future or validation datasets ensuring lower error of prediction. To our knowledge this is the first paper to introduce a statistically useful screening method based on spectra cross-correlation to detect the occurrence of Type II outliers (false negative samples) for routine analysis in a clinically relevant application for medical diagnosis.

Introduction

Near-infrared spectroscopy (NIRS) is a powerful analytical tool widely used to measure chemical and physical properties using a variety of sample presentation techniques [1], [2]. The use of NIRS to measure glucose in human whole blood has attracted much attention during the past twenty years and many research teams have developed different types of instrumentation and processing approaches to measure this low level, non-specific analyte in human tissue matrices [3], [4]. The technique of NIRS requires multivariate calibration to determine the proportionality relationship between the spectroscopic signals measured and the component concentration that are to be inferred from the spectra. The goal of this NIRS technique is to establish a calibration model that can be applied to the spectra of unknown (blind, future) samples to estimate concentration (or property) values [5]. Methods such as multilinear regression (MLR), classical least squares (CLS), principal components regression (PCR), and partial least squares (PLS) were developed and are today commonly used for the development of such calibration models [6], [7], [8], [9].

In this context, PLS has been found to be a useful and robust regression method for calibrating multivariate NIR spectral data [10], [11]. Success in building a PLS calibration model depends upon the quality of the dataset used to construct it. For ‘real world’ measurement conditions a variety of events can occur which create perturbations in spectra or the presentation of unusual samples during measurement that can constitute bona fide outlier conditions [12]. Detection of outliers is very important for the robustness and predictive strength of a PLS model and if they are appropriately removed prior to calibration, or during routine analysis, greater accuracy is achieved. Therefore, the first step towards obtaining high quality calibration and real-time forward prediction analysis is to detect and extract outliers during the analysis process [13], [14], [15]. Generally speaking, there are three types of outliers influencing the quality of the model: 1) outliers in reference data (y-outliers), 2) outliers in spectra data (x-outliers) and 3) outliers in both x and y. In this paper the focus on detection of sample outliers is based on examination of the spectra (x-outliers).

Over the years, several algorithms have been developed for outlier detection for use in different kinds of datasets and the theory and application of these methods have been broadly discussed [16], [17], [18], [19], [20]. Outlier detection in NIRS arena is often based on data with high leverage characteristics [21], [22], [23], [24]. High leverage samples can be determined using score plots from a principal components analysis (PCA) model, such as Hotelling's T²-statistic (based on Mahalanobis distance, a.k.a. MD) and by the Q-statistic. MD and spectral F-ratio (SFR) metrics are most frequently used for outlier detection [25], [26]. Both MD and SFR metrics are dependent on estimated parameters of the multivariate distribution. Thus, samples with an MD or SFR above the designated threshold are considered as outliers. The establishment of MD and SFR thresholds is uncertain using real world data conditions and neither of these methods has provided a clear correlation between residual prediction error and outlier detection metric size. Such an outlier metric is unacceptable for use in medical devices where outlier metrics must be relied upon to determine potential confidence in estimating analytical error, and where analysis results are applied to diagnostic decisions.

The application of outlier identification in medical devices such as blood count analyzers, glucose meters, cardiac devices, blood pressure recording devices, medical imaging platforms, etc., is extremely important as well. The occurrence of medical errors, especially in operating room and intensive care unit, remains a persistent and serious problem. The urgency and the scope of this problem prompt both the academia and industry to develop different outlier detection strategies aimed to alert the clinicians to the occurrence of random or systematic errors and to aid in eliminating clinical mistakes. Detection sensitivity, speed performance analysis, and automation are among the few parameters that investigated during algorithm development in order to improve quality of care in hospital and clinical settings. For the reader's convenience, Refs.[27], [28], [29], [30], [31], [32] provide an overview of few outliers methods used for medical devices.

The objective of the proposed spectra cross-correlation algorithm here is to use only spectral information to determine that no false negatives (Type II outliers) exist. The algorithm allows tuning parameters to adjust an analyzer system to mitigate false negatives during real-time Type II outlier detection. By Type II or false negatives, we mean samples that pass normal outlier metrics, but have large residual or analytical errors based on spectral anomalies not caught by conventional metrics. Thus they pass outlier screening but should not.

The cross-correlation algorithm includes four main steps as follows: First a cross-correlation matrix is computed between all calibration sample spectra. Since the correlation is a Pearson correlation-based calculation the distribution is approximately normal. To form a more rigorous normally distributed statistic, each of the cross-correlation matrix values are converted using the Fisher's z-transform [33], [34] as the second step. In a third step, a z-test of the transformed values is performed; and in the fourth and final step outlier spectra are recognized using a decision (tuning) parameter session. The tuning parameter has been developed at the same time as the calibration algorithm in order to decrease the percentage of false negative and false positive samples. The algorithm was tested on blood samples with a range of blood glucose and hematocrit levels as 30–500 mg/dL and 24–48%, respectively. These samples were blended from the blood constituents of healthy human volunteers (n = 13).

Section snippets

Instrumentation

Spectroscopic data were collected by in-house Fourier transform NIR spectrometer works between 4000 and 8000 cm 1 with a spectral resolution of 32 cm 1. The moveable mirror in the spectrometer is continuously scanning and produces approximately eight interferograms per second. The blood and background (Saline, NaCl 0.9%) samples were pumped through a borosilicate flow cell with a pathlength of nominal 1 mm that was temperature controlled at 34 ± 0.5 °C.

Procedures

Human blood is collected from healthy human

Algorithm description

In the first step of the algorithm, each single-beam spectra is restricted to the spectral interval between 4160 and 4880 cm 1. The literature and empirical findings demonstrate that this spectral region exhibits the greatest specificity for glucose [36], [37]. A characteristic absorption spectrum (after noise scaling) of whole blood over the region of 4000 to 8000 cm 1 is shown in Fig. 1a. Enlarged portion of the 4160–4880 cm 1 region is shown in Fig. 1b. For the outlier algorithm, a

Results and discussion

A PLS model was built with 1060 spectra as mentioned in Section 2. zref of Eq. (4) was set to different values around its average from 5 to 7. For individual zref value, the ‘percent of one's’ (%O's) was varied between 70 and 95% using 5% increments. For each constant zref and %O's tuning, other statistics, such as the standard error of cross-validation (SECV), mean residual, number of total samples, and numbers of outlier samples were calculated.

For all testing, outliers were defined using the

Conclusions

The objective of this research was to incorporate a scientifically valid technique to prevent errors in analysis (Type II) for unattended spectroscopy using multivariate analysis in clinical/medical applications. For that, we have proposed and demonstrated a new algorithm based on spectra correlation to detect Type II outlier occurrence for routine analysis in a clinically relevant application for medical diagnosis. We have tested the algorithm against standard methods for multivariate outlier

Acknowledgement

Authors would like to gratefully acknowledge colleagues and the anonymous reviewer's for all their valuable comments and useful suggestions making the manuscript clearer and more focused.

References (51)

  • P. Geladi et al.

    Anal. Chim. Acta

    (1986)
  • D.L. Massart et al.

    Anal. Chim. Acta

    (1986)
  • Y.Z. Liang et al.

    Chemom. Intell. Lab. Syst.

    (1996)
  • M. Daszykowski et al.

    Chemom. Intell. Lab. Syst.

    (2007)
  • B. Walczak et al.

    Chemom. Intell. Lab. Syst.

    (1998)
  • J.A. Fernández Pierna et al.

    Chemom. Intell. Lab. Syst.

    (2002)
  • T. Lillhonga et al.

    Anal. Chem. Acta

    (2005)
  • M. Imhoff et al.

    Clin. Anaesthesiol.

    (2009)
  • C.E. Metz

    Semin. Nucl. Med.

    (1978)
  • R. De-Maesschalck et al.

    Chem. Intell. Lab. Syst.

    (2000)
  • H.W. Siesler et al.

    Near-Infrared Spectroscopy: Principles, Instruments, Applications

    (2002)
  • D.A. Burns et al.
  • O.S. Khalil

    Diabetes Technol. Ther.

    (2004)
  • K. Maruo et al.

    Appl. Spectrosc.

    (2003)
  • H. Mark et al.

    Chemometrics in Spectroscopy

    (2007)
  • H. Martens et al.

    Multivariate calibration

    (1992)
  • M.H. Kutner et al.

    Applied Linear Regression Models

    (2004)
  • D. Massart et al.

    Handbook of Chemometrics and Qualimetrics, Part B

    (1997)
  • K. Beebe et al.

    Chemometrics, A Practical Guide

    (1998)
  • T. Næs et al.

    A User-Friendly Guide to Multivariate Calibration and Classification

    (2002)
  • V. Barnett et al.

    Outliers in Statistical Data

    (1995)
  • P.J. Rousseeuw et al.

    Robust Regression and Outlier Detection

    (1987)
  • P.J. Rousseeuw

    Robust Estimation and Identifying Outliers

    (1990)
  • D.A. Belsley et al.

    Regression Diagnostics

    (1980)
  • S. Weisberg

    Applied Linear Regression

    (2005)
  • Cited by (4)

    • Predicting the biochemical methane potential of wide range of organic substrates by near infrared spectroscopy

      2013, Bioresource Technology
      Citation Excerpt :

      Detection of outliers is very important for the robustness and the accuracy of a PLS model. Following the above mentioned criteria for outlier determination, three types of outliers can be identified: outliers with respect to the reference value, outliers with reference to the spectra and outliers in relation to both the reference value and the spectra data (Abookasis and Workman, 2011). More than half of the samples identified as outliers, 37, were extreme values outside the 10th and 90th percentiles of the entire data sample set (24 and 13 outliers with BMP values below 100 ml CH4 g−1 VSfed and above 550 ml CH4 g−1 VSfed, respectively) (Table 3).

    • The Concise Handbook of Analytical Spectroscopy: Theory, Applications, and Reference Materials

      2016, The Concise Handbook of Analytical Spectroscopy: Theory, Applications, and Reference Materials
    • Chemometrics

      2013, Analytical Chemistry
    View full text