Model validation and error estimation in multi-block partial least squares regression

doi:10.1016/j.chemolab.2011.06.001

Chemometrics and Intelligent Laboratory Systems

Volume 117, 1 August 2012, Pages 42-53

https://doi.org/10.1016/j.chemolab.2011.06.001 Get rights and content

Abstract

While validation of Partial Least Squares Regression (PLSR) models has been discussed extensively, validation tools that are tailored to Multi-block Partial Least Squares Regression (MBPLSR) have not been discussed in literature yet. This paper introduces validation tools for estimating predictive ability and model stability in MBPLSR models on block level and on global level. Predictive ability on the block level and global level are estimated by calculating the predictive power of block and global parameters. Model stability is estimated by checking the stability of block model parameters and global parameters. By comparing error plots for model stability and predictive ability the user can decide on the number of component to be used. The number of components to be chosen depends on the data set and the purpose of the investigation.

Highlights

► Multi-block PLSR detects and displays y-relevant co-variation patterns in the data. ► Tools to assess the relative importance of the individual blocks are proposed. ► Prediction ability of each block is visualized. ► Model stability in individual blocks is estimated. ► Optimal rank in global and block models is determined.

Introduction

Analyzing a multi-block data set can be accomplished by means of different multi-block methods e.g. Consensus Principal Component Analysis (CPCA) and Multi-block Partial Least Squares Regression (MBPLSR). Both methods provide the user with an efficient graphical overview over sample and variable variation patterns between and within the data blocks [1]. Powerful visualization tools provided by these multi-block methods make it easy to interpret the results. However, one should bear in mind that the interpretations made by practitioners on the basis of visual detection of patterns may be misleading. This remark raises the need of formal validation of the outcomes of a multi-block analysis. So far, this topic has not attracted sufficient attention. Recently, we have reported a study concerning the interpretation and validation of visually detected patterns both in the global and block results of CPCA [2]. We intend to undertake a similar study within the framework of MBPLSR. This method of analysis is a prevalent approach in the analysis of multi-block data sets. It is employed in different fields of science e.g. analysis of environmental data sets [3] and spectral data sets [4], [5], modeling of pharmaceutical processes [6] and monitoring complex chemical processes [7].

Validating the MBPLSR model can be studied from two different points of view: 1) Since MBPLSR is a data analytical technique which enables the user to set up predictive models, one possible way of validation is to validate the predictability of the model. 2) As MBPLSR is also a multi-block data analytical technique, it is of paramount interest to validate the contribution of different blocks in the overall model and to assess how the global MBPLSR model is related to each block. In this paper, two validation strategies for MBPLSR are presented and illustrated on the basis of a data set pertaining to a study which aimed at characterizing natural variability in microbiology. New methods for validating the visually identified patterns, both at global and block levels, from the results of MBPLSR are introduced thus allowing the user to formally validate these patterns. The paper is organized as following: In order to introduce block and global parameters of MBPLSR that are used for visualization, MBPLSR is described and its algorithm is given in Section 2.2. Root Mean Square Error of X and Y (RMSE_X and RMSE_Y) are calculated for the validation purposes in Section 2.3. Section 3 presents a multi-block data set that has been used as an example in this paper. Global and block score plots which are important visualization tools in MBPLSR are illustrated together with our proposed validation tools in Section 4. We end the paper with a conclusion in Section 5.

Section snippets

Notation

We follow the notation commonly used in chemometrics, e.g. Martens & Martens in [8]: Matrices and vectors are written as bold-face, matrices as upper-case letters and vectors as lower-case letters. By the indices b = 1, …, B we denote blocks of variables, by m = 1, …, M cross-validation segments of samples and by a = 1, …, A the number of PLSR components. By X = [X¹ X² … X^b … X^B] we denote the multi-block descriptor data set consisting of B blocks. Measurements pertaining to the same measurement technique, e.g.

Multi-block data set

The multi-block data set which is used as the explanatory data in this study consists of four data blocks with different number of variables in each block; all of the variables are measured on the same 88 microbiological samples. The original multi-block data set is described in detail in the references [1], [21]. Fig. 3 illustrates the multi-block data set. The multi-block explanatory data set contains Fourier Transform Infrared (FTIR) spectra and Amplified Fragment Length Polymorphism (AFLP)

MBPLSR of a multi-block data set

In order to find the common variation pattern in the explanatory multi-block data set which can predict the Sakacin P sensitivity of the strains in the study, MBPLSR was performed. For the MBPLSR algorithm we used super score deflation of X and Y. We have also tested the MBPLSR algorithm were only Y is deflated. The algorithms lead to different block scores and consequently the obtained RMSE plots are expected to be different. The obtained validation results were slightly different (results not

Conclusion

Multi-block techniques such as MBPLSR provide the user with powerful visualization tools that aim at a better understanding of the data. Sample- and variable- variation patterns are detected between and within the data blocks. However the patterns that are identified visually should not be taken for granted and should be validated for interpretation purposes. Indeed, the identification of patterns by visual inspection can be misleading. In order to avoid misinterpreting the identified block and

Acknowledgments

The authors are grateful for financial support by the Nordic Centre of Excellence on Food, Nutrition and Health “Systems biology in controlled dietary interventions and cohort studies” (SYSDIET) funded by NordForsk, by Foundation for Research Levy on Agricultural Products in Norway and by the grant 203699 (New statistical tools for integrating and exploiting complex genomic and phenotypic data sets) financed by the Research Council of Norway.

References (24)

S. Hassani et al.
Analysis of -omics data: graphical interpretation- and validation tools in multi-block methods
Chemometrics and Intelligent Laboratory Systems
(2010)
M.Z. Jaafar et al.
Multiblock analysis of environmental measurements: A case study of using Proton Induced X-ray Emission and meteorology dataset obtained from Islamabad Pakistan
Chemometrics and Intelligent Laboratory Systems
(2011)
L.P. Brás et al.
Multiblock PLS as an approach to compare and combine NIR and MIR spectra in calibrations of soybean flour
Chemometrics and Intelligent Laboratory Systems
(2005)
M. Jing et al.
Multiblock partial least squares regression based on wavelet transform for quantitative analysis of near infrared spectra
Chemometrics and Intelligent Laboratory Systems
(2010)
L.P. Brás et al.
Modelling and identification of individual stage contributions in an industrial pharmaceutical process by multiblock PLS
S.W. Choi et al.
Multiblock PLS-based localized process diagnosis
Journal of Process Control
(2005)
R.W. Gerlach et al.
Partial least-squares path modelling with latent variables
Analytica Chimica Acta
(1979)
I.E. Frank et al.
Prediction of wine quality and geographic origin from chemical measurements by parital least-squares regression modeling
Analytica Chimica Acta
(1984)
I.E. Frank et al.
A multivariate method for relating groups of measurements connected by a causal pathway
Analytica Chimica Acta
(1985)
A. Kohler et al.
Data preprocessing: SNV, MSC and EMSC pre-processing in biospectroscopy

A. Kohler et al.

Interpreting several types of measurements in bioscience

H. Martens et al.

Multivariate Analysis of Quality: An Introduction

(2001)

Cited by (13)

The Sequential Multi-block PLS algorithm (SMB-PLS): Comparison of performance and interpretability
2018, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
Although PCA is extremely useful, this paper will focus on regression methods to cope with many relevant problems involving response data (Y), such as monitoring and quality control, establishing multivariate specifications regions for incoming raw materials (or design spaces), soft-sensors development, etc. Since the early work made by Wold [1] on path modelling, Wold et al. [2] on hierarchical PLS, Wangen and Kowalski [3], MacGregor et al. [4], and Westerhuis and Coenegracht [5] on MB-PLS methods, many variations and alternative MB modelling techniques, and improvements to existing ones, have been proposed (e.g. [6–11]), and were applied to various problems (e.g. [4,12–14]). The method named Network induced supervised learning (NI-SL) [15] was also developed to form meaningful regressor blocks, when those are unknown a priori.
The Sequential Multi-block PLS algorithm, called SMB-PLS, was recently proposed to improve interpretability of large multi-block data structures. It combines the strengths of Multi-block PLS (MB-PLS) and those of the Sequential Orthogonal PLS (SO-PLS) methods. It uses the two-level hierarchical structure of the first (i.e., block and super levels) providing two levels of scrutiny for the analysis of large datasets, and the sequential orthogonalization scheme of SO-PLS, while keeping between block correlated information in the model. This enables the exploration and interpretation of the full data structure without loss of information. SMB-PLS also allows the selection of a different number of latent variables for each regressor block. The modelling performance and interpretation of SMB-PLS were illustrated using two datasets, covering different types of structural relationships between the regressor blocks. SMB-PLS leads to similar predictive performance of response data as MB-PLS and SO-PLS. However, it was shown that SMB-PLS clearly reveals the correlation structure between the regressor blocks, while MB-PLS leads to more ambiguous results. The correlated information between the blocks extracted with SMB-PLS also improves interpretability, for example, by identifying control actions made to attenuate disturbances, such as raw materials variations. Such information cannot be obtained with SO-PLS since it removes between block correlated variations.
A FTIR/chemometrics approach to characterize the gamma radiation effects on iodine/epoxy-paint interactions in Nuclear Power Plants
2017, Analytica Chimica Acta
The effects of radiation on polymeric materials are a topic of concern in a wide range of industries including the sterilization, and the nuclear power industry. While much work has concentrated on systems like polyolefins that are radiation sterilized, some work has been done on epoxy systems. The epoxy system studied is an epoxy/amine paint which is representative of the paint that covers the inner surfaces of the French nuclear reactor containment buildings. In case of a severe accident on a Nuclear Power Plant, fission products can be released from the nuclear fuel to the reactor containment building. Among them, volatile iodine (I₂) can be produced and can interact with the epoxy-paint. This paint is also subjected to gamma radiation damages (due to the high dose in the containment coming from radionuclides released from the fuel). So the epoxy-paint studied was exposed to gamma radiation under air atmosphere after being loaded with I₂ or not.
The aim of this study is to characterize by FTIR spectroscopy the iodine-paint interactions, then to identify the radiation damages on the epoxy-paint, and to check their effects on these iodine-paint interactions. This work shows the potential of multi-block analysis method (ANOVA-PCA and COMDIM = AComDim) for such a study as it allows to identify the nature of iodine/epoxy-paint interactions and to characterize the gamma radiation damages on the epoxy-paint. AComDim method conduces to the extraction of Common Components to different tables and highlights factors of influence and their interactions.
Chemometrics in foodomics: Handling data structures from multiple analytical platforms
2014, TrAC - Trends in Analytical Chemistry
Citation Excerpt :
These are methods that are based on stable projections of the data and visual inspection and empirical validation of the extracted components. These are methods that are developed for solving the collinearity problem (i.e., many more highly correlated variables than the number of samples) and a number of efficient calculation and estimation methods already exist [52]. In multi-block modelling, there are a number of concepts that have appeared to be very fruitful and that are also relevant in foodomics studies.
Foodomics studies are normally concerned with multifactorial problems and it makes good sense to explore and to measure the same samples on complementary, synergistic analytical platforms that comprise multifactorial sensors and separation methods. However, the challenge of exploring, extracting and describing the data increases exponentially. Moreover, the risk of becoming flooded with non-informative data increases concomitantly.
Acquisition of data from different analytical platforms provides opportunities for checking the validity of the data, comparing analytical platforms and ensuring proper data (pre)processing – all in the context of correlation studies. We provide practical and pragmatic tools to validate and to deal advantageously with data from more than one analytical platform. We emphasize the need for complementary correlation studies within and between blocks of data to ensure proper data handling, interpretation and dissemination. Correlation studies are a preliminary step prior to multivariate data analysis or as an introduction to more advanced multi-block methods.
Block-wise Exploration of Molecular Descriptors with Multi-block Orthogonal Component Analysis (MOCA)
2022, Molecular Informatics
Statistical integration of ‘omics data increases biological knowledge extracted from metabolomics data: Application to intestinal exposure to the mycotoxin deoxynivalenol
2021, Metabolites
Obese Subjects with Specific Gustatory Papillae Microbiota and Salivary Cues Display an Impairment to Sense Lipids
2018, Scientific Reports

View all citing articles on Scopus

View full text

Model validation and error estimation in multi-block partial least squares regression

Abstract

Highlights

Introduction

Section snippets

Notation

Multi-block data set

MBPLSR of a multi-block data set

Conclusion

Acknowledgments

Chemometrics and Intelligent Laboratory Systems

Chemometrics and Intelligent Laboratory Systems

Chemometrics and Intelligent Laboratory Systems

Chemometrics and Intelligent Laboratory Systems

Journal of Process Control

Analytica Chimica Acta

Analytica Chimica Acta

Analytica Chimica Acta

Interpreting several types of measurements in bioscience

Multivariate Analysis of Quality: An Introduction