Model validation and error estimation in multi-block partial least squares regression
Highlights
► Multi-block PLSR detects and displays y-relevant co-variation patterns in the data. ► Tools to assess the relative importance of the individual blocks are proposed. ► Prediction ability of each block is visualized. ► Model stability in individual blocks is estimated. ► Optimal rank in global and block models is determined.
Introduction
Analyzing a multi-block data set can be accomplished by means of different multi-block methods e.g. Consensus Principal Component Analysis (CPCA) and Multi-block Partial Least Squares Regression (MBPLSR). Both methods provide the user with an efficient graphical overview over sample and variable variation patterns between and within the data blocks [1]. Powerful visualization tools provided by these multi-block methods make it easy to interpret the results. However, one should bear in mind that the interpretations made by practitioners on the basis of visual detection of patterns may be misleading. This remark raises the need of formal validation of the outcomes of a multi-block analysis. So far, this topic has not attracted sufficient attention. Recently, we have reported a study concerning the interpretation and validation of visually detected patterns both in the global and block results of CPCA [2]. We intend to undertake a similar study within the framework of MBPLSR. This method of analysis is a prevalent approach in the analysis of multi-block data sets. It is employed in different fields of science e.g. analysis of environmental data sets [3] and spectral data sets [4], [5], modeling of pharmaceutical processes [6] and monitoring complex chemical processes [7].
Validating the MBPLSR model can be studied from two different points of view: 1) Since MBPLSR is a data analytical technique which enables the user to set up predictive models, one possible way of validation is to validate the predictability of the model. 2) As MBPLSR is also a multi-block data analytical technique, it is of paramount interest to validate the contribution of different blocks in the overall model and to assess how the global MBPLSR model is related to each block. In this paper, two validation strategies for MBPLSR are presented and illustrated on the basis of a data set pertaining to a study which aimed at characterizing natural variability in microbiology. New methods for validating the visually identified patterns, both at global and block levels, from the results of MBPLSR are introduced thus allowing the user to formally validate these patterns. The paper is organized as following: In order to introduce block and global parameters of MBPLSR that are used for visualization, MBPLSR is described and its algorithm is given in Section 2.2. Root Mean Square Error of X and Y (RMSEX and RMSEY) are calculated for the validation purposes in Section 2.3. Section 3 presents a multi-block data set that has been used as an example in this paper. Global and block score plots which are important visualization tools in MBPLSR are illustrated together with our proposed validation tools in Section 4. We end the paper with a conclusion in Section 5.
Section snippets
Notation
We follow the notation commonly used in chemometrics, e.g. Martens & Martens in [8]: Matrices and vectors are written as bold-face, matrices as upper-case letters and vectors as lower-case letters. By the indices b = 1, …, B we denote blocks of variables, by m = 1, …, M cross-validation segments of samples and by a = 1, …, A the number of PLSR components. By X = [X1 X2 … Xb … XB] we denote the multi-block descriptor data set consisting of B blocks. Measurements pertaining to the same measurement technique, e.g.
Multi-block data set
The multi-block data set which is used as the explanatory data in this study consists of four data blocks with different number of variables in each block; all of the variables are measured on the same 88 microbiological samples. The original multi-block data set is described in detail in the references [1], [21]. Fig. 3 illustrates the multi-block data set. The multi-block explanatory data set contains Fourier Transform Infrared (FTIR) spectra and Amplified Fragment Length Polymorphism (AFLP)
MBPLSR of a multi-block data set
In order to find the common variation pattern in the explanatory multi-block data set which can predict the Sakacin P sensitivity of the strains in the study, MBPLSR was performed. For the MBPLSR algorithm we used super score deflation of X and Y. We have also tested the MBPLSR algorithm were only Y is deflated. The algorithms lead to different block scores and consequently the obtained RMSE plots are expected to be different. The obtained validation results were slightly different (results not
Conclusion
Multi-block techniques such as MBPLSR provide the user with powerful visualization tools that aim at a better understanding of the data. Sample- and variable- variation patterns are detected between and within the data blocks. However the patterns that are identified visually should not be taken for granted and should be validated for interpretation purposes. Indeed, the identification of patterns by visual inspection can be misleading. In order to avoid misinterpreting the identified block and
Acknowledgments
The authors are grateful for financial support by the Nordic Centre of Excellence on Food, Nutrition and Health “Systems biology in controlled dietary interventions and cohort studies” (SYSDIET) funded by NordForsk, by Foundation for Research Levy on Agricultural Products in Norway and by the grant 203699 (New statistical tools for integrating and exploiting complex genomic and phenotypic data sets) financed by the Research Council of Norway.
References (24)
- et al.
Analysis of -omics data: graphical interpretation- and validation tools in multi-block methods
Chemometrics and Intelligent Laboratory Systems
(2010) - et al.
Multiblock analysis of environmental measurements: A case study of using Proton Induced X-ray Emission and meteorology dataset obtained from Islamabad Pakistan
Chemometrics and Intelligent Laboratory Systems
(2011) - et al.
Multiblock PLS as an approach to compare and combine NIR and MIR spectra in calibrations of soybean flour
Chemometrics and Intelligent Laboratory Systems
(2005) - et al.
Multiblock partial least squares regression based on wavelet transform for quantitative analysis of near infrared spectra
Chemometrics and Intelligent Laboratory Systems
(2010) - et al.
Modelling and identification of individual stage contributions in an industrial pharmaceutical process by multiblock PLS
- et al.
Multiblock PLS-based localized process diagnosis
Journal of Process Control
(2005) - et al.
Partial least-squares path modelling with latent variables
Analytica Chimica Acta
(1979) - et al.
Prediction of wine quality and geographic origin from chemical measurements by parital least-squares regression modeling
Analytica Chimica Acta
(1984) - et al.
A multivariate method for relating groups of measurements connected by a causal pathway
Analytica Chimica Acta
(1985) - et al.
Data preprocessing: SNV, MSC and EMSC pre-processing in biospectroscopy
Interpreting several types of measurements in bioscience
Multivariate Analysis of Quality: An Introduction
Cited by (13)
The Sequential Multi-block PLS algorithm (SMB-PLS): Comparison of performance and interpretability
2018, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :Although PCA is extremely useful, this paper will focus on regression methods to cope with many relevant problems involving response data (Y), such as monitoring and quality control, establishing multivariate specifications regions for incoming raw materials (or design spaces), soft-sensors development, etc. Since the early work made by Wold [1] on path modelling, Wold et al. [2] on hierarchical PLS, Wangen and Kowalski [3], MacGregor et al. [4], and Westerhuis and Coenegracht [5] on MB-PLS methods, many variations and alternative MB modelling techniques, and improvements to existing ones, have been proposed (e.g. [6–11]), and were applied to various problems (e.g. [4,12–14]). The method named Network induced supervised learning (NI-SL) [15] was also developed to form meaningful regressor blocks, when those are unknown a priori.
Chemometrics in foodomics: Handling data structures from multiple analytical platforms
2014, TrAC - Trends in Analytical ChemistryCitation Excerpt :These are methods that are based on stable projections of the data and visual inspection and empirical validation of the extracted components. These are methods that are developed for solving the collinearity problem (i.e., many more highly correlated variables than the number of samples) and a number of efficient calculation and estimation methods already exist [52]. In multi-block modelling, there are a number of concepts that have appeared to be very fruitful and that are also relevant in foodomics studies.
Block-wise Exploration of Molecular Descriptors with Multi-block Orthogonal Component Analysis (MOCA)
2022, Molecular Informatics