Process PLS: Incorporating substantive knowledge into the predictive modelling of multiblock, multistep, multidimensional and multicollinear process data

Chemical production processes beneﬁt from intelligent data analysis. Previous work showed how process knowledge can be included in a structural equation modelling framework. While predictive models increase process value, currently available methods have limitations that hinder applicability to many (industrial) processes. This paper describes the Process PLS algorithm which can analyze multi-block, multistep and/or multidimensional processes. Process PLS was benchmarked on a simulated crude oil distillation process. Analysis of 22 empirical data sets from a production process at Nouryon illustrated how Process PLS solves limitations of PLS path modelling. In the analysis of the benchmark Val de Loire data, Process PLS revealed substantially meaningful effects which the recently proposed Sequential Or-thogonalized PLS path modelling completely missed. Process PLS is a promising approach that enables data-driven analysis of process data using information on the complex process structure, to demonstra-bly increase insight in the underlying system, making model-based predictions much more valuable. © 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Improving chemical production processes in an intelligent and data-driven manner is key to the Industry 4.0 directive ( Hermann et al., 2016 ;Lasi et al., 2014 ).The challenge of reducing cost and waste, while maintaining quality product and cost-effectiveness is becoming increasingly relevant due to societal perception, international legislation and planet stewardship ( Centi, 2008 ).Modern process analytical technologies, such as online NIR spectrometers, often help in analyzing and tuning specific parts of production processes, but the data is generally not used to the fullest.The large amounts of historical data from monitoring long-running processes contain a wealth of information but extracting this information in order to make optimal process man-agement decisions remains challenging ( Bhat and McAvoy, 1990 ;Gau and Stadtherr, 2002 ;Luyben, 1990 ).
Statistical process models may provide deep insight into industrial chemical processes, both to aid understanding of normal operating conditions and to, for example, predict yields for a certain set of process parameters or observables.Partial Least Squares (PLS) regression is one of many methods often used in Statistical Process Control ( Gao and Yang, 2003 ;Jose et al., 2011 ).PLS regression enables researchers to extract information about the relation between a (large) number of predictor variables and one or more dependent variables.The main shortcoming of PLS regression is that it does not take into account available information about the architecture of the chemical production process and the expected associations between observed process parameters.Methods such as Multiblock-PLS or Sequential Orthogonalized (SO)-PLS have been developed to separate predictor variables according to architectural information ( Kourti et al., 1995 ;MacGregor et al., 1994 ).However, Multiblock-PLS lacks the option to evaluate relationships between the predictor blocks and SO-PLS can only evaluate these under very limiting constraints.
In previous work, it was shown that a structural equation modelling (SEM) approach enables the incorporation of various types of process knowledge into imposed associations between process variables within a predictive PLS model ( van Kollenburg et al., 2020 ).Because of the inherent non-normality of industrial process data, PLS path modelling (PLS-PM, also called SEM-PLS, or PLS-SEM) ( Hair et al., 2017 ;Tenenhaus et al., 2005 ;Wold, 1975 ) was the logical choice for this work (for details on how PLS-SEM and factor-based SEM differ, see).Application of PLS-PM on process data led to novel understanding of a production process of Nouryon, which in turn led to catalyst usage optimization in a predictive maintenance setting, increasing sustainable production goals.But while PLS-PM may provide insights into processes when used correctly ( Benitez et al., 2020 ), the model has several mathematical restrictions that limit the allowed model complexity to a degree lower than that needed to analyze most industrial processes.One of the main limitations is that PLS-PM assumes that each block of variables can be described with a single latent variable.This so-called unidimensionality assumption is rarely appropriate in physical systems like those found in applications on industrial processes ( Ferrer et al., 2008 ).Additionally, it was found that multicollinearity, which is common in chemical and physical processes, may lead to unstable models with, for example, standardized regression coefficients greater than 1.Such unstable model parameters limit intuitive model interpretation.The interested reader is referred to Rönkkö et al. (2016) for a discussion on the fundamental limitations of PLS-PM.
Recently, ( Romano et al., 2019 ) used SO-PLS as a basis for path modelling in order to solve some of the restrictions of PLS-PM.But while SO-PLS is a useful method for predicting some outcome based on multiple blocks of predictor variables, the path modelling version SO-PLS-PM has critical shortcomings when used for many exploratory and confirmatory path modelling applications.Firstly, the model requires specification of a hierarchy of the predictor blocks.In exploratory analysis, specifying which predictor block is most important a priori may not be appropriate or even possible.Secondly, the implementation of SO-PLS-PM does not provide a fixed set of latent variables that describe the data.As a result, indirect effects between blocks cannot be calculated straightforwardly ( Naes et al., 2020 ), even though this is a core part of path modelling.Standard methods for indirect effects, like multiplying path coefficients ( Keith, 2019 ), do not lead to meaningful results.Reproduced correlations calculated from the model parameters are therefore unavailable.Having fixed estimates for the latent variables in a model is crucial for interpretation of the direct and indirect effects, which form the foundations of any path models ( Nitzl et al., 2016 ;Streiner, 2005 ).
Another common challenge in multi-block PLS solutions, including SO-PLS-PM, is the selection of the optimal number of latent variables for each block.This issue is caused by the imposed hierarchy and the need for orthogonalization of the predictor blocks.Depending on the number of latent variables in one predictor block, the target block is deflated differently (i.e., different information is removed).However, it could be that some of that information was better explained by the second block of predictors.To optimize this step, many combinations must be tried to arrive at an optimal solution.Such brute-forcing approaches increases the computational intensity tremendously, especially when the steps must be (repeatedly) cross-validated.
To overcome the crucial shortcomings of PLS-based path modelling, we developed an algorithm called Process PLS.Process PLS can be used to incorporate substantive process knowledge into predictive modelling, but, as will be shown, provides stable model parameters in situations of highly correlated variables as are com-mon in industrial production data.Process PLS improves on existing methods i) by finding the most important predictors based on observed data instead of a priori specified hierarchies, ii) by providing fixed sets of latent variables which describe the data optimally and as a result making indirect paths meaningful, and iii) by dealing with highly correlated multi-block data, all whilst maximizing model interpretability.
In the next section the algorithm, methodology, and implementation of Process PLS are described in detail.In Section 3 , Process PLS is benchmarked by analyzing a simulated production process and the modelling results are directly compared to results obtained with PLS-PM.In Section 5 an empirical data set of chemical production process is analyzed to show the applicability of Process PLS in the analysis of complex chemical data and comparison of batch-to-batch variations.The improvements of Process PLS over SO-PL S-PM and PL S-PM are illustrated in Section 5 by analyzing the Val de Loire data ( Dabbs and Aksay, 20 0 0 ), which was also analyzed by Romano et al. (2019) .The analyses show how PLS-PM provides nonsensical model parameters and how SO-PLS-PM fails to find substantively important direct effects.Process PLS does not show any problems.Conclusions based on the results and topics for future work are reported in Section 6.

Methodology
Process PLS is developed for situations in which measured variables can be grouped into different 'blocks' and where those blocks may be interrelated.Fig. 1 shows a rather basic example of a process where twelve measured variables are grouped into four blocks.Between some of the blocks, relationships are specified.In Fig. 1 , the colors portray how the Process PLS algorithm estimates the model in three steps: 1) model specification, 2) latent variable estimation, and 3) path coefficient calculation.

Step (1) Model specification
Model specification involves two sub-steps.First, the measured variables are grouped according to where in the process they are measured, as indicated in Fig. 1 in blue.This creates what is known as the outer model, or measurement model ( Section 2.1.1 ).After the variables are assigned to their respective blocks, directional relationships or effects between the blocks are specified ( Section 2.1.2).Existing expert knowledge about the relationships between parts of the process can be tested by incorporating such relationships into the model, and new relationships may be discovered by including yet unknown relationships.This network of relationships between blocks creates the inner model, or structural model (green arrows in Fig. 1 ).The estimation of the outer model (in red in Fig. 1 ) results in estimates of the latent variables of each block (this process is described in Section 2.1.2).The relationships between the blocks in the inner model (the green arrows in Fig. 1 ) are modeled in the final steps of the Process PLS algorithm ( Section 2.3 ).

Sub-step (1a) outer model specification
Suppose there are N measurements of P variables collected in an N × P data matrix V .If the process has M distinct parts, we can use the information about where each variable is measured to group the variables into M blocks (the blocks being representations of the distinct parts of the process).This grouping can be used to partition the data matrix as  will later be used in the latent variable estimation step.It is advisable to organize the blocks in the order in which they naturally occur in the process (actual restrictions will be discussed in Section 2.1.2.).The data for the model in Fig. 1 could be denoted

Sub-step (1b) inner model specification
Any structural relationships between the blocks (i.e., the parts of the process) can be modelled by specifying directional effects between blocks.Suppose that in a chemical process the variables in some block l are known to, or are assumed to, affect those in block k .This information is stored in a lower-triangular connectivity matrix, C. The matrix C is of size M × M and has as elements: where n, m ∈ { 1 , . . ., M } and n > m .The resulting matrix defines the inner model, or structural model.The inner model is used for both the latent variable estimation ( Section 2.2 ) and the explained variance calculation ( Section 2.4 ).As an example, the connectivity matrix of the path model in Fig. 1 is The ones in the first column indicate which blocks are predicted from the first block (i.e., whether there is an arrow in the path model from block 1 to that block).In this case, block 1 has an effect on blocks 3 and 4, and block 2 has an effect solely on block 4.
Process PLS is currently only suited for recursive path models (i.e., directed acyclical graphs).That is, since C is lower-triangular, blocks can only predict blocks that have a higher 'block number'.This also means that recycle/feedback loops cannot be directly modelled.The effect of a feedback loop on other blocks may still be evaluated by using the magnitude of the feedback as a manifest variable.There is no hierarchy in the importance of blocks like in multi-block methods, but some considerations may aid the modelling process.For example, ordering blocks based on possible causality automatically avoids physically impossible effects (not counting possible feedback control loops).Then, as a basic premise, it is recommended to set elements of C to 1 when blocks are physically connected, or when process knowledge indicates or suggests that there is a directional relation between the blocks ( van Kollenburg et al., 2020a ).In this manner, Process PLS can be used for confirmation of prior domain knowledge and estimation of relationships of blocks within a process.

Step (2) Estimation of the outer model
The workhorse powering Process PLS is the SIMPLS algorithm ( de Jong, 1993 ).SIMPLS uses as input a predictor matrix X and a target matrix Y .The algorithm finds the best combination of predictors to optimally predict the target data, then finds another combination to best predict the remaining target data, and then repeats this process until some criterion is reached (e.g., a percentage of explained variance in the target data).The output of SIM-PLS is a set of latent variables ξ , a matrix R containing X-weights, a matrix L containing X-loadings and a matrix Q containing Yloadings.
When the outer model is specified, Process PLS estimates one set of latent variables for each of the M blocks.Each set ˆ ξ m is an N × W m matrix containing the N scores on the W m latent variables of block m , which are estimated using the SIMPLS algorithm.These sets of latent variables are represented in Fig. 1 as the red squares.Since dimension reduction is at the basis of PLS, W m ≤ P m .How the regression is performed depends on whether a block functions as a predictor in the inner model ( Section 2.2.1 ), or only as target ( Section 2.2.2 ).

Predictor blocks.
The latent variables of a block m that functions as a predictor in the inner model are estimated through a PLS2 regression (using SIMPLS) where the measured variables of block m are used as predictors, X m , and the measured variables from the blocks that are being predicted from block m as a target matrix Y m : The optimization by the PLS2 algorithm ensures that the resulting latent variables of block m are both predictive for the variables in Y m and contain a large portion of the variance in X m .
As an example: to estimate the latent variable(s) of block 1 in Fig. 2 , variables 1, 2 and 3 ( by L m , are saved for to calculate explained variance.( Section 2.3.2 ).

Target blocks
When a block m only functions as target (i.e., it is never used as predictor), the latent variables are also estimated through PLS2 regression by using the measured variables of block m as target variables, Y m , and the measured variables from all blocks that predict Y m as a predictor matrix X m : This ensures that the latent variables at block m can be predicted as well as possible.
In the example presented in Fig. 2 , the variables of block 4, Y 4 = [ v 10 , v 11 , v 12 ] , will be predicted from In this situation, the Y 4 -scores are estimates of the latent variable(s) ˆ ξ 4 ( de Jong, 1993 ).
In summary, latent variables for blocks are always estimated by predicting the matrix Y m from X m using the SIMPLS algorithm, but the construction of these matrices depends on the function of a block in the model.When block m functions as a predictor (regardless of whether it is endogenous or exogenous) Y m is a matrix where the target blocks are combined, and X m = V m .The resulting X m -scores are used as estimates for the latent variables ξ m and the X-weights are used as outer model coefficients relating the manifest variables to the latent variables.When block m only functions as target, the X m matrix is a combination of the predictor blocks, Y m = V m , the Y m -scores are used as estimates for the latent variables ξ m and the Y -loadings are used as outer model coefficients relating the manifest variables to the latent variables.

Explained variance in the outer model
When studying the outer model loadings, which are related to the blue arrows in Fig. 1 , the explained variance, which we denote here as R 2 m , describes how much variance in the measured variables can be described by the Process PLS model and is calculated from the loadings.That is, when a block m is a predictor, the variance in the observed variables, V m , that can be explained by (i.e., is summarized in) the latent variables ˆ ξ m is calculated from the X-loadings, L m , (See Section 2.2.1 ) by the following equation: where N indicates the number of observations and the trace is the sum of the diagonal elements of a matrix.When a block m functions as a target only, the explained variance R 2 m is instead calculated from the Y-loadings, Q m , by the following equation:

Regression coefficient calculation
Regression coefficients form the basis of the inner model of Process PLS.After the (sets of) latent variables (represented as red squares in Fig. 1 ) are estimated, the relationships between the blocks are found through a second set of PLS2 regressions.PLS2 regression is performed for each block m , which is a target block irrespective of whether it also serves as predictor block.The set of latent variables, ˆ ξ m , is regressed on the matrix χ m : which combines the latent variables of the n predictor blocks, as defined in the connectivity matrix ( Section 2.1.2.).The elements of the connectivity matrix with a 1 value are represented as the green arrows in Fig. 1 .Since both the target block as well as the predictor block(s) may have multiple latent variables, the resulting regression coefficients form a matrix, B m .This matrix can be regarded as a combination of sub matrices where each W n × W m sub-matrix B m,n contains the regression coefficients of the W n latent variables of predictor block n for predicting the W m latent variables of the target block m .Note that blocks that do not function as predictors for block m are not represented in B m .

Calculating explained variance of the inner model
In order to interpret and study the inner model, it is intuitive to look at how well predictor blocks are able to predict target blocks.The standard way of looking at predictive power is to evaluate how much variance in a target block can be explained by the predictors.Due to overlapping terminology of explained variance of measured variables in the outer model (Section 2.2.3) and explained variance of the latent variables in the inner model predictions, we denote the latter as P 2 m ('Rho squared').The explained variance of predictions is straightforwardly calculated from the Process PLS implementation by subtracting the sum of squared errors between ˆ ξ m and its prediction χ m B m from the total sum of squares of ˆ ξ m .Note that the hat ( ^ ) notation is used for latent variable estimates and is not to be confused with the predicted values obtained from calculating χ m B m .Due to the blockscaling employed in the presented methodology (section 2.4.1) the formula for calculating the explained variance in block m through prediction from χ m simplifies to: Where SS means taking the sum of squares of the term in parentheses.
The acquired explained variance through prediction P 2 m is however not specific to a single path between blocks.The prediction χ m B m is the sum of the contributions of each predictor block and can thus be written as Each predictor block has its own set of regression coefficients B m,n for the predictor variables ˆ ξ m,n , where the subscript n indicates which predictor block these belong to .The partial explained variance of a specific block z can be calculated by multiplying P 2 m with the ratio between the sum of squares of the partial prediction ˆ ξ m,z B m,z and the total sum of squares of all partial predictions for block m .
As an example, P 2 4 . 1 would be the parameter related to the diagonal green arrow in Fig. 1 .Attributing explained variance to specific inner model effects in this way makes that Process PLS can be interpreted as in standard path modelling (i.e., SEM or PLS-PM), even if multiple latent variables are modelled per block.
Predictions for unobserved data can be calculated from the model parameters using observed data as inputs.In that case, start from the outer model parameters to calculate the LVs and use the inner model parameters to calculate the fit as described in the previous paragraph.Other statistics, such as the mean absolute error or the root mean square error, can also be calculated to assess the predictivity of the model.However, attribution of partial errors to specific predictor blocks may not be as convenient as the attribution of partial explained variance.

Software implementation
The Process PLS methodology presented in this work was implemented in the statistical programming language R, version 3.5.0( R Core Development Team, 2014 ) and is freely available as a package called "pathmodelr" which can be found online ( van Kollenburg et al., 2020 ) and is distributed under the GPL-3 license.The package "pathmodelr" makes use of other useful R packages, specifically "dplyr" ( Reeve et al., 2014 ), "reshape2" ( Wickham, 2007 ), "listenv", and "R6" for data handling, "caret" ( Kuhn, 2008 ) for cross-validation, and "ggplot2" and "plspm" ( Sánchez and Trinchera, 2013 ) for visualization.More technical details can be found in the accompanying help files of the package.Some technical details on how pathmodelr scales data and implements cross-validation can also be found in Appendix A.

Benchmarking
Analyzing a simulated data as a 'golden standard' enables researchers to translate (Process PLS) modelling results to meaningful process knowledge and understanding.This is helpful since statistical correlations between variables may be different from the relationships described by the laws of thermodynamics.

Data
To benchmark Process PLS a simulated crude oil distillation process was analyzed.This process is available in the demo version of the Mobatec Modeller software (Mobatec B.V., Rotterdam, the Netherlands) and a detailed P&ID (Piping and Instrumentation Diagram) of the process is provided in the supplemental material.Crude oil feed enters a "preheating section", before entering the distillation column.There, due to a difference in pressure, a phase flash makes the lighter, more volatile components (the top product) turn into vapor and head up towards the rectifying section.Heavier, less volatile components (the bottom product) remain liquid and go down towards the stripping section.The top product is used in a blend for gasoline and the bottom product is used for making diesel fuel.For the current application, the quality parameter is the ratio between top product and bottom product formation rate.
Data on P = 18 process variables, reported in Table 1 , were generated representing more than 25 hours of steady-state production with measurements collected every two seconds, resulting in N = 45081 observations.Since the modeler is based on closed-form mathematical equations, a process in full steady-state will not have any variation over time.Therefore, to obtain data which resembles real-world processes, naturally occurring fluctuations were introduced by randomly fluctuating the feed composition, temperature, and pressures.All data fluctuated within the normal operating conditions, so the data is fault-free.
Because the variables are measured at different locations in the process, measurements made at the same clock time do not necessarily correspond to the same fraction of chemicals from the feed.Since we know the time it takes for the feed to reach each part of the plant (labelled 'lag' in Table 1 ), aligning the data is straightforward.For example, the feed at time t is related to measurements of the top product at time t + 2300 s .As the measurements are taken every two seconds, we shift all measurements from the bottom product variables up 30 0 0 / 2 = 1500 cells (see Fig. 2 for a graphical representation of the alignment procedure).In case the  process lags are not known as first-principle knowledge for a production process, data-driven methods may be used to optimize the shifts of the variable in time-dimension, either by optimizing the relation between process variables and a fixed target variable ( Souza and Araújo, 2011 ) or by doing so dynamically for each process variable separately ( Offermans et al., 2021 ).The aligned data used in the data analysis below ( N = 43581 observations) are available for the interested reader online ( van Kollenburg et al., 2020b ).

Analysis
The outer model was specified by grouping variables together according to where in the process they were measured.In total, the variables can be grouped into 4 process blocks (feed, preheater I, preheater II, separation) and a Yield block (containing the ratio between top and bottom product).It is important to note that the process variables for each block are clearly multidimensional (i.e., they do not pertain to a common construct).Therefore, a single latent variable will most likely not be able to describe the data to satisfactory levels.
The inner model was designed following the three steps proposed by van Kollenburg et al. (2020a) .That is, to first specify effects representing process architecture, then include expert knowledge and finally include effects for the prediction of product quality parameters.In the current application, effects are specified to represent the physical architecture of the process ( Fig. 3 a ).The second step is not included in this simple example.Lastly, effects from each process block to each output block are included as part of the prediction model, resulting in the full Process PLS inner model as shown in Fig. 3 b .
For comparison purposes, two more path modelling approaches were used to analyze the data.Firstly, we analyzed the data using PLS-PM, specifying the same model configurations as described above.The PLS-PM model was considered reflective, using the centroid weighting scheme.Secondly, as PLS-PM is only able to model one latent variable per block, we also estimated a Process PLS model where the number of latent variables for each block was restricted to one.The inner model regression coefficients of PLS-PM were squared to approximate (partial) explained variances ( P 2 m.z , see Section 2.3.2 ).Note that P 2 m.z is only related to the regression of latent variables and not to how much of the variance in the observed data is described by the outer model ( R 2 m , see Section 2.2.3).For example, a high P 2 m .z for a block with low R 2 m does not explain much of the observed data.Therefore, to compare different models, especially when they have differing numbers of latent variables, it is important to consider both P 2 m.z and R 2 m .To quantify how much of the observed data can be explained by a predictor block, one can simply calculate the product P 2 m .z * R 2 m , which will be presented below as well.
The Process PLS analysis was performed with the package 'pathmodelr' in R. For each block, the number of latent variables was optimized using the supplied cross-validation procedure.All manifest variables were standardized.All other settings were left to the defaults.The PLS-PM model was estimated using the PLS-SEM toolbox in MATLAB ( Aria, 2020 ).Full code for estimating the model and reproducing the results can be found together with the data online ( van Kollenburg et al., 2020b ).

Results and discussion
The main results from the inner model are found in Table 2 .The results for the unrestricted Process PLS are also shown in Fig. 4 .As expected, the Process PLS model with single latent variables per block and PLS-PM provided similar results, albeit that Process PLS had somewhat higher P 2 m .z values for the Yield block.Process PLS with an unrestricted number of latent variables had comparable P 2 m .z values between the process blocks, but large differences were found related to the prediction of the Yield.Taking the sum of the partial explained variances for the yield block showed that PLS-PM was able to explain a total of 39% of the variance in the Yield block, Process PLS with one LV per block explained 88% and the unrestricted Process PLS 93%.
Since model performance is not only related to the inner model parameters, the values in Table 2 were multiplied by the corresponding R 2 m (See Table 3 ), which provides a much richer comparison.What becomes immediately clear is that models with a single latent variable per block are not nearly able to explain as much variance in the observed data that the model with multiple latent variables per block.The inner model parameters in Table 2 are similar for the prediction of the process blocks.But when we take into account how much variance is actually described in those pro-cess blocks, the superiority of Process PLS becomes clear: single-LV models only explain a third to a half of the variance in the process variables, while having multiple LVs allows for explaining up to 95% of the variance in the data (see Table 4 ).
It is important to note that the Yield block only contained a single variable (top/bottom ratio), so only a single latent variable would suffice to describe it.For blocks with a single observed variable, PLS-PM automatically uses the observed variable as proxy for the latent variable (i.e., the loading is set to 1), and R 2 m would always equal 1 for such blocks.Hence the values for PLS-PM in Tables 2 and 4 for the Yield block are indeed the same (i.e.P 2 Yield .z* 1 ).However, not all variation in the Yield block is necessarily related to other variables in the process and optimizing all covariances may not benefit from using all information in the Yield block.Since Process PLS is designed to optimize the covariances between all blocks, the algorithm does not by definition take the variable as proxy for the latent variable.Instead, in both Process PLS models R 2 Yield = .89 .Again, this shows how Process PLS works as intended: it does not by definition focus on an end node, simply because it is at the end of the process.Each block is a priori equally important and only covariance and predictive power are the forces driving the results.
It may seem remarkable that the process Feed has little to no direct effect on the process Yield (top/bottom ratio).In most practical settings experience has shown that process input has a significant effect on the process output.However, for a well-controlled production facility as was simulated for this demonstration, the production settings are constantly adapted to guarantee stable and high production quality despite changes in external production factors, including feed variations.Because the intermediate production steps are constantly adapted, it is their operation variance by which the Yield is affected and not the Feed variance itself.In this process the Yield is affected primarily by the Separation and sec- ondarily by Preheater II.This is partly caused by a proximity effect, but also by the fact that these production steps are the main targets of the control routines that are in place to guarantee high production stability and quality.
In general, the total variances explained in each process block is lower in the PLS-PM analysis than in the Process PLS analysis.Fundamentally, it was shown that PLS-PM does not work well for multidimensional data (it only uses a single latent variable per block) and in the presence of multicollinearity (the inner model is estimated with ordinary least squares).Additional numerical simulation experiments were performed to illustrate in detail how Process PLS handles such situations much better than PLS-PM.The results for these studies can be found in Appendix B and C.

Multi-batch analysis
For batch-wise industrial production processes, evaluating and understanding production variations between batches is essential to minimize production cost and maximize production sustainability.In Van Kollenburg et al. (2020a) , it is shown that path modelling using PLS-PM can yield an unprecedented insight into batchto-batch variations in process relationships and how they correlate to variations in production cost.For the process presented, the relationships fitted by PLS-PM could, per batch, be related to the batch cost using a second statistical model.Being able to quantify such relationships leads to better understanding of the process and can improve communication across different layers of the automation pyramid.
Although PLS-PM proved to be a fruitful method for this study, it is theoretically still suboptimal.Industrial processes are complex and consists of many different sub-processes that occur simultaneously at different production steps in the plant.PLS-PM can however only describe the single most dominant (sub)process per production step, as it can only fit a single latent variable per production step.It therefore misses out on many details of the production data and may model different processes for different batches based on which process is most dominant in each batch.As Process PLS can model multiple sub-processes per production step, it is a superior alternative to PLS-PM for this type of study.
To illustrate that Process PLS indeed surpasses PLS-PM for quantifying and understanding batch-to-batch variation, data from 22 batches of a semi-batch process of Nouryon were analyzed.Eleven of these batches were also analyzed in previous work ( van Kollenburg et al., 2020a ).The process consists of several interconnected process units, like heaters, reaction vessels and a catalyst section.When the catalytic efficiency drops below a threshold the process is halted, a new batch of catalyst is introduced, and the production is resumed.The batches had considerably variable yield, leading to fluctuating production costs.Relating this cost to the performance of the batch in terms of process variables will result in better understanding of the batch variations and can even result in a soft sensor that can predict the cost for a running batch in real-time.

Data
Measurements for 34 process variables such as pressures and temperatures are available for each of the 22 production batches.Four variables relate to the incoming feed composition, 28 variables relate to the actual production and two variables relate to the product quality.Note that this data was originally collected for other purposes than statistical analysis.All variables are available at hourly interval except for the product quality variables, which are available approximately twice a day.The quality variables are synchronized to the hourly measurements by assigning the last known quality sample to the hourly samples.This corresponds to how the quality checks are used in practice.Note however that due to this imputation by replication, the standard errors related to the prediction of these quality samples will be underestimated.
For each batch a cost value is available, which relates the fixed amount of catalyst to the varying product yield.A complete overview of the batches used in this analysis, including the number of hourly measurements available and the cost value (in arbitrary units), is given in Table 5 .Longer batches (more hourly measurements) in general have a lower cost value as more product can be produced.Path models on batches with fewer samples will have a higher uncertainty (standard error), which introduces an artificial relationship between modelling accuracy and cost for a batch.

Analysis
Both the outer and inner model specified for path modelling of the production data are illustrated in Fig. 5 .The outer model was specified by grouping the observed variables into nine blocks.The four composition variables of the input were grouped in a block 'Input', the two product quality variables were grouped in a block 'Quality' and the remaining process variables were grouped corresponding to the process step in which they are measured.These process blocks are ordered in Fig. 5 based on their physical location counterclockwise.Only the 'Flow' block is moved to the beginning as it influences the flow in all other blocks.
The inner model (the relationships between the blocks) was specified in three steps.Firstly, effects corresponding to the phys- ical connections between the process units were specified.Secondly, relationships between blocks that are not directly physically connected were identified and included in the inner model using expert-knowledge from Nouryon.Thirdly, all relationships from all blocks to the Quality-block were included.
The model illustrated in Fig. 5 was estimated for all 22 batches using three methods: Process PLS, Process PLS restricted to model only a single LV per block, and PLS-PM.As unrestricted Process PLS can model all sub-processes for the blocks in each batch, it is expected to find the same inner model associations and thus more stable and comparable results for the batches than the two other methods.In other words: Process PLS better copes with batch-tobatch variations and should reveal clearer patterns in the data.
The batch-to-batch variations in process performance as modelled by the different path modelling methods were related to batch-to-batch variations in cost value.For this, correlation coefficients were calculated between the cost-value and each of the path-effects over all batches and for each modelling method.Such results can help process operators understand which process relationships are affecting the production cost and how much.When the entire strategy is implemented for on-line monitoring, it could even be used as a soft sensor that estimates the production cost in real-time.

Results and discussion
Fig. 6 shows the explained variance of the measured variables in the outer model, with variations across batches represented as standard errors.These results show that unrestricted Process PLS is superior in describing the measured data for all blocks and across all batches.As discussed previously, a well-behaved production process has little variance in the quality variables and only little of the variance is related to the process variables.Because Process PLS creates the outer model such that the covariance between blocks is optimized, only limited variance is extracted from the Quality variables.Process PLS thus accurately reflects the substantively different data underlying the blocks.These first results, also compared to restricted Process PLS, confirm that there are multiple sub-processes present in most parts of the production process and that multiple LVs are required for a path model to accurately describe these sub-processes.Moreover, the fact that a linear model like Process PLS can describe most of the variation in the data so accurately is promising, considering that industrial processes may have many non-linear relations (Wang, Liu & Srinivasan, 2010).
The inner model parameters found in each batch from the unrestricted Process PLS, restricted Process PLS and PLS-PM are shown in Fig. 7 .The horizontal axes show the path effects that comprise the inner model, denoted as 'predictor > target' .The vertical axes represent the explained variance in the target block by the predictor block.Each line represents a single production batch.In these figures, it can be observed that the model parameters found for the different batches using unrestricted Process PLS adhere much stronger to an overall pattern.For most of the modelled effects, the variation over the batches is larger for the single LV-methods, and less of a general trend in the lines can be observed.This illustrates that unrestricted Process PLS indeed models the same production process including all sub-processes in each batch and each block, while the single LV-methods do not.The results found for the path-effect from Flow to Reactor exemplify the advantage of (unrestricted) Process PLS especially well.The reactor is at the heart of this production process and comprises multiple sub-processes that need to be carefully controlled by adapting flow rates.Restricted Process PLS can only model one of those processes for each batch.As this process can be different from one batch to another, the size of the relationships from Flow to Heater shows much more batch variation, as can be observed in Fig. 7 B .Unrestricted Process PLS models all sub-processes in every batch, leading to a much smaller variation of the relationship size from Flow to Heater .Moreover, the fact that unrestricted Process PLS models the data measured for these two steps virtually perfectly ( Fig. 6 A ) guarantees that the relationship size between these steps are accurate and are not subjective to the path modelling method not being able to accurately describe the data in the first place, as is the case with the single LV methods.
For all path-effects to the Quality -block, the sizes show more variation over the batches, even for unrestricted Process PLS.This is because the explained variance for the data in this block by Process PLS is lower and suffers more from batch-variation than the explained variance of the other blocks.It should be noted that although PLS-PM can explain more variance in the data of the Quality , the path effect sizes from all blocks to it are not higher and are not more structured than those found using Process PLS.As discussed before, Process PLS may describe less of the variance for an end-block, but it takes covariation of the end-block with the incoming blocks into account (in contrast to PLS-PM).The patheffects sizes that are eventually modelled are therefore comparable to those found by PLS-PM.
Additional process understanding could be obtained by correlating the cost-values to the variations in path effects.The correlations are depicted in Fig. 8 .Especially interesting are the strong (negative) correlations between the cost value and the path effect sizes for Cooler > Heater and Tank1 > Heater as found with unrestricted Process PLS.For most batches, the effect size for Cooler > Heater are generally high while the effect size for Tank1 > Heater are low ( Fig. 6 A ).It is interesting that the base effects may be very different, but that variations in those effects are still highly related to cost.These results have been validated by correspondence with experts.
From a methodological point of view the Process PLS results indicated that the process under analysis was well-designed and well-controlled.This came, of course, as no surprise to the people at Nouryon.At the same time, because so much was known about the process itself, the fact that Process PLS performed so much in agreement with expectations, we are confident that Process PLS is a good method to analyze data from production processes.
It should be noted that Process PLS outputs much more information than the inner model parameter coefficients that are dis-cussed here, and that exploring that information can lead to even more (added) process understanding.Such information is not discussed due to conciseness, but from the information that is discussed it is already clear that Process PLS, by modelling multiple latent variables per block, is a superior choice for modelling complex industrial chemical processes.

Data
For the analysis of industrial production processes, variables were grouped according to the physical section they were connected to and not because of similarity in function.In traditional applications of path models, variables that have the same underlying constructs (e.g., verbal intelligence, or the taste of wine) are grouped together.Therefore, a comparison between Process PLS, SO-PL S-PM and PL S-PM is included for a situation in which the variables are grouped according to the construct they pertain to.
One such application, based on a wine tasting process, was analyzed by Romano et al. (2019) to illustrate the SO-PLS-PM methodology and compare it to the long-standing PLS-PM.The authors used the Val de Loire data, which has P = 27 sensory measurements (i.e., variables) on N = 21 wines that originated from the Val de Loire region in France.The sensory measurements correspond to different aspects of the wine tasting process (i.e, the experience).Five variables pertain to the smell of the wine at rest, three variables pertain to the view of the wine, ten to the smell of the wine after shaking, and ten to the taste of the wine.In addition to these sensory variables, each wine was assigned a Global quality score.
As noted by Martens et al. (2007) , who first analyzed the Val de Loire data with PLS-PM, there is 'severe multicollinearity' between the latent variables of some blocks.In this original application, the authors already opted for PLS regressions for the inner model estimation instead of the standard ordinary least squares regression.However, since Romano et al. (2019) still used PLS-PM as a benchmark for SO-PLS-PM, standard PLS-PM was included in the current comparative study as well to illustrate the issues in an empirical application where unidimensionality can be assumed.

Analysis
The measurement model for this analysis is straightforwardly specified, since the variables are already grouped into 5 blocks: "Smell at rest" (A), "View" (B), "Smell after shaking" (C), "Tasting" (D) and "Global quality" (E).For comparison purposes, the model is specified as it was in Romano et al. (2019) , resulting in a fully connected inner model (See Fig. 9 ).The R script for analyzing the wine data using Process PLS can be found online ( van Kollenburg et al., 2020b ).

Results and discussion
The results of the Process PLS and PLS-PM analyses, as well as the SO-PLS-PM results of Romano et al. (2019) are reported in Table 7 .One of the most striking results is that according to PLS-PM, 1.25 (or 125%) of the variance of the Quality latent variable could be explained by the latent variable of Tasting.This nonsensical result is known to occur in least squares regression when multicollinearity is an issue.Both the PLS-PM implementation in R and in the PLS-SEM Toolbox in MATLAB gave the exact same result so clearly PLS-PM is not appropriate for this data set.Any (further) comparison with PLS-PM was therefore deemed unreliable and thus redundant.
SO-PL S-PM and Process PL S find somewhat similar relationships.But since the effect sizes are estimated with different algorithms these may be difficult to directly compare.The most notable differences are related to the effects on the Global quality block.Any substantive reasoning would reveal that each part of a tasting experience is, to a certain extent, directly related to the quality of a wine (though the direction of causality may be debated.SO-PLS-PM does not capture these effects, as 3 out of 4 direct effects equal zero.In SO-PLS-PM, direct effects can be set to zero due to the inherent choice of number of latent variables each time a regression is done.This can lead to specific predictor blocks being excluded and all explainable variance is attributed to a single block (Tasting in this case).Process PLS on the other hand did find the expected direct effects and attributes the explained variance of Global quality to each of the blocks.
Since the sample size is limited and (almost) the entire population of Val de Loire wines was analyzed, further methods of inference, like significance tests and bootstrap procedures, are not appropriate ( Edwards and Bennett, 1991 ;Gorard, 2017 ;Johnson, 1999 ;Wasserstein and Lazar, 2016 ).From a substantive point of view, this also means that the estimated model only describes the tasting process of Val de Loire wines and not of wines in general.
Next to the direct effects presented in Table 7 , path modelling allows for investigation of indirect effects, too.For example, the Smell at Rest block not only directly affects the perceived Global quality , but it also indirectly affects the Global quality through its effect on the Tasting .The total effect of one block on another consists of the direct effect plus any indirect effects ( Alwin and Hauser, 1975 ).These total effects are also known as reproduced correlations and comparing them to observed correlations provide an indication of how well the model fits the data.For Process PLS one could straightforwardly calculate total effects from the result presented above ( Keith, 2019 ).These calculations are part of the standard output in the pathmodelr package.
Importantly, the 'total effects' as reported in SO-PLS-PM applications follow a different method of calculation and therefore should not be compared to the standard calculations of total effects.The "total effects" from SO-PLS-PM are calculated from a different model, namely a PLS2 model with single predictor and single response block.As such, what is called "total effect" in SO-PLS-PM is somewhat misplaced and should be interpreted as the unconditional, maximal covariance between two blocks.

Conclusions
This paper demonstrated that Process PLS solves several of the key problems of PL S-SPM (i.e., PL S-based path modelling).Process PLS can be used to accurately analyze multidimensional, multiblock data.Process PLS also provides stable model parameters in the presence of multicollinearity, a condition that leads to unstable parameters in PLS-PM and that is highly common for data typically analyzed using path modelling.Process PLS facilitates comparison of results from different data sets as the sign ambiguity present in PLS-PM has been solved.These benefits were demonstrated by applying Process PLS and PLS-PM on the same (simulated) datasets, where PLS-PM produced less-than-satisfactory results.Additionally, specification of a priori hierarchies of predictor blocks common to other solutions, as is a prerequisite for SO-PLS-PM, is no longer required.
One of the core foundations during the development of Process PLS was the need for clear model interpretation.Process PLS is based on concepts which are intuitive to anyone familiar with regression analysis such that it can be used by many researchers.Terms like explained variances, latent variables, and regression coefficients are in the standard vocabulary of most researchers doing statistical analysis.Additionally, since Process PLS is estimated based on ordinary PLS regression, established and proven tools may be used straightforwardly.
In order to benchmark Process PLS results, a simulated data set of crude oil distillation was analyzed.This benchmark showed that Process PLS worked exactly as intended.It was able to accurately model relationships between the parts of the process, and due to the availability of multiple latent variables per block, it could explain much more of the variance in the observed data than PLS-PM could.Additional numerical simulations were performed which in detail showed how Process PLS is well-suited for the analysis of multicollinear and multidimensional data.Such data has detrimental effects on PLS-PM results.
The results for a multi-batch analysis of a semi-batch production process at Nouryon showed how important it is to use multidimensional blocks, without which between-batch comparison is hardly meaningful.The fact that Process PLS results were so in line with the substantive expectations from existing process knowledge, supports the conclusion that Process PLS can be used as a data-driven technique for evaluating process performance.More benchmarking studies need to be performed, however, to investigate how Process PLS can be used in decision making.An extension of Process PLS for streaming data (i.e. with online model updating) would be a logical next development to fit this purpose.
Before applying Process PLS to the data sets, an alignment procedure was followed to link observations from distinct parts of the production facility together, following the flow of a specific collection of materials throughout the system.This procedure was casespecific and was done to improve the validity of the applications.The alignment required detailed knowledge about the production process.While crucial for the presented work, both a priori process knowledge and the temporal alignment may not be prerequisites for future Process PLS applications.
A comparison between Process PLS and SO-PLS-PM (and PLS-PM) was presented in a traditional multi-block application of a wine tasting data set.The analysis of the Val de Loire data showed that some similarities exist between SO-PLS-PM and Process PLS results, as both models can extract important relationships.However, SO-PLS-PM missed to find important direct effects which would be expected from a substantive point of view.SO-PLS-PM only found these relationships through so-called "total effects", which are obtained from separate models (which disregard all other blocks).The fact that SO-PLS-PM fixed to zero some of the crucial effects of wine properties on the wine quality indicates that researchers must be extremely weary not to miss such important relationships in other applications of SO-PLS-PM.At the same time, the application showed that Process PLS does what it was designed to do, which is to prioritize effects based on predictive power, and not on pre-defined hierarchies as it provided substantively accurate results.In the same application, a core issue of PLS-PM was illustrated: Due to the multicollinearity a single predictor nonsensically explained 125% of the variance in Quality.
Throughout this paper, the motivations behind the development of Process PLS were illustrated and its benefits over other methods are manifold.Process PLS can be used for multidimensional data and multicollinear data without the need for user-specified hierarchies and model results can be straightforwardly interpreted.Moreover, Process PLS optimizes the covariance between blocks, while the computational complexity remains low.The simpler approach of Process PLS makes interpretation close to that of PLS-PM, which will be beneficial to future developments and applications.
This paper did not intent to be a comprehensive overview of different process analytical methods.The choice for PLS was substantiated by theoretical arguments like distributional assumptions about industrial process data.At the same time, an important practical argument was that, in contrast to many other machine learning algorithms, PLS is rather easy to explain to the applied workers in the industry.Optimizing relationships between production units is essential in the industrial setting.And since PLS is well known in the (chemical) industry, a PLS-based approach has shown to improve communication between researcher and applied workers so much better.
On a comparative note: if a research field uses principal component analysis, rather that PLS, to extract information from blocks a good path modelling extension is the recently developed NetPCA ( Codesido et al., 2021 ).If factor analysis is the standard (e.g., in social sciences), structural equation models will be the natural way to communicate.From a practical point of view, those models will therefore definitely serve their purpose.From a statistical point of view, however, a comparative study across research fields to evaluate which models work best for which specific research questions would be a welcome addition to the literature.Of course, Process PLS will be suited for many other applications, but models from other fields may also aid the industrial research field.
Process PLS has been implemented in a free-to-use R package called pathmodelr.The software is freely available for researchers to easily analyze their own processes.For training purposes, the data and code used for the benchmarking studies is publicly accessible ( van Kollenburg et al., 2020b ).While the required programming experience needed to perform Process PLS is minimal, a drag and drop GUI would make for even easier use.The authors welcome anyone to build upon the existing R package, provided that the GPL-3 license is upheld.New implementations may extend the Process PLS algorithm presented in this paper with moderation effects and multi-group analysis ( Henseler, 2007 ;Henseler and Chin, 2010 ) and with methods for robustness checks ( Sarstedt et al., 2020 ).Several promising applications of Process PLS on various types of processes are already underway and will be reported in due time.In summary, Process PLS has taken a large step towards a general framework for statistical process analysis, which can be easily applied in many research fields.

Declaration of Competing Interest
None.
then a matrix of N observations of P m variables, with P = M m =1 P m .The partitioning of V creates the outer model, which

Fig. 2 .
Fig. 2. Graphical representation of the data alignment process (not to scale).

Fig. 3 .
Fig. 3. Schematic overview of the inner part of the Process PLS model specification of the crude oil distillation process, starting with a) the architecture and b) including the prediction model.

Fig. 4 .
Fig. 4. Full Process PLS model, with inner model result.The colors correspond to the three steps of the Process PLS algorithm.

Fig. 5 .
Fig. 5. Full specification of the outer and inner model used for path modelling of the data from the Nouryon process.

Fig. 6 .
Fig. 6.Summary of the outer model R 2 -values for each process block across the production batches, modelled using Process PLS (red), restricted Process PLS (green) and PLS-PM (blue).The standard deviations are represented symmetrically, though no values greater than 1 were observed.

Fig. 7 .
Fig. 7. P 2 -values for each path effect per production batch (each line represents one batch), modelled using Process PLS (A), restricted Process PLS (B) and PLS-PM (C).

Fig. 8 .
Fig. 8. Correlations between batch cost and batch performance in terms of path model effects.

Table 1
Process variables, units, time until feed reaches that variable's physical location (lag) and block assignment.For the yield-block, the ratio between the top and bottom product outlets is used as single predictor variable.

Table 2
Inner model results ( P 2 m.z ) for the oil distillation process.

Table 3
Explained variances, R 2 m , in the outer model.Block(m )

Table 4
Direct comparison of model performance.Values in the cells are calculated by multiplying the explained variance P 2 m.z from the inner model ( Table2) with the explained variance R 2 m of the outer model ( Table3).

Table 5
Overview of all batches with their sample sizes and cost values.