PLS for Big Data: A unified parallel algorithm for regularised group PLS

Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocks of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in the presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modeling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparse PLS methods is the link between the singular value decomposition (SVD) of a matrix (constructed from deflated versions of the original data) and least squares minimization in linear regression. We review four popular PLS methods for two blocks of data. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. We present various approaches to decrease the computation time and show how the whole procedure can be scalable to big data sets. The bigsgPLS R package implements our unified algorithm and is available at https://github.com/matt-sutton/bigsgPLS. MSC 2010 subject classifications: Primary 6202, 62J99.


Introduction
In this article, we review the Partial Least Squares (PLS) approach to big data. When one faces the task of statistically analysing data resulting from the observation of cases on a large number of variables, interpretation of the results can be difficult. Analysing each variable separately is time consuming, and describing the results using graphs and numerical indicators is space consuming. One way to circumvent this problem is to select a few of the more important variables while discarding the others. However, selecting the best variables is not easy, not to mention that interesting interactions between these variables will remain unexplored. A better approach is to create a few new variables that combine in a clever way the original ones. Using linear combinations is a good strategy. Indeed, such new variables are easy to compute. (Being composed of all the original variables, they are called components and, being not directly observed, often also called "latent variables".) Furthermore, the weights of each component fully characterise the relationships between the new variables and the original ones, which makes interpretation of the former in terms of the latter easy. This approach is for example at the core of Principal Components Analysis (PCA), a statistical technique that computes weights (of each linear combination) in such a way that the variance of each component is maximal, under the constraint that components are orthogonal. Geometrically, constructing the first component consists in finding the one-dimensional subspace (i.e., a line) such that projected data onto this subspace will have, after projection, maximal variance. (Broadly speaking, this line gives the direction in which the cloud of data points is most elongated.) The second component is obtained by finding another one-dimensional space that is orthogonal to the first one while maximising the variance of data projected onto the former. And so on for subsequent components, stopping after a small number of components have been extracted. In many cases, these few first components are sufficient to recover a large proportion of the overall multidimensional variability present in the original data set, thereby performing a reduction of dimension while keeping most of the information. In practice, these directions (given by the weights of each linear combination) are easily obtained thanks to a linear algebra tool called the Singular Value Decomposition (SVD); see (Meyer 2000, Figure 5.12.1, page 413) for a very nice geometric explanation of how the SVD of a matrix A describes the ellipsoid that one gets when a unit sphere is distorted by the linear transformation associated to A. Note that if the original data have been centered first, the requested orthogonality between components translates directly into their uncorrelatedness, rendering the information brought by each component independent from each other (at least in the ideal world of Gaussian distributions). This facilitates their interpretation. Also, an interesting graph often used in PCA, the so-called correlation circle, enables one to rapidly grasp how PCA components are linked (i.e., correlated) to the original variables, making interpretation in terms of the original variables possible.
The above description focuses on the analysis of a single set of data. Sometimes, however, one is interested in the simultaneous analysis of multiple blocks of data, each comprising a large number of variables. The strategy used in PCA can be extended to this situation, giving rise to a statistical method called Projection to Latent Structures (PLS). Originally developed by Wold (1966) as a set of iterative algorithms involving optimisation of quantities as in the classical Least Squares method, it was termed Partial Least Squares (PLS), a name still often used nowadays. PLS methods perform a wide range of multivariate supervised and unsupervised statistical techniques on multiple blocks of data. Their goal is thus to analyse multiple relationships among several sets of variables measured on the same objects. They construct new variables known as scores (or components), which are linear combinations of the original variables. Instead of maximising variances as in PCA, and since PLS focuses on relationships between sets of data, linear combinations are here obtained by maximising a covariance (or correlation) criterion. Once again, this maximisation is easily achieved in practice using an SVD; see Lafaye de Micheaux et al. (2017) for a rigorous mathematical description of this link along with its proof. The current article focuses on PLS modeling when there are only two blocks of data. One distinguishes between the case when the two blocks are interchangeable (symmetric situation) and when they are not (asymmetric situation). In the latter case, the goal is to predict one set of data from the other. (As a simple analogy, think about "correlation between two variables" versus "simple linear regression between two variables".) In the two-block case, the PLS acronym (for Partial Least Squares or Projection to Latent Structures) usually refers to one of four related methods, which are detailed in Section 2.2: (i) Partial Least Squares Correlation (PLSC) also called PLS-SVD (Krishnan et al. 2011, Abdi & Williams 2013, Rohlf & Corti 2000, (ii) PLS in mode A (PLS-W2A, for Wold's Two-Block, Mode A PLS) (Vinzi et al. 2010, Cak et al. 2016, Wegelin 2000, (iii) PLS in mode B (PLS-W2B) also called Canonical Correlation Analysis (CCA) (Guo & Mu 2013, Hardoon et al. 2004, Hotelling 1936, and (iv) Partial Least Squares Regression (PLS-R, or PLS2) (Wold et al. 2001, Rosipal & Krämer 2006, Geladi & Kowalski 1986).
The first three methods model a symmetric relationship between the data, aiming to explain the shared correlation or covariance between the data sets, while the fourth method (PLS-R), models an asymmetric relationship, where one block of predictors is used to explain the other block. These methods are now widely used in many fields of science, such as genetics (Boulesteix & Strimmer 2007, Ji et al. 2011, Liquet et al. 2016), neuroimaging (McIntosh et al. 1996, Roon et al. 2014) and imaging-genetics (Lorenzi et al. 2016, Liu & Calhoun 2014. Recently, some authors have started to modify these methods using sparse modelling techniques; see e.g., (Lê Cao et al. 2008, Dhanjal et al. 2009, Witten et al. 2009, Chung & Keleş 2010, Chun & Keleş 2010. These techniques refer to methods in which a relatively small number of covariates influence the model. They are powerful methods in statistics that provide improved interpretability and better estimators, especially for the analysis of big data. For example, in imaging-genetics, sparse models demonstrated great advantages for the identification of biomarkers, leading to more accurate classification of diseases than many existing approaches (Lin et al. 2014). This is not really surprising. Indeed, as nicely summarised by Wegelin (2000): "The coefficients computed in a PLS analysis are well-defined and easy to interpret. PLS is especially useful when the columns of X or of Y are collinear or nearly collinear, or when there are more variables than observations (p > n or q > n), since few other methods are available in such a case." Among the few statistical methods that can deal with two blocks of data, one can obviously mention classical multivariate linear regression (a direct competitor of case (iv) when q > 1). Nevertheless, this method fails in the presence of collinearity or when there are more variables than observations, since the least squares estimator is not unique in that case (X T X is not invertible). A way to circumvent this problem is to impose sparsity on the coefficients through a shrinkage lasso approach; see Friedman et al. (2010) and their associated R package glmnet (Friedman et al. 2018). By imposing the assumption of sparsity these methods provide stable estimates of the coefficients, and often have improved predictive performance. However, ultra high dimensional data sets (p >> n) pose a computational challenge in fitting these methods. This problem has only been addressed recently by Zeng & Breheny (2017a) but only for an univariate re-sponse (see their R package biglasso (Zeng & Breheny 2017b)).
In this article, we consider big data extensions to PLS methods which incorporate sparsity in fitting the PLS weights. Similarly to standard PLS methods, these sparse approachs permit dimension reduction by constructing components as linear weighted combinations of the original data. However, sparse PLS approachs allow for some of these weights to be set to zero. Consequently, a variable in the original dataset might well be used in constructing only a subset of the PLS components. An advantage of having several components in PLS is to visualise the data projected on various 2D planes spanned by pairs of components (as illustrated on our Figure 3) in order to gain better insight into the underlying structure of the data. Also, these sparse approaches are often only considered for two blocks of data in a regression context, whereas standard PLS can deal with several blocks of data in a more general analysis. Closely related to Lasso regression is (multivariate) ridge regression (Brown & Zidek 1980) that shrinks the coefficients using a L 2 norm (instead of a L 1 norm for the Lasso). This different type of shrinkage does not lead to sparsity of the coefficients though, and consequently does not perform a reduction of dimension. Another potential competitor of PLS worth mentioning is Principal Component Regression (PCR). Once again, this method is only applicable in a regression context. It solves the problem of data collinearity and reduces the number of regressor variables by replacing the original regressor variables by a few orthogonal (hence non-collinear) principal components obtained via a PCA of X. Compared to PLS, these X-components are obtained independently of the response Y which seems to lead to lower prediction performances than PLS (Yeniay & Goktas 2002). A nice discussion that explores the relationships between PLS, PCR, Ridge and the Lasso can be found in , Section 3.6). A last competitor method worth mentioning is the Vector Generalised Additive Model (VGAM); see Yee & Wild (1996) for details, and the VGAM R package (Yee 2018). This very flexible technique relates the response to the regressors through a sum of non-linear transformations of the original regressors. Results are difficult to interpret though, all the more when the number of variables is large. Finally, note that inference in PLS is done either by using resampling or Bootstrap techniques (see Abdi & Williams (2013, Chapter 23)) or by computation of its degrees of freedom (Kraemer & Sugiyama 2011) thanks to the R package plsdof (Nicole Kraemer 2018). Inference for the Lasso can also be done using resampling techniques, or by using a covariance test statistic (Lockhart et al. 2014) thanks to the R package selectiveInference (Tibshirani et al. 2017).
In Section 2, we survey the standard PLS methods. The optimisation criteria and algorithmic computation are detailed. Gathering an accurate description of all these methods in a single document, along with their complete mathematical proofs, constitutes a valuable addition to the literature; see also (Lafaye de Micheaux et al. 2017, Wegelin 2000. In Section 3 we present the sparse versions of the four types of PLS, as well as a recent group and sparse group version. A new unified algorithm is then presented in Section 4 to perform all four types of PLS including the regularised versions. Various approaches to decrease the computation time are proposed. We explain how the whole procedure can be made scalable to big data sets (any number of measurements, or variables). In Section 5, we demonstrate the performance of the method on simulated data sets including the case of a categorical response variable. Our algorithm is implemented in the R programming language (R Core Team 2017) and is available at https://github.com/matt-sutton/bigsgPLS as a comprehensive package called bigsgPLS that implements parallel computations, either using several cores or a GPU device.

Notation
Let X := X 0 : n × p and Y := Y 0 : n × q be the two matrices (or "blocks") of data. They comprise n observations collected on p and q variables, respectively. Before analysis, the X and Y matrices are transformed by subtracting their column averages. Scaling each column by their standard deviation is also often recommended (Geladi & Kowalski 1986). To make explicit the columns of a n × r matrix A, we write A := [a 1 , . . . , a r ] := (a j ). We also note A •h := [a 1 , . . . , a h ] for the submatrix of the first h columns (1 ≤ h ≤ r), and A •h := [a h+1 , . . . , a r ] for the remaining ones. For two zero-mean vectorsũ andṽ of the same size, we note Cov (ũ,ṽ) =ũ Tṽ and Cor (ũ,ṽ) =ũ Tṽ / (ũ Tũ )(ṽ Tṽ ). The scaling factor (n − 1) −1 is omitted. Let X + be the Moore-Penrose (generalised) inverse of X. We denote the space spanned by the columns of X by I(X). The orthogonal projection matrix onto I(X) is denoted P X = XX + , and P X ⊥ = I−P X denotes the orthogonal projection matrix on the space orthogonal to I(X). When the inverse of X T X exists, we have X + = (X T X) −1 X T . The L p vector norm (p = 1, 2) of an n-length vector x, is The Frobenius norm of a n × r matrix A is A F = vec(A) 2 , where the vec operator transforms A into an nr × 1 vector by stacking its columns. The soft thresholding function is g soft (x, λ) = sign(x)(|x| − λ) + , where (a) + = max(a, 0). Finally, ⊗ denotes the Kronecker product (Lütkepohl 2005, (3), p. 662).

The four standard PLS methods
In this subsection, we survey the four standard PLS methods (i)-(iv) introduced in Section 1. The PLS methods are used to iteratively construct a small number H of pairs of meaningful linear combinations ξ h = X 0 w h and ω h = Y 0 z h (h = 1, . . . , H), of the original X-and Y -variables, in the sense that they should have maximal covariance (or correlation). These linear combinations reveal the linear relationships between the two blocks of data. They are called component scores, or latent variables, while w h and z h are the (adjusted) weights. The construction of the components leads to decompositions of the original matrices X and Y of the form: where Ξ H = (ξ j ), Ω H = (ω j ) are called the X-and Y -scores, C H = (c j ), D H = (d j ) are the X-and Y -loadings, and F X H , F Y H are the residual matrices. The decomposition model states that the blocks X and Y can be expressed as a weighted combination of H latent features plus some noise. For any set of X and Y-scores we have the decomposition: Thus the elements of the decomposition model can always be written as which may be simplified when certain orthogonality properties hold for Ξ H and Ω H . We can interpret the X and Y loading matrices C H and D H as regression coefficients in the reconstruction of the X and Y blocks. In this way, the size of the loadings describes the importance of the score for reconstructing the original data.
Although the four PLS methods all conform to the unified framework, the methods provide different loadings and error matrices due to the calculation of the scores. We now detail the four classical cases (i)-(iv). For each method, the scores may be calculated as linear combinations ξ h = Xw h and ω h = Y z h . We state the PLS objective functions in terms of the data X and Y for the adjusted weights w h and z h (orw h andz h for unnormalised versions). We comment on the focus of the analysis in each of these methods and survey important properties. Additional technical details may be found in Wegelin (2000) and Lafaye de Micheaux et al. (2017).
(i) For PLS-SVD, the roles of X and Y are symmetric, and the analysis focuses on modeling shared information (rather than prediction) as measured by the cross-product matrix R = X T Y . Note that R is proportional to the empirical covariances between X-and Y -variables when the columns of X and Y are centered. When the columns are standardised, R is proportional to the empirical correlations between X-and Y -variables. In this setting, the method is sometimes called PLSC, for Partial Least Squares Correlation (Krishnan et al. 2011). For PLS-SVD, the weights at step h, (w h , z h ), are defined as the solution to the optimisation problem: for 1 ≤ j < h. PLS-SVD searches for orthonormal directions w h and orthonormal directions z h (h = 1, . . . , H), such that the score vectors have maximal covariance. In contrast to the other PLS methods, the X-scores ξ h and Y-scores ω h are in general not mutually orthogonal (Rosipal & Krämer 2006) which can make interpretation of the results more delicate. (ii) For PLS-W2A, the weights at step h, (w h ,z h ), are defined as the solution to the optimisation problem: for 1 ≤ j < h. PLS-W2A thus searches for successive X-score vectors (resp. Y -score vectors) that are orthogonal to the previous ones. The first pair (ξ 1 , ω 1 ) of X-and Y -score vectors is the one with maximal covariance. The next pairs are searched for using successively deflated (i.e., after removing the information contained in the previous pairs of scores) versions of X 0 and Y 0 . (iii) For CCA, the weights at step h, (w h ,z h ), are defined as the solution to the optimisation problem: Classical CCA relates X and Y by maximising the correlation between the scores (also called canonical variates) ξ h = Xw h and ω h = Yz h , but without imposing a unit norm to the adjusted weights (or canonical) vectorsw h andz h . Using the change of variablesw = (X T X) −1/2 w andz = (Y T Y ) −1/2 z we have the following equivalent objective for (w h , z h ): for 1 ≤ j < h. However, when p > n, this optimisation is not feasible because (X T X) −1 and (Y T Y ) −1 are not uniquely defined. A common approach to this problem is to use the Moore-Penrose inverse (Mardia et al. 1979, p. 287) (Nielsen 2002, p. 75). However, this approach can produce a meaningless solution, with correlations trivially equal to one. Moreover, in this case, a small change in the data can lead to large changes in the weights and scores (Wegelin 2000, pp. 26-27). An alternative approach is to perform regularisation on the sample covariance matrices. Regularisation was first introduced to the CCA method by Vinod (1976) and later refined by S. E. Leurgans (1993). This method is known as regularised CCA (rCCA) or canonical ridge analysis. This regularisation is imposed by replacing the matrices X T X and Y T Y with X T X + λ x I p and Y T Y + λ y I q respectively in the optimisation criterion.
The regularisation parameters λ x and λ y should be nonnegative and if they are nonzero, then the regularised covariance matrices may be nonsingular. Note that ordinary CCA is obtained at λ * x = λ * y = 0, and PLS-SVD is obtained with λ * x = λ * y = 1. Other approaches exist; see e.g., (Witten et al. 2009, eq. (13)). (iv) PLS-R (also called PLS1 if q = 1 or PLS2 if q > 1) is a regression technique that predicts one set of data from another, hence termed asymmetric, while describing their common structure. It finds latent variables (also called component scores) that model X and simultaneously predict Y . While several algorithms have been developed to solve this problem, we focus on the two most well known variants. The first is an extension of the Nonlinear estimation by Iterative PArtial Least Squares (NIPALS), modified by Wold et al. (1984) to obtain a regularised component based regression tool. The second is the Statistically Inspired Modification of PLS (SIMPLS) developed by de Jong (1993). Other PLS regression algorithms can be found in Lindgren & Rännar (1998), and see also Alin (2009) for 1 ≤ j < h. The first pair (ξ 1 , ω 1 ) of X-and Y -score vectors is the one with maximal covariance. The next pairs are searched for using successively deflated versions of X 0 and Y 0 . Two equivalent versions of the NIPALS algorithm are found in the literature; whether z h is scaled (Höskuldsson 1988), or not (Wold et al. 1984, Tenenhaus 1998. We note that both algorithms provide equivalent regression parameters, and only differ in the calculation of the Y-scores and loadings. At the end of both algorithms, the fitted values Y H are computed (Phatak & de Jong 1997, Equ. (20)) In addition to the usual decomposition equations (1) that will be explicated below, the PLS regression algorithm includes an additional "inner relationship" which relates the Y -scores Ω •h to the X-scores Ξ •h explicitly: where The second commonly used PLSR algorithm, called SIMPLS (de Jong 1993), calculates the PLS latent components directly as linear combinations of the original variables. The objective function to optimise is (Chun & Keleş 2010, eq. (3)) h 2 = 1.

Computational details
As described in Section 2.2, the PLS algorithms iteratively construct linear combinations of the original data. These quantities are calculated recursively using and v h are called the weight vectors (also direction vectors, saliences, or effective loading weight vectors), while w h and z h are called the adjusted weights.
Since the adjusted weights define the score vectors in terms of the original data matrices (as opposed to the deflated matrices), the size of the elements of the weight vector can be interpreted as the effect of the corresponding variables in the component score. On the other hand, the weight vectors u h and v h are defined in terms of the deflated matrices and cannot be interpreted this way. However, these deflated weights offer easier computational implementation. Each of the four PLS methods can be described as an optimisation problem for the weights (u h , v h ) coupled with a deflation method to ensure the required orthogonality. Generally, this optimisation problem can be stated as:  (3), is equivalent to solving The solution to this problem may be calculated by taking the first left and right singular vectors from the SVD of M h−1 . The remaining deflation methods and initialisations for the PLS methods are given in Table 1.
with the corresponding initialisation and deflation steps.

Penalised PLS
All of the previous PLS methods can be written in terms of a single optimisation problem coupled with an appropriate deflation to ensure the appropriate orthogonal constraints. In this section, we introduce the framework for penalised partial least squares in the unified PLS methodology. Several penalisations are then considered and presented in a unified algorithm that can perform all four PLS methods, and their regularised versions.

Finding the PLS weights
We focus on a particular class of algorithm designed to induce sparsity for the PLS weights. For this class of algorithm the weights, (u h , v h ) at step h, are defined as the solution to the optimisation problem: (4) where P λ1 and P λ2 are convex penalty functions with tuning parameters λ 1 and λ 2 , and the matrix M h−1 is defined using the appropriate deflation (see Table 1). The resulting objective function can be recognised as the Lagrangian of the penalised matrix decomposition introduced by Witten et al. (2009). If the penalty functions are homogenous of order one: P (cx) = cP (x) for all c > 0, then the weights (u h , v h ) can be found by iteratively calculating This algorithm for finding penalised weights was studied in Allen et al. (2014) for regularised principal component analysis. We note that similar variants of this process have been proposed elsewhere, but these methods are presented as stand-alone methods such as sparse PCA (Shen & Huang 2008), sparse CCA (Witten et al. 2009), sparse PLS regression or sparse canonical PLS (Lê Cao et al. 2008). Our work provides a clear, unified framework for multiple PLS methods that allows for calculation of the weights with sparsity-inducing penalisations. Note that the optimisation problem (4) is a biconvex problem and thus multiple optimal solutions may exist (Tseng 1988). By iteratively solving the optimisation problems in (5) and (6) we are guaranteed to improve or retain the same objective value in problem (4). However, if the initial values for u and v are poorly chosen, it is possible for the method to get caught in a local optimum or to end up oscillating between solutions (Netrapalli et al. 2015). We use the SVD solution to initialise our algorithm, an approach that is common for implementing these types of algorithms (Allen & Tibshirani 2010, Witten et al. 2009).

Deflation and the PLS weights
Computing the penalised versions of the four different PLS methods is achieved by alternating between two subtasks: solving (5) and (6) for the weights, and matrix deflation. Without the penalties, P λ1 and P λ2 , the matrix deflation enforces certain orthogonality constraints for each of the four standard PLS methods. However, with either penalty P λ1 or P λ2 , these deflations do not ensure any orthogonal constraints. Although these constraints are lost, Witten et al. (2009) state that it is not clear that orthogonality is desirable as it may be at odds with sparsity. That is, enforcing the additional orthogonality constraints may result in less sparse solutions. Similar to Witten et al. (2009) andLê Cao et al. (2008) we use the standard deflation methods in our implementation of the penalised PLS methods. Alternative matrix deflations have been proposed for sparse PCA (Mackey 2009). However, these methods have not been extended in the general penalised PLS framework.
Another key observation is that for the NIPALS PLS regression, PLS-W2A and CCA the scores were defined in terms of the deflated matrices ξ h = X h−1 u h and ω h = Y h−1 v h . Consequently, the sparse estimators given by solving (5) and (6) perform variable selection of the deflated matrices. Thus the latent components formed using these methods have the interpretation given by Remark 1 below. In our implementation, we also calculate the adjusted weights w h and z h (orw h andz h ), where ξ h = Xw h and ω h = Y z h . These weights allow for direct interpretation of the selected variables in the PLS model. Note that although w h and z h allow for direct interpretation of the selected variables, the sparsity is enforced on u h and v h . So if u h and v h are sparse, this does not necessarily mean that the adjusted weights w h and z h will be sparse.

Remark 1
The first latent variable ξ 1 = Xu 1 is built as a sparse linear combination (with weights in u 1 ) of the original variables. The next latent variable ξ 2 = P ξ ⊥ 1 Xu 2 is the part of the sparse linear combination (with weights in u 2 ) of the original variables that has not been already explained by the first latent

variable. And more generally, the h-th latent variable is built as a sparse linear combination of the original variables, from which we extract (by projection) the information not already brought by the previous latent variables.
We note that an alternative SIMPLS formulation for the penalised PLS methods was proposed in a regression setting by Allen et al. (2013). In the SIMPLS method the weights are directly interpreted in terms of the original variables, so w h = u h and z h = v h . Although this method allows for direct penalisation of the weights, the orthogonality conditions still do not hold. We have incorporated this method and a similar variant for PLS-W2A into our R package bigsgPLS to allow for direct penalisation of the weights.

The penalised PLS methods
Computationally, the PLS method is an efficient approach to sparse latent variable modeling. The main computational cost is in solving for the PLS weights as described in equations (5) and (6). The cost of solving for these weights is penalty-specific but can be minimal in a number of useful applications. We detail a few examples where these equations have been solved analytically and provide an algorithm that treats the penalised versions of the four PLS cases (i)-(iv).

Sparse PLS
The ( These penalties induce the desired sparsity of the weight vectors u h =ũ h / ũ h 2 and v h =ṽ h / ṽ h 2 , thanks to the well known properties of the 1 -norm or lasso penalty (Tibshirani 1994). The closed form solution for this problem is: where g soft (·, λ 1 ) is the soft thresholding function, with the understanding that the function is applied componentwise. To unify these results with the ones to come, we introduce the sparsifier functions S u and S v to denote analytical functions that provide the solution for the weights. The sparsifiers are functions of the data M , the fixed weight u (or v) and additional penalty specific parameters θ u (or θ v ). For sparse PLS we have, where θ u = λ 1 and θ v = λ 2 .

Group PLS
There are many statistical problems in which the data has a natural grouping structure. In these problems, it is preferable to estimate all coefficients within a group to be zero or nonzero simultaneously. A leading example is in gene expression data, where genes within the same gene pathway have a similar biological function. Selecting a group amounts to selecting a pathway (Palermo et al. 2011). Variables can be grouped for other reasons, for example, when we have categorical covariates in our data. The categorical data is coded by its factor levels using dummy variables, and selection or exclusion of this group of dummy variables is equivelant to selection of the categorical covariate (Nguyen & Rocke 2002).
Let us consider a situation where both matrices X and Y can be divided respectively into K and L sub-matrices (i.e., groups) X (k) : n × p k and Y (l) : n × q l , where p k (resp. q l ) is the number of covariates in group k (resp. l). The aim is to select only a few groups of X which are related to a few groups of Y . We define M (k,·) = X (k) T Y and M (·,l) = Y (l) T X. Group selection is accomplished using the group lasso penalties (Yuan & Lin 2006) in the optimisation problems (5) and (6): whereũ (k) andṽ (l) are the sub vectors of the (unscaled) weightsũ andṽ corresponding to the variables in group k of X and group l of Y respectively. This penalty is a group generalisation of the lasso penalty. Depending on the tuning parameter λ 1 ≥ 0 (or λ 2 ≥ 0), the entire weight subvectorũ (k) (orṽ (l) ) will be zero, or nonzero together. The closed form solution for the group PLS method for the k-th subvector ofũ, and the l-th subvector ofṽ is given bỹ The sparsifier functions are applied groupwisẽ with θ u = (p 1 , . . . , p K , λ 1 ) and θ v = (q 1 , . . . , q L , λ 2 ). A proof of these equations is given in (Liquet et al. 2016).

Sparse group PLS
One potential drawback of gPLS is that it includes a group in the model only when all individual weights in that group are non-zero. However, sometimes we would like both sparsity of groups and sparsity within each group. For example, if the predictor matrix contains genes, we might be interested in identifying particularly important genes in pathways of interest. The sparse group lasso (Simon et al. 2013) achieves this within group sparsity. The sparse group selection in the PLS methodology is accomplished using the sparse group lasso penalty in the optimisation problem (5) and (6): The sparse group penalty introduces tuning parameters α 1 and α 2 which provide a link between the group lasso penalty (α 1 = 0, α 2 = 0) and the lasso (α 1 = 1, α 2 = 1). Depending on the combination of α 1 and λ 1 (or α 2 and λ 2 ) the (unscaled) weight subvectorũ (k) (orṽ (k) ) will be eliminated entirely, or sparsely estimated. The adaptation of the sparse group penalty for the PLS method was first considered in (Liquet et al. 2016). The closed form solution of the sparse group PLS method for the k-th subvector ofũ (k) is given by otherwise where g 1 = g soft M (k,·)ṽ , λ 1 α 1 /2 . Similarly, the l-th subvector ofṽ is given by otherwise where g 2 = g soft M (·,l)ũ , λ 2 α 2 /2 . The sparsifier functions for these penalties are:ũ with θ u = (p 1 , . . . , p K , λ 1 , α 1 ) and θ v = (q 1 , . . . , q L , λ 2 , α 2 ).

Other penalties
The penalties discussed so far have enforced general sparsity or sparsity for a known grouping structure in the data. Extensions to the group structured sparsity in partial least squares setting have also been considered in terms of overlapping groups (Chen & Liu 2012), or additional grouping restrictions (Sutton et al. 2018). The penalisations considered so far have all resulted in closed form solutions for the updates of u and v. We note here that this is not always the case. The fused lasso penalty (Tibshirani et al. 2005) is defined by: The first term in this penalty causes neighboring coefficients to shrink together and will cause some to be identical, and the second causes regular lasso shrinkage of the parameters for variable selection. Unlike the previous methods, a closed form solution for the fused lasso cannot be directly achieved. This is because the penalty is not a separable function of the coordinates. Because there is no closed form solution for the fused lasso, we cannot write a sparsifier function, so we have not considered this method. We note that methods exist that can solve the fused lasso problem, either by reparameterisation, dynamic programming or path based algorithms. In particular, Witten et al. (2009) have considered solving problems of the form (5) and (6) with the fused lasso penalty. In their paper, they propose a sparse and fused penalised CCA, however in their derivation they assume X T X = I and Y T Y = I. In our framework, this method would be sparse and fused penalised PLS-SVD.

The unified algorithm
Algorithm 1 allows for a unified computation of all four PLS versions (i)-(iv), with a possibility to add sparsity. Adjusted weights can also be computed and, if the number of requested components H is greater than 1, a deflation step is executed. Note that, if Y is taken equal to X, this algorithm performs PCA, as well as sparse PCA versions. If this is the case, the optimised criteria are simply restated in terms of variance instead of covariance. We now have all the ingredients to propose a unifying algorithm.
Remark 2 On line 10, we impose that u 1 2 = v 1 2 = 1 and u 1,i > 0 where i = argmax 1≤j≤p |u 1,j | to ensure uniqueness of the results. Note that w h and z h of lines 22, 24 and 28 correspond tow h andz h in the text.
At this point, it is worthwhile noting that when p and q are small compared to n, one can slightly modify Algorithm 1 by using the recursive equations that express M h in terms of M h−1 , instead of using the recursions on X h and Y h . These recursions are provided in Table 1 of Section 2.3. This should increase speed of execution of the algorithm.

Algorithm 1 Sparse and non-sparse PLS algorithm for the four cases (i)-(iv)
Require: X 0 = X (with n rows), Y 0 = Y , H, Case, , θx, θy, Su, Sv 1: Extract λx and λy as the first element of θx and θy respectively 2: M 0 ← X T 0 Y 0 /(n − 1) ; P ← Ip and Q ← Iq Apply the SVD to M h−1 and extract the first triplet (δ 1 , u 1 , v 1 ) of singular value and vectors. 9: Set u h ← u 1 and v h ← v 1 10: while u h has not converged ( * ) do Sparsity step if λx > 0 or λy > 0 11: end while End of sparsity step 14: If Case (i) then Adjusted weights step 17: If Case (ii) then 20: If Case (iv) then 26: end if End of adjusted weights step 30: If Case (i) or (iii) then Deflation step 31: Convergence of a vector t is tested on the change in t, i.e., t old − tnew 2 / tnew 2 < , where is "small", e.g., 10 −6 .
Moreover, one can use various approaches to deal with the cases when n, p or q are too large in our algorithm, making some objects too large for the computer's memory. These can be divided into chunk approaches and streaming (or incremental) approaches, which are presented in the next subsections. Of course, any combinations of these approaches can be used if necessary. Some of these approaches might also increase the computation speed, even in a context where all objects would fit into memory.

Matrix multiplication using chunks
To scale Algorithm 1 to big data (i.e., very large n p and q), we can use a simple idea to multiply two very large matrices that are too big to fit into the computer's memory.
Let us divide the total number n of rows of X (resp. of Y ) into blocks X (g) (resp. Y (g) ), g = 1, . . . , G, of (approximatively) the same size. We have The number of blocks G has to be chosen so that each product X T (g) Y (g) : p × q can be done within the available RAM. Note that all these products can be performed in parallel if the required computing equipment is available.

SVD when p or q is very large
The main step of our algorithm is the computation of the first triplet (δ 1 , u 1 , v 1 ) in the SVD of the (p × q) matrices M h−1 . The irlba (Baglama & Reichel 2015) R package can be used to compute quite easily this triplet for values of p and q as big as 50, 000. This package is based on an augmented implicitly restarted Lanczos bidiagonalisation method.
When p (or q) is much larger, another approach is necessary to compute the SVD of M h−1 ; see e.g., (Liang et al. 2016). Suppose that p is large but not q, which is common in several applications. We thus suppose that p q. The Algorithm 1 in Liang et al. (2016) is now presented to highlight the elements needed in our algorithm. We can partition a large matrix M : p × q by rows into a small number s of submatrices (or chunks): . We can take g much larger than q as long as it is still possible to compute the SVD of these submatrices. DefineŨ : p × p and H : p × q bỹ where U i : g × g and D i : g × q and V i : q × q, so that M =Ũ H. Let H = U H D H V T H be the SVD of H. Note that this matrix is as large as M so one may wonder what has been gained with this approach. But D i being a g × q diagonal rectangular matrix, D i V T i has g − q zero row-vectors in its bottom. Consequently, the matrix H contains only sq non-zero row vectors. Now let H = RH : p × q be a rearrangement in rows for H such that its first sq row vectors are non-zero and p − sq row vectors are in its bottom. We now have to compute U * D * V * T , the SVD of a (much smaller) sq × q matrix 1 : H which forms a SVD of M . Now, let 1 be a vector containing only 0s but a 1 in the first position. For our PLS algorithm, we only need to compute the first triplet in the SVD of M , namely δ 1 = 1 T p D H 1 q = D * 1,1 , v 1 = V •1 = V H 1 q = V * 1 q and the first column of (Ũ U H ): It is seen above that only the q first triplets of the SVDs of the M i s are required. So, overall we "only" have to compute s truncated (q × q) SVDs (of the M i s) and one truncated (1 × 1) SVD (of the sq first lines ofH, which are easily obtained from these truncated SVDs). Moreover, we can compute u 1 from v 1 using the simple formula u 1 = M v 1 / M v 1 (using a chunk approach).
When q is larger than p, we just partition M in columns instead of rows. When both p and q are large, one can adapt Algorithm 2 in Liang et al. (2016) which generalises the above. They even propose a third algorithm for the case of online (streaming) SVDs.
Note that these algorithms based on the split-and-merge strategy possess an embarrassingly parallel structure and thus can be efficiently implemented on a distributed or multicore machine.

Incremental SVD when n is large
We want to compute the truncated SVD of the matrix M h = X T h Y h when n is very large (and the X-and Y -matrices are split in blocks, or chunks, of size n/G for some given G). One can use the divide and conquer approach presented in subsection 4.1 to compute first the matrix M h = X T h Y h and then evaluate the SVD of this matrix. We present here an alternative approach to Cardot & Degras (2017) by considering an incremental version of the SVD.

Remark 3 Note that this approximation is in fact exact when H = rank(M n ).
So if we want to use this approach in our algorithm, we would have to compute all the singular elements and not only the first triplet. This being said, if for example q is not too large (e.g., q = 1 which is common in applications) this is not a problem anymore. Moreover, one can easily find that Lafaye de Micheaux et al. 2017). Note also that u 1 is the first eigenvector of the p × p matrix (Y T X) T Y T X whereas v 1 is the first eigenvector of the q × q matrix (X T Y ) T X T Y . We only need to compute either u 1 (if p < q) or v 1 (if q ≤ p), from which we obtain the other one.
At this point, one can write It then suffices to perform the SVD of the matrix Q n+1 of dimension (H + 1) × (H + 1).
To keep the approximation M (H) n+1 of M n+1 at rank H, the row and column of Δ n+1 containing the smallest singular value are deleted and the associated singular vectors are deleted from U n+1 and V n+1 .
This incremental way to compute the SVD provides a promising alternative for handling very large sample sizes (especially when q is not too large). Moreover, the incremental SVD is well suited to a data stream context.

Numerical Experiments
In this section, we use the previously mentioned packages and our new R package to carry out a short simulation study in order to illustrate the numerical behavior of the new proposed approach. The experiments have been conducted using a laptop with a 2.53 GHz processor and 8 GB of memory. The parallel strategy utilises four processor cores.
We present two simulations to illustrate the good performance of the proposed approaches and the scalability to large sample sizes of our algorithm. The first simulation considers the PLS-R model (case (iv)) on group structure data while the second simulation presents an extension of PLS approaches to discriminant analysis.

Group PLS model
We generate data with a group structure: 20 groups of 20 variables for X (p = 400) and 25 groups of 20 variables for Y (q = 500). To highlight the scalability of our algorithm, we generate two big matrices from the following models linked by H = 2 latent variables: where the matrix Ξ H = (ξ j ) contains 2 latent variables ξ 1 and ξ 2 . The entries in these vectors have all been independently generated from a standard normal distribution. The rows of the residual matrix F X H (respectively, F Y H ) have been generated from a multivariate normal distribution with zero mean μ X (resp. μ Y ) and covariance matrix Σ X = 1.5 2 I p (resp. Σ Y = 1.5 2 I q ).
Among the 20 groups of X, only four groups each containing 15 true variables and five noise variables are associated with the response variables of Y . We set the p-vector c 1 (first column of the C H matrix) to have 15 1's, 30 −1's and 15 1.5's, the other entries being all set to 0. All 15 non-zero coefficients are assigned randomly into one group along with the remaining 5 zero coefficients corresponding to noise variables. The vector c 2 is chosen in the same way as c 1 . The two columns of D H are q-vectors containing 15 −1's, 15 −1.5's and 30 1's and the rest are 0's such that the matrix Y has a similar group structure for four groups containing the signal. Finally, the sample size is set to n = 560, 000 observations which corresponds to storage requirements of approximately 5 GB for each matrix, thus with a total exceeding the 8 GB of memory available on our laptop.
The top four plots of Figure 1 show the results of the group PLS estimated with only n = 100 observations. For such a sample size, the usual group PLS can be used without any computational time or memory issues. In this case, group PLS manages to select the relevant groups and performs well to estimate the weight vectors u 1 and v 1 related to the first component and the weight vectors u 2 and v 2 related to the second component.
The bottom four plots of Figure 1 show the results of the group PLS estimated on the full data set which can be only analysed by using the extended version of our algorithm for big data. In this run, we use G = 100 chunks for enabling matrix multiplication. The execution time was around 15 minutes for two components (H = 2) and took less than 2 minutes to get the first component. We can observe that the signal has been perfectly identified and estimated, which is expected for such a huge amount of information.
Note that for validation purposes, the extended version of our algorithm for big data has been executed and exactly matched the usual algorithm on the small data set (n = 100). Note that the values of u 1 (resp. v 1 ) have been rescaled so that its norm equals that of the original c 1 (resp. d 1 ).

Case of regularised PLS-DA
We consider here the case of qualitative response variables for discrimination analysis. In this framework, PLS approaches have often been used (Nguyen & Rocke 2002) by recoding the qualitative response as a dummy block matrix Y : n × c indicating the class of each sample (c being the number of categories). One can also directly apply PLS regression on the data as if Y was a matrix with continuous entries (from now on called PLS-DA). Note that Barker & Rayens (2003) give some theoretical justification for this approach. A group and a sparse group version have been proposed by Liquet et al. (2016) using only penalties on the loading related to the variables in X. Our unified algorithm is then naturally extended in the same way to deal with categorical variables. We illustrate it on a big data set defined as follows. Let A k be the set of indices (i, j) of the i-th observation and j-th variable that are associated with the corresponding gray cell as shown in Figure 2.
We have ∀k = 1, . . . , 6, ∀i = 1, . . . , n, ∀j = 1, . . . , p where μ T = (μ 1 , . . . , μ 6 ) = (−1.0, 1.5, 1.0, 2.5, −0.5, 2.0), and i,j iid ∼ N (0, 1). As illustrated on Figure 2, the matrix X is composed of 6 groups of p k = 100 variables (p = 6 k=1 p k = 600) and each of the 3 categories of the response variable are linked to two groups of variables. We used a sample size of n = 486, 000 which corresponds to storage requirements of approximately 5 GB for the X matrix. We use G = 100 chunks for computing the different matrix products. The run took around 9 minutes for a model using 2 components. The relevant groups have been selected in both components. We randomly sample 9, 000 observations and present in Figure 3 their projection on the two components estimated on the full data set. A nice discrimination of the 3 categories of the response variable is observed.

The EMNIST data set
The now famous MNIST data set (LeCun & Cortes 2010) has become a standard benchmark for classification systems. The Extended MNIST (EMNIST), another freely available data set, "constitutes a more challenging classification task involving letters and digits, and that shares the same image structure and parameters as the original MNIST task" (Cohen et al. 2017). This data set consists of n = 280, 000 handwritten digit images. It contains an equal number of samples for each digit class (0 to 9). The images are already split into a training set of 240,000 cases and a test set of 40,000 cases. Each case is a 28 × 28 pixels gray-scale image whose intensity values range in [0,255]. Overall, if imported into R, they would use around 1.6 GB. We applied our PLS-DA algorithm on this data set, where the matrix X : n × p (p = 784) contains the images, and the matrix Y : n × q (q = 10) indicates the label of each image. We ran a PLS-DA model with 20 latent components which enabled us to get an accuracy (percentage of correct classification) of 86% in around 3 minutes using 2 cores. Note that in Cohen et al. (2017), linear classifiers produced an accuracy of around 85% for the EMNIST data. We also considered a larger version of the EMNIST data set where each image in the training set is rotated 5 and 10 degrees in both a clockwise and counterclockwise direction. The enlarged data set contains 1,200,000 images and uses 7.5 GB of memory. It is more representative of data sets which are too big to be loaded into a standard R session. The PLS-DA model with 20 latent components obtained an accuracy of 85.5% in around 18 minutes with 2 cores.
We also explored different versions of our algorithm by varying the number of chunks (ng option) to compute the different cross product matrices required during the run. The algorithm was run 10 times on the rotated data set and we investigated the effect of using 2 to 6 cores on the computation time to obtain the first component.
The results illustrated in Figure 4 indicate that increasing the number of chunks (for a fixed number of cores) leads to better computational performance. However, it is worth noticing that for ng set to a small value (5 say), increasing the number of cores had a negative effect on computation time. This increase may be due to an increased computational overhead involved with splitting and recombining the data sets with more cores. This computational trade-off can be hard to predict, but we have found that using more cores and a larger ng for big data improves the computational performance.

Conclusion and future work
This paper surveys four popular partial least squares methods and unifies these methods with recent variable selection techniques based on penalised singular value decomposition. We present a general framework for both symmetric and asymmetric penalised PLS methods and showcase some possible convex penalties. A unified algorithm is described and implemented for the penalised PLS methods, and we offer further extensions to deal with massive data sets (n, p and q very large). A full comparison in terms of time and memory of the different proposed extensions is an open area of future research.
Aside from computational issues, it is unclear if retaining the deflations of the usual PLS methods is appropriate when there is penalisation. In particular, we note that the orthogonality constraints of the original PLS methods are not retained for the penalised methods. Further development of our methods could seek to preserve the orthogonality constraints. We are perusing this open area using ideas from Tibshirani &Taylor (2011), andMackey (2009) for the simple lasso penalty. However, further investigation is required in the context of more complex penalties such as group or sparse group penalties.