Weighted sparse principal component analysis

Sparse principal component analysis (SPCA) has been shown to be a fruitful method for the analysis of high- dimensional data. So far, however, no method has been proposed that allows to assign elementwise weights to the matrix of residuals, although this may have several useful applications. We propose a novel SPCA method that includes the ﬂ exibility to weight at the level of the elements of the data matrix. The superior performance of the weighted SPCA approach compared to unweighted SPCA is shown for data simulated according to the prevailing multiplicative-additive error model. In addition, applying weighted SPCA to genomewide transcription rates obtained soon after vaccination, resulted in a biologically meaningful selection of variables with components that are associated to the measured vaccine ef ﬁ cacy. The MATLAB implementation of the weighted sparse PCA method is freely available from https://github.com/katrijnvandeun/WSPCA.


Introduction
The use of high-throughput experimentation is becoming more popular in both research and practice. Examples include the use of spectroscopic techniques (visible/near-infrared spectroscopy in food processing [1], nuclear magnetic resonance (NMR)); spectrometric techniques (e.g., mass-spectrometry of urine samples for diagnosis [2]); imaging techniques (e.g., functional magnetic resonance in brain research [3]) and of high throughput screening (e.g., genomewide screening of blood samples of patients [4]). These advanced measurement technologies result in vast amounts of data which characteristically contain several thousands or even millions of different variables measured on a limited number of samples. In the general statistical literature such very wide data are called high-dimensional [5]. In many disciplines, including chemistry and biology, dimension reduction techniques such as singular value decomposition and principal component analysis are very popular tools for the analysis of multivariate data [6]. Yet, with the increasing number of measurements, more advanced dimension reduction techniques are often needed for several reasons. First, often there is a need to automatically select the most important variables within the large set for further investigation [7] because such sparseness in the number of features at play corresponds to biological reality. Examples of this include transcriptional regulation -which is steered by a few transcription factors only -and the limited number of interacting elements in biological networks [8]. Furthermore, as shown in Ref. [9] PCA performs poorly in case of many more variables than samples (in terms of yielding inconsistent estimates of the loadings), an issue that can also be addressed by selecting a subset of variables having non-zero loadings on the components. Yet another reason for imposing sparseness is to address issues related to the interpretation of the estimates. In principal component analysis (PCA) for example, this would be typically done by inspecting the loadings of each of the variables. With thousands of variables such inspection is an infeasible task, and as shown in the important paper of [10], the common practice of neglecting small loadings is flawed. Hence, sparse PCA (SPCA) techniques [11,12] have been proposed that perform variable selection by making the components dependent on a limited number of variables in the sense that many of the loadings are constrained to be zero. Furthermore, SPCA methods have better statistical properties in terms of consistency of the estimates than ordinary PCA in the high-dimensional setting [13]. Hence, not surprisingly, SPCA techniques and other sparse dimension reduction techniques have become a very popular tool for the analysis of high-dimensional data. Yet, this is a field of statistical research that is still growing and subject to improvements.
Like classical PCA, the majority of SPCA methods have several drawbacks. One problem of both PCA and SPCA is the implicit assumption of independent and identically distributed noise [14,15]. Homoscedastic error is often not a realistic assumption: in analytical chemistry, for example, the error increases proportionally with the concentration of the analyte [16]; the same holds for microarray fluorescence intensities, which have a variance that increases with their mean [17,18]. This kind of error variance that increases with the strength of the signal is known as multiplicative error and implies heteroscedasticity (yet not the other way around). Also for fMRI data the assumption of homoscedastic additive noise is unrealistic given the strong structural dependencies in such data [14]. In these situations, methods that do not account for the heteroscedasticity may yield incorrect estimates of the standard errors but may stil be unbiased: As long as the data (signal with noise) scatter symmetrically around the true model scores (signal only), PCA will yield unbiased estimates. However, for analytical chemistry and micro-array expression data usually a model is assumed that contains both additive and multiplicative error [19]; this is a model with heteroscedastic noise that is no longer symmetrically scattered around the true score and hence implies biased estimates in ordinary least squares approaches. An elementwise weighted least squares approach with weights that are inversely proportional to the noise can be used to resolve this issue [19,20]. For classical PCA, Wentzell et al. [18] developed such an approach that incorporates differential weighting of the residuals with inversely proportional weights (see also [21]). No such approach that can weight at the level of single measurements has been proposed yet for sparse PCA.
In summary, researchers and others working with high-dimensional data in disciplines like chemometrics, bioinformatics, and neuroinformatics, are in need of sparse dimension reduction techniques that allow for a differential weighting of each of the elements in the matrix of residuals. Beyond the possibility of incorporating general noise structures, this approach has several other uses. One is that it naturally accommodates for missing data (by introducing zero weights). Furthermore, weighted PCA has also been used for differential weighting of the variables and of the observations. One family of examples that calls for such differential weighting are problems of chemical process monitoring and process control in which one may wish to form principal components for process forecasting. The components in question are based on variables or observations measured at different time points, so more weight being needed for more recent information. A second group of examples may be found in standard atmospheric science. When variables correspond to spatial locations, and weights are needed to account for an uneven spacing between locations (see further [22,23]). Also in these cases, an extension of SPCA to a weighted SPCA (WSPCA) method is useful. Hence, we propose a sparse PCA method that allows to weight each of the elements of the data matrix individually in this paper.
The remainder of this paper is structured as follows: First we introduce the sparse PCA model and the elementwise weighted approach, including a discussion of the estimation of the model parameters and the algorithm. The performance of the novel WSPCA method, compared to state-of-the-art SPCA methods, is assessed in a simulation study. Next, we show how the inclusion of weights that account for the heterogeneity of the measurement error improves prediction for publicly available data from a study on genomewide transcription in samples obtained from subjects vaccinated against influenza [24].

Methods
We will make use of the following formal notation: matrices will be denoted using bold uppercase letters, the transpose by the superscript T , vectors using bold lowercase, and scalars using lowercase italics.
Furthermore, we will use the convention to indicate the cardinality of a running index by the capital of the letter used to run the index (e.g., this paper deals with J variables with j running from 1 to J), see Ref. [25].

Formal model
We consider the following low rank approximation of the data matrix X , with T of size I Â R a matrix of component scores of the I observational units on R underlying components, with P of size J Â R a matrix with loadings of the J variables on the R components, and with E an I Â J matrix of residuals. The number of components, R, is supposed to be small and T is subject to an identification constraint; here we will impose T T T ¼ I, which implies orthonormality of the component scores. Furthermore, we will consider model (1) to be subject to sparseness constraints on the loadings: that is, many of the p jr are assumed to be zero but it is unknown which ones these are. Related sparse models have been proposed in the work of [11,14,[26][27][28]. Except for [14] all these papers implicitly rely on the assumption of independent and identically distributed noise. Standard estimation of the classical PCA model in (1) is based on minimizing the least squares criterion X À TP T 2 with respect to T and P and such that T T T ¼ I. This comes down to minimizing the squared Frobenius norm of the residuals: jjEjj 2 ¼ P i;j e 2 ij . To obtain sparse loadings, the state-of-the-art approach is to add a sparsity inducing penalty such as the lasso, that is, to add jjPjj 1 ¼ P j;r pjr to the least squares criterion (see, e.g. Refs. [11,27,29,30]), min T;P X À TP T 2 þ λkPk 1 : The lasso penalty is tuned by the metaparameter λ ! 0 with larger values implying more sparseness and shrinkage to zero. When λ ¼ 0, a non-sparse PCA approach is obtained. Note that the lasso is a shrinkage and selection operator meaning that it shrinks all loadings to zero and some exactly to zero [31].
Instead of adding the penalty to the ordinary least squares criterion in (2), however, we will rely on penalizing an elementwise weighted least squares criterion. The resulting objective function hence becomes with W a given matrix of nonnegative weights (e.g., reciprocal to the standard error of the noise or 0 in case of missing data) and ∘ denoting the elementwise or Hadamard product: ðW ∘ XÞ ij ¼ w ij x ij . (If all weights are put equal to 1, the first term in (4) becomes an ordinary least squares objective function, which goes with the implicit assumption that the residuals e ij ¼ x ij À P r t ir p jr are independent and identically distributed.) To allow for more flexibility and the use of stability selection as a tool for model selection [32] (see the model selection section further on in this paper), we finally propose a slightly more general objective function: The penalty term now includes a matrix B, which contains sets of prespecified, nonnegative weights that allow to tune the strength of the penalty for the individual loadings. Through these matrices we can accommodate the adaptive [30] or randomized lasso [32]; these may be preferred over the ordinary lasso (which is obtained if all entries of B are put equal to one). The ordinary lasso may be biased in the sense that the non-zero coefficients are shrunken too much [33,34], and this issue is addressed by the adaptive lasso that penalizes coefficients that have more information content less, for example, by using weights that are inversely proportional to the coefficients resulting from a non-sparse analysis. The randomized lasso, used in combination with a resampling scheme, has been put forward as a powerful and simple procedure that is consistent in terms of variable selection in conditions where the ordinary lasso is not; see Ref. [32] for details.

Algorithm
To find optimizers of the objective function (5), we will make use of an alternating procedure in which each of the arguments (T and P) is updated conditional upon fixed values for R, λ, W and B (how to obtain R, λ, and B will be discussed in the Section on tuning and model selection). The non-standard elementwise weighted least squares problem is solved by relying on a majorize minimize (MM) procedure [35]. This is a numerical technique that introduces surrogate functions to solve complicated optimization problems. Kiers et al. [36] discusses how to set up a majorizing function for the weighted least squares problem that takes the form of an ordinary least squares problem. Because MM is closed under summation (that is, the sum of majorizing functions is a majorizing function) the problem posed in (5) can be reformulated as a standard SPCA problem. Several procedures have been proposed to solve the SPCA problem, these include both procedures for sparse component weights [12] and for sparse component loadings [11,27,37] (see Ref. [29] for a discussion of the two approaches). All these methods rely on an alternating scheme where T and P are updated in turn. In our procedure, for the estimation of the sparse loadings P, we follow the univariate soft thresholding approach taken by Ref. [27] because this is a closed form and computationally efficient solution to the conditional optimization problem. Furthermore, it allows to solve the problem for a prespecified number of zero-coefficients by adapting the lasso tuning parameter throughout the iterative procedure (see Refs. [26,27] for further details). For the estimation of the component score matrix T, we make use of a standard -though not so generally known -matrix algebra result [38] to estimate all components simultaneously ( [27] uses a sequential deflation approach and thereby looses control over the orthogonality of the component scores).
The following iterative scheme underlies our algorithm (assuming that W, B, R, and λ are known): 1. T and P are initialized with T ðtÞ and P ðtÞ , respectively, and with T ðtÞ subject to the constraint: ðT ðtÞ Þ T T ðtÞ ¼ I. 2. Calculate the update T ðtþ1Þ of T conditional upon T ðtÞ ; P ðtÞ under the constraint. Set T ðtÞ ¼ T ðtþ1Þ . 3. Calculate the update P ðtþ1Þ of P conditional upon T ðtÞ ; P ðtÞ . Set P ðtÞ ¼ P ðtþ1Þ . 4. Check the stop criteria: If met, terminate, else return to step 2.
The details of the procedure, including a derivation of the parameter estimates, are given in the appendix. Here, we discuss a few important properties of the procedure. A first one is convergence. In each step, it is guaranteed that the loss is non-increasing. Hence, given that the loss is bounded from below by zero, the procedure converges to a fixed point (for suitable starting values, unsuitable values including, e.g., P ¼ 0). Termination of the algorithm is based on two criteria: A first one is convergence of the loss function values (when the difference in loss between subsequent iterations is smaller than some threshold, the procedure terminates) and a second one is a maximum number of iterations. A second important property of the procedure is that it is a local optimization heuristic. Therefore we use a multistart approach: The algorithm is run multiple times with different initializations and the solution with the lowest value of (5) is retained as the final solution.
The algorithm avoids costly operations in terms of computation time or memory. Specifically, because of 1) the separability of the loss function in the variables and 2) the orthogonality of the component scores, the estimation of the loadings becomes a univariate soft thresholding problem [31]. This means that all loadings can be calculated in parallel similar to performing a number of simple regressions (this is regression with only one predictor).

Tuning and model selection
The derivation of the parameter estimates is conditional upon fixed values for the weights W, B, the number of components R, and the lasso tuning parameter λ. Here we present a model selection strategy for the latter three; the discussion of how to obtain the weights w ij is deferred to the Results section.
Several strategies have been proposed for selecting the number of components, including the use of scree plots [22,39] and cross-validation [40]. Also for the tuning of the lasso parameter, cross-validation is a highly popular model selection tool [31]. Yet, this approach is known to result in too many non-zero coefficients (i.e., it results in a superset of the relevant variables [41]). An alternative that allows to control the false positive error rate is stability selection [32]. Here, we propose to first select the number of components based on a scree plot and to subsequently tune the lasso with stability selection; see Ref. [42] for a similar proposition. Support for a sequential approach which involves the scree test as a first step was recently found by Vervloet et al. [39]: using simulated data they showed that the sequential approach outperformed a simultaneous cross-validation strategy (in a non-sparse setting). Nevertheless, more research into model selection for SPCA methods is needed.
The first step in our model selection approach is to determine the number of components via a scree test. The scree test is based on a visual inspection of the proportion of variance accounted (VAF) for by the components resulting from the non-sparse weighted PCA analysis of the data; we refer to the Results section for the details of the calculation of the VAF by each component.
Given a fixed number of components R, a second step is to tune the lasso using stability selection in combination with a randomized lasso (i.e., the elements of B are set equal to one or to a constant value α 2 ½0:2; 0:8, which comes down to either b ij ¼ 1 or b ij ¼ α determined in a random fashion; see Ref. [32] for more details). Stability selection is a resampling-based method where different subsamples are generated (by sampling from the observations with replacement) in order to estimate the selection probability of each loading (i.e., the probability that the loading in question is nonzero): Given some level of sparsity, each subsample is analyzed with WSPCA and for each loading the proportion of samples in which it is non-zero is recorded as an estimate of the selection probability. Only loadings with a high selection probability (values between 0.6 and 0.9 are recommended in Ref. [32]) are considered truly non-zero. Meinshausen and Bühlmann [32] propose to tune the level of sparsity by controlling the expected number of false positives (this is, coefficients that are wrongly estimated to be non-zero). This can be done by making use of the following theorem. Let π thr denote the set probability threshold (e.g., π thr ¼ 0:9), EðVÞ the expected number of false positives, and q Λ the number of coefficients with selection probability equal to or larger than π thr . Under the assumption of exchangeability and that the selection by the weighted sparse PCA procedure is not worse than random guessing, Meinshausen and Bühlmann [32] have shown that with J denoting the number of variables. From this inequality it follows that for a given value for the probability threshold (e.g., π thr ¼ 0:9) and upperbound on the expected number of false positives (e.g., The latter equation (7) allows to tune λ: λ must be chosen such that the number of coefficients with estimated selection probability exceeding π thr is equal to q Λ . For further details on how to find this λ we refer to the original publication.
The result of the stability selection procedure, is that the set of nonzero loadings is determined for each component. However, estimates of the actual values of the non-zero loadings are not obtained and, if needed, have to be estimated in an additional analysis that constraints loadings not belonging to the stable set to zero and yields non-zero estimates for the stable set. This can be accomplished, for example, by a non-sparse constrained PCA analysis or a constrained sparse PCA analysis that is tuned such that no additional zeros are introduced. Variables with zero coefficients on all components can be removed from the data to speed up the calculations. In the appendix, we give more detail on how such a constrained analysis can be performed and the accompanying MATLAB code is online available: https://github.com/katrijnvand eun/WSPCA.

Results and discussion
In this section we will illustrate the added value of the weighted sparse PCA approach in two ways. First, in a simulation study, we show that weighted sparse PCA outperforms non-sparse and/or non-weighted approaches in recovering the loadings for data generated under a model which has both additive and multiplicative noise. Second, in an empirical analysis of transcriptomics data we show that a weighted sparse PCA approach (with weights accounting for differences in measurement quality of the intensities) gives better predictions than a non-weighted sparse PCA approach.

Simulation study
In analytical chemistry and also for microarray gene expression data, a hybrid model containing both additive and multiplicative noise is assumed to underlie the data [16,19]: with μ ij the signal or true score, η ij eN ð0; σ 2 η Þ the multiplicative error, and εeN ð0; σ 2 ε Þ the additive error. Here, we will assume for the offset α ¼ 0.
With microarray expression data, the convention is to log transform the data so our interest is mainly in lnðx ij Þ. Furthermore, we assume that a PCA model underlies the log transformed data. This means that in absence of noise, lnðx ij Þ ¼ lnðμ ij Þ ¼ P r t ir p jr or x ij ¼ expð P r t ir p jr Þ. Hence, in presence of noise the following model including both additive and multiplicative noise is assumed: Of crucial importance for the weighted SPCA approach proposed in this paper, are the weights. As noted before, these should be proportional to the reciprocals of the error variance of the log-transformed observed data, see Ref. [16] for the derivation of this result. In the simulation study, we will use the model parameters to calculate these weights. With real data this is not possible. One option is to use the expressions derived in Section 3.4 of Rocke and Durbin [19], another may be to calculate the variances over replicate data if these are available.
To obtain some intuitions on the additive-multiplicative error model in (8) for log-transformed data and how the analysis of such data may benefit from a differential weighting, we plotted data with and without noise, both for the transformed and untransformed scores. In Fig. 1 this is done for nine different levels of additive and multiplicative noise. We created a vector of equally spaced values between À3 and 3 representing the log transformed true scores lnðμ ij Þ; these are taken as the reference values on the x-axis. Against these values, several other scores have been plotted for each of the nine cases. On the one hand this is lnðμ ij Þ itself (the red dots on the first bisector), but also the true scores before log transform are shown (μ ij , the green line), in addition to perturbed scores before (x ij , the orange dots) and after log transform (lnðx ij ), the blue dots). Finally, the black line represents the weights w ij as expressed by (12). The case of data with no noise at all is depicted in the upperleft corner of Fig. 1: Because the true and perturbed scores are the same, some of the lines overlay and the weights are not defined because of the zero variances in (10). The leftmost column contains data with multiplicative noise only while the top row contains data with additive noise only (which corresponds to the homoscedastic case that is implicitly assumed in typical SPCA methods). Of primary interest are the transformed scores (which are the blue dots representing the observed data that will be the input for the SPCA analysis) and the red line representing the true model scores that -ideally-are recovered by the weighted analysis. As long as the blue dots scatter symmetrically around the red line, the unweighted analysis should be able to recover the model scores μ ij . However, when the scatter is no longer symmetric, as in the case with multiplicative noise ¼ 0:01 and additive noise ¼ 0:05 depicted in the middle row, rightmost column of Fig. 1, there will be bias unless focus is on the part of the data showing symmetric scatter. In this particular panel (σ 2 ε ¼ 0:05 and σ 2 η ¼ 0:01), this is precisely what the weights suggested by Rocke and Durbin do: The data corresponding to lnðμ ij Þ < 0 on the abscissa get a zero weight and this is the region that introduces bias. Using the same kind of reasoning, it can also be seen that the approximation by Rocke and Durbin [19] is problematic when there is no multiplicative noise and only additive noise (middle and rightmost column in the top row). There almost all weight is put on a very limited number of observations; hence, when using the weighted approach, with weights calculated as suggested by Ref. [19], poor performance can be expected in case of homoscedastic noise. This was also pointed out in Ref. [43].
In the simulation study, we manipulated the level of additive and multiplicative noise using the same levels as for the illustrative example, namely 0 (no noise), 0.01, and 0.05. We also manipulated the number of observations units (I ¼ 100 and I ¼ 1000) and the number of variables (J ¼ 100 and J ¼ 1000) in addition to the level of sparsity (with two levels: no sparsity or 50% meaning that half of the loadings are set equal to zero). The number of components R was fixed to two. Using a fully crossed design, this leads to 3 Â 3 Â 2 Â 2 Â 2 ¼ 72 conditions. In each condition we generated ten replicated data sets, so overall we generated 720 data sets.
To generate true scores lnðμ ij Þ in accordance with a (sparse) PCA model, we created a matrix of component scores T and a matrix of sparse loadings P as follows: An initial matrix of size I Â J was created by sampling from the standard normal distribution. From the singular value decomposition of this matrix, orthonormal component scores T were obtained by taking the R left singular vectors corresponding to the largest singular values. Loadings were obtained by generating values from a uniform distribution. In the condition with 50% sparsity, half of the loadings, selected in a random way, were set equal to zero yielding P. On average the GINI index was equal to 0.69 for the loading matrices generated with sparseness and 0.36 for those generated without sparseness. True scores were then calculated as μ ij ¼ expð P r t ir p jr Þ. To these multiplicative and/or additive noise was added under model equation (8). The final step was to take the log transform. Yet, in case of relatively large additive noise, negative x ij may result in a number of cases for which the log transform is not defined. Such cases were treated as missing data, this is with w ij ¼ 0. Weights as defined by Rocke and Durbin were obtained by calculating Note that in those cases without additive noise (i.e., σ 2 , meaning that all weights are equal and, hence, that the weighted PCA approach is equivalent to an unweighted approach. In case of no noise at all (both σ 2 η ¼ 0 and σ 2 ε ¼ 0) the weights are not defined. In this condition of the simulation study, we set all weights equal to one. When there is no multiplicative noise, σ 2 ij , which means that more weight is given to data with higher true scores.
To check the added value of imposing sparseness and weighting, four different analyses were performed to cover all possible combinations (weighted vs. unweighted combined with sparse vs. non-sparse). Thus each of the data sets was analyzed with classical PCA, weighted PCA, sparse PCA (using the PMA R package [11]) and with our newly developed weighted sparse PCA method. For the weighted approaches we used the weights defined by (12) making use of the true values of μ ij and the error variances σ 2 ε and σ 2 η . Note that each of the four methods used could deal with missing data; for the weighted approaches this was possible by setting w ij ¼ 0; PMA allows for missing values in the data matrix by excluding them from the computations (see Ref. [11]), and classical PCA was performed using our own implementation of weighted sparse PCA (setting λ ¼ 0, i.e., no sparseness, and w ij ¼ 0 for the missing values and w ij ¼ 1 elsewhere). Both sparse analyses were run using information on the true number of non-zero loadings. As the data are generated with orthogonal T, the PMA package was run with the option of orthogonal component scores.
The performance of the methods was assessed at two levels: first, at the level of the loadings to assess configurational similarity, and, second, at the level of the reconstructed scores to measure fit. First, the recovery of the loadings was assessed by calculating Tucker's coefficient of congruence [44]. Let b p jr be the estimated loading of variable j on component r, then Tucker's congruence between the estimated and true loadings for component r is (13) with φ r ¼ 1 indicating perfect congruence and values closer to zero indicating lower congruence. We calculated the average congruence over the two components, taking into account that sparse PCA is invariant under reflection and permutation; this means that in case of a negative congruence value the estimated loadings were reflected and that φ was calculated as the maximum over all possible permutations of the components. PCA and WPCA have rotational freedom; this was taken into account by rotating towards the true loading matrix. Note that rotational freedom -in general -does not apply to sparse PCA techniques as optimality is defined by a penalized loss and rotation will result in higher values of the penalty term. MATLAB code implementing this procedure is available online:https://github.com/katrijnvandeun/WSPCA. Fig. 2 displays boxplots of the congruence coefficients obtained with the four different PCA analyses in the conditions with 50% of sparseness in the generated loadings, I ¼ 100, and J ¼ 1000 (this setting corresponds to the case of high-dimensional data; plots for the other values of I and J can be found online: https://github.com/katrijnvandeun/WSPCA). The leftmost panel (MULT ¼ 0) represents the conditions where there is no multiplicative noise. Within that panel, the results at the left were obtained in the condition with no noise at all (this is, also no additive noise ADD ¼ 0) while the results in the middle and right subpanel were obtained for the conditions with additive noise. In absence of any noise, all methods should perform extremely well. The non-sparse PCA methods and WSPCA indeed have almost perfect congruence (i.e., φ ¼ 1). SPCA, on the other hand, has low congruence in this noise-free condition. The reason for this is the higher sensitivity of the SPCA implementation by Witten et al. [11] to local optima: WSPCA relies on a multistart procedure including one rational start (namely initializing with the singular vectors) and ten random starts while the SPC function in the PMA R package of Witten et al. [11] relies on a single SVD-based rational start. When there is no multiplicative noise but some additive noise, WSPCA does not perform as well as the other methods as could be expected (see the top row of Fig. 1). When there is no additive noise but only multiplicative noise (the subpanels at the left in the middle panel MULT ¼ 0:01 and right panel MULT ¼ 0:05), WSPCA performs equally well as PCA and WPCA as it correctly imposes sparseness and outperforms SPCA because it uses a multistart procedure. Note that PCA and WPCA have been rotated to the true structure while WSPCA does not rely on information on the true structure. In practical data analysis situations, such information is not available: without the additional rotation, PCA and WPCA perform worse than WSPCA (see Fig. 3). When there is both additive and multiplicative noise, WSPCA clearly outperforms the other methods as it is able to downweight the observations that introduce bias. Summarizing, in the majority of the conditions, weighted sparse PCA outperforms the three other methods in recovering the true loadings. A second measure of performance that we assessed, is the deviation between the true data scores lnðμ ij Þ ¼ P r t ir p jr and the fitted data b x ij ¼ P r b t ir b p jr , expressed by the badness-of-recovery (BOR) statistic, When BOR ¼ 0 this means that the data are perfectly fitted by the model, with higher values meaning worse fit. The log transformed values of the BOR statistics plus one are summarized by the boxplots in Fig. 4;  also in this case a zero means perfect fit. The reason for displaying the log-transformed values is the occurrence of extreme outliers of the BOR statistic in (14). The boxplots confirm the results observed for the recovery of the loadings: In presence of hybrid (i.e., mixed additivemultiplicative) noise WSPCA outperforms the other methods, when there is no noise at all, all methods have (close to) perfect fit, and when there is only additive noise WSPCA does not outperform the other methods. WSPCA outperforms SPCA in the conditions with both additive and multiplicative noise because this is a situation that results in biased estimates when ordinary (penalized) least squares is used (as previously discussed in the introduction). It may seem surprising that WSPCA also outperforms WPCA -which is a more flexible model -in these conditions with hybrid noise. This is because here performance is measured as the deviation from the true scores (and not from the observed data) with worse performance of WPCA because of overfitting. Note that (not visible in the Figure) in the conditions without any noise, perfect fit is only obtained for the non-sparse methods while SPCA and WSPCA have BOR values that differ -although barely-from zero. This is due to the penalty in the optimization criterion resulting in lower values of the optimization criterion (5) for solutions that do not have perfect fit. For this optimization criterion, as discussed above in relation to the presence of local optima, WSPCA has a lower value than SPCA. Fig. 5 displays boxplots of the badness-of-fit for the four different PCA analyses in the conditions with data generated without sparseness, I ¼ 100, and J ¼ 1000 (this setting corresponds to the case of highdimensional data; plots for the other values of I and J can be found online: https://github.com/katrijnvandeun/WSPCA). Note that the sparse PCA methods were tuned to yield zeros for 50% of the loadings although the data were generated with non-sparse loading matrices. Hence, a reasonable expectation is that PCA and WPCA outperform SPCA and WSPCA. 1 Furthermore, WPCA may be expected to perform better than all other methods when there is both additive and multiplicative noise present in the data. The BOR statistics in Fig. 5 support this except for the condition with large additive noise (ADD ¼ 0:05) and medium multiplicative noise (MULT ¼ 0:01). In this condition PCA seems to perform slightly better than WPCA. Note that when there is no multiplicative noise, PCA outperforms WPCA. This confirms the observation made before (see the discussion of Fig. 1).

Description of the data
Nakaya et al. studied the response to vaccination against influenza at several levels of the cellular organization. Microarray gene expression data and antibody titers are publicly available for two seasons (see the GSE29617 and GSE29614 series in the Gene Expression Omnibus at htt p://www.ncbi.nlm.nih.gov/geo/). The titers are included as a dependent variable that we will try to predict on the basis of the component scores. Here we concentrate mainly on the data resulting from the collection of peripheral blood mononuclear cells (PBMC) three days after vaccination in 26 subjects 2 vaccinated with trivalent inactivated influenza vaccine (TIV) in the year 2008; the data from the 2007 season will be used for external validation, yet, given their very limited sample size (9 subjects only), will not be used to build the weighted sparse PCA model. Hybridization was to the Affymetrix Human Genome U133 Plus 2.0 and U133 þ PM arrays for the 2007 and 2008 seasons, respectively; these arrays share 54 675 probesets which form the variables in our analyses. Note that the data are ultra-high-dimensional (over 50 000 variables for only 26 observations).

Data preprocessing
To obtain the expression values, we used the affyPLM package [45] with default settings yielding Robust Multichip Average (RMA) values. The baseline corrected values were obtained by subtracting the RMA values at baseline from the RMA values three days after vaccination. This pre-processing procedure was applied both to the 2007 and 2008 data. Weights were obtained as follows: From the GESTr R package [46], estimates of the variance components were obtained; the true expression scores μ ij in equation (8) were estimated by the observed expression scores. In Fig. 6, a histogram of the weights is shown for the 2008 season: almost all weights lie in the range ½0:6 À 1:0. Hence, we do not expect the results of the weighted analysis to differ a lot from those of the unweighted analysis.
Plasma hemagglutination-inhibition (HAI) antibody titers were obtained as described in the original paper [24]. First, we calculated the difference in antibody titers at day 28 with the titers at baseline for each of the three influenza strains that compose TIV. The maximal difference was retained and this score was log 2 transformed.

PCA analyses
To determine the number of components we performed a weighted PCA analysis of the data and plotted the variance accounted for (VAF) by each component in Fig. 7; following the recommendations by Willett and Singer [47], this is Note that with this criterion, the VAF accounted for by the first R components is equal to the sum of the VAF by each of the r ¼ 1; …; R components. The first three components stand out compared to the remaining components. Hence, we select three components for further analysis. We also performed an unweighted analysis; this resulted in VAFr values that only differ in the fourth decimal from those obtained with the weighted PCA. Next, we determine the level of sparsity using expression (7): Taking the default values EðVÞ ¼ 1 and π thr ¼ 0:90 (see Meinshausen and Bühlmann [32]) yields q λ ¼ 209 non-zero coefficients for a single component or 627 in total. We ran two different types of SPCA analyses: an unweighted SPCA analysis using the PMA package and a weighted sparse PCA analysis using the WSPCA method that was proposed in the present paper. Both analyses were tuned such that 627 non-zero loadings over the three components were obtained. For the PMA analysis a single rational start based on the singular value decomposition was used; for the stability selection part of the WSPCA analysis, 250 subsamples were used for each level of sparsity with the level of sparsity being gradually decreased until the desired number of 627 non-zero loadings was attained. Each of the subsamples was subjected to a WSPCA with a single rational start.
The performance of the unweighted and weighted SPCA in terms of how well their component scores fit the antibody titers is summarized in Table 1. First we trained the model by obtaining loadings and component scores for the 2008 data as well as regression weights (by regressing the antibody titers for the vaccinations in 2008 on the component scores derived from the transcriptomics data for 2008). These model parameters were then used for the prediction of the 2007 antibody titers (the independent test set): Component scores were derived from the 2007 data using the loadings obtained for the 2008 data; these components scores where then used as the predictor scores in the regression model with intercept and regression weights as obtained for the 2008 data. As shown in Table 1, the squared correlation between the observed and fitted antibody titers for the training set were 0.15 and 0.20 for the unweighted and weighted solutions respectively. For the 2007 test set, the squared correlation between predicted and observed scores obtained for the unweighted and weighted SPCA analysis were 0.26 and 0.43 respectively. Hence, although the same number of components and level of sparsity was used and the same type of initial configuration, the components resulting from our weighted SPCA procedure had better predictive quality than those obtained from the unweighted SPCA analysis   performed with the PMA package. Different variables were selected by the two procedures which is likely the result of weighting. The results of the WSPCA analysis with stability selection were further checked for a biologically meaningful selection. A list was generated containing the 627 probesets with non-zero loadings. Probesets not annotated to gene symbols were eliminated and those mapping to the same gene were summarized using the median score. A hypergeometric test was used to analyse the enrichment in blood transcriptional module (BTM), using a false discovery rate (FDR) threshold of 0.001 (a high FDR threshold is suggested for the use of hypergeometric tests in module enrichments [48]). The BTM are sets of manually annotated correlated gene sets which were obtained from meta-analysis of over 500 gene expression studies obtained from human blood. These are widely used for the interpretation of human blood gene expression data as they are more tissue-adapted compared to general pathways and are generally regarded as well annotated. The enrichment analysis was performed using the tmod R package [49]. The script for the analysis is available from https://github.com/katrijnvandeun/WSPCA.
The enriched modules are presented in Fig. 8. Overall, the genes identified by WSPCA are mainly enriched in pathways of the subcompartment of the immune system responsible for the first rapid response to pathogens (innate immunity), as expected for gene expression data collected soon after vaccination and as originally reported. Mainly, the genes identified were found to be enriched in innate immune cells such as dendritic cells, monocytes and neutrophils, and in essential functions of the innate immune system such as the recognition of pathogens via pattern recognition receptors and the induction of interferon responses. Additionally, the genes were also enriched in functions associated with cell cycle and transcription, which in addition to being general processes involved in both active responses to interventions and housekeeping activities, also underlie the immunological activities referred to above. Overall, the profile presented here is representative of the early transcriptional response to vaccination.

Discussion of the results
In this section we assessed the performance of weighted sparse PCA compared to unweighted and/or nonsparse PCA under various conditions in a simulation study and for real data in an empirical application. The main factors of interest in the simulation study were presence versus absence of additive and multiplicative noise and of sparseness. When both additive and multiplicative noise are present, weighted PCA clearly outperforms unweighted PCA. When only multiplicative noise is present, the weighted and unweighted approaches perform equally well. However, when only additive noise is present, unweighted PCA outperforms weighted PCA. Hence, in this situation we advise against using a weighted sparse PCA approach. In analytical chemistry usually both additive and multiplicative noise are present but this should in general be checked, for example, by means of the maximum likelihood estimates of the additive and multiplicative error variance parameters [16]. With respect to sparseness, the results obtained in the simulation study show that it is important to impose the correct amount of sparseness: The sparse PCA methods outperformed the nonsparse methods when there was sparseness in the loadings while the nonsparse methods outperformed the sparse methods in case of no sparseness. Various model selection tools have been proposed to tune the proper amount of sparseness in sparse PCA, including the case of no sparseness at all. These include cross-validation [12], stability selection [32], and the BIC [50].

Conclusion
In this paper, we have introduced an approach to sparse PCA that allows to weight each of the elements of the data matrix individually. Such a weighted sparse approach was illustrated to be useful in terms of an improved recovery of the loadings in a simulation study and an improved prediction of antibody titers on the basis of components derived from a weighted sparse PCA analysis of early gene signatures from a study of vaccination against influenza. Overall the simulation results support the use of a weighted sparse PCA approach when the data are subject to both additive and multiplicative noise.
To account for dependence, for example between observations related in time, the weighted sparse PCA method may be extended in a similar way as presented in Ref. [51]. Another useful extension may be the development of an iteratively reweighted least squares procedure, as such a procedure does not rely on prior knowledge for the weighting. The algorithm presented here already contains the key ingredients to implement such an extension.

Author contributions
KVD developed the WSPCA procedure and performed the analyses. MC was involved in the analysis conception and design; All authors were involved in drafting the manuscript or critically revising it for important intellectual content. All authors had full access to the data and approved the manuscript before it was submitted by the corresponding author.

Declaration of competing interest
MC is an employee of the GSK group of companies and reports owning shares in GSK.