A Comparison of Sparse Partial Least Squares and Elastic Net in Wavelength Selection on NIR Spectroscopy Data

Elastic net (Enet) and sparse partial least squares (SPLS) are frequently employed for wavelength selection and model calibration in analysis of near infrared spectroscopy data. Enet and SPLS can perform variable selection and model calibration simultaneously. And they also tend to select wavelength intervals rather than individual wavelengths when the predictors are multicollinear. In this paper, we focus on comparison of Enet and SPLS in interval wavelength selection and model calibration for near infrared spectroscopy data. The results from both simulation and real spectroscopy data show that Enet method tends to select less predictors as key variables than SPLS; thus it gets more parsimony model and brings advantages for model interpretation. SPLS can obtain much lower mean square of prediction error (MSE) than Enet. So SPLS is more suitable when the attention is to get better model fitting accuracy. The above conclusion is still held when coming to performing the strongly correlated NIR spectroscopy data whose predictors present group structures, Enet exhibits more sparse property than SPLS, and the selected predictors (wavelengths) are segmentally successive.


Introduction
One of characteristics of near infrared spectroscopy (NIR) data is that the number of predictors is much more than the size of observations. Taking corn data [1] as an example, the number of predictors is up to 700 but the sample size is just 80. Thus a problem in building calibration model for NIR is how to select a set of important predictors among a large number of candidate covariates. Wavelength selection for spectroscopy is a classic topic [2] and many methods have been proposed, such as VIP [3], MWPLS [4,5], and MC-UVE [6]. A drawback of the above algorithms is that model calibration and wavelength selection are separated into two steps: the calibration model is firstly established and then the variable selection procedures are performed based on the model from the first step. Recently, sparse variable selection methods [7][8][9][10][11][12][13][14][15][16] have gained much attention for dealing with high-dimensional data from various fields. One of advantages of sparse methods is that they can perform the model calibration and variable selection simultaneously. In addition, sparse algorithm can shrink some estimation coefficients to exactly zero, thus the predictors corresponding to zero-valued coefficients are eliminated from the original calibration model. This is extremely useful when coming to model interpretation. Nowadays, there are many useful sparse methods for addressing the NIR spectroscopy data [17][18][19][20][21][22][23]. In this paper, we focus on two of them: elastic net [17] and sparse partial least squares (SPLS) [18]. Both Enet and SPLS can obtain sparse coefficients by choosing appropriate parameters.
Another feature of NIR spectroscopy is multicollinearity among the predictors. The neighboring predictors are continuous wavelength intervals and they are highly correlated. In this situation, the problem is that which strategy should be accepted when doing the model calibration and wavelength selection? In other words, to select a single wavelength each time or an entire interval of strongly correlated and adjacent wavelengths? On one hand, selecting the entire variable 2 International Journal of Analytical Chemistry group can obtain better calibration and prediction accuracy compared with selecting single predictor from the group when multicollinearity or high correlation is present in the group variables [24][25][26]. On the other hand, the interval of wavelengths among which the pairwise correlations are strongly correlated should be regarded as a natural group when this wavelength interval is associated with a particular type of chemical bonding. So those predictors in the same group should be in or out of the calibration model simultaneously. For the above two considerations, the sparse methods for NIR spectroscopy data should be able to handle group variables (wavelength intervals) selection, which is called group effect in [17]. Fortunately, both Enet and SPLS can automatically group the multicollinear predictors and select (or eliminate) the entire predictor group simultaneously from the model. Therefore, Enet and SPLS are two potential powerful methods which are suitable for addressing the NIR spectroscopy data. In fact, many references [27][28][29][30][31][32][33][34][35][36][37][38] have introduced Enet or SPLS to analysis of NIR spectroscopy data. The purpose of this article is to compare the performance of them when dealing with the NIR spectroscopy data.
The remainder of this paper is organized as follows: Section 2 offers the basic theory of Enet and SPLS. Sections 3 and 4 give the experimental results on simulation data and real data sets, respectively. In Section 5, we give the conclusion and make a brief discussion.

Theory of Enet and SPLS
. . Sparsity of Enet and SPLS. We consider the following linear model for variable selection and estimation: where = ( 1 , 2 , . . . , ) is the regression coefficient vector. is usually the Gauss noise, namely, ∼ (0, 2 I). y = ( 1 , 2 , . . . , ) is the response and X = (x 1 , x 2 , . . . , x ) is the predictor matrix, where x = ( 1 , 2 , . . . , ) is the ℎ ( = 1, 2, . . . , ) predictors. For the simplicity, we also assume that the response variable is centered and the predictors are standardized to have zero mean and unit length, namely, Traditional methods to obtain the regression coefficients in the linear model (1) are ordinary least squares (OLS). The solution of OLŜ( ) = (X X) −1 X y generally has not Figure 1: Two-dimensional LASSO penalty (blue) and Enet penalty (black).̂( ) is the ordinary least squares solution and the contours reflect the estimates of̂with equal deviation in terms of squared error loss. Enet penalty is strictly convex, so the optimal solution is located in one corner of the Enet. sparsity (the term "sparsity", as used here, refers to the linear model (1) having many zero-valued regression coefficients). The OLS is often overfitting and has poor predictive performance when applied to those highly correlated data. To date, there are many ways to deal with this issue. The OLS with the 1 −norm constraint, which is called LASSO [7], may be the most important one [39], as LASSO can perform variable selection and estimation simultaneously.
Enet [17] is an improved version of the LASSO by using doubly regularized parameters and can be expressed by the following constrained OLS optimization problem: where 1 and 2 are two nonnegative regularization parameters; ‖ ‖ 1 = ∑ =1 | | is the 1 -norm; and ‖ ‖ 2 = (∑ =1 2 ) 1/2 is the 2 -norm. If 2 = 0, Enet is exactly equivalent to LASSO. The scale factor "1 + 2 " should be "1 + 2 / " when the predictors are not standardized to have mean zero and 2 -norm one. Enet penalty " 2 ‖ ‖ 2 2 + 1 ∑ =1 | |" is the combination of 1 -norm and 2 -norm. The 1 -norm constraint induces sparsity; namely, it can shrink those small coefficients being exactly zero. 2 -norm constraint addresses the potential singularity and produces lower prediction error. The Enet constraint can be seen as a mix norm, which is like a fish net (that is why it is called elastic net) (see Figure 1). The Enet ball is a (hyper)cube with corners on the coordinate axes where all but one parameter is exactly zero. It is geometrically easy to see that the loss contours always touches the hypercube in a corner with some of the parameters being exactly zero. So, Enet shrinks some coefficients being exactly zero when the Enet constraint is active.
International Journal of Analytical Chemistry 3 The important special case comes true when the ridge parameter 2 comes to be sufficiently large. In fact, when 2 → ∞, Enet changes to bê where ( ) + and sgn( ) are, respectively, defined as follows: Equation (4) is called univariate soft thresholding (UST) [40] and it shows that Enet coefficients can be estimated by UST when 2 is large enough. Partial least square (PLS) [41][42][43] is a widely used statistical analytic tool that aims to reduce the dimensionality of the high-dimensional data by constructing latent components. PLS finds the first components by iteration to model the relationship between X-matrix and y-response. Each component (score) t is the linear combination of the original predictors, namely, t = Xw = 1 x 1 + 2 x 2 + ⋅ ⋅ ⋅ + x . Generally, each weight of vector w obtained by PLS is not zero; thus PLS does not automatically lead to selection of relevant predictors. Although PLS can deal with ill-posed problems and improve the prediction accuracy, it is still hard when coming to model interpretability. So, sparse partial least squares (SPLS) [18] was proposed for getting the sparse solution. Actually, SPLS can be seen as the generalized PLS which inserts a variable selection procedure. SPLS finds its first sparse principal component by the following optimization problem: . . w w = 1, where M = X yy X, w, and c are the direction vectors and keep close to each other, 0 < ≤ 0.5, 1 ≥ 0, and 2 ≥ 0. Equation (6) can induce the sparse property by imposing the Enet penalty. It should be pointed out that the penalty acts on the surrogate of the direction vector c instead of the original direction vector w, and w and c are calculated by an alternative iteration algorithm where solving Enet is a crucial step. For univariate response y,ŵ = X y/‖X y‖ is the direction vector of PLS, andĉ = (|ŵ | − 1 /2) + sgn(ŵ ) ( = 1, 2, . . . , ) for sufficiently large 2 . SPLS is also an iteration algorithm that finds first direction vector firstly, then the second and up to figuring out weight vectors.

. . Group Variables (Wavelength Intervals) Selection of Enet and SPLS.
Considering strictly convex of Enet, suppose that where = x x is the sample correlation coefficient of the predictors x and x . Equation (8) presents an upper bound of the absolute difference of the regression coefficients and indicates that Enet enables group variables (wavelength intervals) selection. Namely, if two predictors are strongly correlated ( → 1), the corresponding regression coefficients are almost identical. So those strongly correlated predictors (wavelength intervals) will be simultaneously in or out the model in the form of groups or intervals.
PLS is often calculated by NIPALS [44] and SIMPLS [42] algorithms, but we just employ NIPALS to get SPLS solution in this issue. SPLS-NIPALS can select more than one predictor each time and the response y is deflated, so the eigenvector X y/‖X y‖ is proportional to the current correlation. This means that, if there is a group where the predictors are highly correlated, then SPLS can select (or eliminate) these group variables simultaneously.
. . Tuning the Parameters in Enet and SPLS. Two regularization parameters ( 1 , 2 ) are used in Enet. The sparse parameter 1 can be replaced by the fraction ( ) of the 1norm as is limited and ranged from 0 to 1. In practice, can be equally divided into 100 values and the ridge parameter 2 can set be some large numbers for the consideration of group effect and UST.
There are totally four parameters ( , 1 , 2 , ) in the SPLS. A small (e.g., = 0.5) is used to avoid local optimization in the iteration. The ridge parameter 2 should set to be sufficiently large to obtain a UST solution which just depends on the LASSO penalty parameter 1 . Thus, just the sparse parameter 1 and the number of principal components need to be tuned in practice. In addition, the parameter 1 can be replaced by the if the soft thresholding direction vector is set to bê where 0 ≤ ≤ 1. Compared with 1 , the advantage of using is that is limited into [0, 1]. Thus can be equally divided into 100 intervals in practice. would not be too large; for example, it could be set be 1 to 15. Thus, we make use of 100 × 15 = 1500 grid points to search for the optimal combination of model parameters.
The measurement used for tuning the parameters is mean squared prediction error of tenfold cross-validation ( ), which is defined as follows: where is the measure value of the ℎ ( = 1, 2, . . . , ) sample and̂− is the predicted value obtained by leaving the V ℎ fold samples out.

Simulation Study
The purpose of this section is to give comparisons of Enet and SPLS from several aspects when the true model is known.
. . Example : Study on the Cases of > and < . In this example, the simulation of overdetermined ( > ) and underdetermined ( < ) data sets is used for investing the real-world cases in spectral analysis. We simulate a sparse model with a diverging number of observations, predictors, and sample correlations. The simulation data is generated via the linear model (1) and = √ 8. The × design matrix X is drawn from a multivariate normal distribution (0, Σ) whose covariance matrix Σ has entries Σ = | − | , ( , = 1, 2, . . . , ). Choosing such covariance structure is to coincidence with NIR spectroscopy data as it indicates that those neighboring predictors are more correlated (see Figure 2). We consider = 0.5, 0.7 and 0.9 and six combinations of Thus 18 combinations of different , , , and are discussed, where the first 9 cases are overdetermined and the last 9 cases are underdetermined. The model calibration accuracy is measured by the relative prediction error ( ) defined as follows: wherêis the estimate of ( = 1, 2, . . . , ), and the results for comparisons are listed in Table 1. We can easily see that SPLS outperforms Enet in terms of and "C" in almost all the cases, where "C" is the number of predictors that are correctly selected into the model, but SPLS tends to select much more uninformative predictors (denoted by "IC" in Table 1) than Enet. Both Enet and SPLS can select almost all those right predictors contained in the true model and two methods have similar performance in this situation. "C + IC" is the total number of the predictors that are selected into the model, and we can see that Enet tends to select a smaller predictor set as the key variables than SPLS. With the increase of correlation among predictors, the number of predictors selected into the model and the estimation accuracy changes slightly by two methods. In sum, Enet tends to select less predictors as key variables than SPLS; thus it gets more parsimony model and brings advantages for mode interpretation; SPLS can obtain much smaller calibration accuracy than Enet, so SPLS is more suitable when the attention is to get better model fitting accuracy.

. . Example : Comparison of Two Methods at Handle
Multicollinearity. It is a good way to perform wavelength intervals selection rather than wavelength points selection in NIR spectroscopy analysis [25]. In this section, we simulate a sparse model to evaluate the group variables selection of Enet and SPLS. We firstly generate three independent latent variables: k ∼ (0, 5 2 ) ( = 1, 2, 3), then let the sample size be = 240 and the number of predictors be = 30. The response and 30 predictors are generated as follows: where ∼ (0, I 240 ) ( = 1, 2, . . . , 30) are independent. We can easily see that the predictors 1 to 6, 7 to 13, and 14 to 30 constitute of three variable group structures, and the predictors in the same group are multicollinear. The first two groups are associated with the response and the third group is mixed into the model as the noise. In this simulation, 100 data sets are generated, and for each data set, the 240 samples are divided into training, validation, and test sets by 120, 60, and 60, respectively. Training set is for building the model, validation set is for tuning model parameters when doing cross-validation, and test set is for testing the performance of the model. Both Enet and SPLS are employed to deal with these 100 data sets, and the corresponding results are shown in Table 2 and Figure 3. We can see that sum up, both Enet, and SPLS have good performance when coming to dealing with strongly correlated data in which the predictors present group structure, this coincides with the theoretical analysis on two methods. Table 2 shows that SPLS performs better than Enet in term with MSE (see (14)). Figure 3 shows that the estimate coefficients of predictors from the same group by Enet are more consistent than that by SPLS. In addition, Enet is more likely to eliminate the uninformative variable groups. We can see that the predictors in the true model (from 1st to 13th predictors) are selected by the Enet and SPLS, but SPLS also select some uninformative predictors (from 14th to 30th predictors) and Enet almost not. So Enet is still the winner when considering variable selection and model interpretation in the case of handling multicollinearity.

Real Data Sets
Mean square errors (MSE) are utilized as prediction accuracy for real data sets analysis. MSE is defined as follows:   wherêis the estimate of ( = 1, 2, . . . , ) and is the sample size of the data set. In this study, each real data set is divided into training data set and testing data set, and training MSE (Train MSE) and testing MSE (Test MSE) are reported based on 100 replications.
. . Corn Data Set. The first data set is cited from [1], which consists of 80 samples of corn measured on three different NIR spectrometers. The wavelength range is 1100-2498 nm at 2 nm intervals and thus it gets 700 predictors (or variables) measured by three instruments called "m5", "mp5", and "mp6" and correspondingly obtains three predictor matrices called "m5spec", "mp5spec", and "mp6spec", respectively. The predictors of three matrices are generally strongly correlated (see Figure 4). Taking "m5spec" for an example, there are 93.4% predictors whose correlation coefficients are more than 0.92, and even 49.4% predictors whose correlation coefficients are more than 0.99. The moisture, oil, protein, and starch values for each of the samples are also included as response variables and stored in the response matrix "propvals". In this study, we combine three predictor matrices with four responses to compare the performance of Enet with SPLS.
For each combination, the 80 samples are divided into training set and testing set with the sample size 50 and 30, respectively. The training set is employed to establish the model and the testing set is used to test the model performance. Train MSE, Test MSE, and the number of key predictors (Num of selected) selected into the model are reported based on 100 replications on the data sets. The results are shown in Table 3 and Figures 5 and 6, respectively. Table 3 and Figure 5 tell that SPLS can obtain better calibration accuracy than Enet, but Enet can establish a more sparse model and so it is easier to interpret the model. The above results coincide with the results obtained from simulation data. The testing MSE is close to the training MSE for all the situations by both Enet and SPLS; this illustrates that two methods are suitable for investigating NIR spectroscopy data. Two methods obtained "consistent" results on three predictor matrices with just slight difference, so Enet and SPLS are not sensitive when performing data with noise. In addition, SPLS obtains smaller fitting accuracy but Enet selects much less predictors as key variables. So Enet is more suitable when focusing on model interpretability, and SPLS should be employed when the attention is model calibration accuracy. Figure 6 tells us that the coefficients paths obtained by two methods are segmentally zero-valued or nonzero-valued. This means that successive wavelength intervals are selected into or eliminated out of the model. Both Enet and SPLS exhibit group effect when performing the NIR spectroscopy data in which the predictors from the neighboring wavelength interval are strongly correlated and can be seen as a group. However, Enet has less variable groups than SPLS, so the group effect is more outstanding by Enet than by SPLS when performing the NIR spectroscopy data.
. . Gasoline Data Set. The second data set, cited from [49], is another NIR spectral data set with NIR spectra and octane numbers of 60 gasoline samples. The NIR spectra were measured using diffuse reflectance as log(1/R) from 900 nm to 1700 nm in 2 nm intervals, giving 401 wavelengths (predictors) (see Figure 7). 60 samples are also divided into training set and testing set with the sample sizes 38 and 22, respectively. Same as the corn data set, three indices are reported in Table 4 based on 100 replications. Obviously, SPLS has much better estimation accuracy and Enet selects much less predictors as key variables. Figure 8 shows the regression coefficient paths via 100 replications with randomly choosing the training and testing sets, and it tells that Enet just almost   . . Buckwheat Data Set. The above corn and gasoline are two public NIR spectroscopy data sets, and the third NIR spectroscopy data set, called "bwX", is from our lab, which consists of 40 observations of buckwheat measured by FieldSpec 3 spectrometer. The NIR spectroscopy wavelength The comparison of Enet and SPLS on corn data set. Three measures "trainMSE", "testMSE", and "Num of selected" are scaled to unit one. The results of Enet and SPLS are marked by the numbers "1" and "2", respectively. The results of three predictor matrices of "m5spec", "mp5spec", and "mp6spec" combination of four responses are, respectively, shown by deepskyblue, orange, and grey bars.
range is 780-2500 nm at 2 nm intervals; thus it contains 861 predictors. The NIR spectra were measured using diffuse reflectance as log(1/R) (see Figure 9). Starch in buckwheat is measured as the response in this study (called "bwy"). The starch is the vital nutrient in buckwheat and the fast detection of starch is very important in practice. 40 samples are also divided into training set and testing set with the sample sizes 30 and 10, respectively. 100 replications are performed on the buckwheat data sets and the results are reported in Table 5 and Figure 10. Similar to the results from gasoline data set, Table 5 and Figure 10 still show that SPLS obtains much low prediction error and Enet is more likely to select less wavelength intervals or predictors as important variables.

Conclusion and Discussion
Enet and SPLS are two popular model calibration and selection methods for dealing with NIR spectroscopy data. The number of predictors of NIR data is much larger than sample size and the neighboring predictors are continuous, multicollinear wavelength intervals. The two methods can not only select more predictors than sample size but also exhibit group effect. In other words, Enet and SPLS can automatically group the multicollinear predictors and select or eliminate the entire predictor group simultaneously from the model for the "large p and small n" data. So the two methods are very suitable for investigating NIR spectroscopy data. The purpose of this article is to try to give advice on  Figure 6: The coefficient paths of predictor matrix "m5spec" with four responses from corn data set. The left and right four panels are generated by Enet and SPLS, respectively. All the panels show that the coefficients paths are segmentally zero-valued or nonzero-valued, so two methods select successive wavelength intervals as key variables.  which method should be used when dealing with NIR data in practice. The results from both simulation and real spectroscopy data show that Enet tends to select less predictors as key variables than SPLS; thus it gets more parsimony and sparse model and brings advantages for mode interpretation.
SPLS can obtain much smaller model calibration accuracy than Enet. So SPLS is more suitable when the attention is to get better fitting accuracy. What is more important, the above conclusion is still held when coming to performing the strongly correlated data whose predictors present group structures. In addition, two methods can obtain "consistent" results when the predictor matrices present slight differences, so they are not sensitive when performing data with noises.
As mentioned above, SPLS tends to select a large number of predictors when performing the high-dimensional NIR spectroscopy data. Although the reference of SPLS [18] states that (6) is proposed to obtain a sufficiently sparse solution, it is not so sparse in practice, especially compared with Enet. In this situation, one can also use two or more steps to further shrink the size of predictors. In other words, one can firstly employ SPLS to roughly select the predictors and then use other sparse methods such as Enet to refine the rest candidate predictors.

Data Availability
Three real data sets used in the following section as well as corresponding instructions are available in the electronic supplementary material (available here). The corn [1] as well as gasoline [49] data sets is two public spectroscopy data sets, and the buckwheat data set is from our lab and can be used freely.

Conflicts of Interest
The authors declare that they have no conflicts of interest.