Local Strategy Combined with a Wavelength Selection Method for Multivariate Calibration

One of the essential factors influencing the prediction accuracy of multivariate calibration models is the quality of the calibration data. A local regression strategy, together with a wavelength selection approach, is proposed to build the multivariate calibration models based on partial least squares regression. The local algorithm is applied to create a calibration set of spectra similar to the spectrum of an unknown sample; the synthetic degree of grey relation coefficient is used to evaluate the similarity. A wavelength selection method based on simple-to-use interactive self-modeling mixture analysis minimizes the influence of noisy variables, and the most informative variables of the most similar samples are selected to build the multivariate calibration model based on partial least squares regression. To validate the performance of the proposed method, ultraviolet-visible absorbance spectra of mixed solutions of food coloring analytes in a concentration range of 20–200 µg/mL is measured. Experimental results show that the proposed method can not only enhance the prediction accuracy of the calibration model, but also greatly reduce its complexity.


Introduction
Multivariate entire-spectrum data analysis is recently becoming a hot topic in analytical chemistry. One of the goals of the multivariate spectral analysis is to construct a calibration model that relates spectral databases to the chemical or physical properties of an analytical sample [1]. In complex samples, it is somewhat difficult to discriminate overlapping peaks [2]. Therefore, multivariate calibration methods, like principal components regression (PCR) [3] and partial least squares regression (PLSR) [4], have been extensively used in multivariate spectral analysis. Especially, PLSR has been proven to be a very powerful multivariate statistical tool for quantitative analysis because of its ability to solve problems, such as collinearity and band overlaps of the spectral data [5]. It has been shown that PLSR with global samples can yield precision prediction models.
With the development of full spectrum regression methods, the choice of the most appropriate calibration data is crucial in order to obtain calibration models with good performances in predicting the new samples [6]. Derived from the different directions of selection, the calibration set selection methods can be categorized into two categories: sample selection and variable selection (the later also called wavelength selection).
In the local strategy, the samples (calibration subset) that are spectrally most similar to the one to be predicted are selected from a database (calibration set), and the calibration model has distilled water. The spectra were collected over wavelengths of 407-605 nm at 1-nm intervals using a cuvette with a path length of 1 cm and referenced to an air background; which result in 198 variables. Here, the wavelength range 407-605 nm was considered throughout this paper because the absorbance of all food coloring samples are absent after 605 nm and absorbance of the amaranth and carmine samples are convergent below 407 nm. Each recorded spectrum was the average of 10 successive scans. The 97 samples were then randomly divided into a calibration set (70 samples) and a prediction set (27 samples). It should be noted that to obtain stable and reliable prediction results, the calibration set must be uniformly distributed throughout the sample space.

Data Pre-Processing
In practical multivariate analysis, a proper pretreatment of the spectral data is necessary. In this paper, the Savitzky-Golay (SG) smoothing was applied to remove the noise and distortion in the original spectra. The parameters have been set as follows: degree of polynomial p = 2, number of smoothing points 2l + 1 = 21. The original absorption spectra and smoothed spectra of the 97 food coloring samples are shown in Figure 1a,b. The original spectra displayed in Figure 1a, which can be observed that the spectra are often distorted, especially with high concentrations near the maximum absorption positions of the samples, leading to a nonlinearity in the spectra. Figure 1b displays the smoothed spectra with SG smoothing method. It is obvious that the smoothed spectra maintain the important features of the original spectra such as maximum absorption positions and overall shape by comparison with Figure 1a. Although the SG smoothing method produces a superior estimate for spectra data, there is a clear overlapping of the spectra and the datasets include nonlinearity and irrelevant variables. Figure 1c shows the absorption spectra of 60 µg/mL aqueous solutions with single components, such as amaranth, carmine, tartrazine, and sunset yellow FCF. As can be seen, the spectra of amaranth and carmine overlap, and bands in the sunset yellow FCF spectrum overlap with the absorbing regions of the other analytes. Thus, straightforward UV-VIS absorbance measurements are not able to distinguish these compounds; therefore, multivariate calibration is a suitable choice for overcoming this problem. A total of 97 working solutions containing various ratios of amaranth, carmine, tartrazine, and sunset yellow FCF were prepared by appropriate dilution (concentrations of 0-200 µ g/mL) with distilled water. The spectra were collected over wavelengths of 407-605 nm at 1-nm intervals using a cuvette with a path length of 1 cm and referenced to an air background; which result in 198 variables. Here, the wavelength range 407-605 nm was considered throughout this paper because the absorbance of all food coloring samples are absent after 605 nm and absorbance of the amaranth and carmine samples are convergent below 407 nm. Each recorded spectrum was the average of 10 successive scans. The 97 samples were then randomly divided into a calibration set (70 samples) and a prediction set (27 samples). It should be noted that to obtain stable and reliable prediction results, the calibration set must be uniformly distributed throughout the sample space.

Data Pre-Processing
In practical multivariate analysis, a proper pretreatment of the spectral data is necessary. In this paper, the Savitzky-Golay (SG) smoothing was applied to remove the noise and distortion in the original spectra. The parameters have been set as follows: degree of polynomial p = 2, number of smoothing points 2l + 1 = 21. The original absorption spectra and smoothed spectra of the 97 food coloring samples are shown in Figure 1a,b. The original spectra displayed in Figure 1a, which can be observed that the spectra are often distorted, especially with high concentrations near the maximum absorption positions of the samples, leading to a nonlinearity in the spectra. Figure 1b displays the smoothed spectra with SG smoothing method. It is obvious that the smoothed spectra maintain the important features of the original spectra such as maximum absorption positions and overall shape by comparison with Figure 1a. Although the SG smoothing method produces a superior estimate for spectra data, there is a clear overlapping of the spectra and the datasets include nonlinearity and irrelevant variables. Figure 1c shows the absorption spectra of 60 µ g/mL aqueous solutions with single components, such as amaranth, carmine, tartrazine, and sunset yellow FCF. As can be seen, the spectra of amaranth and carmine overlap, and bands in the sunset yellow FCF spectrum overlap with the absorbing regions of the other analytes. Thus, straightforward UV-VIS absorbance measurements are not able to distinguish these compounds; therefore, multivariate calibration is a suitable choice for overcoming this problem.

Local Strategy
Local strategy is based on the selection of a calibration subset from a spectral database for each unknown sample. This method is especially suitable for the spectra which have grouping information according to different composition. Each unknown sample requires the development of a specific model with a new subset of samples that are spectrally similar. The selection of a calibration subset is a critical step that considerably affects the precision and accuracy of the subsequent calibration. The similarity between each predicted sample and samples in calibration set has been computed using the S-GRC, and the calibration subset is selected on the basis of the higher S-GRC. This calculation step is described in detail in the following paragraph. To achieve the best prediction performance using the local strategy, the number of samples in calibration subset for each prediction sample needs to be evaluated. In this study, LOO cross-validation is applied, and RMSECV is calculated to determine the number of samples in the calibration subset. The optimal model always shows the lowest RMSECV.
Grey system theory [18] is a useful mathematical tool for analyzing systems when a limited amount of information is available. It has been widely applied in various fields [19,20]. Grey relation analysis (GRA) is one tool of grey system theory used for determining whether sequences are closely related [21,22]. Here we propose the S-GRC to fully evaluate the similarity between absorption spectra of samples by analyzing the absolute deviation and change rates of the sequences.
For computing the S-GRC between reference sequence X i and sequence X j the following equation is used: where the sequences X i and X j are nˆ1 vectors from prediction set and calibration set respectively, n is the number of wavelengths, i " 1, 2,¨¨¨, m p , j " 1, 2,¨¨¨, m c , m p and m c are number of samples in prediction set and calibration set, respectively, ε ij is the absolute degree of GRC, γ ij is the relative degree of GRC, and θ P r0, 1s is the weight of the change rates. In this paper, the relative degree of GRC is focused on the geometrical difference between spectra sequences and the effect of based bias between different spectrums can be eliminated. Therefore, it is better than the absolute degree of GRC in discrimination of overlapping spectra. Therefore, the weight value θ is set to be 0.2. The ε ij represents difference of sequences in absolute deviation, which is given by: where |s i |,ˇˇs jˇa ndˇˇs i´sjˇa re calculated as follows: where x i pkq P X i and x j pkq P X j . The γ ij describes the difference in geometry between sequences, which is calculated by: where γpx j pkq, x i pkqq is given by: and ξ P r0, 1s is the distinguishing coefficient. According to [22] the value is generally set at ξ " 0.5. Here, min l"1,¨¨¨,m c |x l pkq´x i pkq| and max l"1,¨¨¨,m c |x l pkq´x i pkq| are, respectively, the minimum value and maximum value of the 1ˆm c deviation vector, where l is from 1 to m c . The symbols used in the calculation process of S-GRC are summarized in Table 1. Table 1. The summary of symbols used in the calculation process of S-GRC.

Symbol Representation
θ " 0.2 weight of the rates of change ξ " 0.5 distinguishing coefficient ρpX i , X j q synthetic degree of GRC between sequences X i and X j ε ij absolute degree of GRC between sequences X i and X j γ ij relative degree of GRC between sequences X i and X j n the number of wavelengths m p number of samples in prediction set m c number of samples in calibration set S-GRC: Synthetic degree of grey relation coefficient.

Wavelength Selection
For each unknown sample, a new calibration model has been employed with local strategy. Compared with global method, the computation load of local strategy will be greatly increased. The traditional wavelength selection methods is complex when they are used in a local strategy. Thus, considering both effectiveness and time consumption, the SIMPLISMA can be used as the wavelength selection method for the local model in this paper.
The goal of wavelength selection is to eliminate noisy variables and to improve the prediction performance. Wavelength selection based on SIMPLISMA can provide the analytical wavelengths with a high signal to noise ratio (called pure variables) for the prediction model and the influence of the variables that are irrelevant to the studied properties can be eliminated. The pure variable is one that contains intensity contributions from only one component in the mixture [23], and SIMPLISMA assumes that every component in the mixture under study has a pure variable (e.g., a wave number) with a finite intensity for the particular component and zero intensity for all other components in the mixture [24]. Since pure-variable intensities are directly proportional to the concentrations of the associated components, a calibration model constructed with pure wavelengths provides better results than that constructed with the entire spectrum. The number of the selected variables is a critical parameter, which decides the stability and accuracy of the model. When the number of selected variables is too small, the robustness and accuracy of the model may be affected due to the loss of useful informative variables. On the other hand, when more variables are used, uninformative variables may be contained in the model and cause its performance to be weak. Here, in order to generate a rapid process of wavelength selection, the determining coefficient function without cross-validation, defined in Section 2.5, has been used to determine the proper number of the selected wavelengths.

Simple-to-Use Interactive Self-Modeling Mixture Analysis
In the SIMPLISMA method [25], the pure variable is determined by the standard deviation divided by the sum of the mean and a constant: where p ij represents the purity values of the selected variable (i = 1, 2, . . . , n; j is the number of pure variables, and j = 1, 2, . . . , r; r is the number of components). All of the p ij values are plotted in the form of a spectrum, the so-called purity spectrum, and the wavenumber of the highest intensity represents the jth pure variable. α is the value of noise. This noise level is typically l-5% of the maximum mean, and here, we use a value of 1%. µ i is the mean and σ i is the standard deviation of the ith column vector of the original spectral data matrix D mˆn , m is the number of samples.
The weight factor w ij is a determinant-based function used to remove all the contributions correlating with previously pure variables:

11)
Note that w i,1 " 1 for j = 1. Here, p j´1 is the wave number of the (j´1)th purity variable, and C is the correlation around the origin matrix, which is given by: where Dpλq is the original data matrix D scaled by the length λ The standard deviation spectrum s i,j that has the closest relationship with the original data is described by: SIMPLISMA will continue to search for pure variables until the maximum number of components is reached. The number of pure variables is determined on the basis of determining coefficient defined as: where R j is the determining coefficient of the jth pure variable, and R sj is the relative total intensity of the standard deviation spectra of the jth pure variable, which is given by: If the j pure variables are representative of the entire mixture system, then all the other variables in the dataset will be linear combinations of these j pure variables, resulting in R spj`1q value of nearly zero. Since the value of R spj`1q become close to 0, the value for determining coefficient R j will be relatively high after determining the proper number of pure variables. Additional details regarding this method can be found in [23,25].

Model Evaluation
The RMSEP of the prediction set was used to evaluate the accuracy of our models, with the RMSEP calculated as: RMSEP " whereŷ j and y j are the predicted and reference concentrations of the jth sample, and m p is the number of samples in the prediction set. Statistical analysis was performed using the Wilcoxon matched-pairs signed-ranks test between reference and predicted concentrations of different methods [26]. In this paper, an "exact" test was used and two-tailed p values were calculated. Differences were considered statistically significant at p < 0.05.
For the local strategy, RMSECV of the LOO procedure was used to optimize the number of samples in calibration subset, which is given by: RMSECV " where y j is the concentration of the jth sample in calibration set,ŷ pjq is the predicted values in cross-validation without the jth sample, and N is the number of samples in calibration subset, N ď m c . In order to select the optimum PLS factors, leave-one-out cross-validation (LOOCV) was used. Here, LOOCV works by temporarily extracting one sample from calibration set, and then predicting the selected sample by the remaining ones. Since the concentrations of the samples in calibration set are known, the prediction errors can be calculated. In this situation, the LOOCV process is repeated as many times as there are samples in calibration set. The squared prediction errors are summed and expressed as the prediction residual error sum of squares (PRESS), which is calculated by: where m c is the number of calibration samples; y j the reference concentration for jth sample and y pjq represents the estimated concentration. Resembling the F test, the determination of the PLS factors is mathematically defined by computing the ratio between two successive values of PRESS. If PRESS k {PRESS k´1 exceeds 1, use (k´1) PLS factor in the model.
As comparison, the local algorithm based on the Mahalanobis distance has also been calculated, which is given by: where MD j is the Mahalanobis distance jth sample and predicted sample, T j and T pred are the scores of jth sample and predicted sample, and COV is the covariance matrix of the scores matrix of the calibration set.

Calibration Subset Selection
A key factor influencing the predictions accuracy is the choice of an adequate size and distribution of the calibration subset. In this study, the proper number of samples in the calibration subset that returns the minimum RMSECV of the LOO procedure was identified. Although the calibration model may be not the best in this method, it was robust and realized acceptable accuracy. The calibration subset selection steps are as follows: 1.
Sorting calibration set A mˆn in the descending order of S-GRC values. Here, m = 70 and n = 198; 2.
Selecting the former N sub samples of calibration set that compose calibration subset. Here, N sub = 3 or 4 or 5; 3.
Applying the regression method, PLSR, on the absorbance data in the calibration subset using the LOOCV method, calculating the RMSECV; and  (2) and (3) until N sub > 70. calibration model may be not the best in this method, it was robust and realized acceptable accuracy. The calibration subset selection steps are as follows: 1. Sorting calibration set mn A  in the descending order of S-GRC values. Here, m = 70 and n = 198; 2. Selecting the former Nsub samples of calibration set that compose calibration subset. Here, Nsub = 3 or 4 or 5; 3. Applying the regression method, PLSR, on the absorbance data in the calibration subset using the LOOCV method, calculating the RMSECV; and 4. Nsub = Nsub + 1, repeating the step (2) and (3) until Nsub > 70.   Figure 4. The thick curves are the absorbance spectrum of predicted samples, and the fine curves are the absorbance spectra of the calibration subset samples. It is obvious that not only the samples with the  Figure 4. The thick curves are the absorbance spectrum of predicted samples, and the fine curves are the absorbance spectra of the calibration subset samples. It is obvious that not only the samples with the smaller absolute deviation, but also samples with similar rates of change have been selected in the calibration subset. The concentrations of each sample in calibration subset are list in Table 2. The S-GRC between predicted sample and the samples in calibration subset are also presented. smaller absolute deviation, but also samples with similar rates of change have been selected in the calibration subset. The concentrations of each sample in calibration subset are list in Table 2. The S-GRC between predicted sample and the samples in calibration subset are also presented.     Table 2. The S-GRC between predicted sample and the samples in calibration subset are also presented.

Wavelength Selection
The spectral data in the calibration subset of each prediction sample was analyzed using the SIMPLISMA approach with a noise level α of 1%, as can be seen in Equations (8)- (13). The number of the selected wavelengths has been determined by the coefficient defined in Equation (14).
The following examples of one-component, two-component 1-2, and four-component samples, illustrated the process of wavelength selection are shown in Figures 5-8. As can be seen in Figure 5, the variables with a relatively high intensity (between 517-572 nm, in Figure 5a) will be relatively pure, and the variable with the highest intensity (at 542 nm, shown in Figure 5a) is the first pure variable. By contrast, a variable with a low intensity (between 407-507 nm, in Figure 5a) will have contributions from several components. After eliminating the effect of the first pure variable by using Equations (11) and (12), the second purity spectrum is shown in Figure 5b. The second purity spectrum results in the selection of the next pure variable, i.e., 447 nm, which is accepted as the second pure variable. Following this treatment, the other pure variables are selected, until the number of the pure variables equals 7. As shown in Figure 5e, the seventh purity spectrum has an odd shape and low intensities nearly zero. Such erratic behavior is a strong indication that spectrum consists of only noise. This is also confirmed by determining coefficient curve in Figure 5f. As shown in Figure 5f, there is a sudden change of determining coefficient when j = 6. This indicates that the relative total intensity of the standard deviation spectra of the seventh pure variable defined in Equation (15) is nearly zero and the seventh purity spectrum does not contain any useful information except the noise. Thus, the number of the selected wavelength is 6, and the SIMPLISMA process ends at the number of pure variables equaling to 7. With the same method, the purity spectra and determining coefficient during the processing of calibration subset spectra for four-component sample, two-component sample 2 and one-component sample, are shown in Figures 6-8, respectively. As can be seen, the number of selected wavelength for four-component sample (amaranth (50 µg/mL), carmine (60 µg/mL), tartrazine (40 µg/mL), and sunset yellow FCF (30 µg/mL)), two component sample (amaranth (100 µg/mL) and carmine (50 µg/mL)) and one component sample (amaranth (160 µg/mL)) are 25, 17, and 5, respectively.

Wavelength Selection
The spectral data in the calibration subset of each prediction sample was analyzed using the SIMPLISMA approach with a noise level  of 1%, as can be seen in Equations (8)- (13). The number of the selected wavelengths has been determined by the coefficient defined in Equation (14).
The following examples of one-component, two-component 1-2, and four-component samples, illustrated the process of wavelength selection are shown in Figures 5-8. As can be seen in Figure 5, the variables with a relatively high intensity (between 517-572 nm, in Figure 5a) will be relatively pure, and the variable with the highest intensity (at 542 nm, shown in Figure 5a) is the first pure variable. By contrast, a variable with a low intensity (between 407-507 nm, in Figure 5a) will have contributions from several components. After eliminating the effect of the first pure variable by using Equations (11) and (12), the second purity spectrum is shown in Figure 5b. The second purity spectrum results in the selection of the next pure variable, i.e., 447 nm, which is accepted as the second pure variable. Following this treatment, the other pure variables are selected, until the number of the pure variables equals 7. As shown in Figure 5e, the seventh purity spectrum has an odd shape and low intensities nearly zero. Such erratic behavior is a strong indication that spectrum consists of only noise. This is also confirmed by determining coefficient curve in Figure 5f. As shown in Figure 5f, there is a sudden change of determining coefficient when j = 6. This indicates that the relative total intensity of the standard deviation spectra of the seventh pure variable defined in Equation (15) is nearly zero and the seventh purity spectrum does not contain any useful information except the noise. Thus, the number of the selected wavelength is 6, and the SIMPLISMA process ends at the number of pure variables equaling to 7. With the same method, the purity spectra and determining coefficient during the processing of calibration subset spectra for four-component sample, two-component sample 2 and one-component sample, are shown in Figures 6-8, respectively. As can be seen, the number of selected wavelength for four-component sample (amaranth (50 μg/mL), carmine (60 μg/mL), tartrazine (40 μg/mL), and sunset yellow FCF (30 μg/mL)), two component sample (amaranth (100 μg/mL) and carmine (50 μg/mL)) and one component sample (amaranth (160 μg/mL)) are 25, 17, and 5, respectively.        As shown in Figure 9a,d, according to the determining coefficients 6, 25, 17, and 5 wavelengths were selected, respectively, for the specific examples. It is clear that the number of selected wavelengths (indicated by the vertical lines) in Figure 9b,c are greater than that in Figure 9a,d. This is due to interference between the components with overlapping spectra in the four-component sample (shown in Figure 9b) and two-component sample 2 (shown in Figure 9c). The final size of the calibration subset for each unknown sample in prediction set has been shown in Table 3. It can be seen that the one-component samples and the two-component samples, As shown in Figure 9a,d, according to the determining coefficients 6, 25, 17, and 5 wavelengths were selected, respectively, for the specific examples. It is clear that the number of selected wavelengths (indicated by the vertical lines) in Figure 9b,c are greater than that in Figure 9a,d. This is due to interference between the components with overlapping spectra in the four-component sample (shown in Figure 9b) and two-component sample 2 (shown in Figure 9c). As shown in Figure 9a,d, according to the determining coefficients 6, 25, 17, and 5 wavelengths were selected, respectively, for the specific examples. It is clear that the number of selected wavelengths (indicated by the vertical lines) in Figure 9b,c are greater than that in Figure 9a,d. This is due to interference between the components with overlapping spectra in the four-component sample (shown in Figure 9b) and two-component sample 2 (shown in Figure 9c). The final size of the calibration subset for each unknown sample in prediction set has been shown in Table 3. It can be seen that the one-component samples and the two-component samples, The final size of the calibration subset for each unknown sample in prediction set has been shown in Table 3. It can be seen that the one-component samples and the two-component samples, like the first, 6-10, and 16-21 prediction samples, have a small number of samples and wavelengths in the calibration subset (less than 12). Two-component samples, like 15th, 22nd, and 23rd prediction samples, have relatively larger calibration subset sizes; this is because there is a clear overlapping of the spectra of the two components. For the four-component samples, the sizes of calibration subsets are usually more than 20, to obtain adequate similar samples for building a prediction model.

Comparative Performance of the Various Methods
The RMSEPs of the prediction set (27 samples) and computational time are calculated by using the proposed method and listed in Table 4. As a comparison, the RMSEPs and computational time obtained by a global model with entire spectra, a global model combined with GA, UVE, MWPLSR, SPA wavelength selection methods, and a local model with Euclidean distance, Mahalanobis distance, and spectral angle mapper similarity criterions with the same calibration and prediction set are also listed in the table. It should be noted that after selecting a calibration set or subset with one of the aforementioned methods, calibration equations were calculated using PLSR and LOOCV was used to select the optimum number of PLS latent variables. The results of different runs with UVE can be different, so the time and RMSEP of UVE are the averages of 10 runs.
As shown in Table 4, it is clear that all local strategies with different similarity criteria can improve the prediction performance which produced the smaller RMSEP of each component compared with global model with entire spectra. Especially, a local strategy with a S-GRC similarity criterion can provide a relative RMSEP improvement of amaranth, carmine, tartrazine, and sunset yellow FCF greater than 60%, 57%, 12%, and 40%, respectively, than obtained by a global model with entire spectra. In addition, each query sample requires the development of a specific model with a new subset of samples, so all local strategies need long computational times, more than 120 s. As expected, the lower RMSEP has also been obtained with wavelength selection methods. However, the GA, MWPLS, and SPA methods need cross-validation to determine the proper number of selected variables, so the calculation time is obviously longer than SIMPLISMA and UVE methods. Considering both prediction performance and calculation time, the SIMPLISMA and UVE have been chosen as the wavelength methods for calibration subsets selected by the local strategy in this paper. Comparison of the RMSEPs obtained by these two methods with the same local strategy shows that the SIMPLISMA produced a better prediction with smaller RMSEPs. Although the RMSEP of carmine obtained by the local model combined with SIMPLISMA is slightly larger than that by UVE, there is no significant difference between 1.76 µg/mL and 1.73 µg/mL. In addition, the final size of selected calibration set obtained by local strategy based on S-GRC combined with SIMPLISMA wavelength selection is [4,38]ˆ [3,25], which is considerably less than that with UVE. Such results also indicate the proposed method is more effective. It should be noted that [4,38]ˆ [3,25] represents the number of samples in all selected calibration sets, between 4 and 38, and the number of selected variables in these sets is between 3 and 25.   Statistical analysis was also performed using the Wilcoxon matched-pairs signed-ranks test. The p-values between reference and predicted concentrations with different methods were listed in Table 5. As can be seen in Table 5, no statistical differences were found between reference and predicted concentrations with different methods for each component compared with Wilcoxon tests (p > 0.05). This means it is hard to evaluate the prediction performance of different models based merely on Wilcoxon tests. Therefore, standard error of prediction residual error was further calculated. As shown in Tables 4 and 5, the standard errors of prediction residual error and RMSEP are almost equal, and with a similar variation tendency. This proves, yet again, that both local strategy and wavelength selection approaches can improve the prediction performance of multivariate calibration models, especially the local strategy of S-GRC, wavelength selection based on SIMPLISMA, and a combination of both.

Conclusions
Throughout this paper, we proposed to conduct a local strategy based on the similarity criterion named as S-GRC. RMSECV has been applied to determine the number of spectrally-similar samples for each unknown sample. Wavelength selection has been performed according to the order ranking of pure variables by SIMPLISMA. In order to generate rapid predictions, a determining coefficient has been used to ascertain the proper number of selected variables without cross-validation. For comparison, a global model with an entire spectra, a global model with different wavelength selection methods (GA, UVE, MWPLS, SPA), and a local model with different similarity criteria (Euclidean distance, Mahalanobis distance, and spectral angle mapper) were also developed. With ultraviolet-visible spectral data of food coloring analytes, it has been proved that an optimized calibration subset can be selected by the proposed methods for building a high-performance prediction model in reasonable time frames. In conclusion, the local strategy based on S-GRC, combined with the SIMPLISMA wavelength selection method, was recommended to build a robust model for multivariate calibration, especially to resolve spectra with partial overlaps.