Towards improvement in prediction of iodine value in edible oil system based on chemometric analysis of portable vibrational spectroscopic data

Iodine value (IV) is a significant parameter to illustrate the quality of edible oil. In this study, three portable spectroscopy devices were employed to determine IV in mixed edible oil system, a new Micro-Electro-Mechanical-System (MEMS) Fourier Transform Infrared Spectrometer (MEMS-FTIR), a MicroNIRTM1700 and an i-Raman Plus-785S. Quantitative model was built by Partial least squares (PLS) regression model and four variable selection methods were applied before PLS model, which are Monte Carlo uninformative variables elimination (MCUVE), competitive reweighted sampling (CARS), bootstrapping soft shrinkage approach (BOSS) and variable combination population analysis (VCPA). The coefficient of determination (R2), and the root mean square error prediction (RMSEP) were used as indicators for the predictability of the PLS models. In MicroNIRTM1700 dataset, MCUVE gave the lowest RMSEP (2.3440), in MEMS-FTIR dataset, CARS showed the best performance with RMSEP (2.2185), in i-Raman Plus-785S dataset, BOSS gave the lowest RMSEP (2.5058). They all had great improvements than full spectrum PLS model. Four variable selection methods take a smaller number of variables and perform significant superiority in prediction accuracy. It was demonstrated that three new portable instruments would be suitable for the on-site determination of edible oil quality in infrared and Raman field.

The theory of FT-NIR is relied on the absorption of electromagnetic radiation which wavenumbers range from 800 to 2500 nm. The spectra produced by FT-NIR mainly corresponds to overtones and combinations of vibrational modes referring to C-H, C=C, C=O and N-H chemical bonds which arises from overlapping absorptions 2,4,5 . And it has proved to be a dependable tool for measurement of biological and chemical systems on account of wide range overtone bands. The main limitation of FT-NIR is its dependence on reference methods, its low sensitivity to minor constituents and its dependency on intricate calibration procedures 6,7 . So aiming to predict indirectly, appropriate chemometric tools should be used for multivariate calibration which are highly indispensable for the advanced technology of spectroscopy. Some computational approaches such as PCA and PLS, which allow the processing of abundant variables that then need data reduction process.
Raman spectroscopic technique is also based on the vibrational transitions occurring. Raman scattering depends on the change of the molecular polarizability and is useful for the in vivo or on-site study 8,9 . It is also widely used to analyze food components such as proteins, lipids, and water in food science.
The application of vibratory spectroscopy and chemometrics in oil has been reported by many researchers. Lucyna Dymińska et al. 10 used infrared and Raman methods to determine the iodine values of unsaturated plant oil. Cleiton A. Nunes 11 assessed quality parameters, adulteration and authenticity of edible oils and fats by vibrational method and chemometrics. Nor Fazila Rasaruddin et al. 12 also tested the IV of palm oils by FT-NIR. Li et al. 13,14 reported the use of FT-NIR for rapid measurement of iodine value, saponification number and cis and trans content of edible oil. However, fewer investigations about portable vibrational spectroscopy methods application were reported and influence of variable selection on Raman spectroscopy was rarely systematic studied. In our study, both BOSS and VCPA were first applied in Raman spectra.
Variable selection methods are well recognized in chemometrics and industrial applications. The elimination of variables which do not contribute to any inference is highly desirable for several reasons. For example, in NIR, absorption bands of fundamental frequency vibrations and combination of vibrations make it possible for quantitative analysis. Generally, NIR doesn't need sample preparation. Several properties can be predicted according to a single spectrum simultaneously. However, adverse issues are also inevitable, as absorption bands are usually overlapping. Moreover, spectroscopy characterizes a chemical sample with thousands of wavelength variables, which may include lots of irrelevant information for calibrations like noise or background, often resulting in a negative effect to the whole modeling. Therefore, suitable chemometrics algorithm is necessary to deal with NIR spectrum, with the purpose to eliminate the uninformative variables effectively by using variable selections.
Vibrational spectroscopy has been proved to be a reliable method of rapidly determining the physical and chemical properties in edible oil. It has provided a responsive alternative for the commonly used methods applied in the industries. However, more applications of on-site test should be developed. This study has three proposes. First is to investigate the feasibility of using MEMS-FTIR, MicroNIR TM 1700 and i-Raman Plus-785S to quantify IV of edible oil based on PLS regression models. Second is to investigate the influence of variable selection methods especially BOSS and VCPA on the robustness and predictability of calibration models developed by PLS. Last one is to demonstrate the potential of three portable devices for the on-site analysis of edible oils in the view of IV.

Materials and Methods
Agents and reagents. Potassiumiodide  Sample preparation. Soybean oil, olive oil, peanut oil and blend oil products were obtained from local supermarket. Iodine value were operated by the standard titration method which is based on the official methods introduced in the method for animal and vegetable fats and oils-determination of iodine value (ISO 3961:1996, MOD). 59 samples were prepared by mixing the four kinds of oil with the concentration of soybean oil, olive oil, peanut oil, blend oil from 0% to 85.46%, 0% to 69.34%, 0% to 88.35%, 0% to 85.46%, respectively.

Instruments.
MicroNIR1700. MicroNIR1700 is a micro NIR spectrometer developed and manufactured by JDSU. The instrument uses a Linear Variable Filter (LVF) as a light-splitting element. The LVF is a special band-pass filter, which is specially fabricated into a wedge-shaped coating in a specific direction. Since the center wavelength of the passband and the film Layer thickness, the wavelength of the filter penetrates linearly in the wedge direction, which plays a role of spectroscopy. LVF is coupled to a linear array detector (128-pixel uncooled InGaAs photodiode array). Dual integrated vacuum tungsten light source, 16-bit A/D converter.

MEMS-FTIR.
MEMS-FTIR is a long wavelength near infrared spectrometry machine developed by HAMAMATSU in Japan. The MEMS-FTIR is a Fourier transform infrared spectrometer which is compact and with low cost. A Michelson interferometer and an infrared detector are grouped together in a small space. The MEMS-FTIR is formed by a fingertip size FT-IR engine, a control board, a photo-detector, input/output fibers, etc. Its size is 75 × 100 × 27 mm. Spectral measurement or absorption measurement can be done simply by connecting to a PC via USB. It is very suitable for on-set in-situ test analysis.
i-Raman Plus-785S. i-Raman Plus is a portable raman instrument developed by B&W Tek, Inc Company. It uses innovative intelligent spectral processing technology, high efficiency thin back-illuminated CCD detector, lower cooling temperature, resulting in better signal to noise ratio and higher dynamic range. The i-Raman ® Plus-785S has a maximum integration time of up to 30 minutes and has the unique advantage of detecting weak Raman signals. It combines both high resolution and wide spectral range with spectral ranges up to 3200 cm −1 and optimal resolutions up to 4.5 cm −1 .

MEMS-FTIR.
NIR spectra were collected with 5 mm quartz cuvette. The spectra were acquired over the range 1100~2100 nm (middle gain resolution, 2000 ms scans) at room temperature. Between each spectrum, the quartz cuvette was rinsed by the next sample.
i Raman Plus-785S. Raman spectra were acquired with 5 mm quartz cuvette over the range 175~3200 cm −1 at room temperature. The resolution is 4.5 cm −1 . Dark current was calibrated every 30 minutes and background was collected by the next sample.
Where c j is utilized on the conjunction of the addition of the original data and random variables, β j is on half of the regression coefficients of the jth variable when ignore the ith calibration sample, and n is the calibration samples number. β j denote the mean value, and s(β j ) stands for the standard deviation of all β ij for the jth variable, and β ij is obtained through leave-one-out approach. The criterion of eliminating redundant variables is achieved as the equation below: Here c ( ) j stands for stability criterion of the jth variable in the dataset originally; and c max( ) artif stands for the absolute value of the maximum value for c ( ) j from the added random variables. In MCUVE, Monte Carlo sampling strategy is brought in the UVE instead of leave-one-out method: random choosing M samples from all the calibration samples to set up PLS models to calculate the regression coefficient β, then repeating the process for N times. So Eq. (2) convert into the following equation: Here, β ij denotes the regression coefficient of the jth wavelength in partial regression model, which is established by the ith M random chosen samples.

CARS.
CARS is proposed also based on absolute value of regression coefficients with the purpose evaluating the significance of variables 15 . Monte Carlo is employed for sampling. To carry out feature selection and leaving out variables with small absolute regression coefficients in compulsive way, the exponentially decreasing function (EDF) is adopted. Through an EDF run, the ratio of wavelengths retained is processing in the ith sampling run follows the equation: where a and k are two constants. They can be computed as: Adaptive reweighted sampling (ARS) is adopted to realize a competitive feature selection in the view of the regression coefficients. This step follows the principle 'survival of the fittest' which is the basic theory of Darwin's Evolution Theory 19 .
In the end, cross validation is employed to select the subsets according to the lowest RMSECV.
BOSS. BOSS  (1) K subsets are generated by using BSS, all the variables are assigned with equal weights (w).
(2) Build K PLS sub-models with all the subsets and pick out best models with the lowest RMSECV.
(3) Add up all the normalized regression vector to acquire new weights for variables.
new subsets are generated by WBS according to new weights. This way guarantees that we have larger probabilities to select the variables which have larger absolute value of regression coefficients.
The subset which has the lowest RMSECV during the iteration is selected as the optimal variable subset by repeating step (2-4).
VCPA. The optimizing variable subset is selected rely on binary matrix sampling (BMS) and EDF. In each iteration run, BMS and model population analysis (MPA) are carried out for once. After N EDF runs, 14 variables are remained which are considered be the most significant. Then RMSECV of all the combinations is calculated and the lowest RMSECV is recorded. In the end, the optimal subset with the lowest RMSECV is selected in the final run 17 .

Partial Least Squares Regression (PLS).
PLS is a two-block regression method which is aimed to model the relationship between measured spectrum matrix X and a response vector y. Eqs (9) and (10) illustrate the PLS model 21 .
Here T is score matrix, P is the loading matrix. q as a y-loading vector, E A and f A are residual matrix of X and y-vector.

Model Validation.
To assess the performance of four promising variable selection approaches, namely CARS, SCARS, BOSS and SBOSS. Mean-centered was applied before modeling, and the optimal number of latent variables was determined by 5-fold cross validation. RMSEC (Root mean square error of calibration), RMSEP, Q cv 2 and Q test 2 were used to evaluate model performance. Standard deviation (SD) in 50 runs was employed to evaluate the robustness of PLS model. Simultaneously, the number of optimal latent variables (nLVs) and variables selected number (nVAR) were also reported. While y i is the experimental of the predicted properties, and ŷ i and y i represent predicted and average respectively. Ncal is the number of calibration samples of the training set. RMSEP and Q test 2 hold the equation following the same as RMSEC and Q cv 2 .
i n i 1 2 Each method was repeated for 50 times to assess the stability. The standard derivation (SD) was employed to calculate stability with Eq. (11). Where X i and X are predicted and average value, separately. n stands for the number of all samples. The smaller the value of the stability, the more stable is the method.

Results and Discussion
The dataset was separated into Calibration set (36 samples) and independent test set (23 samples) by K-Stone sampling 22,23 . For preprocessing, centering was employed in all datasets before modeling. For MCUVE, the Monte Carlo sampling number is set to 500. The regression coefficients of every variable were recorded. A coefficient matrix was developed after 500 iterations. Then, all the variables were ranked in accordance with their reliability index. In our study, 5-fold cross validation was employed to decide the number of variables. With all these settings, we ran MCUVE to estimate its predictive performance. For CARS run, the number of Monte Carlo sampling runs was 100. In BOSS, the bootstrap number was set to 1000. Several parameters also influence VCPA strategy, EDF runs (50 times), BMS sampling runs (1000), ω, the number of the left variables in the final run of EDF (14), σ, the ratio of best models of k sub-models (10%). We ran VCPA with the settings as in the parentheses. All the four variables selection methods were repeated for 50 times to assess the prediction accuracy and robustness.
Infrared spectra features. Fig. 1 showed the raw spectra of mixed edible samples on MicroNIR1700, MEMS-FTIR and iRaman Plus-985S. MEMS-FTIR has wider spectrum range than.  IV of edible oil. Both the mean and standard deviation were given in Table 2.

Quantitation of IV by variable selection and PLS.
As for MicroNIR1700, the four variable selection methods didn't give great improvements compared to the full spectra PLS model. MCUVE gave the best performance with RMSEP (2.3440), it increased 1.86% than full   Table 2).
CARS achieved a good prediction with the least variables, we can see that from both Figs 4 and 5. the reason may be that variables are heavily collinear and therefore the model's variance could be reduced with fewer variables. BOSS and VCPA also had fewer variables, but they retained variables around 700 nm and 800 nm which were uninformative, that was the reason why the model of BOSS and VCPA performed worse. There were no absorption peaks around 700 nm and 800 nm, so it didn't have any information. Figure 2 (3) in Supplementary Materials showed the regression coefficient path of each variable from one run of CARS with the100 runs of sampling. We can see that in the first sampling run, the absolute value of regression coefficient of each variable was very small. However, with the number of sampling runs increased, the coefficients of some variables became larger and larger while others got smaller and smaller. Especially, the regression coefficients even decreased to zero if the relevant variables were knocked out by CARS. Therefore, the corresponding variable has more chances to survive if the absolute regression coefficient performs larger. It is also essential to analyze the regression coefficient path of each wavelength as shown in Fig. 2 (2)(3) of MEMS-FTIR dataset (Supplementary Materials). As previously mentioned, each line reflected the changing of regression coefficient of one variable. During CARS iteration, some significant variables were chosen while other ineligible ones were ignored. In MEMS-FTIR dataset, CARS performed the best with the lowest RMSEP (2.2185), followed by MCUVE, VCPA and BOSS. CARS also got the minimum standard deviation (0.1144). The reason why CARS presented the best may be that it selected most of the informative variables around 1392 nm, 1414 nm which corresponded to the combination of the C-H stretching and vibration with other vibration modes of the concerned molecule. Most strong absorption bands of the calibration samples were observed at CH and CH 2 over-tones. These overtones occurred at 1207, 1391, around 1408, 1715 and 1734 nm together with minor absorption bands in the range between 2083 and 2202 nm (Fig. 5).
The absorption bands around 1207 nm comprised the second overtones of C-H, while those between 1612-1818 nm were attributed to the first overtones of C-H which comprises CH 3 , CH 2 and HC-CH. The reason why BOSS played the worst is that BOSS chose the variables with much noise (around 1100 and 2100 nm) and missed the informative variables (around 1715 nm).
As for Raman data, it is obvious to see that variable selection had a great influence to the PLS model. BOSS and VCPA improved a lot compared to the full spectrum PLS model, nevertheless MCUVE and CARS showed bad results. Figure 5 demonstrated selected variables of four variable selection methods. It is noticeable that both MCUVE and CARS retained the variables between 175 and 220 nm that are mainly noise. Variables belong to the region are uninformative, moreover, MCUVE missed the informative variables around 1745, 1655, 1438, 1301 and 1068 nm, hence MCUVE had a bad result even worse than full spectrum PLS model. It should be noticed that denoising ability of MCUVE and CARS are weak (Fig. 4). VCPA ranked the second place in all the models with good stability. It retained informative variables efficiently on account of the employment of BMS and MPA. BMS is a sampling approach that each variable has the same opportunity to take part in the sampling process, which let it be a suitable sampling choice of variable selection. Moreover, the same with CARS, EDF makes VCPA select fewer variables. The RMSECV of every EDF run was presented in Supplementary Material. It was demonstrated that the trend of decreased RMSECV is accordance with the EDF. The RMSECV is decreasing with the shrinking variable space, which means that the remaining variables were gradually selected toward the optimal variable subset. At last, the optimal subset from all the combinations among 14 variables was found.  In general, the results in Table 2 showed relatively good predictions which indicated that the calibration models are robust. It indicated that the predictions of MicroNIR1700, MEMS-FTIR and i-Raman-785s were comparable to their corresponding reference methods for IV determination and therefore the three portable devices based on edible oil analysis is suitable for on-site measurement of IV for edible oil or other biodiesel production. Variable selection is necessary for quantitative model to improve the prediction results and ensure the reliability.

Conclusion
In our study, we discussed the influence that variable selection methods MCUVE, CARS, BOSS and VCPA has on the MicroNIR1700, MEMS-FTIR and i-Raman-785s PLS calibration modeling for the vibratory spectroscopy analysis IV of edible oil. The results showed that the three portable spectroscopy devices were capable of providing a rapid and accurate measurement of IV of edible oil destined for biodiesel production with a proper calibration and a responsive model. Once the calibrations are in place, portable device is a fast and easy to use method for the IV measurement in an on-site environment. It drastically reduces the time from routine IV value quality control analysis and does not involve the use of any chemical reagents. Conclusively, it's possible to use portable vibratory spectroscopy as an edible oil quality control tool for IV measurement and more robust PLS and prediction models can be obtained based on variable selection methods.

Data Availability
All data included in this study are available upon request by contact with the corresponding author.