Rapid non‐destructive analysis of lignin using NIR spectroscopy and chemo‐metrics

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. Food and Energy Security published by John Wiley & Sons Ltd. 1College of Engineering and Technology, Southwest University, Chongqing, China 2Chongqing College of Electronic Engineering, Chongqing, China


| INTRODUCTION
Snow pear, a common and popular fruit in China, has been demonstrated as traditional remedies for relieving respiratory symptoms, constipation and alcoholism in traditional Chinese medicine (TCM) (Li et al., 2012) for over 2000 years. Near-Infrared (NIR) spectroscopy (Ibáñez et al., 2019;Song et al., 2019;Zhang et al., 2018), combined with processing analytical technology have proven to be a useful and non-destructive spectral techniques (Ibáñez et al., 2019;Ozdemir et al., 2019) for rapid quantitative analysis of internal chemical components in fruits and food (Xia et al., 2018).
Studies have shown that stone cell content plays a critical role in determining fruit quality of pears, but not the case in other fruits. More specifically, higher concentration and larger size of stone cells lead to thicker pulp and deterioration of the taste. Although genetic variation of the present stone cells is a key factor (Tao et al., 2009) that modulates the stone cells content, it has also been found that Lignin content plays a direct, positive role to modulate the size and content of stone cells (Cai et al., 2010;Xue et al., 2019) in snow pears. The more lignin (Xue et al., 2020) is synthesized, the more stone cells are formed. Accumulation of stone cell mass increases the pulp density, leading to the worsening of the taste. Therefore, the lignin content can be used as a distinguishing indicator for internal quality of 'Snow' pear.
In recent research, Lignin can be extracted, purified, and analyzed by ultraviolet (UV) spectroscopy, Fourier-transform infrared (FTIR) spectroscopy and 1H nuclear magnetic resonance (1H NMR) spectroscopy (Cai et al., 2010). The lignin is then dissolved in 25% solution of acetyl bromide in glacial acetic acid by the acetyl bromide soluble lignin (ABSL) method (Fukushima et al., 2000), and was read and analyzed at 280 nm. Ultraviolet (UV) Spectrophotometry is a simple and inexpensive method for the determination of lignin. Ionization Differential Ultraviolet Spectrophotometry (Δε -Method) can be used to analyze lignin dissolved at neutral and alkaline conditions by using a UV spectrophotometer (Goldmann et al., 2017). Hydrogen thermolysis (HyPy) (Beramendi-Orosco et al., 2004) has been described as a potential rapid method for obtaining lignin enriched residues from samples, which can be used for simple estimation of lignin content. Lignin extracted from cordgrass, switchgrass, and corn stove can also be analyzed and characterized by using ethyl acetate-ethanol-water organic solvent pretreatment method (Cybulska et al., 2012). Another analysis method of lignin is CuO oxidation (De la Cruz et al., 2015), lignin is oxidatively hydrolyzed into phenol monomers, which are quantified by high-performance liquid chromatography or gas chromatography-mass spectrometry. Klason method is another classical method for analyzing lignin content (Assis et al., 2017;Bunzel et al., 2011). The above traditional measurement of lignin content is a chemical method, and a large number of chemical reagents such as H 2 SO 4 , HCL and other acid reagents are required during the process, which requires extra safety precautions and hazard waste treatment in the measurement process. These methods generally can be highly accurate but are time-consuming, and it is impractical to analyze a large sample size of lignin content quickly. Therefore in this study, we propose the use of feature wavelength analysis as a non-destructive and in particular, a rapid analysis of lignin content in 'Snow' pears by using NIR spectroscopy.
NIR spectroscopy has become a simple, rapid, and nondestructive analytical technique (Assis et al., 2017;Fan et al., 2016;Lee & Han, 2016), to measure the chemical composition, physical and chemical properties of samples effectively, and has been widely used in detection of fruits, vegetables, meat, and other foods. Recently, much attention has been paid on the rapid detection of soluble solid content (SSC), total acid (TA), firmness and other internal quality of fruits, such as apples, pears and watermelons, etc., by using NIR spectroscopy combined with chemometric (Ibáñez et al., 2019;Li et al., 2016;Luo et al., 2018;Xu et al., 2019). In addition, the study on the prediction of lignin content in other plant materials by using NIR spectroscopy with chemometric methods has been reported. Camila Assis's team used NIR Spectroscopy with the range of 10,000-4000 cm −1 , Ordered Predictors Selection (OPS), and Partial Least Squares (PLS) to build the multivariate calibration models for estimating the lignin content in different parts of sugarcanes (Assis et al., 2017). And the OPS algorithm selected fewer variables, but achieved greater predictive capacity, with the values of RMSEP, Rp, and RPD for the middle stalk is 0.61, 0.95, and 3.24, respectively. An inexpensive and portable NIR spectrometer is employed to rapidly determine the lignin in wood combined with multiplicative scatter correction (MSC), particle swarm optimization (PSO) and the kernel extreme learning machine (KELM) algorithm by Hao Yang's team . There have been extensive studies on the determination of lignin in Rice Straw (Hu et al., 2018), Fine Root (Elle et al., 2019), Pinus radiate wood (Fahey et al., 2018), Pinus pinaster (Pnb) wood (Alves et al., 2020), Cryptomeria japonica (Horikawa et al., 2019), pulp wood feedstock (Liang et al., 2020), flax fiber (Huang & Yu, 2019), etc., by using NIR spectroscopy combined with variable selection algorithms and modeling approaches. However, there are much fewer studies, and in fact, we only found one such case about the determination of lignin content in fruits. Xiaohui sheng's team used NIR spectroscopy and the uninformative variable elimination (UVE) algorithm to non-destructively detect lignin content in Korla fragrant pear (Sheng et al., 2020). After the SNV pretreatment, the SEP of the UVE-PLS model was 1.36%, R p 2 was 0.87, and RPD was 2.03. In this study, more and deeper methods and algorithms will be applied to determination of lignin content in snow pears by using NIR spectroscopy.
In general, a large number of spectra data could be acquired by a near infrared spectrometer, which have serious collinearity problems, and inevitably contain the interference and useless information. If the partial least square regression (PLSR) model is established based on the full wavelength, it will not only increase the complexity, but even reduce the predictive performance of the model. Therefore, it is important to identify effective variables before the PLSR model established.
In order to improve the predictive ability and efficiency of the PLSR model, a series of variable selection methods are studied and classified into interval selection and individual selection. Synergy interval partial least squares (SiPLS) , interval partial least square (iPLS) , moving window partial least squares (MWPLS) (Li et al., 2009), changeable size moving window partial least squares (CSMWPLS) (Luo et al., 2018), backward interval partial least square (BiPLS) (Assis et al., 2017), interval combination optimization (ICO), etc., which belong to the interval selection methods. The collinearity problem among wavelength variables in an interval is not considered in this method. Generally, the selected variables are concentrated in several intervals, and the variables need to be further simplified. Some typical individual selection methods include Monte Carlo non-information variable elimination (MCUVE) (Li et al., 2014;Yan et al., 2019), competitive adaptive reweighted sampling (CARS) (Bai et al., 2019;Jiang et al., 2015;Yan et al., 2019), genetic algorithm (GA) (Du et al., 2019), successive projections algorithm (SPA) (Li et al., 2014;Liu et al., 2017) and bootstrapping soft shrinkage (BOSS) (Deng et al., 2016). The MCUVE selects the useful variables according to the stability of variables. The number of variables selected is usually large and the prediction result is poor by the MCUVE method. CARS (Li et al., 2009) algorithm will remove variables with small absolute value of regression coefficient by force. However, the absolute value of regression coefficient can be affected by the change of the sample space, resulting in the possible elimination of useful variables. GA algorithm has a high risk of over fitting, which will produce local optimal solution, resulting in low computational efficiency and poor prediction performance. SPA algorithm can reduce the collinearity problem among variables extent, but the selected variables contain less useful variables or even interference variables. BOSS algorithm is a wavelength selection method based on variable space, which selects feature wavelength according to the absolute value of regression coefficient in variable space by using weighted bootstrap sampling (WBS). But frequency of variables as be an importance feature is ignored and the selected variable may not be optimal in the BOSS algorithm.
On account of the existing problem above the wavelength selection methods, a bootstrapping soft shrinkage combined with frequency and regression coefficient of variables (FRCBOSS) approach is proposed in this paper. FRCBOSS algorithm inherits the advantages of the weighted bootstrapping sampling (WBS) in the BOSS method. The frequency and regression coefficient of variables are chosen as evaluation indexes, and the optimal variable can be selected. In this study, the PLSR model will be established based on the feature wavelength selected by the FRCBOSS method for rapid and non-destructive analysis of lignin content in 'Snow' pears. Therefore, the research processes of this work are as follows: (1) A Fourier-Transform Near-infrared (FT-NIR) spectrometer is used to acquire NIR spectra of samples; (2) The pears are grounded into dry powder and then the lignin content reference value is measured by the Klason method; (3) Savitzky-Golay smoothing (SG), Standard Normal Variate (SNV), Multiplicative Scatter Correction (MSC), First Derivative (D1) and their combination are applied in the NIR spectra preprocessing; (4) SiPLS, CARS, SPA, BOSS, FRCBOSS, and combination are used to select the useful wavelength, PLSR models for predicting lignin content in 'Snow' pears are established and compared; (5) The PLSR model is established based on 19 variables selected by the improved variables selection method FRCBOSS, which has the best prediction ability of lignin content in 'Snow' pears.

| Sample description
In July 2020, a total of 195 'Snow' pears were purchased from a reliable local fruit market in Chongqing, China. In order to ensure the accuracy of the experiment, pears were washed, The diagram of the three measuring points. P1, P2, and P3 were marked around the equator of pear were set as a 120° angular distance. numbered, and then stored in an ice box (4℃, 75%RH) for 24 h before the experiment was carried out (Yuan et al., 2020). Figure1 showed that three measuring points (P1, P2, and P3) around the equator of pear were set as a 120° angular distance.

| Spectra acquisition
The NIR diffuse reflectance spectra of 'Snow' pears were acquired by a FT-NIR spectrometer (Nicolet iS50, ThermoFisher scientific, U.S.A). The spectrometer was consisted of a high sensitivity InGaAs detector, a near infrared light source (a 20 W tungsten lamp), a processor, a Michelson interferometer, a beam splitter, some optical fibers, and probes, with resolution of 4 cm −1 was used to record NIR spectra in the range of 10,000-4000 cm −1 . The average reflectance spectra from three positions around the equator of the sample were acquired.

| Lignin content reference value measurement
When the spectra acquisition of each 'Snow' pear was completed, the lignin content reference value was measured immediately (see Table 1). The intact pear was made into dry powder firstly, as follows ( Figure 2): (1) An aliquot of 200 g fresh pear flesh (1 cm outside the core to 2 mm under the pericarp)was obtained from one intact pear; (2) A hydraulic press was used to press the 200 g pear flesh for eliminating the juice and the residues was obtained; (3)The residues were dried at 50℃ for 5 days and crushed by a miniature pulverizer (IkaA11, Germany); (4) The residues were separated on a 100-mesh sieve; (5)The above powder (50 g) after separating was extracted in water and ethanol for 10 h and 5 h, respectively, with Soxhlet extraction device; (6) The final dry pear powder was obtained after drying in a cabinet (65 ℃).
Then, the lignin content was measured by using a traditional Klason method, as follows ( Figure 3): (1) Dry pear powder (500 mg) was put into a 250 ml beaker, and 30 ml of 72% H2SO4 was added and then external agitation was applied until the mixed solution reached complete homogeneity; (2) The mixed solution was stirred and set at 30℃ for 2 h; (3) The beaker with mixed solution was sampled in the boiling bath of water for 2 h; (4) The mixed solution was diluted with deionized water to reduce the sulfuric acid concentration and autoclaved at 120℃ and 15 psi for 1 h; (5) The treated solution was poured into the sand core funnel (2.5 cm diameter, 1.6 μm particle retention, dried, and weighed), filtrated and washed acid free by distilled water during filtrating. (6) The funnel containing lignin and acid insoluble residue particles was put into a drying cabinet for 5 h (65℃) until the weight was constant. The compounds without structural biomass should be removed, which may interfere with the measurement of lignin content in the Klason method (Assis et al., 2017;Bunzel et al., 2011). Formula (1) shows the calculation method of lignin content in Klason.
where W L is the weight of lignin content in 'Snow' pears, the W 1 is the weight of the funnel containing lignin, the W 2 is weight of the sand core funnel.
The lignin content of 195 pears covered 67.75 to 84.57 mg/g. As shown in Figure 4, the concentration of the sample was approximately in accordance with the normal distribution. 146 pear samples were selected as calibration model samples, and the remaining 49 pear samples were used for prediction model samples by using Kennard stone (KS) algorithm. The lignin distribution of the calibration set and the prediction set were shown in Table 1.

| Spectra data pretreatment
The NIR diffuse reflectance spectra of 'Snow' pears existed baseline shifts, transformation and noise due to the different light propagation distance and shape of samples. The spectra were preprocessed before modeling and analysis. The following methods were applied in the NIR spectra preprocessing, in order to getting better pretreatment results : spectra preprocessed by Savitzky-Golay Smoothing (11 points), spectra preprocessed by Normalize (area), spectra preprocessed by Standard Normal Variate (SNV), spectra preprocessed by 1st Derivative (D1), spectra preprocessed by Savitzky-Golay Smoothing (11 points) combined with Standard Normal Variate (SNV), spectra preprocessed by Savitzky-Golay Smoothing (11 points) combined with Normalize (area), spectra preprocessed by Savitzky-Golay Smoothing (11 points) and Normalize (area) combined with 1st Derivative (D1). A set of dataset consists of 195 samples of 'Snow' pears measured on a FT-NIR spectrometer. The average reflectance spectrum from three positions contains 3112 wavelength points at intervals of 4 cm −1 with the range of 10,000-4000 cm −1 . The dataset was divided into a calibration set (146 samples) and a prediction set (49 samples) by using Kennard stone (KS) algorithm (refer Table 1), respectively. F I G U R E 2 The flow chart of dry pear powder production.

F I G U R E 3
The flow chart of measuring lignin content by using a Klason method. 6 of 15 | WU et al.

| Weighted bootstrap sampling (WBS)
Weighted bootstrap sampling (WBS) is similar to Bootstrap sampling (BSS) method. It is a statistical technique for random sampling with replacement, which hires different weights for different objects. In each trail, the objects with larger weights are much easier to be selected. In actual, the weight vector of the variable will be normalized, therefore, the value of weight of each variable is between 0 and 1.
The probability of each variable being selected is: where n is the number of variables, ω i is the weight of the i variable, R is the number of variables retained for the previous iteration. A variable with a larger weight have a greater probability to be selected in the sampling process (refer Formula (2)). If the same with the number of objects R is the number of replacements, the ratio of objects been selected is approaching 0.632 when R goes to infinity. The number of selected variables will reduce gradually in each iteration, and the variables space will shrink softly.

| New weights by FRC
Assuming that the NIR spectra data matrix is X n×p , and n is number of samples in row, p is number of variables in columns. The vector y n×1 is the measured property of interest. According to beer's law, the relationship between spectral matrix and concentration vector is: where e is the random error vector, β is the regression coefficient vector. The N subsets are generated on variables space by using weighted bootstrap sampling (WBS), and the N PLS sub-models are established using subsets obtained. The value of root mean squared error of cross validation (RMSECV) between regression coefficient matrix β n × p and N sub-models will be calculated in each iteration. The Nδ optimal models are selected from the N models, the value of Nδ is k. The frequency of each variable could be obtained from k models, and the frequency vector of each variable f 1 × p will be calculated by normalizing it. Meanwhile, the absolute value of the regression coefficient matrix β k × p of k models are used to carry out summation operation. Finally, the regression coefficient vector after summation is normalized to obtain β 1×p .
Where sum(·) denotes summation operation, norm(·) denotes modular operation, abs(·) denotes absolute value operation. The weight of the variables ω after obtaining the frequency coefficient f and regression coefficient β of the variable can be updated by formula (5). where α is the fusion coefficient of frequency and regression coefficient, and its value range is 0~1. When α is 0, the weight of variables ω is only determined by the regression coefficient. The weight vector will be normalized again after fusion. The weight of each variable combines two important indicators frequency and regression coefficient according to the formula (5).

| FRC-BOSS wavelength selection method
The regression coefficient and frequency of the variable are used as evaluation indexes of weight of the variable in FRC-BOSS algorithm. The larger the weight, the larger probabilities the variable is selected in the next replacement. The iteration repeats until the number of variables in the subsets equals to 1.
Specific implementation steps of FRCBOSS as follows.
Step 1. N subsets are generated according to the weight of variables by using weighted bootstrap sampling. The variables initially have the same weight, which ensures that each variable has the same probability of being selected at the beginning of the iteration.
Step 2. The root mean squared error of cross validation (RMSECV) of the subsets are calculated by using partial least square (PLS) algorithm.
Step 3. Nδ optimal models are selected in the N subsets, and the subsets with lowest value of root mean squared error of cross validation (RMSECV) are obtained.
Step 4. The weight of each variable is recalculated according to the above formula (5) in each iteration process.
The lignin content distribution of all pears. a.u. stands for arbitrary unit.
Step 5. When the number of remaining variables p1 > 1 in each iteration, return to step 1 for the next iteration, otherwise, execute step 6.
Step 6. The subset with lowest RMSECV during the iteration is chosen as the optimal variable set.

| Model building and evaluation
In this paper, the correlation coefficient of calibration (R c ), the root mean square error of cross validation (RMSECV), the correlation coefficient of prediction (R p ) and the root mean square error of prediction (RMSEP) are chosen as indicators to evaluate the performance of the calibration and prediction models (Tian et al., 2018). R c and RMSECV are defined as follows: where n is the number of samples in the calibration set, y i,a is the measured value of the i sample in the calibration set, y i,p is the predicted value of i sample in the calibration set, y i,cm is the mean of the measured value in the calibration set. R p and RMSEP are defined as follows: where m is the number of samples in the prediction set, y j,a is the measured value of the j sample in the prediction set, y j,p is the predicted value of j sample in the prediction set, y j,pm is the mean of the measured value in the prediction set. After preprocessing, the SiPLS, CARS, SPA, BOSS, and FRCBOSS and their combinations were used to select the useful variables, and PLSR models for predicting lignin content in 'Snow' pears were established and compared. The evaluation indexes of models included the coefficient of determination of calibration (Rc), the root mean square error of cross validation (RMSECV), the coefficient of determination of prediction (Rp) and the root mean square error of prediction (RMSEP). Figure 5 shows the raw NIR diffuse reflectance spectra of all pear samples, and the abscissa is wavenumber with range of 10,000-4000 cm −1 and the ordinate is absorbance. It can be seen that the trends of NIR spectra were similar, and mainly included the frequency-doubling and frequency-combining information of C-H, O-H and other chemical bond stretching. The two strong absorption peaks around 5155 and 6944 cm −1 are due to the third overtone of the O-H. There is a small absorption peak around 8403 cm −1 , which is relevant to the third overtone of the C-H functional group (Ma et al., 2018). These NIR spectra of pear samples not only contain the hidden information of different components of fruit, but also include baseline shifts, transformation, and noise. Therefore, the raw NIR spectra need to be preprocessed before modeling and analysis.

| Effect of different pre-processing on the PLS model for lignin
PLS models are established using full spectrum data of calibration sets coupled with corresponding reference values for lignin in pears. The Unscrambler X version 10.4 software is applied to preprocess the raw NIR spectra by different methods. As shown in Figure 6, the characteristics of the spectrum are more obvious after preprocessing based on the five different methods. SG and NORM preprocessing are used to remove both the random noise and the adverse effects in spectra. SG can improve the signal-to-noise ratio of the sample signal, but fail to eliminate the additive effect. NORM eliminate the adverse effects caused by too large-scale differences of data.
Effect of different pretreatments on the full-spectrum PLSR models based on the raw spectra and are provided in Table 2. The results show the performance of the PLSR model was improved by using different pretreatments. For the prediction of lignin, the preprocessed spectra show lower RMSECV (ranged from 1.730% to 1.893%), higher Rc (ranged from 0.760% to 0.806%). NORM spectra get better calibration statistics than SG spectra, and the combination of NORM and SG preprocessing can further improve the accuracy and robustness of full-spectrum PLSR model. Notably, the NORM and SG spectra provided the stronger calibration with the lower RMSECV. Therefore, the PLSR model based on spectra data by SG and NORM methods is optimal model after comparison in Table 2 (Bold font with *).

| Optimization of PLS models by variable selection
In this study, Synergy interval Partial Least Squares (SiPLS) (Jiang et al., 2012) is an optimization algorithm of interval partial least squares (iPLS) and firstly used to select effective regions from full wavelength in order to simplify the PLSR model. In SiPLS variable selection methods, the full wavelengths is divided into 19 subintervals, and the 2, 9, 17 and 19 synergy subintervals are selected as the effective regions, with 655 variables in total (see Figure 7 and Figure 8). Then CARS is applied based on the variables selected by SiPLS (SiPLS-CARS), the sampling times is set to 70, and the 10folds cross validation are used to calculate and select the optimal number of sampled variables. Figure 8 shows that an optimal variables subset is obtained and the corresponding number of sampled variables (effective wavelengths) is 28. The SPA combined with SiPLS method (SiPLS-SPA) can further reduce the number of effective variables and overcome the limitations of SPA algorithm with low S/N or insufficiency in multivariate calibration. The final number of effective wavelengths is 19 by SiPLS-SPA method in Figure  8. The BOSS method is used to choose effective variables based on the variables selected by SiPLS (SiPLS-BOSS).
In BOSS, the max number of latent variables is set to 10, the time of fold is 5, and the number of iteration is 1000, respectively. After running SiPLS, the redundant variables selected by BOSS are removed and 16 variables are retained (see Figure 8). In the application of FRCBOSS, the number of sub-models N is 2300, the optimal model δ is 0.05 and the fusion coefficient of frequency and regression coefficient α is 0.4, respectively. Then SiPLS-FRCBOSS is applied and the optimum solution is obtained, which is consisted of 19 variables (see Figure 8). The SiPLS, SiPLS-CARS, SiPLS-SPA, SiPLS-BOSS and SiPLS-FRCBOSS are applied to select characteristic wavelength related to quantitative analysis of lignin in pears, and a comparison of five variables selection algorithms is conducted for NIR model optimization.
The less correlated variables are removed. More concise and effective variable subsets are provided by the methods mentioned above. According to different optimization strategies, the numbers of eliminated variables by each method could be arranged in the order SiPLS-BOSS >SiPLS-FRCBOSS ≥SiPLS-SPA >SiPLS-CARS >SiPLS. The number of selected variables selected by SiPLS-SPA and SiPLS-FRCBOSS have some similarity, but those selected by SiPLS-FRCBOSS are more concentrated. The variables selected by SiPLS-CARS are a few larger than that by SiPLS-SPA, SiPLS-BOSS and SiPLS-FRCBOSS methods. BOSS method combines the strategies of model population analysis (MPA) and weighted boostrap sampling (WBS) to extract useful information according to the weight of each variable (Deng et al., 2016). The weight of each variable is only determined by the absolute values of regression coefficients. And the method follows the principle of soft shrinkage, in which the unimportant variables are not removed directly but are assigned smaller weights. The variable set with the lowest RMSECV is selected in BOSS method. Although the number of selected variables is the least by the BOSS, the important variables are may eliminated at the same time. The number of variables selected by FRCBOSS is a little larger than that by BOSS, because the method selects the useful information according the absolute values and stability value of regression coefficients, and the method can reselect the variables removed by BOSS method. SPA minimizes the NIR spectra collinearity according to the projection operation, and only select the variables with minimum redundant information for multivariate (Xiaobo et al., 2010). CARS method can remove a large number of variables according to the absolute value of regression coefficient by force in the early stage of variable selection (Li et al., 2009). However, the absolute value of regression coefficient will change with the change of the sample space, resulting in the eliminating variables may contain useful variables.
The effective wavelengths on 'Snow' pears datasets are extracted by using SiPLS, SiPLS-SPA, SiPLS-CARS, SiPLS-BOSS, and SiPLS-FRCBOSS methods, which have similar characteristics (see Figure 8). The SiPLS method select 4-synergy subintervals wavelengths, and the number of variables is larger. The comparison of five variable selection methods shows that SiPLS-FRCBOSS has the highest selection efficiency in the selection of characteristic absorption which is closely related to various chemical structures in the near infrared spectrum. F I G U R E 5 the raw NIR spectra of 195 'Snow' pears.
The effective spectral regions mainly around 4000 cm −1 , 4002 cm −1 , 4007 cm −1 , 4009 cm −1 , 4048 cm −1 , 4050 cm −1 , 4262 cm −1 , 4264 cm −1 , 4312 cm −1 , 4785 cm −1 , 7176 cm −1 , 7232 cm −1 , 7234 cm −1 , 7239 cm −1 , 7342 cm −1 , 7367 cm −1 , 7376 cm −1 , 7459 cm −1 , and 7469 cm −1 are selected by SiPLS-FRCBOSS method (see in Table 3 and Figure 8). For lignin, absorbance bands between about 4000 cm −1 and 4350 cm −1 and around at 4785 cm −1 are assigned to combination bands of O-H stretching and C-O stretching, as well as C-H stretching and C=C stretching (Baillèresa et al., 2002;Yang et al., 2016). The strong absorption signal from 7342 cm −1 to 7450 cm −1 corresponds to first overtone of O-H stretching from phenolic groups present in lignin (Yonenobu & Tsuchikawa, 2003). These characteristic absorptions can be preserved in the variable subset based on SiPLS-FRCBOSS. On the contrary, due to partial collinearity, some overlapping peaks, such as 9367 cm −1 , 9623 cm −1 , and 9664 cm −1 selected by other methods are easily removed from variable subset by SiPLS-FRCBOSS, which ultimately affects the accuracy and robustness of the PLS model. On the other hand, the water content in pears also has stronger absorption peaks in the spectral ranges of F I G U R E 6 The NIR spectra of 'Snow' pears after different pretreatment methods. 7176-7250 cm −1 selected by five variable selection methods, which mainly include the first overtone O-H stretching and combination band (stretching and deformation) (Schwanninger et al., 2011). The broad and overlapping O-H bands due to water attenuated the stability of near infrared analysis model for pear properties.

| Modeling and comparison
PLS models based on variables selected by SiPLS, SiPLS-SPA, SiPLS-CARS, SiPLS-BOSS, SiPLS-FRCBOSS are established to non-destructively measure lignin content in 'Snow' pears. The results are listed in Table 3. Besides,  the predictive results (predicted) obtained using with NIRS technology are compared with the ones (reference) obtained using the traditional Klason method, as shown in Figure 9. The PLS models optimized by selected variables show lower RMSECV (1.458%-1.580%) and higher Rc (0.838-0.863) to full spectrum models in cross validation, which preliminarily proved the feasibility of variable selection for model optimization. SiPLS-SPA and SiPLS-CARS can simplify the PLS model successfully in Figure 9b and c, but the prediction ability of the PLS model based on SPA and CARS is also reduced, as the Rp of prediction is 0.826 and 0.807, the RMSEP is 1.248 and 1.303 (refer in Table 3), respectively. The PLS model based on the 16 variables selected by SiPLS-BOSS is established (SiPLS-BOSS-PLS model) in Figure 9d, which has a good prediction ability, as the R p of prediction is 0.850 and the REMSEP is 1.154 in Table 3. However, the prediction ability of SiPLS-BOSS-PLS model isn't the best compared with SiPLS-FRCBOSS-PLS model, probably because the BOSS algorithm selects feature wavelength only according to the absolute value of regression coefficient and neglecting the frequency of variables in variables space,

F I G U R E 9
Comparison results of NIR model based on different wavelengths selection method. and resulting in some useful wavelengths are removed. The results of PLS modeling based on 19 variables selected by SiPLS-FRCBOSS are encouraging in Figure 9e, and the prediction ability of the SiPLS-FRCBOSS-PLS model is the best (refer in Table 3), with R p of prediction is 0.880 and RMSEP is 1.004%. Although the number of variables selected by SiPLS-FRCBOSS is a little larger than that selected by SiPLS-BOSS, FRCBOSS algorithm could reselect the useful wavelength removed by BOSS method. It indicates that the FRCBOSS algorithm proposed in this paper can overcome the disadvantage of BOSS in selecting variables, the latter algorithm only considering the absolute value of regression coefficient and neglecting the frequency of variables in variable space. Klason is a traditional method to measure lignin content, which is destructive and slow measurement method, while NIRS technology is a non-destructive and fast measurement method. Therefore, compared with the traditional measurement method, NIRS technology has more advantages for analysis of lignin content in 'Snow' pears.
In recent years, there are studies on determination of lignin in other materials (such as wood, rice straw, fine root, flax fiber etc.) For lignin in sugarcane, the obtained values are in the range of 16.23-33.85% with a mean value of 23.44%. The OPS method combined with PLS regression allows the building of more simple, interpretable, and predictive models for determination of lignin in sugarcane, with the Rc, RMSECV, Rp and RMSEP for bagasse-with-juice equal to 0.94, 0.71, 0.94, and 0.65, respectively (Assis et al., 2017). The lignin content in pulp wood is from 14.82% to 34.2%, with an average of 26.43%. NIR spectra obtained by a portable instrument combined with KELM, PSO and MSC methods were employed to predict the lignin contents in various types of pulp wood, with Rp = 0.981, RMSEP = 0.958 and RPD = 5.015, respectively (Liang et al., 2020). The mean value of three major lignin monomers (H, S, and G) in the rice are 0.202%, 0.821% and 1.33%, respectively. The average of total lignin (H+S+G) is 2.353%. Four modified partial squares methods for rapid predicting of H, S, G, and H+S+G are built. An acceptable determination coefficient for calibration and external validation that ranged from 0.85 to 0.93, and 0.82 to 0.88, respectively (Hu et al., 2018). The range of lignin content in this fine root samples (10-43%) exceeds that typically reported in the literature, lignin content of grass species fine roots suggest a mean of 20.5% ±7.7%. the authors used CARS-PLS to select the most relevant wavelengths for root lignin prediction based on full spectrum information, with the Rp, RMSEP and RPD are 0.86, 2.82%, and 2.67, respectively (Elle et al., 2019). Lignin contents of flax samples ranged from 1.27% to 7.06% with a SD of 1.37, and the average of 3.35%. For the lignin PLS model, the reduced range (6900-5600 cm −1 ) had significantly better model fitness than the range (12,500-4000 cm −1 ) model, with Rc of 0.936, Rp of 0.769, RMSEC of 0.351%, RMSEP of 0.455% and RPD of 2.366 (Huang & Yu, 2019). In this study, the lignin content mean value is 7.727%, the SiPLS-FRCBOSS-PLS model for determination of lignin in pears is built, with Rc is 0.863, Rp is 0.880, RMSECV is 1.458% and RMSEP is 1.004%. All the data above are refer in Table 4. The prediction ability for lignin in sugarcane, wood and root is better than that in rice and pear based on the NIR spectroscopy technology. The content of lignin in rice and pear is very low, which has adverse effect on the determination lignin reference value by Klason method and the acquisition of NIR spectrum by spectrometer. Table 4 shows that the models for measuring lignin in 'Snow' pear and rice straw have a considerable level. It is feasible for non-destructive and rapid analysis of lignin content in 'Snow' pears by using NIR Spectroscopy combine with SiPLS-FRCBOSS-PLS model, which has a better result.

| CONCLUSIONS
Our experimental results show that the FRCBOSS algorithm proposed in this paper provide advantages over BOSS method in selecting more effective variables, as the BOSS algorithm only considers the absolute value of regression coefficient and neglects the frequency of variables in variable space. The PLS model based on variables selected by FRCBOSS method is applied to the non-destructive testing of the lignin content in Snow' pears with promising results. The extremely low level of lignin content in rice and pear presents a particular challenge in determining lignin reference value by Klason method and the acquisition of NIR spectrum by spectrometer. Our presented models have been demonstrated as an effective and viable option for measuring the low level of lignin content in 'Snow' pear and rice straw. We have successfully shown that the application of NIR spectroscopy combined with SiPLS-FRCBOSS-PLS algorithm is feasible for rapid and non-destructive analysis of lignin content in Snow' pears.