Determination of Cellulose Crystallinity of Banana Residues Using Near Infrared Spectroscopy and Multivariate Analysis

Crystallinity is an important property of lignocellulosic biomass due to its significant effect on acid/enzymatic hydrolysis. Normally, physicochemical analysis, such as powder X-ray diffraction and nuclear magnetic resonance, is used to reveal the crystallinity content. However, these analytical methods are expensive and laborious. In this context, methods that rapidly predict the crystallinity are important, even if used only for screening calibration. Thus, we intend to show the potential of near-infrared spectroscopy (NIRS) and chemometrics to replace reference methods in crystallinity determination. The results show that NIRS can be used to determine crystallinity in banana residues by the use of partial least squares regression, providing good coefficients of determination (Rcal,pred > 0.82), low relative errors (< 14%) and good range error ratio (≥ 7.7). The interpretation of the regression coefficients, multivariate figures of merit and external validation results indicate a strong relationship between the NIR spectrum and crystallinity in banana samples.


Introduction
Cellulose is a natural polymer consisting of a linear chain of β (1→4) linked glucose molecules.Each repeating unit contains hydroxyl groups able to form hydrogen bonds between cellulose chains governing the physical properties of cellulose. 1,2The intrachain hydrogen bonding between hydroxyl groups and oxygens stabilizes the linkage and results in the linear configuration of the cellulose chain. 2 During cellulose formation, van der Waals and intermolecular hydrogen bonds between hydroxyl groups and oxygens of adjacent molecules promote the aggregation of cellulose chains to form microfibrils.These microfibrils contain two different regions.The crystalline region consists of highly ordered cellulose molecules, while the molecules in the amorphous region are less highly ordered. 2,3The major part of cellulose (approximately 2/3 of the total cellulose) is in the crystalline form. 4Seven different crystalline forms of cellulose have been identified by X-ray diffraction (XRD), with distinct physical and chemical characteristics. 5he extensive hydrogen bonding and compact structure of crystalline cellulose hinder the hydrolysis process, while the amorphous region tends to be easily hydrolyzable by acids/enzymes. 6Normally, one partial hydrolysis occurs, which removes the amorphous regions from cellulose, increasing the proportion of the crystalline region that is resistant to further hydrolysis. 7Therefore, concentrated acids and/or a high amount of enzymes are used in acid and enzymatic hydrolysis, respectively, to reduce the crystallinity of cellulose as much as possible and fully convert it to the amorphous state. 8Because the crystallinity of a lignocellulosic material is considered one of the main factors influencing the effectiveness of acid/ enzymatic hydrolysis, [9][10][11] it is important to know the level of crystallinity before initiating subsequent steps to optimize the quantities of reagents required, reducing the costs and time of analysis.
XRD and solid-state 13 C nuclear magnetic resonance (NMR) are currently used to determine the crystallinity of a lignocellulosic biomass. 12,13However, it is not always suitable for crystallinity estimation, especially for screening purposes.Moreover, the disadvantages related to XRD and NMR techniques, such as being complex, time consuming and expensive, limit their use.In this context, near-infrared spectroscopy (NIR) is a fast technique, non-destructive and of simple application, suitable to replace the traditional methods. 14This technique based on vibration spectroscopy makes it possible to reveal physical properties, such as the crystallinity content, [15][16][17][18] which is possible due to the fact that the cellulose crystallinity, which involves intermolecular hydrogen bond and crystalline networks, is clearly evident in the infrared spectra.As the C-O and O-H stretching and C-H deformation are vibrational modes predominating in the NIR region, it is expected that this region will be influenced by the crystallinity. 15o evaluate a physical property using NIR spectra, multivariate methods such as partial least squares (PLS) can be used to build a regression model that makes the quantification possible.The process requires a calibration data set, for which the reference values for the property of interest and the measured NIR spectra are known for all samples.][17][18] Kelley et al. 19 used NIR and PLS regression models for the determination of crystallinity content in loblolly pine wood and the results obtained were of poor quality, with R 2 cal and R 2 pred of 0.52 and 0.15, respectively for a model with 2 latent variables (LV).Qu et al. 17 investigated the ability of NIR to predict the crystallinity of wood.For a PLS model with 8 LV, it was possible to achieve R 2 cal and R 2 pred of 0.93 and 0.72, respectively.Jiang et al. 16 also evaluated the wood crystallinity.These authors obtained excellent results (R 2 cal and R 2 pred values of 0.95 and 0.86, respectively) showing that the NIR data was well correlated with crystallinity determined by the X-ray diffraction.They obtained satisfactory results for the range error ratio (RER), relative standard deviation (RSD).However, the quality of the models from the works mentioned above were not assured by statistical parameters such as the figures of merit.
In this work, NIR spectra and multivariate methods have been applied to rapidly determine the crystallinity of cellulose in banana residues with satisfactory results.The quality of the models obtained is ensured by the determination of the figures of merit, RER and RSD values, external validation set, and interpretation of the regression coefficients.

Samples
Sixty-nine samples of banana were obtained and submitted to the further analysis.They are distributed among stalk, stem, rhizome, rachis and leaves.The identification, fraction, origin, species and year of harvest of these samples are indicated in Table 1.
Approximately 500 g of each biomass was cut into small pieces, mixed, and dried at 105 °C in an oven until constant weight.The samples were then ground in a Romer micro mill (Romer Labs, São Paulo, Brazil) equipped with a number 10 mesh size and then sieved with a number 40 mesh size.
After sieving, the samples were submitted to an extraction process (ethanol 95%, 100 °C, 1500 psi) in a Dionex ASE 200 system (Thermo Fisher Scientific, Waltham, MA, USA) to assess whether the extractives have substantial influence on the cellulose crystallinity.

XRD analysis
The reference values of crystallinity were determined by XRD.The diffractograms were recorded using an X-ray diffractometer (XRD 7000 Shimadzu) with Cu Kα radiation, a voltage of 30 kV and a current of 20 mA.The scanning range was from 2θ = 5° to 50° at a scan speed of 0.071° s -1 .
There are several methods in the literature based on using the diffractogram to calculate the crystalline content, 20 and two of them were applied in this work.In the first one, which will be designated method A, the crystallinity index (CI) of a given sample was calculated by subtracting the minimum intensity of the peak 101 (amorphous band (Iam)) from the maximum intensity that represents the crystalline portions (Ic) of the peak 002 and then taking the ratio between the difference and the total intensity, 12 according to equation 1. Figure 1a shows an example of the crystalline and amorphous peaks used in this equation. ( The second approach, called method B, is a deconvolution method.Individual peaks were fitted by Gaussian functions, as shown in Figure 1b.For this purpose, the peak fitting program (PeakFit; www.systat.com) was used, and interactions continued until the convergence of ‡ χ 2 , which corresponds to an R 2 value greater than 0.94 for all deconvolutions.The sum of the area under the crystalline adjusted peaks (Ic), designated as 101, 1 -01 and 002 in Figure 1b, and of the amorphous broad band (Iam) were used to calculate the CI 13,21 according to equation 2. (2)

NIR analysis
The NIR diffuse reflectance spectra were acquired using a FOSS XDS spectrometer (FOSS, Hillerød, Denmark) equipped with a Rapid Content Analyzer (RCA) module.Spectra from 1100 to 2500 nm were collected at a grating resolution specified as 0.5 nm.Three spectra were recorded for each sample, and the average spectrum was used for data analysis.

Data analysis
Diffractograms were explored by principal component analysis (PCA) on mean-centered raw data to reveal the hidden structure within the XRD data set.In this method, a small set of orthogonal principal components that maximizes the variance in the data set is defined.The dimensionality of the data set is reduced, providing a visual representation of the relationships between banana samples and variables. 22he collected NIR spectra of the banana fractions were used to construct a regression model that relates the matrix (X) containing spectral data and the vector (y) representing the crystallinity content.PLS was used to obtain the calibration models.In this method, 22 factors (latent variables) that relate X and y are obtained by maximizing the covariance between the X scores (t) and y, such that Xw = t and .
For quantification, the NIR spectra were pretreated by a Savitzky-Golay second derivative 23 computed using a window of 31 points and a second order polynomial.‡ where v o is the experimental value, and v e is the expected value.The original data set was randomly split into a calibration set (75% of the samples) and a prediction set (25% of samples).The number of LV in the calibration model was determined based on the occurrence of the minimal residual variance, 24 or visually when the minimum did not exist, to avoid overfitting by cross validation. 25An automatic uncertainty test (the Martens' uncertainty test) was applied to select the significant variables in the multicomponent model. 26rediction evaluations were carried out employing certain parameters, such as the coefficient of determination in calibration (R 2 cal ), in cross validation (R 2 cv ) and external validation (R 2 pred ); root mean square error of calibration (RMSEC); root mean square error of cross validation (RMSECV); root mean square error of prediction (RMSEP), range error ratio (RER), 27 RSD, 14 number of LV and of outliers excluded.
The modeling is incomplete without interpretation of the regression coefficients.From the chemical point of view, a suitable interpretation of the regression coefficients in terms of a cause-effect relationship is highly desirable. 28dditionally, to ensure the performance of the models, figures of merit were evaluated. 14ultivariate data analyses (PCA and PLS) were performed using the Unscrambler 10.2 (Camo Software, Oslo, Norway), and the calculation of the figures of merit was conducted using the PLS-toolbox 6.7 (Eigenvector Research, Wenatchee, WA, USA) for Matlab 7.2 software (Math Works, South Natick, MA, USA).

Results and Discussion
NIR spectra from the banana residues are shown in Figure 2a, with the greatest variation occurring in the regions of 1400-1600 and 1900-2400 nm.The main bands are located at 1428-1430, 1920, 2100, 2270 and 2329 nm.The band at 1428-1430 nm is assigned to amorphous regions in cellulose (first overtone of O-H stretching), while the band at 1920 nm is attributed to the O-H stretch/O-H bend of polysaccharides, which overlaps with the water band. 29The broad band at 2100 nm can be assigned to OH stretching + CH deformation in cellulose.Both bands at 2270 and 2329 nm are from polysaccharides. 18,29,30The first one is related to CH 2 stretching + CH 2 deformation from crystalline fractions of cellulose, and the second is related to the CH stretching + CH deformation combination from semi/ crystalline regions.
Figure 2b shows the NIR spectra after being pretreated with the second derivative (window size of 31 points and second degree polynomial) to remove the baseline offset and to elucidate the peaks corresponding to the crystalline and amorphous structures.
The mean and standard deviation plots of the CI obtained by the two methods discussed in the previous section for each fraction in the 69 banana samples are shown in Figure 3.The highest ranges in CI were observed for banana stem, calculated by both method A (37.81-56.60)and method B (6.65-23.56),followed by rhizome calculated by method A (27.40-40.99).The lowest ranges were observed for leaves (10.09-12.42)and rhizome, both obtained by method B (6.68-10.10).The stalk ranged from 41.14 to 52.70 and from 7.61 to 14.27 for methods A and B, respectively, while the rachis presented a range of 61.88-66.74and 22.62-26.79for methods A and B, respectively.Finally, the CI values for leaves calculated by method A varied from 29.39 to 34.16.
Crystallinity in banana residues was reported for the first time by Guimarães et al., 31 but only for the pseudostem fraction.The values of CI in this work obtained by method A (intensity of peaks) are higher (ca.10%) than the values reported. 31This small difference could be due to distinct species, cultivars, soils and years of sampling.
The values of CI calculated by method A are always higher than the values obtained using method B (Figure 3), most likely due the underestimation of the amorphous peak intensity because the valley is used to estimate the amorphous contribution (see Figure 1a) in the method that uses the intensities. 13,32incipal component analysis The analysis of the PCA scores based on the meancentered diffractograms (Figure 4a) shows one significant overlap with some tendency towards separation between banana fractions.The 3 rachis samples are clearly separated from the other fractions in PC1.
The first two PC explained 82% and 6% of the total variance, respectively.Some trend or discrimination could be elucidated between the groups leaves/rhizome from stalk/ stem due their similarity in the CI content.Three samples from rachis showed greater dissimilarity, most likely due to the high crystallinity content, providing a different spectral profile.The PC1 and PC2 are characterized, respectively, by positive and negative loadings (Figure 4b) at 22° < 2θ < 23° and 15° < 2θ < 17°, which are typical of crystalline structures. 18C1 differentiates the banana fractions (stem and rachis) with positive scores associated with crystalline parts.Based on the PC2 loadings, the negative bands attributed to crystalline parts differentiate the banana fractions (stem, stalk and rachis) with crystalline characteristics from the leaves and rhizome, which have more amorphous characteristics.

Parameters for model evaluation and validation of PLS models
PLS regression models were performed on the meancentered NIR spectra after 2D (31 points and 2 nd degree polynomial) pretreatment and feature selection using Martens'uncertainty test.the results from the regression models to predict the crystallinity percentage by the two different methods (A and B).Six LV were employed in both models.This high number of LV could be explained by the fact that different crystalline forms 5 absorb in different regions of the spectrum (as seen in the regression coefficients), so that a single factor is not capable of explaining all the variability, which justifies using more factors to model this physical property.
A linear fit was obtained between the reference and predicted crystallinity with R 2 cal,val of 0.89 and 0.82, respectively, for method A, and R 2 cal,val of 0.86 and 0.85 for method B (see Figure 5).
The values of RMSE for calibration, cross-validation and prediction were larger for method A because the crystallinity reference values were much higher for this method, and so the relative error is a better parameter to compare the results from the two methods.The relative errors were significantly different, being 6.5% for method A and twice as high (13.9%)for the crystallinity determined by area (method B).Both RER values were above 4, indicating that, according to the American Association of Cereal Chemists (AACC), 33 both models are qualified for screening calibration, and method A is appropriate for quality control, with an RER value equal to 11.
The results found for method A are in accordance with results from the literature 17 for a model with approximately the same number of LV (7) and similar R 2 cal (0.87) and R 2 val (0.83).Kelley, Elder, and Groom, 19 when evaluating the crystallinity of wood, obtained poor correlation between crystallinity and NIR spectra (R 2 cal,val < 0.50).Jiang et al. 16 also evaluated the crystallinity of wood samples and obtained excellent results using the full spectrum (Vis-NIR) with R 2 cal,val of 0.95 and 0.86 for a 5 LV model.Their percentage error (6%) was the same reported in Table 2, but better RER values were obtained in this work (11 versus 6).It should be noted that including the visible spectral region did not improve our crystallinity model.
Regarding method B, the literature also reports satisfactory results.The crystallinity of tacrolimus solid dispersions evaluated by NIR, 18 when the area contributions from the crystalline and amorphous phases of the diffractograms were considered, produced good results (R 2 cal,val of 0.99 and 0.93, respectively).The intensity method (method A) gives an empirical measurement that allows rapid comparison of crystallinity samples.This method is useful for comparing the relative differences among samples and should not be used as a method for estimating the real crystallinity.The major problem with this method is that usually the minimum position between the 002 and the 101 peaks (Figure 1a) is not aligned with the maximum of the broad amorphous cellulose band which is likely higher, and so the Iam value for the intensity method could be significantly underestimated, resulting in an overestimation of the CI, 13 which justifies the higher crystallinity values calculated by method A when compared to method B.
Although the intensity method (method A) does not provide the best estimate of cellulose crystallinity, this method presented the best regression model and is also the reference method most commonly used in the literature for crystallinity determination in biomass by NIR spectroscopy. 16,34he main source of error in method B is most likely the super-estimation of the amorphous contribution given by the broad band in Figure 1b.A quick way to solve the problem would be to subtract the amorphous contribution from the diffractogram using an amorphous pattern. 13,18,35esides, none of these Gaussian functions could model the scattering pattern perfectly throughout the entire angle range. 36So this method tends to give higher amorphous values and lower CI.Xu et al. 19 suggested that, when studying crystallinity in biomasses, attention should be paid to cellulose rather than whole biomass, and the Rietveld's method 36,37 for CI calculation should be preferred over the intensity methods.
To complete the modeling, regression coefficients from PLS models on pretreated data for method A (Figure 6a) and method B (Figure 6b) were interpreted together with the derivative spectra (Figure 6c).
They exhibit typical bands of crystalline cellulose at 1480, 1589, 1830, 1906, 1962 and 2070 nm (all associated with the O-H stretch, 1 st overtone).A negative relationship was found in the regions of 1340, 1428/1430, and 1704 and at 2064 nm (O-H combination), with bands typical of amorphous cellulose. 18,30,38Typical polysaccharide bands (1669 and 2270 nm) were also found for the two regression coefficients.For both models, negative coefficients correspond to a direct relationship because these regression coefficients were obtained from the second derivative spectra.The main prominent bands of crystalline cellulose reported in the literature are at 1480, 1589 and 2070 nm. 30,38,39All of them presented higher regression coefficients in the PLS model, which indicates that method A is better able to capture the relevant information for determining the crystallinity than method B. The same is observed for the amorphous bands, with larger regression coefficients for model A than model B (1340, 1428/1430, 1704 and at 2064 nm).
All the figures of merit for multivariate calibration, such as sensitivity (SEN), analytical sensitivity (γ), selectivity (SEL), signal-to-noise ratio, limit of detection (LOD) and limit of quantification (LOQ) were calculated, 14 and the results obtained are acceptable (Table 3).
The RMSEP and RMSEC values were less than 4%, and the deviation values between the reference and predicted values were less than 10%.A low quantity of outliers were removed (< 3%).The SEL of these methods indicates that 4% and 7% of the information modeled in methods A and B, respectively, is due to the analyte.
The SEN values are directly affected by the pretreatment used.The derivative spectrum has small intensities requiring large regression coefficients for the conversion to analyte concentration, leading to small sensitivity values. 40Therefore, the low sensitivity values (10 -5 and 10 -4 ) obtained in this work are not surprising, due to the derivative pretreatment.The γ or the inverse of the analytical sensitivity (γ -1 ) expresses the minimum concentration difference, which is discernible by a method considering the random experimental noise, 14 and presented values smaller than 0.0011%.
The LODs obtained (0.0034 and 0.0012% of crystallinity) are very low compared to the minimum experimental value (20%).The LOQs of 0.0113 and 0.0039% were also lower than the minimum value observed (20%), thus confirming the applicability of both models.
The linearity can be confirmed by the plots in Figure 5, which show that the points are reasonably well distributed around the diagonal line, ensuring that both methods A and B follow linear trends.

Conclusions
The results demonstrated that NIR spectra together with multivariate analysis can be used to determine the crystallinity content in banana residues, independent of the method used to measure the crystallinity.For both models, satisfactory results were obtained, providing R 2 cal,pred ≥ 0.82 and reasonable results for RMSEC, RMSEP, RER, RSD and multivariate figures of merit.Additionally, the regression coefficients were interpretable from the chemical perspective.
Method B, as presented here, could provide a more accurate measure of the crystallinity of lignocellulosic biomass and thus better predictions if the contributions from the amorphous pattern are considered.The most popular method for estimating CI, method A, produces significantly higher values than the other method.However, it is simple to use and is thus recommended as a time-saving empirical measure of relative crystallinity. 12t was proved that NIR associated to multivariate analysis can be used for screening calibration and quality control to estimate crystallinity content in biomass.Thus, the key conclusion of this study is that NIR is an nondestructive, rapidly and very important method to reduce time and costs of crystallinity content prediction.

Figure 1 .
Figure 1.Diffractogram of a banana sample (stem) illustrating the two most common methods for calculating the crystallinity index, CI: (a) by the intensity method and (b) by the area method.

Figure 3 .
Figure 3. Mean and standard deviation of crystallinity content determined for all botanical fractions by two different methods: (a) intensity and (b) peak deconvolution.

Figure 5 .
Figure 5. (a) Plot of reference vs. predicted values from calibration and external validation sets for cellulose crystallinity determined by method A. (b) Plot of reference vs. predicted values from calibration and external validation sets for cellulose crystallinity determined by method B.

Figure 6 .
Figure 6.Regression coefficients from PLS models for the cellulose crystallinity (a) determined by method A; (b) determined by method B and (c) spectra pretreated by the second derivative.

Table 1 .
Identification of the banana samples a Not identified.

Table 2 .
Parameters and statistics for model validation of the PLS models a CI determined by intensity; b CI determined by area deconvolution; c original matrix size (69 × 2800); d dimensionless statistics; Out.: outliers; Cal.: calibration; CV: cross-validation; Pred.: predicted.

Table 3 .
Results from figures of merit for the PLS models.(% -1 ) for SEN and γ; (%) for γ -1 , LOD and LOQ a CI determined by intensity; b CI determined by area deconvolution.