Fast Classification of Geographical Origins of Honey Based on Laser-Induced Breakdown Spectroscopy and Multivariate Analysis.

Traceability of honey is highly required by consumers and food administration with the consideration of food safety and quality. In this study, a technique named laser-induced breakdown spectroscopy (LIBS) was used to fast trace geographical origins of acacia honey and multi-floral honey. LIBS emissions from elements of Mg, Ca, Na, and K had significant differences among different geographical origins. The clusters of honey from different geographical origins were visualized with principal component analysis. In addition, support vector machine (SVM) and linear discrimination analysis (LDA) were used to quantitively classify the origins. The results indicated that SVM performed better than LDA, and the discriminant results of multi-floral honey were better than acacia honey. The accuracy and mean average precision for multi-floral honey were 99.7% and 99.7%, respectively. This study provided a fast approach for geographical origin classification, and might be helpful for food traceability.


Introduction
Honey is a natural sweet product produced by bees from the nectar of flowers [1], which mainly consists of carbohydrates, water, proteins, minerals, amino acids, phenols, and vitamins, etc. Because of its high nutrients and healthy benefits, honey has been considered as an important health product around the world, especially in China. It has been demonstrated that honey can improve immune systems and oral health, prevent side effects linked with cancers treatment, heal wounds, etc. [2,3]. However, the constitutes of honey has regional features, and it is likely influenced by the climate, altitude, and other environmental factors [4,5]. Hence, the supplemental information concerning the geographical origin should be given according to Food Safety Law of the People's Republic of China.
In order to determine the geographical origins of honey, some analytical methods have been proposed by researchers. Chemical analysis methods including high performance liquid chromatography-mass spectrometry/mass spectrometry [6], isotope ratio mass spectrometry [7], inductively coupled plasma optical emission spectroscopy [8][9][10], and gas chromatography mass spectrometry [9] were used to discriminate the geographical and botanical origins of honey. Regional and botanical differences in chemical substances contribute to the discrimination. However, the sample preparation of these methods is time-consuming, and lots of reagents are needed.

LIBS Measurement
Before LIBS analysis, the honey samples (8 g) were added in 12-well plates. No other sample preparation was needed. A laboratory-assembled LIBS device was used to analyze samples, the detailed of which has been introduced in previous research [20]. In this experiment, a laser (Vlite 200, Beamtech, Beijing, China) was used to ablate samples at the second-harmonics wavelength (532 nm), with ablation energy of 80 mJ and frequency of 1 Hz. The focal length of lens is 100 mm, and the lens-to-sample-distance (LTSD) in this case was 99 mm. The plasma light was collected by a UV-NIR achromatic mirror system (CC52, Andor, Belfast, UK), and transferred to an Echelle spectrograph (ME 5000, Andor, Belfast, UK), finally detected by an intensified charge coupled device (ICCD, DH334T-18F-03, Andor, Belfast, UK). The delay time, integration time, and relative gain of ICCD were 2 Sensors 2020, 20, 1878 3 of 12 µs, 10 µs, and 26, respectively. Before experiment, the intensity of ICCD was calibrated by a deuterium tungsten halogen source (DH-2000-BAL-CAL, Ocean Optics, Largo, FL, USA), and the wavelength of spectrograph was calibrated with a mercury argon lamp (HG-1, Ocean Optics, Largo, FL, USA). LIBS measurement was performed by single shot scanning in an ablation region of 10 × 10 mm with a resolution of 1 mm. Hence, 100 successive shots were performed for each sample, and 100 spectra were collected.

Multivariate Analysis
All spectra from the same origin were used for representing regional characteristics. A total of 4000 spectra were obtained for each origin. In order to establish and verify discriminant model, thirty samples (3000 spectra) were randomly assigned to a calibration set, and the rest (1000 spectra) were in the prediction set. In this study, principal component analysis (PCA) was used to quantitively visualize the distribution of honey (the calibration samples) with score plots, and linear discriminant analysis (LDA) and support vector machine (SVM) were used for quantitatively classifying geographical origins. PCA, LDA, and SVM analysis was done in the MATLAB (v2018, The MathWorks Inc., Natick, MA, USA).
PCA is an unsupervised cluster algorithm which reduces data dimensions through projecting variables into some principal components (PCs) with maximal variations [21]. It can serve as a useful first step before classification of samples [22]. Because the number of original variables was large (more than 20,000), and lots of them were redundant variables. Hence, the first few principal components could be used to visualize sample distribution in score plots and represent the majority of spectral information. In addition, the loadings represent the contributions to PCs, which could be used to determine feature variables. In PCA model, LIBS spectra were used as inputs.
LDA and SVM were two popular multivariate analysis algorithms, both of which has been widely used in solving classification problems [23,24]. LDA is a supervised classification algorithm based on Bayes' formula, which linearly transforms the samples into a lower dimensional space, so that the samples belong to the same class cluster together [25]. The objective of LDA is to determine the best fit parameters for classification. It is simply to carry out and can be computed fast enough for in-line application. Hence, spectral sensors combined with LDA has been widely applied in food quality control [26,27], and produce good results. However, because of the strong dependence of assumption in its derivation, factors such as noise, non-Gaussian data distribution, and outliers might have a detrimental effect on LDA's performance [28]. Hence, SVM that performed good in discrimination was also used to classify the geographical origins of honey.
SVM is a supervised non-parametric statistical learning algorithm, which has been used for solving complex separations [23]. There is no assumption made on the data distribution. First, kernel function was used to map the data into a higher dimensional feature space which is separable with linear algorithms. Then, a hyperplane with maximum margin was determined to separate different classes. In order to solve multi-class separations in this case, one-against-one multiclass method was used.
In contrast to PCA, the dependent variables (group labels) are also considered in LDA and SVM when modeling. In this case, the independent variables (X) were the first few PCs, the dependent variables are the group labels of geographical origins. Moreover, 10-folds cross validation were used to avoid overfitting.
In addition, confusion matrix, accuracy, mean average precision (MAP), precision, and recall of each model were used to evaluate model quality. Confusion matrix is a commonly used tool representing classification results. On a confusion matrix, the row corresponds to the output class, and the column corresponds to the target class. Each cell represents the number of samples belongs to target class whereas classified as predicted class. Hence, the diagonal cells correspond to observations that are correctly classified. The off-diagonal cells correspond to incorrectly classified observations. Other figures of merit including accuracy, MAP, precision, and recall could be calculated from confusion

One-Way ANOVA Test
In order to examine the regional difference of LIBS emissions, one-way ANOVA test was performed. In this case, 33 of peak intensity of main emissions was considered as dependent variables, and the class label was considered as independent variable. The main emission lines can be identified according to National Institute of Standards and Technology (NIST) database [29]. Values were reported as the mean ± standard deviation (SD), and one-way ANOVA test was performed by SPSS (ver. 25.0, SPSS Inc., Chicago, IL, USA). Duncan's test was used to determine the significance level (p < 0.05).

Spectral Characteristics of Honey
The LIBS spectra offer fingerprint data of honey that contains the regional information. Because of the climate, temperature, and environmental factors, honey from different geographical origins might have different elemental constitutes [4]. The elemental difference could be visualized through LIBS spectra. Figure 1 shows the LIBS average spectra of honey from different origins. Two categories (acacia honey and multi-floral honey) from three different origins were analyzed. Each peak wavelength in LIBS spectrum represented the specific element that could be identified in NIST database, and the peak intensity was related to elemental concentration. As shown in Figure 1, the tendency of LIBS spectrum from different origins was similar. However, slight difference in peak intensity of different honey origins could be found. The emissions (Mg II 279.55, Mg II 280.27, and Mg I 285.21 nm) from A1 (acacia honey, Shaanxi) and M3 (multi-floral honey, Hubei) were significantly stronger than those from other origins. In addition, the emissions (Na I 589.00 and Na I 589.59 nm) from A1 (acacia honey, Shaanxi) were stronger than those from other origins, which indicated high Na concentration in multi-floral honey from Hebei. Other differences such as emissions of Ca I 422.67, K I 766.49, and K I 769.90 nm could also be observed. Due to the variation of constitutes in single group, it was hard to distinguish the origins with above mentioned rules. Hence, multivariate methods were further used to visualize the clusters and discriminate the geographical origins. Table 2 shows the peak intensity of main emissions of honey. One-way ANOVA test was performed for six different groups. The emission marked in bold showed that the peak intensity had significant difference among at least five groups. It indicated that the emissions of Mg I 285.21, Ca II 393.37, Na I 589.00, Na I 589.59, K 766.49, and K I 766.90 nm might have distinguished differences among the five groups, which played an important role in discrimination. Moreover, the peak intensity of emission of Na I 589.00 nm has significant difference among the six groups. It might be considered as a feature emission for geographical and varietal classification. The significant differences of these emissions might provide fundamental signatures for the multivariate classification of honey origins.

PCA Analysis
PCA analysis was used to visualize the clusters with scores plots, and determine the important variables with loadings plots. First, all honey (including acacia honey and multi-floral honey) within different geographical origins were visualized through PCA analysis. The contribution of the first three principal components accounted for 88.2% of explained variance, with PC1, PC2, and PC3 of 75.5%, 8.5%, and 4.2%, respectively. Figure 2a shows score plots of six different groups. In general, six groups entangled with each other. It might be credited to complex reciprocal effect of botanical and geographical origins. It was hard to distinguish with PCA analysis.

PCA Analysis
PCA analysis was used to visualize the clusters with scores plots, and determine the important variables with loadings plots. First, all honey (including acacia honey and multi-floral honey) within different geographical origins were visualized through PCA analysis. The contribution of the first three principal components accounted for 88.2% of explained variance, with PC1, PC2, and PC3 of 75.5%, 8.5%, and 4.2%, respectively. Figure 2a shows score plots of six different groups. In general, six groups entangled with each other. It might be credited to complex reciprocal effect of botanical and geographical origins. It was hard to distinguish with PCA analysis.
Therefore, PCA analysis was separately performed for acacia honey and multi-floral honey, the score plots of which are shown in Figure 2b,c. The contribution of the first three principal components for acacia honey and multi-floral honey accounted for 89.5% and 88.9% of the explained variance, respectively. Apparently, the classification result of multi-floral honey was better than that of acacia honey. The samples from Shanxi, Qinghai, and Hubei provinces clustered more compact, and could be separated. The loadings of PCA indicate the contribution of each variable, which can be used to determine feature variables. The larger absolute value of the loading, the more importance of the variable. In addition, positive value indicates a positive link, and negative value indicates a negative link. Because the first three principal components contributed most of the total variance (>85%), their loadings were used to determine important variables. Figure 3 shows loading plots of the first three principal components for all honey, acacia honey, and multi-floral honey. Similar trends could be observed for these three plots, and the variables with large absolute loadings corresponded to the main emissions. Therefore, PCA analysis was separately performed for acacia honey and multi-floral honey, the score plots of which are shown in Figure 2b,c. The contribution of the first three principal components for acacia honey and multi-floral honey accounted for 89.5% and 88.9% of the explained variance, respectively. Apparently, the classification result of multi-floral honey was better than that of acacia honey. The samples from Shanxi, Qinghai, and Hubei provinces clustered more compact, and could be separated.
The loadings of PCA indicate the contribution of each variable, which can be used to determine feature variables. The larger absolute value of the loading, the more importance of the variable. In addition, positive value indicates a positive link, and negative value indicates a negative link. Because the first three principal components contributed most of the total variance (>85%), their loadings were used to determine important variables. Figure 3 shows loading plots of the first three principal components for all honey, acacia honey, and multi-floral honey. Similar trends could be observed for these three plots, and the variables with large absolute loadings corresponded to the main emissions. Most of loadings of PC1 (except spectral range of CN emissions) is positive, which indicated that there is a positive link between the variable and the information contained in PC1. As shown in Figure 3, the major elements of C, H, O, and N contributed to the discrimination, as well as the mineral elements of Mg, Ca, Na, and K. Most of loadings of PC1 (except spectral range of CN emissions) is positive, which indicated that there is a positive link between the variable and the information contained in PC1. As shown in Figure  3, the major elements of C, H, O, and N contributed to the discrimination, as well as the mineral elements of Mg, Ca, Na, and K.

Quantitative Discrimination
Because the variables of full spectrum were over 20,000, it might lead to overfitting and worsen calculation speed [30]. In this case, principal components after dimensional reducing were used to construct models. The first few principal components with accumulated variance over 95% were used to represent the raw variables. The number of PCs for all honey, acacia honey, and multi-floral honey were 26, 23, and 29, respectively. Then, these variables were used as the inputs of LDA and SVM models.
Confusion matrix was used to evaluate the performance (Figure 4). For all honey, the accuracy of LDA and SVM models were 84.1% and 83.1%, respectively. In LDA model, 93 acacia honey samples from Shaanxi province was misclassified as Shanxi province, and 531 acacia honey samples from Shanxi province were misclassified as those from Jilin province. In SVM model, the largest misclassification was from the samples from Shanxi province; 795 acacia honey samples from Shanxi province was misclassified as those from Jilin province. The results indicated that it was hard to discriminate the samples from Shanxi province and Jilin province. For acacia honey, the accuracy of LDA and SVM models were 74.1% and 82.6%, respectively. The low accuracy was also originated from the misclassification between Shanxi province and Jilin province. The recall of Shanxi province and Jilin province in LDA model were 74.8% and 59.9%, whereas 63.0% and 88.3% in the SVM model. For multi-floral honey, the accuracy of the LDA and SVM models were 98.6% and 99.7%, respectively. In the LDA model, only 33 samples from Shanxi province were misclassified as those from Qinghai Province. The performance of the SVM model was better than the LDA model.

Quantitative Discrimination
Because the variables of full spectrum were over 20,000, it might lead to overfitting and worsen calculation speed [30]. In this case, principal components after dimensional reducing were used to construct models. The first few principal components with accumulated variance over 95% were used to represent the raw variables. The number of PCs for all honey, acacia honey, and multi-floral honey were 26, 23, and 29, respectively. Then, these variables were used as the inputs of LDA and SVM models.
Confusion matrix was used to evaluate the performance (Figure 4). For all honey, the accuracy of LDA and SVM models were 84.1% and 83.1%, respectively. In LDA model, 93 acacia honey samples from Shaanxi province was misclassified as Shanxi province, and 531 acacia honey samples from Shanxi province were misclassified as those from Jilin province. In SVM model, the largest misclassification was from the samples from Shanxi province; 795 acacia honey samples from Shanxi province was misclassified as those from Jilin province. The results indicated that it was hard to discriminate the samples from Shanxi province and Jilin province. For acacia honey, the accuracy of LDA and SVM models were 74.1% and 82.6%, respectively. The low accuracy was also originated from the misclassification between Shanxi province and Jilin province. The recall of Shanxi province and Jilin province in LDA model were 74.8% and 59.9%, whereas 63.0% and 88.3% in the SVM model. For multi-floral honey, the accuracy of the LDA and SVM models were 98.6% and 99.7%, respectively. In the LDA model, only 33 samples from Shanxi province were misclassified as those from Qinghai Province. The performance of the SVM model was better than the LDA model.  In addition, a comparison of modeling performance in three classifications is listed in Table 3. Accuracy and mean average precision were used to evaluate the performance. In general, the SVM model performed better than the LDA model. The accuracy for all honey, acacia honey, and multi-floral honey were 83.1%, 82.6%, and 99.7%, and the mean average precision were 79.3%, 89.5%, and 99.7%. The discrimination performance of all honey was worse than acacia honey or multi-floral honey. The geographical origin of multi-floral honey was well classified, with accuracy and mean average precision of 99.7% and 99.7%. Great differences have been found among these three origins, which were mainly caused by the botanical difference. The flowers in Qinghai Province were mainly rhodiola rosea, chrysanthemum, codonopsis pilosula, and hippophae rhamnoides, etc. The flowers in Shanxi Province were mainly chaste, jujube, and acacia, etc. The flowers in Hebei Province were mainly Chinese medical plants, such as goldthread, chrysanthemum, etc. The constituents of honey might be affected by the regional difference of botanical variety.

Conclusions
Laser-induced breakdown spectroscopy was successfully used to discriminate the geographical origins of honey. Spectral intensity of emissions from Mg, K, Ca, and Na showed slight difference among different origins. One-way ANOVA test indicated emissions from Na I 589.00 nm had significant difference among six groups, which might be considered as feature emission for geographical origin discrimination. Different clusters of origins in multi-floral honey could be separated in PCA score plot, whereas the samples from all honey (including acacia honey and multi-floral honey) and acacia honey were entangled with each other. Emissions from major elements C, H, O, and N as well as Mg, Ca, Na, and K had large loading values, which indicated the importance in each principal component. In addition, the geographical origins of all honey, acacia honey, and multi-floral honey were quantitatively discriminated with LDA and SVM. In general, the SVM model performed better than the LDA model. For acacia honey, the accuracy and mean average precision were 82.6% and 89.5%. Some deep learning methods such as convolutional neural networks might be used to further improve the performance. Excellent discriminant result was achieved in multi-floral honey, with accuracy and mean average precision of 99.7% and 99.7%, respectively. It might be credited to the regional difference in botanical variety. The results indicated the feasibility of the utilization of LIBS for discriminating the geographical origins of honey, which might provide an approach for food traceability.