The Honey Volatile Code: A Collective Study and Extended Version

Background: The present study comprises the second part of a new theory related to honey authentication based on the implementation of the honey code and the use of chemometrics. Methods: One hundred and fifty-one honey samples of seven different botanical origins (chestnut, citrus, clover, eucalyptus, fir, pine, and thyme) and from five different countries (Egypt, Greece, Morocco, Portugal, and Spain) were subjected to analysis of mass spectrometry (GC-MS) in combination with headspace solid-phase microextraction (HS-SPME). Results: Results showed that 94 volatile compounds were identified and then semi-quantified. The most dominant classes of compounds were acids, alcohols, aldehydes, esters, ethers, phenolic volatiles, terpenoids, norisoprenoids, and hydrocarbons. The application of classification and dimension reduction statistical techniques to semi-quantified data of volatiles showed that honey samples could be distinguished effectively according to both botanical origin and the honey code (p < 0.05), with the use of hexanoic acid ethyl ester, heptanoic acid ethyl ester, octanoic acid ethyl ester, nonanoic acid ethyl ester, decanoic acid ethyl ester, dodecanoic acid ethyl ester, tetradecanoic acid ethyl ester, hexadecanoic acid ethyl ester, octanal, nonanal, decanal, lilac aldehyde C (isomer III), lilac aldehyde D (isomer IV), benzeneacetaldehyde, alpha-isophorone, 4-ketoisophorone, 2-hydroxyisophorone, geranyl acetone, 6-methyl-5-hepten-2-one, 1-(2-furanyl)-ethanone, octanol, decanol, nonanoic acid, pentanoic acid, 5-methyl-2-phenyl-hexenal, benzeneacetonitrile, nonane, and 5-methyl-4-nonene. Conclusions: New amendments in honey authentication and data handling procedures based on hierarchical classification strategies (HCSs) are exhaustively documented in the present study, supporting and flourishing the state of the art.


Introduction
The high consumer demand for authentic products along with the pressure on the market with products of low quality, distributed by cheap producing countries, creates the need for a multi-optional handling of natural-based products. A typical example of such products comprises honey-the sweet viscous solution obtained through the action of honeybees (Apis mellifera). The main types of honey include nectar and honeydew honeys. Nectar honeys are produced via the collection of the nectar of flowers by the honeybees.
On the other hand, honeydew honeys are characterized by the presence of secretions of plant-sucking insects (Hemiptera) living in the parts of the plants or conifer trees [1]. Given the historical meaning and symbolism of honey through the welfare of many civilizations [2], the latter has been subjected to exhaustive research. Apart from the basic components which are sugars and moisture, there are plenty of micro-constituents including minerals, phenolic compounds, organic acids, proteins, free amino acids, vitamins and volatile compounds and traces of lipid acids that have attracted researchers [3][4][5][6][7].

Analysis of Gas Chromatography-Mass Spectrometry in Combination with Headspace Solid-Phase Microextraction (HS-SPME/GC-MS)
The experimental strategy for the isolation and semi-quantification of honey volatile compounds along with HS-SPME/GC-MS equipment and analysis conditions are given in details in previous work [10]. The mass spectral library used for the identification of volatile compounds was Wiley 7 (2005) of the National Institute of Standards and Technology (NIST). Only the volatile compounds that had ≥80% similarity with those of the Wiley MS library were tentatively identified using the GC-MS spectra. Data were expressed as concentration (Canalyte, mg/kg) based on the ratio of peak areas of the isolated volatile compounds to that of the internal standard (benzophenone) multiplied by the final concentration of the internal standard, assuming a response factor (RF) equal to 1 for all the compounds. An additional method of identification was considered and included the calculation of Kovats indices using a mixture of n-alkanes (C8-C20) which was supplied by Supelco (Bellefonte, PA, USA). The standard mixture was dissolved in n-hexane. The retention time of the standards was determined according to the temperature-programmed run used in the analysis of honey samples. MS and Kovats indices data were compared to those found in the Wiley MS library. A solvent delay of 5 min was inserted in the program, in order to avoid the elution of ethanol, in which the internal standard was dissolved. Each sample was analyzed in duplicate and the results were averaged.

Statistical Analysis
Multivariate analysis of variance (MANOVA), linear discriminant analysis (LDA), k-nearest neighbors (k-NN), and factor analysis (FA) were applied to the semi-quantitative data of volatile compounds. The first part of the statistical analysis included the botanical origin differentiation of clover, citrus, chestnut, eucalyptus, fir, pine, and thyme honeys based on the use of the significant volatiles (p < 0.05) which served as the independent variables. The botanical origin served as the grouping variable consisted of 7 groups (clover, citrus, chestnut, eucalyptus, fir, pine, and thyme honeys). In the second part of the statistical analysis, honey samples were grouped according to the honey code. The expectation was to investigate whether classification results could be improved.
For the MANOVA analysis, the indices of the multivariate hypothesis such as Pillai's Trace, Wilks' Lambda, Hotelling's Trace, and Roy's Largest Root were computed to determine whether there was a multivariate effect of volatile compounds (p < 0.05) on the botanical origin or the honey code of honey. The size of the effect was further evaluated by consideration of partial eta squared values (η 2 ). It should also be stressed that the lower the value of Wilks' Lambda, the higher the differences between groups of objects. Considering only the significant volatiles, LDA was then applied to classify honey samples according to group membership based on the use of original and cross-validation methods. LDA provides linear discriminant functions originating from the combinations of the significant variables (all independent variables are entered together/simultaneously in the analysis) multiplied by the standardized canonical discriminant function coefficients plus a constant, characteristic for each discriminant function [18]. Moreover, the tolerance test was also computed in the analysis. Tolerance may be defined as the proportion of a variable's variance not accounted for by other independent variables in the created discriminant function. It practically shows that a variable with very low tolerance contributes little information to the predictive model and may cause computational problems [19].
For the k-nearest neighbors analysis, the botanical origin of samples or the honey code served as the target parameter, while the significant volatiles (p < 0.05) served as the features. The number of the k-nearest neighbors was set by default equal to the higher number provided by the SPSS software-that is, 3-5. The classification ability of the constructed model was estimated by the application of training and holdout partitions. In the training sample, 70% of the cases were randomly assigned to partitions, while the rest of the cases were assigned to the holdout sample. The distances of an unknown object from all the members of the training set were calculated using the Euclidean distance in multi-dimensional space. The k-smallest distances between the unknown object and the training set sample were then identified. K is normally a small odd number, and the unknown variable is allocated to the class with the majority of these k distances [10]. For the performance of feature selection among the population of significant variables, k-NN analysis was run again using only the specified predictors that built the model (usually [3][4][5] in the previous step, in order to reduce the number of predictors to those having the lower error rate and k distance.
Afterwards, a data mining technique such as factor analysis (FA) was applied to the significant parameters in order to explain the total variance of the constructed model in the multidimensional space. At the same time, FA provides a reduction in the variables, in order to gain the most pronounced ones with the higher correlation and communalities of independent latent variables. The communalities indicate the common variance shared by factors with specific variables. A higher communality indicates that a larger amount of the variance in the variable has been extracted by the factor solution. For the most effective data collection during FA, communalities should be ≥0.4. The extraction method was principal component analysis (PCA). The rotation method used was Varimax with Keiser Normalization. The Varimax rotation is used in statistical analysis to simplify the expression of a particular sub-space in terms of just a few major items each. The actual coordinate system (practically constant) is the orthogonal basis that is being rotated to align with these coordinates. The sub-space can be defined with either PCA or FA. Varimax maximizes the sum of the variances of the squared loadings (squared correlations between variables and factors [20]. The accuracy and strength of factor analysis was supported further by the Kaiser-Meyer-Olkin (KMO) test, which comprises a measure of how well suited the data is for factor analysis. The acceptable value considered was that of KMO ≥ 0.50. In addition, the effectiveness and suitability of factor analysis was explored using Bartlett's test of sphericity. This test highlights the hypothesis that the correlation matrix is an identity matrix, which would indicate that the variables incorporated into the model are unrelated and therefore unsuitable for structure detection. Small probability values (p < 0.05) indicate that a factor analysis may be useful with data treatment [20]. Statistical analysis was run using the SPSS (version 20.0, IBM, Armonk, NY, USA) statistics software.

Results and Discussion
3.1. Volatile Compounds of Clover, Citrus, Chestnut, Eucalyptus, Fir, Pine, and Thyme Honeys A considerable number of volatile compounds were putatively identified and semi-quantified. In total, 94 volatile compounds of different classes were isolated. Table 1 lists the volatile compounds  Figure 1 shows a typical chromatogram of clover honey from Egypt, indicating with numbers some selected key volatile compounds. In supplementary material (Figures S1-S6), typical chromatograms of citrus, chestnut, eucalyptus, fir, pine, and thyme honeys from the investigated regions are given.  Figure 1 shows a typical chromatogram of clover honey from Egypt, indicating with numbers some selected key volatile compounds. In supplementary material (Figures S1-S6), typical chromatograms of citrus, chestnut, eucalyptus, fir, pine, and thyme honeys from the investigated regions are given. Clover honeys showed higher amounts (mg/kg) of 2-methylbutanal, 3-methylbutanal, 3-methyl-1-butanol, and 2-methyl-1-butanol compared to the other honey types. The aforementioned compounds comprise isoleucine-and leucine-derived volatiles and are found in a wide range of foods such as honey, beer, cheese, coffee, chicken, fish, chocolate, olive oil, and tea. These volatile compounds are associated with a fruity and malty flavor [21,22].
Citrus honeys were characterized by the higher amounts (mg/kg) of lilac aldehyde C (isomer III), dill ether, methylanthranilate and herboxide (isomer II). These compounds have been reported previously to dominate the volatile profile of citrus honeys among other honey types [3,4].
Chestnut honeys showed the highest amounts (mg/kg) of benzaldehyde, benzeneacetaldehyde, 1-octene, and furfural. The latter volatile compound may serve as an indicator of heat resistance/treatment of chestnut honeys.
Eucalyptus honeys were characterized by the presence of heptane and beta-damascenone, since these compounds were found in higher amounts (mg/kg). The presence of hydrocarbons in honey is a typical phenomenon, whereas beta-damascenone is a cyclic carotenoid derivative and possesses a rose odor [22].
Pine honeys had higher amounts (mg/kg) of acetic acid, octane, alpha-pinene and beta-thujone. It is remarkable that beta-thujone was identified only in pine honeys, comprising a key volatile marker, Clover honeys showed higher amounts (mg/kg) of 2-methylbutanal, 3-methylbutanal, 3-methyl-1-butanol, and 2-methyl-1-butanol compared to the other honey types. The aforementioned compounds comprise isoleucine-and leucine-derived volatiles and are found in a wide range of foods such as honey, beer, cheese, coffee, chicken, fish, chocolate, olive oil, and tea. These volatile compounds are associated with a fruity and malty flavor [21,22].
Citrus honeys were characterized by the higher amounts (mg/kg) of lilac aldehyde C (isomer III), dill ether, methylanthranilate and herboxide (isomer II). These compounds have been reported previously to dominate the volatile profile of citrus honeys among other honey types [3,4].
The latter volatile compound may serve as an indicator of heat resistance/treatment of chestnut honeys.
Eucalyptus honeys were characterized by the presence of heptane and beta-damascenone, since these compounds were found in higher amounts (mg/kg). The presence of hydrocarbons in honey is a typical phenomenon, whereas beta-damascenone is a cyclic carotenoid derivative and possesses a rose odor [22].
In the case of thyme honeys, 1-octen-3-ol, thymol methyl ether, carvacrol methyl ether, octyl ether, 4,7,7-trimethylbyciclo (3.3.0)-octan-2-one, and eugenol were found in higher amounts (mg/kg) compared to the other honey types. Ethers give an ether-like odor, whereas 1-octen-3-ol provides a floral and grassy odor. Furthermore, 1-Octen-3-ol is referred to as "mushroom alcohol", given the fact that is the main flavor component of mushrooms [25]. Eugenol is a phenylpropanoid volatile that has a pleasant, spicy, and clove-like odor [26]. More specifically, 62 of the 94 volatile compounds were found to be significant (p < 0.05) for the botanical origin differentiation of honeys. Afterwards, these volatiles were subjected to LDA. The minimum tolerance level of the analysis was set at 0.001. Results showed that 4-terpineol, borneol, para-cymene, carvacrol methyl ether, thymoquinone, thymol, and eugenol did not pass the tolerance test. Therefore, these volatile compounds were excluded (SPSS program) a priori from the discriminant analysis. Therefore, 56 volatile compounds were subjected to LDA.
The first discriminant function recorded the higher eigenvalue (127.977) and a canonical correlation of 0.996, accounting for 71.5% of total variance. The second discriminant function recorded a much lower eigenvalue (39.902) and a canonical correlation of 0.988, accounting for 22.3% of total variance. The third discriminant function recorded a much lower eigenvalue (4.530) and a canonical correlation of 0.905, accounting for 2.5% of total variance. Similarly, the fourth discriminant function recorded an even lower eigenvalue (2.764) and a canonical correlation of 0.857, accounting for 1.5% of total variance. Moreover, the fifth discriminant function had a lower eigenvalue (2.360) and a canonical correlation of 0.838, accounting for 1.3% of total variance. Finally, the sixth discriminant function had the lowest eigenvalue (1.446) and a canonical correlation of 0.769, accounting for 0.8% of total variance. All six discriminant functions accounted for 100% of total variance.  The classification rate was 95.4% using the original and 81.5% using the cross-validation method. Supplementary Table S1 shows the significant variables (markers of botanical origin of clover, citrus, chestnut, eucalyptus, fir, pine, and thyme honeys) that were ordered by absolute size of correlation within function. The higher the absolute value of correlation, the best discrimination the variable provides within the discriminant function. The most pronounced markers of the botanical origin of honey are marked with an asterisk. In particular, octanoic acid ethyl ester, nonanoic acid ethyl ester, decanoic acid ethyl ester, dodecanoic acid ethyl ester, decanal, nonanal, 5-methyl-4-nonene, hexanoic acid ethyl ester, heptanoic acid ethyl ester, 4-ketoisophorone, octanol, tetradecanoic acid ethyl ester, geranyl acetone, 6-methyl-5-hepten-2-one, 1-(2-furanyl)-ethanone, alpha-isophorone, and decanol contributed to discriminant function 1, whereas hexadecanoic acid ethyl ester contributed to discriminant function 2. It should be remembered that the first two discriminant functions explained 93.8% of total variance.
On the contrary, the botanical origin classification rate of honeys, based on the cross-validation method followed the sequence: clover (62.5%), citrus (80.6%), chestnut (66.7%), eucalyptus (75%), fir (100%), pine (94.9%), and thyme (57.1%). In total, 25% of clover honeys (two samples) were allocated to pine honeys, whereas 12.5% of samples (one sample) were allocated to thyme honeys. In total, 19.4% of citrus honeys (six samples) were allocated to pine honeys. In total, 33.3% of chestnut honeys (one sample) were allocated to fir honeys. In total, 25% of eucalyptus honeys (one sample) were allocated to chestnut honeys. In total, 5.2% of pine honeys (two samples) were allocated in equal proportions (2.6%) to clover and chestnut honeys, respectively. Finally, 42.8% of thyme honeys were allocated to clover (11.4%) (four samples), citrus (2.9%) (one sample), chestnut (11.4%) (four samples), and pine (17.1%) (six samples) honeys, respectively (Table 2).  It should be stressed that these results were accepted, given the fact that cross validation is a more pessimistic method of classification of a group of objects, since in cross validation, each case is classified by the functions derived from all cases rather than that particular case. The errors obtained in both classification techniques (original and cross-validation) reveal important findings regarding honey authentication. These errors show the contribution of numerous plants in the produced honeys, even in cases of honeydew honeys. Honeydew honeys are harvested after nectar honeys. Therefore, the contribution of flowering plants in honeydew honeys is quite common.
In addition, present findings may be related to beekeeper practices during the harvesting of different honey types. However, the classification results obtained support previous studies in the literature that focus on honey authentication using volatile compound analysis and chemometrics [3][4][5]8,11]. The results obtained in the present study, which is collective in nature, show a clear differentiation of honeydew vs. nectar honeys.

K-Nearest Neighbors
For the k-NN analysis, the number of samples was randomly assigned to training and holdout partitions. The training sample consisted of 110 honey samples (72.8%), while the holdout sample consisted of 41 samples (27.2%). All cases (151 honey samples) (100%) were used in the statistical analysis, comprising a valid procedure. The overall classification rate was 77.3% for the training and 87.8% for the holdout sample and was satisfactory in both cases. The botanical origin classification rate of honey types for the training sample followed the sequence: clover (75%), citrus (70.8%), chestnut (0%), eucalyptus (0%), fir (95.8%), pine (92.3%), and thyme (58.3%). However, the zero classification rates of chestnut and eucalyptus honeys are attributed to the limited honey samples, since the majority of them were assigned to the holdout sample. Of the eight clover honey samples subjected to training analysis, six were assigned to clover and two to thyme honeys. Similarly, of the 24 citrus honey samples, 17 samples were assigned to citrus, one to clover, four to pine, and two to thyme honeys, respectively. One chestnut honey sample was assigned to eucalyptus honeys. Of the 24 fir honey samples, 23 were assigned to fir and only one to pine honeys. A similar trend was obtained for pine honeys-in which, 25 samples were assigned to pine and only one to fir honeys, respectively. Finally, of the 24 thyme honey samples, 14 were assigned to thyme, six to pine, three to clover, and one to citrus honeys, respectively.
Regarding the holdout sample classification rate, this was higher than the training sample by 10.5%. The botanical origin classification rate of honey types for the holdout sample followed the sequence: citrus (85.7%), chestnut (0%), eucalyptus (100%), fir (100%), pine (100%), and thyme (81.8%). The clover honeys were assigned previously to training sample. Therefore, no classification results were obtained. Of the seven citrus honey samples subjected to holdout analysis, six were assigned to citrus and one to pine honeys. Of the two chestnut honeys, one was assigned to citrus and one to eucalyptus honeys, respectively. Finally, of the 11 thyme honey samples, nine samples were assigned to thyme and two to pine, respectively (Table 3). Table 3. Classification of clover, citrus, chestnut, eucalyptus, fir, pine, and thyme honeys according to botanical origin using the 56 volatile compounds and K-Nearest Neighbors (k-NN) analysis. Among the 55 significant volatile compounds (predictors), the most effective predictors (k-nearest neighbors) that built the model were acetic acid ethyl ester, formic acid ethyl ester, and 2-methylbutanal. Based on this observation, the k-NN analysis was run again by performing feature selection in order to investigate whether the classification results could be improved. The selected features were the aforementioned volatile compounds.
For the specified k-nearest neighbors analysis, the sample was divided again to training and holdout partitions. The training sample consisted of 100 honey samples (66.2%), whereas the holdout sample consisted of 51 (33.8%). Both partitions explained 100% of the procedure, indicating that all cases were valid. The analysis stopped when the absolute ratio was less than or equal to the minimum change, which was inserted by default equal to 0.01. The overall classification rate was 83% for the training and 74.5% for the holdout sample. The individual botanical classification rate was differentiated given the fact that the sample assignment to training and holdout partitions was also differentiated (Table 4). Considering the total standard error of the forced predictors, but also that of the individual predictors, the model built with k = 3 (nearest neighbors) was more applicable. The 10 predictors were the three specified predictors, followed by furfural, lilac aldehyde C, benzaldehyde, nonanol, para-cymene, 5-methyl-4-nonene, and nonane. The total error rate of the three specified features was 0.74. The respective individual error rate for the seven predictors left was: 0.39, 0.27, 0.25, 0.21, 0.18, 0.17, and 0.17.
On the contrary, the total error rate of the model built with k = 4 nearest neighbors in relation to the three specified features was 0.75. The individual error rate for the seven predictors left was: 0.41, 0.28, 0.25, 0.23, 0.21, 0.20, and 0.20 for furfural, lilac aldehyde C, phenylethylalcohol, 5-methyl-4-nonene, 1-octen-3-ol, and nonane, respectively. The selection of the predictors during k-NN analysis with k = 3 and k = 4 nearest neighbors according to the botanical origin of honey is shown in Figure 3.
FA During factor analysis, the 56 significant volatile compounds were reduced to 16 principal components (PCs) based on the rule of an eigenvalue greater than one. The rotated component matrix is given in Supplementary Table S2. The total variance explained was 80.524% (ca. 80.52%).
The fitness of data for factor analysis was estimated by the Kaiser-Meyer-Olkin (KMO) test, which comprises a measure of the effective performance of factor analysis, to a set of data, indicating the sampling adequacy. The acceptable value was considered that of KMO ≥ 0.50. The suitability and applicability of factor analysis was further evaluated using Bartlett's test of sphericity. This test highlights the hypothesis that the correlation matrix is an identity matrix, which would indicate that the variables incorporated into the model are unrelated and therefore unsuitable for structure detection. Small probability values (p < 0.05) indicate that a factor analysis may be useful with data treatment [20]. The value of the KMO test was 0.636. Furthermore, Bartlett's test of sphericity had the following qualitative values: X 2 = 10,587.564, df = 1540, and p = 0.000.  The factors that best explained the rotated component matrix were the following volatile compounds: Decanol (PC1, 10.454% of total variance), undecanoic acid ethyl ester (PC2, 10.380% of total variance), para-cymene (PC3, 7.768% of total variance), 2-hydroxyisophorone (PC4, 7.699% of total variance), nonane (PC5, 5.509% of total variance), dill ether (PC6, 5.058% of total variance), lilac aldehyde C (PC7, 4.412% of total variance), lilac aldehyde D (PC8, 4.352% of total variance), acetic acid ethyl ester (PC9, 3.918% of total variance), 5-methyl-2-phenylhexenal (PC10, 3.905% of total variance), decane (PC11, 3.596% of total variance), 1-(2-furanyl)-ethanone (PC12, 3.563% of total variance), benzeneacetonitrile (PC13, 3.240% of total variance), nonanol (PC14, 2.400% of total variance), 2-methylbutanal (PC15, 2.156% of total variance), and hexanoic acid ethyl ester (PC16, 2.113% of total variance). showed that there was a statistically significant effect of the honey code on the semi-quantitative data of volatile compounds. More specifically, 65 of the 94 volatile compounds were found to be significant (p < 0.05) for the classification of honeys according to the honey code. Then, these volatiles were subjected to LDA. Results showed that 4-terpineol, (1R)-1,7,7-trimethyl-bicyclo-(2.2.1)-heptan-2-one, carvacrol methyl ether, thymoquinone, thymol, and eugenol did not pass the tolerance test. Therefore, these volatile compounds were excluded from the discriminant analysis. In that sense, 59 volatile compounds contributed to the structure matrix as shown in Supplementary Table S3.
Results showed that four canonical discriminant functions were formed: Wilks' Lambda = 0.000, X 2 = 1009.601, df = 236, p = 0.000 for the first function; Wilks' Lambda = 0.014, X 2 = 502.352, df = 174, p = 0.000 for the second function; Wilks' Lambda = 0.090, X 2 = 284.776, df = 114, p = 0.000 for the third function; and Wilks' Lambda = 0.369, X 2 = 117.597, df = 56, p = 0.000 for the fourth function. The first discriminant function recorded the higher eigenvalue (72.605) and a canonical correlation of 0.993, accounting for 87.7% of total variance. The second discriminant function recorded a much lower eigenvalue (5.321) and a canonical correlation of 0.917, accounting for 6.4% of total variance. The third discriminant function recorded a lower eigenvalue (3.124) and a canonical correlation of 0.87o, accounting for 3.8% of total variance. Similarly, the fourth discriminant function recorded the lowest eigenvalue (1.709) and a canonical correlation of 0.794, accounting for 2.1% of total variance. All four discriminant functions accounted for 100% of total variance.
In Figure 4,  The classification rate was 96.0% using the original and 86.1% using the cross-validation method. The classification rate of honeys according to the honey code, based on the original method, followed the sequence: CCC (88.1%), E (100%), F (100%), P (100%), and T (97.1%). In total, 12.9% of CCC honeys were allocated to the E group (4.8%, two samples) and to the P group (7.1%, three samples). In total, 2.9% (three samples) of T honeys were allocated to the P group.
For the cross-validation method, the classification rate of honeys followed the sequence: CCC (81%), E (75%), F (93.5%), P (92.3%), and T (80%). In total, 4.8% (two samples) of CCC honeys were allocated to the E group; 2.4% (one sample) were allocated to the F group, and 11.9% of samples (five samples) were allocated to the P group. In total, 25% (one sample) of E honeys were allocated to CCC honeys. In total, 3.2% of F honeys were allocated in equal proportions to the E (one sample) and P (one sample) groups, respectively. The same holds for P honeys-in which, two samples (5.2% of the total population) were allocated to the E (one sample, 2,6%) and F (one sample, 2,6%) groups, respectively. Finally, 14.3% of T honeys were allocated to P honeys (five samples), and 5.7% (two samples) to E honeys (Table 5). The classification rate was 96.0% using the original and 86.1% using the cross-validation method. The classification rate of honeys according to the honey code, based on the original method, followed the sequence: CCC (88.1%), E (100%), F (100%), P (100%), and T (97.1%). In total, 12.9% of CCC honeys were allocated to the E group (4.8%, two samples) and to the P group (7.1%, three samples). In total, 2.9% (three samples) of T honeys were allocated to the P group.
For the cross-validation method, the classification rate of honeys followed the sequence: CCC (81%), E (75%), F (93.5%), P (92.3%), and T (80%). In total, 4.8% (two samples) of CCC honeys were allocated to the E group; 2.4% (one sample) were allocated to the F group, and 11.9% of samples (five samples) were allocated to the P group. In total, 25% (one sample) of E honeys were allocated to CCC honeys. In total, 3.2% of F honeys were allocated in equal proportions to the E (one sample) and P (one sample) groups, respectively. The same holds for P honeys-in which, two samples (5.2% of the total population) were allocated to the E (one sample, 2,6%) and F (one sample, 2,6%) groups, respectively. Finally, 14.3% of T honeys were allocated to P honeys (five samples), and 5.7% (two samples) to E honeys ( Table 5).
As can be observed, the classification of honeys according to the honey code based on original and cross validation methods provided higher prediction rates compared to the differentiation of honeys according to botanical origin. In Table 3, the key volatile compounds for the discrimination of honeys according to the honey code are marked with an asterisk.  100.0 a 96.7% of grouped cases using the original method were correctly classified. b Cross validation was performed only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases rather than that particular case. c 88.1% of cross validated grouped cases were correctly classified.

K-Nearest Neighbors
The training sample consisted of 111 honey samples that represented 73.5% of the total sample population. Similarly, the holdout sample consisted of 40 samples that represented 26.5% of the sample population. All cases (151 honey samples) (100%) were used in the statistical analysis, comprising a valid procedure. The scale features (predictors) (57 statistically significant volatile compounds) were normalized during analysis.
The overall classification rate was 81.1% for the training and 80% for the holdout sample and was satisfactory in both cases. The classification rate of honey types according to the honey code followed the sequence: CCC (86.7%), E (25%), F (96.2%), P (96.4%), and T (47.8%). Of the 30 CCC samples subjected to training analysis, one sample was assigned to the E group and three samples to the P group. Similarly, of the four E honey samples, three honey samples were assigned to the CCC group and only one sample to the E group. Of the 26 F honeys, 25 samples were assigned to the F group and only one sample to the P group. Of the 28 P honeys, 27 samples were assigned to the P group and only one sample to the F group. Finally, of the 23 T honey samples, 11 were assigned to T, six to CCC, and six to P, respectively. For the training sample, the honey code had the following hierarchy: CCC > E, F, P > T, CCC > T, and was in general applicable.
The classification rates for the holdout sample were improved for the F, P, and T honey groups. The classification rate of honey samples according to the honey code for the holdout sample followed the sequence: CCC (83.3%), F (100%), P (100%), and T (50%). As can be observed, the hierarchy in honey lettering was maintained by 3/5 cases (F, P > T and CCC > T). Of the 12 CCC honey samples assigned to the holdout sample, 10 were assigned to the CCC group and two to the P group. F and P honey samples were perfectly assigned to the respective groups. Finally, of the 12 T honey samples, seven were assigned to T group, two to the CCC, and four to the P group (Table 6). Among the 57 significant volatile compounds (predictors), the most effective predictors (k-nearest neighbors) that built the model were formic acid ethyl ester, acetic acid ethyl ester, and heptane. Therefore, the k-NN analysis was repeated by performing feature selection in order to investigate whether the classification results could be improved. The selected features were the aforementioned volatile compounds.
For the specified k-nearest neighbors, the sample was divided again to the training and holdout partitions. The training sample consisted of 110 honey samples (72.8%), whereas the rest of the honey samples represented the holdout sample (27.2%). Similar to the botanical origin differentiation of honeys, the analysis stopped when the absolute ratio was less than or equal to the minimum change which was inserted by default equal to 0.01.
The overall classification rate was 89.1% for the training and 63.4% for the holdout sample. The classification rate of the training sample was improved, whereas that of the holdout sample was decreased. However, discrepancies among the two methods followed (simple k-NN and k-NN analysis with feature selection) may be attributed to the number of the predictors assigned to the model and the random dividing of the sample to training and holdout partitions in relation to sample size.
The classification rate of honey samples according to the honey code for the training sample followed the sequence: CCC (87.5%), E (0%), F (100%), P (93.1%), and T (88.5%). These results show that the honey code was satisfactorily applicable given the hierarchy followed: CCC > E, F > P > T and CCC > T. The classification rate of honey samples according to the honey code for the holdout sample followed the sequence: CCC (50%), E (0%), F (72.7%), P (80%), and T (55.6%). Similarly, the honey code was satisfactorily applicable given the hierarchy followed: CCC > E, F, P > T, and CCC > T. The classification results along with each sample assignment are given in Table 7.
The model was built with k = 3, k = 4, and k = 5 nearest neighbors, which were automatically selected. Similar to the botanical origin differentiation of honeys, the forced predictors' error was considered for the selection of the best model. The 10 predictors that were obtained during the k-NN analysis with k = 3 neighbors were the three forced predictors (heptane, formic acid ethyl ester, acetic acid ethyl ester) followed by dodecanoic acid ethyl ester, benzaldehyde, nonanal, alpha-terpinene, geranyl acetone, 5-methyl-4-nonene, and lilac aldehyde B (isomer II). The total error rate of the three specified features was 0.6818. The respective individual error rate for the seven aforementioned predictors was: 0.30, 0.2727, 0.2455, 0.1818, 0.1616, 0.1455, 0.1364, and 0.1364.
On the other hand, the total error rate of the model built with k = 4 nearest neighbors in relation to the three specified features was 0.6636. The individual error rate for the six predictors left was: 0.3000, 0.1909, 0.1636, 0.1364, 0.1273, and 0.1273, for furfural, lilac aldehyde C (isomer III), decanoic acid, lilac aldehyde D (isomer IV), pentanoic acid, and nonane.
Finally, the total error rate of the model built with k = 5 nearest neighbors in relation to the three specified features was 0.6545. The individual error rate for the seven predictors left was: 0.2909, 0.20, 0.1364, 0.1545, 0.1182, 0.1091, and 0.1091, for furfural, lilac aldehyde C (isomer III), benzeneacetaldehyde, octanoic acid ethyl ester, nonanoic acid ethyl ester, nonanoic acid, and pentanoic acid. The selection of the predictors during k-NN analysis with k = 3, k = 4, and k = 5 nearest neighbors according to the honey code is shown in Figure 5. Table 7. Classification of clover, citrus, chestnut, eucalyptus, fir, pine, and thyme honeys according to the honey code using the forced predictors and, in total, 10 volatile compounds and k-NN analysis.

Conclusions
Results of the present study showed that specific volatile compounds in combination with polyparametric statistical techniques such as MANOVA, LDA, k-NN and FA, may provide exhaustive information about the botanical origin of honey. Even though honey samples were harvested in different parts of the world, the classification of honeys according to botanical origin was, in general, very satisfactory. At the same time, the application of the honey code to the collective set of data and the use of the aforementioned statistical techniques resulted in a higher classification rate of the honey samples. The use of hierarchical classification strategies (HCSs) may expand the state of the art and flourish the complicated topic of "Honey authentication" by highlighting with numbers the distinction/differentiation rates, for instance, of monofloral or blends of nectar or honeydew honeys with specific and intense flavors.
Supplementary Materials: The following are available online at http://www.mdpi.com/2304-8158/8/10/508/s1, Table S1: Standardized canonical discriminant function coefficients of the discrimination model-structure matrix of the LDA model according to the botanical origin of honey; Table S2: Rotated component matrix of volatile compounds used for the botanical origin differentiation of clover, citrus, chestnut, eucalyptus, fir, pine, and thyme honeys; Table S3: Standardized canonical discriminant function coefficients of the discrimination model-structure matrix of the LDA model according to the honey code; Table S4: Rotated component matrix of volatile compounds used for the distinction of clover, citrus, chestnut, eucalyptus, fir, pine, and thyme honeys according to the honey code; Figure S1: A typical gas chromatogram of citrus honey (no. 4) from Spain indicating selected key volatile compounds. 10: Herboxide (isomer II). 11: Lilac aldehyde C (isomer III). 12: Dill ether. IS: internal standard; Figure S2: A typical gas chromatogram of chestnut honey (no. 2) from Portugal indicating selected key volatile compounds. 13: Octane. 14: 5-methyl-2-Furaldehyde. 15