A Comparison of Methods for Classification of Flue-cured Tobacco Aroma Types

It is well acknowledged that flue-tobacco aroma types were divided into light, medium and heavy in China. For the sake of singling out an optimal scheme to discriminate the spatial distribution of flue-cured tobacco aroma type, in the current study, different amounts of chemical indices data with various methods including BackPropagation Neural Networks (BP NN), Support Vector Machine (SVM) and Discriminant Analysis (DA) were presented and compared. All the experimental results indicated that, by and large, the number of chemical indices have nothing to do with the accuracy. Additionally, the classification effects of BP NN are superior to the others. On a whole, the best scheme with accuracy reaching to 81.18% and kappa value up to 0.72 was drawn only when the BP model combined with 9 kinds of chemical indices. In the end, the optimal spatial distribution was established in ArcGIS9.3.


INTRODUCTION
According to Food and Agricultural Organization (FAO), flue-cured tobacco is planted and sold in more than 150 countries and regions around the world.Among them, China is the major tobacco production and consumption country.Generally, there are three aroma types (light, medium and heavy) of flue-cured tobacco in China.The types are usually determined by experts.However, flue-cured tobacco aroma types are inevitably associated with leaf chemical compositions.Only few studies focused on identifying flue-cured tobacco aroma types using routinely measured chemical compositions at national scale (Bi et al., 2006;Yang et al., 2014;Zhang et al., 2013).For example, Bi et al. (2006) combined eight chemical variables with discriminate analysis method for classifying the FCT aroma types of Yunnan, Henan and Liaoning provinces and a relative good result was achieved.Yang et al. (2014) received a very good classification effects adopting the methods of SVM with various kinds of chemical compositions.By building the Fisher discriminant formula with 67 kinds of chemical compositions of FCT, Yan et al. has discriminated the aroma types of FCT, in 11 main tobacco production provinces of China.
It is well acknowledged that Artificial Neural Network (ANN), Support Vector Machine (SVM) and Discrimination Analysis (DA) are the most popular methods for supervised classification.With the great capability of nonlinear approximation and pattern recognition, ANN has been widely used in many fields, such as engineering, medical science, agriculture, finance and national defense (Widrow, 1988).Currently, Back Propagation (BP) neural network has been one of the most widely used ANNs.For example, Kavdır (2004) differentiated between 2 and 3 weeks old sunflower plants and common cocklebur weeds of similar size, shape and color by a back propagation neural network classifier.Being a powerful classifier, SVM has been also widely used in the fields where ANNs have dominated.For instance, Kolios and Stylios (2013) used a series of traditional and modern algorithms to investigate the Land Use and Land Cover (LULC) changes in a coastal area.They reported that the SVM classifier gave the best overall accuracy for the study area (Kolios and Stylios, 2013).Zheng et al. (2015) applied Support Vector Machines (SVMs) to discriminate various crop types in a complex cropping system in the Phoenix Active Management Area and the models achieved very high overall classification accuracy.DA is also a widely used supervised classifier.For example, Marey-Pérez and Rodríguez-Vicente (2011) revisited the factors determining forest management by farmers in northwest Spain using the discriminant analysis.Riveiro-Valino used discriminant analysis to validate the types of dairy farms obtained from the combinatorial method for Galicia (Riveiro-Valiño et al., 2009).Nieuwenhuizen et al. (2010) compared discriminant analysis and neural network to determine the reflectance properties of sugar beet and volunteer potato.They found that the neural network gave the best classification results (Nieuwenhuizen et al., 2010).
In the current study, ANN, SVM and DA are compared to identify the flue-cured tobacco aroma types at national scale in China based on routinely measured chemical compositions of flue-cured tobacco leaves.The results are expected to provide valuable information on regional planning and decision making for producing high-quality flue-cured tobacco with different aroma types.

METHODOLOGY
BP neural network: Back Propagation (BP) neural network algorithm has been a fashion way for classification because of its strong nonlinear mapping ability and high learning accuracy.On the basis of the error function gradient of network, error against propagation algorithm is used to train the BP neural network.In this study, a multilayer feed-forward network including input layer, hidden layer and output layer was applied to classify the aroma types of fluecured tobacco (Fig. 1).
Figure 1, numbers of neurons are included in each layer.X , , , … , and Y y , y , y , … , y T are the k-th input and output samples of the BP network.The number of input, hidden and output layer neurons is M, I and P, respectively.
The sigmoid function was the continuous differentiable non-linear activation function used for hidden and output layers.The function is defined as follows: The input and output formulas of the i-th neuron in hidden layer are defined as: The input and output formulas of the p-th neuron in output layer are defined as: The output error of the p-th neuron in output layer is defined as: The formula of weight modifying is defined as: where, w is the weight between neurons.v and y are the input and output values of output layer, separately.T t , t , t , … , t is the expected output.Here, M, I and P were 18, 10 and 3, respectively.While n is the number of iterations, Η represents the learning step size.And δ stands for the local gradient, k is on the behalf of the k-th sample.More information on back propagation neural network could be found in Hecht-Nielsen (1989) and Johnson and Wichern (1992).SVM: Support Vector Machine (SVM) developed by Vapnik (Li et al., 2009) is a statistical learning technique based on the VC (Vapnik-Chervonenkis) dimension theory which minimizes prediction error and model complexity (Li et al., 2009).SVMs overcome efficiency problems of ANNs, such as over-fitting and local minimum.Figure 2a the input vectors are mapped to a high feature space from the input space by a nonlinear transformation function.An optimal separating hyperplane can be structured successfully in this feature space.Figure 2b, circles and triangle represent different classification samples, respectively and the samples in solid line are support vectors.The Subject to y w ϕ χ b 1 ξ and ξ ≧ 0, 1, 2, 3, … , .where, C is punish coefficient.The bigger the c value is, the more severe the penalty would be.
SVMs are binary classifiers.Several methods have been designed to deal with multi-class classification problems (Hu et al., 2010;Hsu and Lin, 2002).Oneagainst-one method is adopted in the current study.For a k-class classification problem, a total of k (k-1)/2 (k>2) classifiers are constructed and each of them trains the data derived from two different classes: In this case, three classifiers were constructed to identify light, medium and heavy aroma types: Discriminant analysis: Discriminant analysis is a multivariate statistical analysis procedure where a data set containing p variables is separated into a number of previously defined groups using a linear combination of features.Given a set of p independent variables with known class k, discriminant analysis attempts to find linear combinations of the predictors (discriminant function, D).The function D Eq. ( 11) is expected to differentiate the k groups of samples as well as estimate groups membership and possibility according to the Fisher's discriminant procedure.For each group i (i = 1, …, k), the discriminant function D is defined as follows: where, b1, b2, …, bp are discriminant coefficients or scores, x1, x2, …, xp are independent variables and c is a constant.The centroids summarizing the group information are calculated as follows: where, X1, X2, …, Xp denote the mean values of the independent variables in the corresponding discriminant function for group i.The discriminant function D is designed with the aim of maximum distance between the centroids.The group membership for a new case is calculated based on the centroids.When the average discriminant score is lower than zero, the new case will be assigned to the group with lower centroids and vice versa.
Accuracy evaluation: Overall accuracy and kappa coefficient (Fleiss, 1971) are used to evaluate the where, P(A) is the proportion of times that the methods agree and P(C) is the proportion of times that one expects them to agree by chance.Almost perfect agreement was yielded when 0.81 < kappa < 1; substantial agreement if 0.61 < kappa < 0.8; moderate agreement if 0.41 < kappa < 0.6; fair agreement if 0.21 < kappa < 0.4; slight agreement if 0.01 < kappa < 0.2 and poor agreement if kappa<0 (Landis and Koch, 1977).The relative improvement of overall accuracy and kappa coefficient were used to measure the improvement on the classification accuracy of the better performed models over the reference methods: R where, CAE and CAR are the overall accuracy or kappa coefficient of the better performed models and the reference method, respectively.All analyses were done in Matlab 7.0.

Data:
During the period of 2003 to 2007, 186 tobacco leaf samples with grade of C3F were collected from the representative counties planting flue-cured tobacco across China.Among them, 27 records with light, medium and heavy aroma types were used to train the classifiers and the remaining were unclassified samples.Eighteen routinely measured chemical compositions of flue-cured tobacco leaves including water-soluble total sugar, total plant alkaloid, total nitrogen, protein, reducing sugar, total volatile acids, total volatile alkali, ratio of nitrogen to nicotine, ratio of sugar to nicotine, ratio of potassium to chlorine, petroleum ether extracts, pH, potassium, chloride, nitrate, sulfate, ash content and alkalinity of water-soluble ash content were used to classify the aroma types in the current study.The descriptive statistics of these chemical parameters was given in Table 1.
The first five chemical compositions (water-soluble total sugar, total plant alkaloid, protein, total nitrogen and reducing sugar) were the most widely used indicators in evaluating flue-cured tobacco leave quality (Hu et al., 2010;Wang et al., 1998;Du et al., 2000) Therefore, these five chemical compositions were used as basic indicators.The others were added to the classifiers one after another.Finally, 42 classifiers developed by BP, SVM and DA were evaluated in the current study (Table 2).

RESULTS
The overall accuracy and kappa coefficients of the classifiers were shown in Table 2.The mean values of overall accuracy and kappa coefficient were 78% and 0.66 for BPs, 67% and 0.50 for SVMs and 67% and 0.50 for DAs.The results showed that BP models gave better performance than SVM and DA methods.The values of relative improvement of BP on SVM and DA were 16.5 and 33.6% for overall accuracy and kappa coefficient, respectively.
Compared with the basic BP classifier (BP5), BP models with more indicators have higher performance.The average relative improvements were 13 and 25% for overall accuracy and kappa coefficient, respectively (Table 2).Among them, the BP models with 9, 12 and 18 indicators had higher classification performances with relative improvement higher than 15 and 30% for overall accuracy and kappa coefficient, respectively.Hence, the BP model with fewer indicators and higher performance (BP9) was the optimal classifier for identifying flue-cured tobacco aroma types.The confusion matrix was shown in Table 3.A total of 151 samples were classified into their well-known aroma types.The overall accuracy and kappa coefficient of BP9 were 81.18% and 0.72, respectively.These results suggested that the classification produced substantial agreement.
The spatial distribution map of the flue-cured tobacco aroma types was built by ArcGIS 9.3 based on the optimal result produced by BP9 (Table 4).Most of the samples were classified into their well accepted types.The dominating areas producing flue-cured tobacco with heavy aroma type are Henan, Anhui, Jiangxi, Hunan, eastern regions of Shandong provinces.Areas producing flue-cured tobacco with Fig. 3: Spatial distribution of flue-cured tobacco aroma types medium aroma type are mostly concentrated in Guizhou, Chongqing, Hubei, central Shandong, northwestern Hunan and northeastern Yunnan provinces.Areas producing flue-cured tobacco with light aroma type are mainly distributed in Yunnan and Fujian provinces.However, not all results of the current study were similar to their well-known types.For instance, Yunnan province located in southwestern China is well-known for yielding flue-cured tobacco with light aroma type.In the current study, above half of samples in Yunnan were classified into light aroma type and were medium and heavy aroma groups (Fig. 3).These results were agreement with the previous study (Yang et al., 2014) and further confirmed that care should be taken in regional planning and decision making for flue-cured tobacco production in these areas.

CONCLUSION
In this study, three kinds of supervised classifiers were evaluated for identifying flue-cured tobacco aroma types using routinely measured chemical compositions.The results showed that the BP model with 9 indicators outperformed others.The overall accuracy and kappa coefficient of BP9 was 81.18% and 0.72.The spatial distribution map showed that most samples were classified into their well-accepted aroma types.In the future, we will futherly explore the special chemical indice that can make the optimal model achieve the best effect.

Table 1 :
Descriptive statistics of chemical compositions of tobacco leaves

Table 2 :
Overall accuracy (OA) and kappa coefficients of the classifiers

Table 3 :
Relative improvement of OA and kappa coefficients for BP