CATEGORIZATION OF GELAM, ACACIA AND TUALANG HONEY ODOR-PROFILE USING K-NEAREST NEIGHBORS

Honey authenticity refer to honey types is of great importance issue and interest in agriculture. In current research, several documents of specific types of honey have their own usage in medical field. However, it is quite challenging task to classify different types of honey by simply using our naked eye. This work demostrated a successful an electronic nose (E-nose) application as an instrument for identifying odor profile pattern of three common honey in Malaysia (Gelam, Acacia and Tualang honey). The applied E-nose has produced signal for odor measurement in form of numeric resistance (Ω). The data reading have been pre-processed using normalization technique for standardized scale of unique features. Mean features is extracted and boxplot used as the statistical tool to present the data pattern according to three types of honey. Mean features that have been extracted were employed into K-Nearest Neighbors classifier as an input features and evaluated using several splitting ratio. Excellent results were obtained by showing 100% rate of accuracy, sensitivity and specificity of classification from KNN using weigh (k=1), ratio 90:10 and Euclidean distance. The findings confirmed the ability of KNN classifier as intelligent classification to classify different honey types from E-nose calibration. Outperform of other classifier, KNN required less parameter optimization and achieved promising result.


INTRODUCTION
Honey is a natural food that can be used as food additive, medicinal food and food preservative that comes with yellow color, sticky and have sweet taste. It is collected from exudate of trees and nectar of blossoms (Shafiee, Minaei, Moghaddam-Charkari, & Barzegar, 2014). Honey can be differentiated within their types with where the honey is collected (botanical origin) which also sometimes influences honey quality, market price and honey appearance (Seisonen, Kivima, & Vene, 2015). Honey comes with a unique of compound structure. Large amount of compounds are present in sample of honey that have many advantages to human health. It is rich with its nutrients and there are many researches have been embarked using honey as a resources for medicinal used. In Malaysia, there are three most common honey types which are Gelam, Acacia and Tualang honey (Chua, Rahaman, Adnan, & Eddie Tan, 2013). Gelam honey which is smooth, strong penetrating odour, 99% soluble in warm water and amber liquid appearance (Kassim, Achoui, Mustafa, Mohd, & Yusoff, 2010) is collected from floral source which is Melaleuca spp ( Gelam tree) that produce from monofloral Apis mellifera. Acacia honey or also called as Robinia pseudoacacia (Marghitas et al., 2010) is one type of honey that have milder taste as compared to others, transparent to light yellow color and not crystallized. Tualang honey is produced by Apis Dorsata, bee that produce their hives on Tualang tree (Koompassia Excelsa) (Bashkaran et al., 2011) that is collected from in Rain forest of Peninsular Malaysia. Actually, it is quite a challenge job to classify honey within their group since its look quite similar in color. Sometime, the buyers or entrepreneurs got trick from the seller about the authenticity of the honey and manipulated of honey price.
Therefore, this present research is a necessity study to classify three types of common honey in Malaysia using e-nose application and KNN approach. Raw data of sample odor taken from e-nose is pre-processed using normalization technique and continue with boxplot and mean features as statistical tool presentation. The result of classification determine using KNN approach.

RELATED RESEARCH REVIEW
Since honey have various types and high price value, there are several method for detection among honey whether to classify among their types or to identify pure honey and adulteration honey. A detection technique can be separated into three categories; chemical, image and electronic. Chemical detection for honey have been widely done using physico chemical and bioactive properties (Isla et al., 2011) and combination of Solid Phase Micro-Extraction (SPME) and Gas Chromatography Mass Spectrometry (GC-MS) (Soria, Sanz, & Martinez-Castro, 2009). Image detection study have been reported from others researcher using ARGUS image processing system and machine vision to analyze color, shape and texture of honey. Computer vision system and artificial neural network already explored for honey characterization (Shafiee et al., 2014). As compared to all the detection device, E-nose need less time consuming, no need complex operation condition and expert to control and less cost compare to others (Boeker, 2014). E-nose analysis is one of the common methods for detection in food industry and precise in detection of honey (N. Zahed, M.S.Najib & N.F.Azhani, 2016). E-nose is a system mimicking human olfactory system by evaluate chemical profile of complex compound. Its function is depends to array of sensors with overlapping sensitivity (Westenbrink et al., 2015). It gives a signal reading and show the pattern recognition via the software in computer when it is connected using universal serial bus (USB) cable. Unlike others analytical instrument, E-nose identifies mixture of organic samples without having to identified individual mixture present in the samples (Huang, Liu, Zhang, & Wu, 2015).
Data measurement from E-nose undergoes signal processing method. In signal processing, it is initialized with signal from E-nose, continue with normalization for pre-processing data and continue to feature extraction (M.S. Najib, S.H.Azih. N.Zahed, M.F.Zahari & W.M.A.Mamat, 2016). In feature extraction, the mean features are used to compare samples and verify by statistical toot using boxplot. In a pre-processing technique, normalization is one of a vital step to increase accuracy in classification performance (Halim, Najib, Ghazali, & Zahari, 2014). This technique generally functions to accommodate multiple range and unit of values multidimensional data by scaling and translating while the dimension have zero mean and unit variance. Boxplot is one of statistical tool to express the specific characteristics of data presented from a group of datasets. This technique is introduced by J.W. Tukey in year 1977(Tukey, 1977. Intelligent classification of honey using E-nose has already done by Zakaria et al using Probabilistic Neural Network (PNN) method to classify 18 different samples of honey (Zakaria et al., 2011). In this paper, the data measurement is done using Cyronose320 E-nose with 32 non-selective sensors. The sample is heated before data measurement is conducted. The experiment is repeated for 5 times. The data then preprocess using fractional measurement. The pattern recognition is presented using Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA) technique. On the other hand, classification technique from honey detection using enose also applied using Artificial Neural Network (ANN) by Simona Benedettiti using sensors reading as the input data of the system (Branco, Kidd, & Pickard, 2006). By referring to this paper, the E-nose model 3320 with 22 different sensors is used for data measurement. This paper lack of data about the pre-processing data process. Experiment is repeated three times and feature extraction done using PCA technique. Table 1 shows the summary of classification rate result from existing intelligent classification method. From the result, it can be seen that both method PNN and ANN not exceed 100 % classification rate Table 1: Classification rate of honey using PNN and ANN In order to increase the performance of honey classification from e-nose measurement, technique of intelligent classification using K-Nearest neighbors is proposed. As far as literature is concerned, KNN has already been researched for honey classification from chemical data perspective, and the results of accuracy rate obtained have not exceeded 100% for all the samples (Maamor et al., 2014).
The principal idea of KNN is where most frequently class level is selected as the class level for one testing sample. This approach is suitable for text based problems including visual pattern recognition, in which the similarities property is compare in term of "k" nearest input between datasets with neighbourhood (Lam et al., 2014). KNN has a huge benefit for the implementation because it is very easy to implement and computationally efficient (Devak, Dhanya, & Gosain, 2015) due to the less calculation, high accuracy and finer timeless compare to others machine learning algorithm (Li, Wang, Tang, & Tian, 2014). Moreover, it has been proven that this approach can be very effective in time series classification problem (González, Bergmeir, Triguero, Rodríguez, & Benítez, 2016). In this approach, the new sample is compare to the training sample in the term of nearest objective function value in training space (Liu & Sun, 2011). Training and testing data is declared based on splitting data process. The 10:90 training to testing data splitting is generated with one portion of first data is declared as training and the remaining portion is for testing purposed (Naughton, Stokes, & Carthy, 2010). Classification problem is one of the step in KNN algorithm to measure the similarity of each cases. There are different way to calculate the similarity, Intelligent Classification Method PNN ANN Classification rate (%) 92.59 83.50 for instance by using geographical graph, questionnaire, colour and so on. To make the calculation simpler, all the similarity characteristics are converted to numeric value and example of calculations are using Euclidean distance (Xia, Xiong, Luo, Dong, & Zhang, 2015).

Sample Preparation
Three samples of pure honey were taken from Terengganu Honey hunter Co-operative Limited, a project that having a collaboration with Terengganu Agriculture and Department. The samples were placed in room temperature condition to avoid the adulteration of the honey. All types of honey were collected from Hutan Simpan Merchang, Terengganu. Selected types of honey which are more common honey in Malaysia have been prepared which are Gelam Honey, Acacia Honey and Tualang Honey. Each types of honey was taken in a same amount of volume which is 50 ml. Then, it was placed in three labelled vials.

E-nose Setup
E-nose system was setup as Figure 1. The main component of the E-nose system is a computer to acquire and display odor measured data results, microcontroller circuit to read the data response of sample test, fan to spread the odor sample in constant flow, sensor array as the indicator for detection honey sample and last but not least is chamber house to trap odor sample to avoid mixed with surrounding odor. Firstly, the computer that has been installed with dedicated software where the result from E-nose has been displayed. USB wire was connected from computer to circuit board that contain microcontroller. A special coding software that read the output from microcontroller and display result in numeric form was employed. The fan function is to spread odor from sample, so the sensor array will be responded to give signal for the odor test. A special pipe was used to connect wiring sensors to the circuit board. Honey sample was placed under the sensor array. Since the E-Nose system use non-specific sensor for detection, four gas sensors were selected in this research. The main sensor that was used which properties are conducting polymers or metal oxide sensors. Sensors array consist of four different types of sensors (S1, S2, S3 and S4).

Data Measurement
For one experiment, a sample was measured for 120s (data collection) and clean phase is 320s (E-nose neutralisation). For one sample (honey type), the experiment was repeated for five times. Then, the E-nose was changed to rest mode within 30 minutes. After 30 minutes, the E-nose was then switched to active mode. This process was repeated for the rest of the other sampled data measurements. Table 2 shows the analytical condition for measurement using E-nose system. The total output from E-Nose is matrix numeric number with size 3000x4 that called as raw data. The generated data is from 1000x4 for Gelam test, 1000x4 for Acacia test and 100x4 for Tualang test.

Data Pre-processing
The measured raw data was pre-processed using normalization technique for reducing error and normalizing the range of the data. There are several equations were used for normalization application and the equation selection in this research is by using equation (1) below: (1) In Eq.(1), ′ presents the each data reading in ohm unit from e-nose calibration while R max is the maximum reading in each data measurement. Then, the data was processed using clustering technique. It was used for data mining process and commonly used for statistical data analysis. The aim of this analysis is to grouping various data according to their cluster (group) by using function in MATLAB software. From the whole set of normalized data, there were matrix (3000 measurement x 4 array of sensors) data that were present. Those data was clustered according to group of sample data (1000 measurement x 4 array of sensors) matrix for each sample which is Gelam, Acacia and Tualang honey. After the datasets were grouped based on their sample, the data was minimized by clustering based on experiment for each sample since each of a sample measurement based on 5 different sets of experiment. Each experiment consist of (200 mean of experiment x 4 array of sensors) matrix data. The data was collected based on mean calculation for all the experiment clusters. The final mean data was represented (30 set of mean measurement x 4 array of sensors) data measurement which include (10x4 Gelam) odor data, (10x4 Acacia) odor data and (10x4 Tualang) odor data.

Feature Extraction
A boxplot or known as a whisker diagram is one of statistical analysis tool used in this research done using MATLAB software. Generally, it is used to present the = / distribution of data or full range of variation data. It summarizes data to five distribution which include minimum data, maximum data, median data, 1st quartile, range data (Q1) and 3rd quartile range data (Q3). Range between Q1 until Q3 represent the interquartile (IQR). The data that out of range will present as the outlier.

Intelligent Classification
KNN classifier is one of intelligent classification technique that can be run in MATLAB software as there are the setting algorithm for this function in software. To complete classification technique in this system, the distance of data was measured by applying selective rule. The measured data was compared between training and testing data. This system was started with input and output assignment. In this step, input and output of the system was clearly declared. The input the mean of cluster data for each sample and output is the class for each sample. The value assign for input and output is remain same. Second step was data preparation. To find the best performance of intelligent classification using KNN, it undergoes data splitting or split sample technique. Total data is subsample to 'training' data and the remaining data is subsample to 'testing' data which is prepared accordingly to training to testing ratio. This practice approach is accepted by Cool et al, 1987 and already practice by other researcher using statistical measure 70:30, 60:40 and 50:50 (Surendiran & Vadivel, 2011). In this step, the total data was splitted from ratio 10:90 until 90:10 before it was inserted into the system. The next step was assigning the training and testing prepared data. The data of train was assigned first in the system and continued with testing data. The system has automatically calculated the class of the testing data based on training data. Consequently, the confusion matrix was done to measure true and false case from result of classification. Lastly, performance measures of honey classification using KNN was evaluated. Performance measure of KNN was measured using statistical analysis of error calculation. The error of classification result was done by applying MSE calculation. Figure 4 indicate the final data cases after clustering the data. The total cases for each sample is 10 cases present in line as indicate in 2D graph. Each graph represents a different pattern of data based on odor sample. The data is used as the input for intelligent classification using CBR technique. Sensor array 1 is methane CNG gas sensor (S1), sensor array 2 is carbon monoxide gas sensor (S2), sensor array 3 is alcohol sensor (S3) and last sensor is CO/combustible gas sensor (S4). From figure 2 and figure  3, the highest response of sensor to odorant sample is S4 due to its high sensitivity to propane which is one of the element compounds found in Gelam and Acacia honey. By observation in Figure 4, the most responsive sensor is S3 since there are high volatile compound are benzoic in Tualang honey.

Statistical Analysis
In this section, the distribution data for three types of honey were represented in whole statistical tool using boxplot include minimum, maximum, median, 1 st quartile, 3 rd quartile, interquartile and outlier (if present). In designing this boxplot analysis using MATLAB software, the data cases for each sample with size of (10x4 matrix) data was inserted as the input data. The distribution data present in the boxplot is based on sensor array. Figure 5 until Figure 8 shows the different of boxplot pattern according to the feature 1 (sensor 1) until feature 4 (sensor 4) respectively for Gelam, Acacia and Tualang honey. From the figures, it seen that the boxplot pattern is different for all honeys from each features. Its mean the different response of sensors to the odor of each honey types.
In Figure 5, clearly can see minimum, maximum, median, 1 st quartile, 3 rd quartile, interquartile is different for all the honeys. Maximum data reading for Gelam is 0.27, Acacia is 0.23 and Tualang is 0.35. Figure 6 shows Tualang honey has the lower interquatile range compare to others honey due to the value taken is not varies from cases of Tualang honey response to feature 2. In Figure 7, all five data distribution for Tualang honey is constant, means the fixed value of data for each cases of Tualang response from feature 3. In Figure 8, even maximum value for Gelam and Tualang honey is quite similar, the other values of data distribution is different to classify each group.It can be seen that boxplot in Figure 5 Table 3 indicates the mean data for each sample that was assigned as the class in KNN classifier. From all the data input, one mean data was calculated to represent the group. There are the mean data for Gelam, acacia and Tualang with the value of 0.6648, 0.6166 and 0.6967 respectively.   Table 4 shows the result of percentage similarity using KNN classifier by using K=1 and varies distance, rule and separation ratio of training to testing data from the total of 200 measurement data for three samples of honey. The input data for one observation is not overlap between training and testing data. The results from 10:90 until 90:10 training testing ratio have produced different rate of accuracy performance. It can be observed that the accuracy of lowest training ratio of 10:90 has the lowest performance. Clearly, it is shown and proved that the performance is increased by increasing the ratio of training to testing data. The rate of accuracy of 70:30 to 90:10 training testing data splitting ratio has shown consistent improvement. Based on the table, the successful performance was selected while applying percentage of 70:30 until 90:10 training testing data splitting from input data, using distance Euclidean and varies rule nearest, random and consensus. From the result obtained for all the training testing data ratio, it can be seen that Correlation distance for all the rule have low performance as compared to other distance.  Figure 8 represents the graph of honey classification with ratio 90:10 by applying KNN method. The total of training data is 180 from each samples and testing data is 20 measurement each samples. From the graph, it can be observed that all of sample is classified correctly according to their mean data of sample. Table 5 focuses on the error calculation using Mean Square Error (MSE) to check the performance using KNN classifier using k=1, distance=Euclidean, rule=nearest as mention the best performance to classify three types of common honey in Malaysia. Based on the result, there are zero value of MSE in ratio of 90:10, 80:20 and 70:30.The highest error can be obviously depicted in 10:90 training testing data splitting ratio. Thus, the result of the classification has proven that by increasing the training to testing ratio the performance of classifier will increase.

CONCLUSION
In this research work, honey classification performance based on E-nose and classifier method using KNN technique was investigated. Three types of pure honey, Gelam, Acacia and Tualang odor profiles have been classified after data measurement preprocessed using normalization technique, extracted using mean feature and graphical representation of statistical tool using boxplot using E-nose. Different from existing method, the sample in this work is not undergoes heating process before the data measurement. Codex Alimentarius standard mention that one of important point of honey adulteration is using excessive heat to the honey that can effect honey quality because loss of volatile compound and reduction of enzyme activity. The overall analysis has shown that E-nose was able to distinguish between three types of honey by showing significant different data of odor profile pattern from four sensor array features. It was proven by graphical representation of statistical tool using boxplot. KNN classification technique was successfully demonstrated and the result indicates that 100% rate of accuracy, sensitivity and specificity using ratio 90:10 and Euclidean distance was achieved. By comparing the performance of intelligent classification from existing method which are probabilistic neural network and artificial neural network for honey classification, KNN approach achieved outperform result of classification rate from other existing approach. Unique from others method, for KNN approach, the parameter optimization done by varies the distance, rule and do the splitting ratio training to testing data. Observation shown that by increasing test to train ratio, the classification will perform better result in accuracy, sensitivity and specificity. The contribution to knowledge of this paper is that, the KNN can be one of the intelligent classification method for honey which produced 100% rate of accuracy.