Enhancing Electronic Nose Performance by Feature Selection Using an Improved Grey Wolf Optimization Based Algorithm

Electronic nose is a kind of widely-used artificial olfactory system for the detection and classification of volatile organic compounds. The high dimensionality of data collected by electronic noses can hinder the process of pattern recognition. Thus, the feature selection is an essential stage in building a robust and accurate model for gas recognition. This paper proposed an improved grey wolf optimizer (GWO) based algorithm for feature selection and applied it on electronic nose data for the first time. Two mechanisms are employed for the proposed algorithm. The first mechanism contains two novel binary transform approaches, which are used for searching feature subset from electronic nose data that maximizing the classification accuracy while minimizing the number of features. The second mechanism is based on the adaptive restart approach, which attempts to further enhance the search capability and stability of the algorithm. The proposed algorithm is compared with five efficient feature selection algorithms on three electronic nose data sets. Three classifiers and multiple assessment indicators are used to evaluate the performance of algorithm. The experimental results show that the proposed algorithm can effectively select the feature subsets that are conducive to gas recognition, which can improve the performance of the electronic nose.


Introduction
Feature selection is an important technique in the applications of pattern recognition. In practical application, there are usually too many redundant features in the the data, which will greatly affect the classification accuracy and computational complexity. In addition, in order to eliminate the influence of redundant features on classification process, the feature selection plays an important role in reducing the dimension of data, improving the accuracy of the model and helping us have insight into the data more deeply [1].
Feature selection can be roughly divided into three categories: filter, wrapper, and embedded [2]. The filter method sorts the features according to predefined criteria, and the feature selection process is independent of the classification. The wrapper method wraps the classifier in the search algorithm and it is guided by the objective function. In the embedded method, the selection of variables is integrated into the training process.
Electronic nose is a kind of system for the detection and classification of volatile organic compounds by imitating human olfactory [3,4]. The electronic noses are composed of a gas sensor array, and the collected data are classified by the pattern recognition algorithm [5]. In recent years, electronic noses have played an important role in various fields, such as environmental monitoring,

•
This paper applies the grey wolf optimization based algorithm to the feature selection of electronic nose data for the first time. Two novel transform approaches and adaptive restart approach are employed for the proposed algorithm to enhance classification accuracy and reduce the dimension of electronic nose data, which further enhance the performance of electronic nose.

•
The proposed algorithm is compared with other classical feature selection algorithms over three electronic nose datasets in multiple assessment indicators, and the experimental results show that the proposed algorithm is superior to other algorithms over all datasets, which proves that the proposed algorithm can be applied to different types of electronic nose and achieve positive results.

•
The effect of two proposed transform approaches and adaptive restart approach are investigated in multiple assessment indicators, which proves that the two mechanisms are helpful in selecting better features of odor and enhancing the accuracy of final gas recognition.
The organization of this paper is as follows: Section 2 provides a description of the proposed algorithm. Section 3 introduces the data set and the feature extraction methods that are used in the experiment. The experimental results are discussed in Section 4. Finally, Section 5 concludes the paper.

Methodology
In this section, the proposed algorithm is expected to solve the feature selection problem of gas sensor data. The main parts of the algorithm are original grey wolf optimization, evaluation, binary transform approach, and adaptive restart approach, which will be described in more detail in the following subsections.

Grey Wolf Optimization
Grey wolves are social animals with special skills in catching prey. Wolves can catch prey in the shortest time through the cooperation and strict grading. In the pack of wolves, the leader is called alpha, and alpha wolves are responsible for making decisions on predation. Beta wolves are on the second level and they are responsible for assisting alpha wolves to make decisions. The delta wolves have to submit alpha wolves and beta wolves, but they can command the omega wolves. They are responsible for monitoring the surrounding environment and warning groups in case of danger. Omega wolves have to follow the command of other levels.
In the mathematical model of GWO, α, β, and δ represent the first, second, and third optimal solution. The rest search agents are collectively referred to as ω. The model is guided by α, β, and δ. The grey wolves will gradually approach the prey and surround it while hunting [32]. This behavior can be modeled mathematically as follows: where − → X p (t) and − → X (t) represent the position vector of prey and grey wolf in the iteration t, − −− → rand1 and − −− → rand2 are random vectors in the range of 0 to 1, a is the number that decreases linearly from 2 to 0 during the whole iteration according to Equation (5): where M Iter is the maximum iterations of the algorithm. It is assumed that α, β, and δ are the first three best solutions in the process of the searching optimal solution. Thus, the three wolves with the first three minimum fitness value are retained as α, β, and δ during each iteration, which will guide the model to update the position of other search agents. This process can be represented by the mathematical model as follows: α, β, and δ will keep searching for prey during the process of hunting, and Algorithm 1 outlines the grey wolf optimization (GWO) algorithm. Algorithm 1 Grey wolf optimization 1: Initialize the number of iterations for optimization N iter 2: Initialize the positions of n grey wolves X i , i = 1, 2, ..., N 3: Calculate the fitness value of each grey wolf 4: Choose the best three grey wolves as X α , X β , X δ base on there fitness 5: t ← 0. 6: while t < N iter do 7: Update the position of the wolves using to Equation (6) 8: Update α, A, and C 9: Calculate the fitness of each grey wolf 10: Update the first three grey wolves X α , X β , X δ 11: t ← t + 1. 12: end while 13: return best f itness, X α

Evaluation
Fitness value is the evaluation criteria for searching the optimal solution. The fitness function needs to ensure that the calculated solution can have high classification accuracy in different classifiers, which is used for guiding the algorithm to find the optimal solution.
K-nearest neighbor (KNN) is a commonly used classification method. The test samples are classified with KNN by analyzing the categories of K training samples closest to the test samples in feature space. KNN is easy to implement and has great performance in multi-class classification, hence it was used to calculate the fitness of the proposed algorithm.
In order to find the optimal feature subset, the evaluation of selected feature subsets must be considered from the following two aspects: • Maximum classification accuracy • Minimum number of features Considering the above two factors, the fitness function is as shown in Equation (10): where f is the fitness value. P R (D) is the error rate of test set with selected features under decision D. |F| and |S| are the length of the original eigenvector and the eigenvector with selected features. α and β are the weights for balancing the classification accuracy and eigenvector length, where al pha ∈ [0,1] and beta = 1 − α. The experimental data set was divided into train set and test set. 10-fold cross-validation was used to train classification models on the train set to prevent over fitting, and the model with the best performance was used to calculate the error rate on the test set.

Binary Transform Approach
Each dimension of the solution obtained by the original GWO is a continuous value. Because of the particularity of feature selection problem, the solution needs to be limited to the binary (0,1) value. The S-shaped and V-shaped transform functions are usually used to convert from decimal to binary [33]. Two new approaches for mapping search agents to binary vectors will be introduced in the following.

Approach1
In this approach, the main function can be formulated as shown in Equation (11): where x d binary is the binary value of each search agent in dimension d, x d 1 , x d 2 , x d 3 are calculated using Equations (12)- (14).
where rand is a random number between 0 and 1, x d α is the value of alpha wolf in dimension d, A d 1 , and D d α are calculated using Equations (3) and (7).
where rand is a random number between 0 and 1, x d β is the value of beta wolf in dimension d, A d 2 and D d β are calculated using Equations (3) and (8).
where rand is a random number between 0 and 1, x d δ is the value of delta wolf in dimension d, A d 3 and D d δ are calculated using Equations (3) and (9).

Approach2
In this approach, the effect of position vector on transformation is also considered, and the main function can be formulated as shown in Equation (15) where x d binary is the binary value of each search agent in dimension d, rand is a random number between 0 and 1, NBT d is a continuous value between 0 and 1 which is calculated from the position vector and the value of dimension d as in Equation (16).
where x d s is the value of position vector in dimension d, which is calculated by the original GWO.

Adaptive Restart GWO
Original GWO has a certain probability of falling into local optimum, which affects the search capability [34], and restart is an exceedingly economic strategy when the algorithm falls into complex problems [35]. Thus, the adaptive restart method is proposed to enhance the search capability of the algorithm. When the minimum fitness of the iteration t + 1 is greater than or equal to the minimum fitness of the iteration t, we will harbor the idea that the search process has already or has a tendency to fall into the local optimum, and a slice of search agents will be reinitialized randomly. The adaptive restart approach emphasizes the dynamic reinitialization of search agent according to the optimal fitness of each iteration. The number of randomly reinitialized search agents is calculated by Equation (17).
where f t is the minimum fitness value of the iteration t. N s is the number of search agent. round represents the rounding operation. According to Equation (10), we can find that a higher value of fitness means a higher error rate or a larger number of selected features, which represents a poorer search effect in each iteration. Because fitness ∈ [0,1], the number of restarted search agents will vary from 0 to N s according to the search effect. Finally, the overall pseudocode of the ARGWO with two proposed binary transform approaches (ARGWO1 and ARGWO2) can be found in Algorithm 2.

Algorithm 2 Adaptive restart GWO
1: Initialize the number of iterations for optimization N iter 2: Initialize the positions of n grey wolves X i , i = 1, 2, ..., N 3: Calculate the fitness value of each grey wolf 4: Choose the best three grey wolves as X α , X β , X δ base on there fitness 5: t ← 0. 6: f itness N ← the best fitness calculated from initialized wolves 7: while t < N iter + 1 do 8: Update the position of the each wolf using Equation (6) 9: Update α, A, and C 10: Update x 1 , x 2 , and x 3 using Equations (12)- (14) or get the position vector of each wolf 11: Transform each wolf's position into a binary vector using Equation (11) and x 1 , x 2 , x 3 or Equation (15) and wolf's original position vector. 12: Calculate the fitness of each Wolf 13: f itness L ← f itness N 14: f itness N ← best fitness in iteration t 15: if f itness N ≥ f itness L then 16: calculate N restart using Equation (17). 17: select N restart wolves randomly from all search agents to reinitialize 18: end if 19: Update the first three grey wolves X α , X β , X δ base on fitness 20: t ← t + 1. 21: end while 22: return best f itness, X α

Datasets and Feature Extraction
Three gas sensor data sets in different application domains were used in this experiment, which will be described in more detail in the following subsections.

Dataset1
Dataset1 is the sensor array data collected by Vergara et al., which was publicly available in UCI Machine Learning Repository for detecting different gases [36,37]. Dataset1 contains 13,910 samples collected by 16 chemical sensors (TGS2600, TGS2602, TGS2610, and TGS2620 four of each), which were exposed to six different concentrations of gas (Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, and Toluene). The information of the dataset1 is presented in Table 1.
Two distinct types of features were extracted from the response signal: • The features defined as the maximal resistance change relative to the baseline and the DR normalized version.

•
The features reflecting the increase/decrease transient part of the sensor response in the whole measurement process, which can be formulated as shown in Equation (18): where R[k] is the resistance measured of each sensor at time k, α is a parameter between 0 and 1 to smooth the scalar. Six features of rising and falling stages of the sensor response were extracted by using three different alpha values (0.1, 0.01, 0.001). Thus, for each sensor, eight features were extracted, and the sample of 128-dimensional eigenvector was formed (16 sensors × 8 features). The value of each number in the column called "No. samples" is the number of samples obtained in each gas environment. Dataset1 collects the data of six kinds of gases, and the total number of samples obtained in six kinds of gases is 13,910.

Dataset2
Dataset2 is the time series data collected by the chemical detection platform composed of eight chemo resistive gas sensors (TGS2611, TGS2612, TGS2610, TGS2600, two TGS2602, and two TGS2620) for detecting mixtures of Ethylene with Methane or Carbon Monoxide, and it was publicly available in UCI Machine Learning Repository [38]. There are 180 samples in dataset2, in order to reduce the difficulty of classification, and the pattern recognition of dataset2 was regarded as a binary classification problem. The example of the time series of dataset2 is outlined in Figure 1(left).  TGS2600 TGS2600   TGS2602 TGS2602   TGS2602 TGS2602   TGS2620 TGS2620   TGS2612 TGS2612   TGS2620 TGS2620 TGS2620 TGS2620   TGS2611 TGS2611   TGS2610 TGS2610 TGS2610  It is necessary to extract the hidden features contained in the gas sensor data in order to improve the classification accuracy [39]. For each sensor, 16 features were extracted, and the sample of 128-dimensional eigenvector was formed (8 sensors × 16 features). The feature extraction methods applied to dataset2 are outlined in Table 2.

Dataset3
Dataset3 is a time series dataset for the detection of wine quality [40,41]. Six gas sensors (Table 3) were used to detect different quality wines (high quality, medium quality, low quality) and ethanol, and 300 samples were collected. The example of the time series of dataset3 is outlined in Figure 1(right). The same feature extraction methods were used in dataset3, and the details can be seen in Table 2. For each sensor, 16 features were extracted, and the sample of 96-dimensional eigenvector was formed (6 sensors × 16 features).
In order to comprehensively introduce the data sets that used, the number of attributes, samples and other important information of the three data sets are summarized in Table 4.

Feature Type Descripition
Maximum response Max response value of curve-baseline Derivative Maximum and minimum derivative of sensor value Time constant The time when the sensor value reaches the 30%, 60%, and 90% of its maximum response

Integral
The integral of curve calculated by I = T gasO f f T gasOn (x(t) − baseline) dt Equal interval value Obtain ten values in the sensor data at equal time intervals from T gasOn to T gasO f f Gas sensor array exposed to turbulent gas mixtures 128 180 2 3 Wine quality inspection 96 300 4

Result and Discussion
The detail of the feature selection algorithms that were used for comparison is described in Section 4.1. Section 4.2 compares the proposed algorithm with the well-known algorithms on multiple assessment indicators over three data sets. In Section 4.3, the effect and superiority of proposed binary transform approaches on GWO based feature selection algorithm will be discussed. Some useful information will be provided in Section 4.4 by analyzing the adaptive restart approach.

Description of the Compared Algorithm
In the experiment, except for the original BGWO, two widely used algorithms for gas sensor data feature selection and two efficient meta-heuristic algorithms were also used to compare with the proposed algorithm, which is summarized as: Support Vector Machine-Recursive Feature Elimination (SVM-RFE) [42], Max-Relevance and Min-Redundancy (mRMR) [43], Binary Grey Wolf Optimization (BGWO) [27], Discrete Binary Particle Swarm Optimization (BPSO) [44], Genetic Algorithm (GA) [45]. KNN is a classifier with few parameters and high classification accuracy, so it was used as a wrapper method of all meta-heuristic algorithms in this study; the experimental results show that the classification achieves the best performance when k is 5. Each data set was divided into train set and test set with the ratio of 7:3. The error rate of the test set and the number of selected features were used to guide the search direction of all meta-heuristic algorithms, and all of them have been run independently 20 times. mRMR and SVM-RFE are calculated on the complete datasets. Gaussian kernel and linear kernel were used in SVM-RFE, and the optimal feature subset selected from the two kernel functions was taken as the final result. The parameter settings for all algorithms are outlined in Table 5. All parameters were set according to multiple experiments and relevant literature to ensure the fairness of the experiment. The 10-fold cross-validation was conducted on all algorithms in order to eliminate the influence of over fitting.

Comparison of the Proposed Algorithm and Other Algorithms
In the first experiment, adaptive restart GWO with two binary transform approaches (ARGWO1 and ARGWO2) which have been proposed in Section 2 were compared with five feature selection algorithms (mRMR, SVM-RFE, BPSO, GA, BGWO) on three electronic nose datasets.
KNN [46], SVM [47], and Random Forest (RF) [48] were used to calculate the classification accuracy of the feature subset selected by each algorithm to ensure the reliability of accuracy evaluation. Each data set was randomly divided into the training set and testing set with the ratio of 7:3, the classification accuracy of test set was used for the evaluation of feature subset to prove its future performance on the unseen data. The feature subset which gets the least number of features on the premise of the maximum classification accuracy under multiple runs was regarded as the optimal result of each algorithm. The classification accuracy of the optimal feature subset selected by each algorithm on three data sets is as shown in Table 6. We can see that ARGWO1 and ARGWO2 achieve excellent performance: ARGWO2 achieves the highest average classification accuracy on all data sets and the average classification accuracy of ARGWO1 on dataset2 and dataset3 is only less than ARGWO2. In order to judge whether there is over fitting in different models, the accuracy of training sets under different training models is as shown in Table 7, and we can remark that the training was not overfitted.
In order to evaluate the feature subset more comprehensively in classification performance, F1-score was used in this experiment. F1-score is the harmonic mean of precision and recall rate [49], and it can be formulated as in Equation (19): where F1 is the F1-score obtained by each classifier under each data set, TP is the number of samples that is correctly predicted, FP is the number of samples that errors predicted as class k, FN is the number of samples belonging to class k but is predicted by other classes. The F1-score of the optimal feature subsets selected by each algorithm are outlined in Table 8. We can see that ARGWO2 and ARGWO1 achieve the first and second best average F1-score. From Tables 6 and 8, we can see that the classification performance of dataset1 on different classifiers is far lower than other datasets, which is mainly due to the impact of sensor drift on classification accuracy. The experimental results show that selecting appropriate features through the feature selection algorithm can suppress the impact of sensor drift in a certain extent, and the proposed algorithm achieves positive performance in compensating the drift effect.   Table 9 shows the length of the optimal feature subset selected by each algorithm. We can see that the ARGWO2 achieves the minimum average length of the eigenvector. In fact, the classification performance is more important than the length of the feature subset in the assessment indicators system of the feature selection algorithm. Therefore, in the process of determining the optimal result from each algorithm, the classification performance was given priority, and the shortest one was chosen in the feature subsets with the highest classification performance, which also explains the reasons for setting the values of parameters α and β in the fitness function. Table 10 outlines the Wilcoxon test calculated on the classification accuracy and average fitness obtained by the different algorithms. In this experiment, the average classification accuracy of the feature subsets obtained by multiple runs under each dataset and each classifier was regarded as the individual element presented to the Wilcoxon test, and we can remark that the ARGWO2 achieves a significant enhancement over most of the other approaches.   From the above experiments, we can conclude that the proposed algorithm outperforms other methods in classification performance and number of selected features. In addition, ARGWO2 achieves less features while obtaining the highest classification performance on all data sets, which indicates that ARGWO2 can achieve a positive performance on the data collected by different types of electronic nose. By using the proposed algorithm for feature selection, the useful information can be extracted from the gas response signal to enhance the performance of the electronic nose. The feature subset selected by the algorithms with the KNN wrapper method has a similar classification accuracy ranking on three classifiers, which proves that KNN is an effective wrapper method of the meta-heuristics algorithm. The effect of the two proposed mechanisms on the GWO based feature selection algorithm will be discussed in the following subsections.

The Effect of Binary Transform Approach on the Proposed Algorithm
Section 4.2 shows that the proposed algorithm outperforms the original BGWO algorithm in classification performance and the number of selected features. In this section, GWO with sigmoid function (BGWO), GWO with approach1 (GWO1), and GWO with approach2 (GWO2) were used to study the effect of binary transform approach on the GWO for feature selection. In order to control the variables, GWO1 and GWO2 did not add the adaptive restart approach in this experiment. Fitness value is a comprehensive evaluation of the accuracy and length of feature subsets to guide the search direction of the proposed algorithm; thus, it is an important index for the evaluation of the GWO based algorithm, and three fitness related assessment indicators were used in this experiment [27]: The best fitness is the minimum fitness value obtained by running the algorithm for M times, and it can be formulated in Equation (22): The worst fitness is the maximum fitness value obtained by running the algorithm for M times, and it can be formulated in Equation (23): The mean fitness is the average fitness value obtained by running the algorithm for M times, and it can be formulated in Equation (24). Figure 2 shows the fitness value obtained by three algorithms in 20 independent runs over all the datasets. In addition, according to Equation (10), we can remark that the feature subset with lower fitness, which means lower error rate and fewer selected features, represents the better search performance. We can see from the figure that the overall performance of GWO1 and GWO2 is better than BGWO. Figures 3-5 outline the best, mean, and worst fitness value obtained by three algorithms over all the data sets. We can see that GWO1 has a slight advantage over BGWO. In addition, GWO2 achieves much better performance than the other two algorithms. The Wilcoxon test was used to verify the significant difference between the above algorithms, and the average fitness over three datasets was regarded as the individual element presented to the Wilcoxon test, and the average fitness values under multiple runs were used to calculate the p-value between different algorithms. From the experiment results, we can remark that the GWO1 and GWO2 achieve significant enhance over the BGWO by achieving the p-value of 0.0081 and 0.0076. We can remark that the binary transform approach has a positive influence on the search capability of GWO for feature selection, and it is more advantageous to find the optimal feature subset by using the appropriate approaches. Compared with sigmoid function, the search capability of the algorithm can be improved more by approach1 or approach2. Moreover, approach2 achieves an outstanding advantage, and it is possible that approach2 takes into account the information on the position vector of search agents rather than relying on only one element of the position vector in the process of binary transformation. Therefore, we take the attitude that it may be a good way to control the binary transform process by combining more information related to search agents.

The Effect of Adaptive Restart Approach on the Proposed Algorithm
In this section, GWO with an adaptive restart approach(ARGWO1, ARGWO2) and without an adaptive restart approach (GWO1, GWO2) were used to study the effect of the adaptive restart approach on the algorithm. Figures 6 and 7 show the best, worst, and average fitness value obtained by the above four algorithms over all the data sets. We can see that the mean fitness and the worst fitness of the algorithm are reduced by using an adaptive restart approach, and the best fitness is also reduced in some cases, which proves that the adaptive restart approach effectively improves the search capability. The Wilcoxon test was also used in this experiment, the average fitness over three datasets was regarded as the individual element presented to the Wilcoxon test, and the average fitness values under multiple runs were used to calculate the p-value between the proposed algorithm with and without adaptive restart. The p-value between ARGWO1 and GWO1 achieved 0.0288, and the p-value between ARGWO2 and GWO2 achieved 0.0036. We can remark that the performance of the proposed algorithm is significantly improved by using adaptive restart. Std is a measure for the variation of the optimal result obtained by the algorithm under multiple runs [50]. In addition, it was used as the index to evaluate the stability of the algorithm in this experiment. Std is formulated as in Equation (25).
where f i is the final fitness value of the independent operation i, f mean is the mean fitness. The average Std value over all data sets obtained by the above four algorithms in this experiment are outlined in Figure 8. We can see from the figure that the stability and repeatability of the GWO based algorithm can be improved by using an adaptive restart approach. The adaptive restart approach is based on the optimal fitness value and the number of search agents in each iteration to determine the number of search agents that need to be reinitialized. By analyzing the current search effect, adaptive restart can dynamically affect the search direction of the algorithm when the algorithm has already or has a tendency to fall into the local optimum, so as to prevent falling into the local optimal solution and select more useful features from the gas response signal. From all these experiments, we can conclude that the proposed algorithm outperforms other algorithms over all datasets, which indicates that the proposed algorithm can be applied to different types of electronic nose and effectively enhance their performance of gas sensing. The performance of the GWO based algorithm for feature selection of electronic nose data are effectively improved by proposed binary transform approaches and adaptive restart. The fitness value of the selected feature subset can be reduced by using proposed binary transform approaches, especially in approach2. The stability and search capability of the GWO based feature selection algorithm can be further enhanced by using adaptive restart approach. Throughout the paper, the proposed algorithm can effectively select more favorable feature subsets for gas recognition, but, because the adaptive restart approach does not add too much influence to the search behavior of GWO, there is still a certain probability of falling into local optimum in the process of searching. In order to obtain the optimal feature subsets, it is usually necessary to run multiple times.

Conclusions
This paper proposes a novel method for enhancing the performance of electronic nose by feature selection using the improved grey wolf optimization based algorithm. Two novel binary transform approaches and adaptive restart were employed for the proposed algorithm. The proposed algorithm was compared with five typical feature selection algorithms on three electronic nose datasets. Three classifiers and multiple assessment indicators were used to evaluate the performance of each algorithm. The results showed that the proposed algorithm outperformed original BGWO and other algorithms. In addition, the search capability of the GWO can be effectively improved by using adaptive restart and proposed binary transform approaches. The proposed binary transform approaches can search feature subset from electronic nose data that maximizes the classification accuracy while minimizing the number of features and adaptive restart attempts to further enhance the search capability and the stability of the algorithm, which will help us to obtain more favorable feature subsets for pattern recognition from high-dimensional electronic nose data. In summary, the proposed algorithm is a promising method for enhancing classification accuracy and reducing dimension, which can effectively enhance the performance of different types of electronic nose. In the future, the proposed algorithm can be combined with more kinds of features to improve the performance.

Conflicts of Interest:
The authors declare no conflict of interest regarding the design of this study, analyses, and the writing of this manuscript.

Abbreviations
The following abbreviations are used in this manuscript: