A Micro-GA Embedded PSO Feature Selection Approach to Intelligent Facial Emotion Recognition

This paper proposes a facial expression recognition system using evolutionary particle swarm optimization (PSO)-based feature optimization. The system first employs modified local binary patterns, which conduct horizontal and vertical neighborhood pixel comparison, to generate a discriminative initial facial representation. Then, a PSO variant embedded with the concept of a micro genetic algorithm (mGA), called mGA-embedded PSO, is proposed to perform feature optimization. It incorporates a nonreplaceable memory, a small-population secondary swarm, a new velocity updating strategy, a subdimension-based in-depth local facial feature search, and a cooperation of local exploitation and global exploration search mechanism to mitigate the premature convergence problem of conventional PSO. Multiple classifiers are used for recognizing seven facial expressions. Based on a comprehensive study using within- and cross-domain images from the extended Cohn Kanade and MMI benchmark databases, respectively, the empirical results indicate that our proposed system outperforms other state-of-the-art PSO variants, conventional PSO, classical GA, and other related facial expression recognition models reported in the literature by a significant margin.


I. INTRODUCTION
F ACIAL emotion recognition has opened up a new era for human-computer interaction, and has provided benefits to a wide range of computer vision applications, such as healthcare, surveillance, event detection, personalized learning, and robotics [1]- [7].Robust emotion classification relies heavily on effective facial representation.However, it is still a challenging task for identifying significant discriminative facial features that could represent the characteristics of each This paper aims to deal with such challenges to produce effective and optimized discriminative facial representations to benefit real-time facial expression recognition.In comparison with other feature selection methods, evolutionary computational (EC) algorithms show powerful global search capabilities, and have been widely accepted as efficient techniques for feature selection [8].Among different EC algorithms, the particle swarm optimization (PSO) algorithm is motivated by the flocking behaviors of birds, and has been extensively used for feature optimization with the benefits of a lowcomputational cost and a fast convergence speed.However, conventional PSO tends to converge prematurely and, therefore, be trapped in local optima [8].As a result, in this paper, a PSO variant embedded with the concept of a micro genetic algorithm (mGA) is proposed.Known as mGA-embedded PSO, the proposed algorithm incorporates a nonreplaceable memory, a small-population secondary swarm, a new velocity updating strategy, a subdimension-based regional facial feature search strategy, and a cooperation of local exploitation and global exploration search strategy to overcome both premature convergence and local optimum problems encountered by conventional PSO.
The proposed facial emotion recognition system consists of three steps: 1) feature extraction; 2) feature optimization; and 3) emotion recognition.Fig. 1 illustrates the system architecture.First of all, we use modified local binary patterns (LBPs), i.e., horizontal and vertical neighborhood comparison LBP, to extract the initial facial representation.Then, the proposed mGA-embedded PSO algorithm This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/ is used to identify the most discriminative and significant features for differentiating distinct facial expressions.Diverse classifiers (e.g., single and ensemble models) are applied to recognize seven emotions: 1) happiness; 2) sadness; 3) anger; 4) fear; 5) surprise; 6) disgust; and 7) neutral.The system is evaluated with two facial expression databases, i.e., the extended Cohn Kanade (CK+) [9] and MMI [10].State-of-the-art PSO variants, conventional PSO, and classical genetic algorithm (GA) are used to compare with the proposed mGA-embedded PSO algorithm in feature optimization.The empirical results indicate that the proposed system outperforms state-of-the-art optimization methods and other related facial expression recognition research reported in the literature by a significant margin.The main contributions of this paper are summarized as follows.
1) A modified LBP operator that conducts horizontal and vertical neighborhood pixel comparison is proposed, in order to overcome the drawbacks of original LBP by retrieving the missing contrast information embedded in the neighborhood to generate the initial discriminative facial representation.2) A novel mGA-embedded PSO algorithm is proposed for feature optimization, in order to mitigate the premature convergence and local optimum problems of conventional PSO.It provides great flexibility to allow the feature selection process to not only separate facial features into specific areas for in-depth local search but also combine facial features for overall global search.
3) The proposed algorithm includes a new velocity updating strategy by employing the personal average experience to generate the individual best, pbest, and Gaussian mutation to produce the global best, gbest, in order to increase swarm diversity.4) The proposed algorithm also applies the diversity maintenance strategy of mGA to keep the original swarm in a nonreplaceable memory [11], which remains intact during the lifetime of the algorithm, in order to reduce the probability of premature convergence.5) In order to speed up evolution for convergence, the small population size concept of mGA is used to generate a secondary swarm with five particles.The secondary swarm consists of the swarm leader and four follower particles from the nonreplaceable memory with the lowest or highest correlation with the leader to increase local exploitation and global exploration.These local and global search mechanisms work in a collaborative manner to guide the search toward global optima.
A subdimension-based search strategy is also conducted, in order to identify optimal features for each facial region.6) Our proposed system is evaluated with CK+ and MMI databases.It outperforms state-of-the-art LBP and PSO variants, and other facial expression recognition methods reported in the literature significantly.This paper is organized as follows.We discuss the related work in Section II.Section III introduces the proposed LBP variant for feature extraction and the mGA-embedded PSO algorithm for feature optimization.A comprehensive evaluation study is presented in Section IV.The conclusions and suggestions for future work are presented in Section V.

II. RELATED WORK
In this section, we discuss state-of-the-art research on texture extraction, PSO-based feature optimization and facial expression recognition.

A. Feature Extraction Techniques
A number of LBP variants are available to increase its robustness and discriminative power.As an example, dominant LBP (DLBP) is able to retrieve the most frequently occurred patterns of LBP to improve its texture descriptive capability.According to [12], uniform patterns in LBP can lead to a loss of information with respect to complex shapes despite their effectiveness in capturing fundamental patterns in an input image.Therefore, instead of purely using uniform patterns, DLBP calculates the occurrence frequencies of all the patterns extracted by LBP.These patterns are subsequently ranked based on the occurrence frequencies to enable the extraction of dominating patterns in texture images.
Completed LBP (CLBP) [13] employs three key components, i.e., CLBP-center, CLBP-sign, and CLBP-magnitude, to extract the image's local gray level and the sign and magnitude features of local difference, respectively.The final CLBP histogram is formed by fusing these three components.In comparison with LBP which only considers the sign component, CLBP takes the magnitude component and intensity of the central pixel into account for formulating the additional discriminative power.It produces superior texture classification accuracy than those from other state-of-the-art LBP algorithms.Center-symmetric LBP (CS-LBP) [14] aims to solve the lengthy histogram problem of LBP.In order to produce more compact binary patterns, CS-LBP purely employs the center-symmetric pairs of pixels for comparison.Therefore, compared with LBP, it enables a significant reduction in dimensionality while capturing better gradient information.
Local derivative pattern (LDP) [15] is a high-order local pattern descriptor, which encodes directional pattern features based on local derivative variations.In comparison with LBP (as a nondirectional first-order local pattern operator), LDP encodes more detailed discriminative information by calculating higher-order directional derivatives.It effectively extracts spatial relationships in a local region.LBP, on the other hand, only defines the relationships between the central point and its neighbors.In LDP, the first-order derivatives from four different directions, i.e., 0   , are calculated.A set of 16 spatial relationship templates is defined for derivative direction comparisons with each template assigned a value of "0" or "1" based on whether it is a "monotonically increasing/deceasing" or a "turning point" pattern.The four first-order derivatives are then concatenated to form the second-order LDP.The nth-order LDP, therefore, encodes the (n−1)th-order derivative direction variations.Higher-order LDP possesses superior capabilities in providing detailed discriminative features, but at the cost of an increasing level of noise.Another novel texture descriptor, local phase quantization (LPQ) [16] deals with image blurring based on quantized phase of the discrete Fourier transform computed in local neighborhoods.The LPQ operator is tolerant to centrally symmetric blur including motion, out of focus, and atmospheric turbulence blur.It is developed based on the blur invariance characteristics of the Fourier phase spectrum.In LPQ, four Fourier coefficients are used to sample the phase component of the frequency at four discrete points for each individual pixel position.The resulting vector is then further processed by separating each value into the real and imaginary parts to generate an 8-D vector.Decorrelation is also conducted using a whitening transform to ensure statistical independence of the samples.A simple scalar quantizer is subsequently used to obtain the 8-bit binary code for each pixel position representing a blur insensitive, Fourier phase information of the pixel location.These codes are then converted into a histogram for image classification.Overall, LPQ is superior to LBP and Gabor filter bank-based methods in dealing with image blurring.

B. PSO Variants and Feature Selection Techniques
There are many PSO variants in the literature to overcome the local optimum problem of conventional PSO [17].Mahmoodabadi et al. [18] proposed a PSO variant known as high exploration PSO (HEPSO).In HEPSO, PSO is integrated with a multicrossover mechanism of the GA and the food source finding operator of bee colony optimization for updating the particle velocity and position, respectively.Evaluated with well-known benchmark functions, HEPSO has shown superiority over other PSO variants.Li et al. [19] proposed another hybrid PSO algorithm with the integration of fuzzy reasoning and a weighted particle to guide the swarm.The weighted particle is used to adjust the search direction, whereas other parameters such as the attraction factor and inertia weight controlled by fuzzy reasoning are used to adjust local exploitation and global exploration to guide the search.The proposed model was tested with ten benchmark functions, and was further applied to nonlinear neural network (NN)-based modeling.Jordehi [20] proposed an enhanced leader PSO model known as ELPSO.ELPSO employs Gaussian, Cauchy, opposition-based, and differential evolution (DE)-based mutation to increase the diversity of the swarm leader.
PSO variants have also been extensively used for feature selection.Zhang et al. [21] extended the conventional bare bones PSO (BPSO) to feature selection problems with binary variables.Known as binary BPSO, a reinforced memory strategy is used to update pbest of each particle to retain swarm diversity, whereas a uniform combination technique is applied to increase local and global search capabilities of the algorithm.In binary BPSO, the influence of the uniform combination is strengthened as the occurrence of stagnated iterations of the algorithm increases.Wang et al. [22] proposed a parameter-free Gaussian bare-bones DE algorithm (GBDE).GBDE employs Gaussian distribution as the mutation strategy and a self-adaptive scheme for crossover probability adjusting.GBDE has been further enhanced by integrating with DE/best/1 (another mutation strategy) to achieve a fast convergence rate.The enhanced model outperforms several DE variants and bare-bones algorithms.Chuang et al. [23] proposed chaotic binary PSO (CBPSO) for feature selection.It combines two chaotic maps, i.e., logistic and tent maps, with BPSO to determine the inertia weight, in order to overcome the local optima problem.The results indicate that CBPSO in combination with a tent map is able to produce the best performance.
Xue et al. [8] proposed two PSO-based multiobjective feature selection algorithms, i.e., nondominated sorting PSO (NSPSO) and crowding, mutation, and dominance PSO (CMDPSO), to generate a Pareto front of nondominated solutions.NSPSO integrates the concept of nondominated sorting with PSO, while CMDPSO embeds PSO with the strategies of crowding, mutation, and dominance.Both algorithms apply a crowding distance to the nondominated solutions for maintaining the selected gbest diversity for each particle.Specifically, CMDPSO employs an external archive to store the nondominated solutions and a binary tournament selection to generate gbest for each particle based on the crowding distance.It also uses the mutation operation to diversify the search.Evaluated with 12 datasets, CMDPSO outperforms NSPSO and other multiobjective algorithms, including nondominated sorting GA II (NSGAII).

C. Face and Facial Emotion Recognition
Krisshna et al. [24] developed a face recognition system with a method called threshold-based binary PSO feature selection (ThBPSO).ThBPSO conducts multiruns of conventional BPSO and stores gbest identified from each run.Then, a threshold is used to identify the importance of each dimension of the global best solutions.A feature is selected and considered as important if the total number of selections of this feature in the past runs is more than the predefined threshold.The system was tested with seven benchmark datasets, and showed superior performance over other state-of-the-art methods.Liu et al. [25] proposed a deep learning architecture, i.e., action units inspired deep networks (AUDNs), for learning facial expression features.AUDN consists of three sequential processes: 1) a convolutional layer and a max-pooling layer to learn the micro-action-pattern (MAP) representation; 2) feature grouping to integrate correlated MAPs to produce mid-level semantics; and 3) a multilayer learning process to construct subnetworks for higher-level representations.
Zavaschi et al. [26] proposed a novel facial expression recognition system with the integration of ensemble classifiers trained on both Gabor and LBP features.A set of 73 base support vector machine (SVM) classifiers was generated by varying parameter settings of Gabor filters and LBP.NSGAII was used to identify the most optimal ensemble structures whose fitness function focused on the minimization of both error rate and number of selected base classifiers in the ensemble.Diao et al. [27] proposed an adaptive ensemble reduction technique by applying the heuristic harmony search (HS) algorithm.HS identified an optimal ensemble size while preserving or increasing ensemble diversity and classification accuracy.
Zeng et al. [28] proposed a one-class classification system using KERNEL whitening and support vector data description to distinguish spontaneous emotional expressions from outlier nonemotional expressions.Meng and Bianchi-Berthouze [3] developed a multistage framework to explore continuous emotion recognition from naturalistic facial and vocal expressions where temporal relationships between consecutive levels of a given affective dimension were modeled using hidden Markov model (HMM).In terms of automatic multimodal emotion recognition, Zeng et al. [29] conducted spontaneous emotion detection from audio-visual modalities using AdaBoost multistream HMM.Soleymani et al. [30] performed continuous emotion recognition from electroencephalogram (EEG) signals and facial expressions.The power spectral density from EEG signals and facial landmarks were employed to represent multimodal emotional inputs.Diverse regression models such as recurrent NNs and continuous conditional random fields were used for emotion regression of the valence dimension.
Eleftheriadis et al. [31] proposed a discriminative shared Gaussian process latent variable model for multiview and view-invariant classification of facial expression.A discriminative manifold was derived based on learning of multiple views of a facial expression.Emotion classification was conducted using both expression manifold and view-invariant or multiview information.Their work compared favorably with other related state-of-the-art developments.Happy and Routray [32] proposed a facial expression recognition system with the consideration of texture features of selected salient facial patches.Active facial patches associated with emotional expressions were initially extracted, which were then further analyzed to obtain discriminative salient facial features for distinguishing between each pair of emotion classes.A facial landmark detection technique to enable more accurate localization of facial patches with less computational costs was also proposed.The system employed the one-against-one classification method for emotion recognition.

A. Facial Feature Extraction Using the Proposed LBP
In this paper, in order to improve the discriminative abilities of LBP, we propose horizontal and vertical neighborhood pixel comparison LBP (hvnLBP).It is integrated with the Gabor filter for producing the discriminative facial representation.
There are four steps in the feature extraction process: 1) preprocessing for illumination changes and noise invariance; 2) face detection; 3) Gabor magnitude image generation; and 4) the proposed hvnLBP-based textural description.First of all, we apply histogram equalization and bilateral filter to compensate illumination variations and reduce noise in the input image, respectively.We then use a Haar-cascade face detector to detect faces.A 2-D Gabor filter is also applied to produce magnitude pictures.Finally, the proposed hvnLBP operator is used to generate the textural description of facial images.
As a well-known texture descriptor, LBP [33] employs a circular neighborhood for feature extraction.This original LBP operator performs a comparison purely between the central pixel and the eight surrounding neighborhood pixels, therefore likely to lose the contrast information among the neighborhood pixels.To solve this problem, we propose hvnLBP to capture missing contrast information among the neighborhood pixels.Instead of comparing with the central pixel as in original LBP, hvnLBP employs horizontal and vertical neighborhood pixels for direct comparison to produce the resulting textural descriptions.As an example, we employ P = {p 0 , p 1 , p 2 , p 3 , p 4 , p 5 , p 6 , p 7 } to represent the eight neighborhood pixels in LBP, as shown in Fig. 2. In either vertical or horizontal comparison, the values of the vertical or horizontal neighboring pixels are compared with one another.A 1 is assigned to the pixel with the highest value and a 0 is assigned to the remaining pixels.This horizontal and vertical comparison process can be conducted in any order, i.e., horizontal comparison followed by vertical comparison, or vice versa.Moreover, in both vertical and horizontal comparisons, we do not include the center pixel for comparison.Referring to Fig. 2, as an example, for horizontal comparison, we first compare the pixel sets of {p 0 , p 1 , p 2 }, {p 7 , p 3 }, and {p 6 , p 5 , p 4 }.Subsequently, we conduct the vertical comparison with the pixel sets of {p 0 , p 7 , p 6 }, {p 1 , p 5 }, and {p 2 , p 3 , p 4 }.If a pixel has conflicting outputs in the horizontal and vertical comparisons (e.g., the highest value in the horizontal comparison but not in the vertical comparison, or vice versa), then the highest value (i.e., 1) is used as the final output, since the pixel is regarded as important, which contains valuable contrast information in the dimension that generates the highest value.The mathematical representation of this proposed hvnLBP p,r operator is illustrated as follows: where p is the number of neighborhood pixels, and r is the radius.l i represents the ith neighborhood of pixel l while S denotes the comparison operation, as follows: where l j , l k , and l m represent the neighborhood pixels in a row or column.Note that l k is removed if it is the center pixel.An example output of the proposed hvnLBP p,r operator is provided in Fig. 2, where p = 8 and r = 1.In this paper, we use a window size of 75×75 pixels to represent a detected face image.Therefore, by applying the proposed hvnLBP operator, we obtain 25 × 25 (i.e., 625) subregions with the size of each subregion being 3 × 3.
Overall, in comparison with the original LBP operator, the experimental results indicate that hvnLBP is more capable of capturing discriminative contrast information such as corners and edges among neighborhoods to inform subsequent PSObased feature selection and facial expression analysis.

B. Proposed PSO Algorithm for Feature Optimization
To identify the discriminative characteristics of each expression, we propose a PSO variant embedded with the concept of mGA for feature optimization, called the mGA-embedded PSO algorithm.This proposed PSO algorithm mitigates the premature convergence problem of conventional PSO, and shows superior capabilities of discriminative feature selection.The proposed mGA-embedded PSO algorithm employs personal average experience and Gaussian mutation for velocity updating.Furthermore, it integrates the diversity maintenance strategy of mGA to keep the original swarm in a nonreplaceable memory, which remains intact during the lifecycle of the algorithm to increase swarm diversity.Inherited from the concept of mGA, a secondary swarm with a small population size of five particles is employed.The swarm comprises a leader and four follower particles with the highest or lowest correlation to the leader from the nonreplaceable memory to increase local and global search capabilities and avoid premature convergence.Moreover, the algorithm separates facial features into specific areas for in-depth local subdimension-based search.Overall, the local exploitation and global exploration search strategies of the algorithm work cooperatively to lead the search process to the global optima.Algorithm 1 illustrates the pseudo code of the proposed mGA-embedded PSO algorithm, while Fig. 3 shows the flowchart of the algorithm.
1) Update of pbest and gbest: In conventional PSO, each solution is represented as a particle in the swarm.Particles move in the search space by following the swarm leader in order to find the optimal solutions.Each particle has a position in the search space represented as x i = (x i1 , x i2 , . . ., x iD ), whereas it also has a velocity represented as v i = (v i1 , v i2 , . . ., v iD ), with D denoting the dimensionality of the search space.Each particle has a memory of its best experience whose position is represented as pbest.The swarm leader represents the best experience of the overall swarm, whose position is represented as gbest.The position, x t+1 id , and velocity, v t+1 id , of each particle are updated using the following equations [34]: where t and d indicate the tth iteration and dth dimension in the search space, respectively.An inertia weight, w, is used to embed iteration influence of the previous velocity.Note that r 1 and r 2 represent random values within the range of [0, 1] whereas c 1 and c 2 are the acceleration constants.
Furthermore, p id and p gd indicate elements of pbest and gbest in the dth dimension.In this paper, we modify the velocity updating formula (4) by introducing the averaging search strategy for computing p id and Gaussian mutation for computing p gd .Specifically, the averaging search strategy takes the personal average experience into account, instead of the conventional personal best experience.The average experience is obtained by averaging the positions found from previous iterations of each individual particle for generating pbest.This enables the algorithm to better look into the search space inbetween to increase local exploitation.Furthermore, instead of using the position of the global best experience directly, Gaussian distribution operation is applied to the swarm leader to generate gbest.This mutation technique enables the generation of offspring further away from its parent to increase global exploration.Therefore, the revised velocity updating strategy possesses more capability of sustaining search diversity.The updated formulas are provided as follows: where p id and p gd represent the updated pbest and gbest in the dth dimension using personal average experience and Gaussian distribution, respectively, as defined in ( 6) and (7).Moreover, in (7), φ(o, h) indicates the Gaussian distribution and o represents the mean of the distribution with h as the standard deviation which decreases linearly during the execution.Note that x d max and x d min indicate the upper and lower bounds of the decision vector in the dth dimension, respectively, d = 1, 2, . . ., D.
As indicted in Algorithm 1, we first initialize the original swarm with 30 particles.The modified PSO operation with the proposed velocity updating formula is applied to the initial swarm.It iterates ten times at the beginning of the algorithm to find the best leader.We use a small number of iterations (i.e., 10) for this initial PSO search to accelerate convergence and allow benefits from subsequent search strategies to take place.This mainly aims to find the best balance between computational costs and performance.The following setting (obtained from experimental trials) is applied to this modified PSO operation, i.e., maximum velocity = 0.6, inertia weight = 0.78, population size = 30, acceleration constant c 1 = c 2 = 1.2, and maximum generations = 500.Moreover, ( 8) is used to define the fitness evaluation for each particle, C, which consists of two criteria, i.e., classification performance and the number of selected features.Since we apply the proposed PSO algorithm to each emotion category separately, in an attempt to identify the discriminative features for each distinct expression, the classification accuracy score in (8) indicates accuracy of each individual expression, rather than combined accuracy across all emotion categories.This helps avoid bias toward specific emotion categories during optimization (see the related discussion in Section IV) where w a and w f are two predefined weights for classification accuracy and the number of selected features, respectively, with w a = 1 − w f .In addition, parameters w a and w f indicate the relative importance of classification performance and the number of selected features, respectively.In this paper, since the classification performance is considered to be more important than the number of selected features, w a assumes a higher value than w f , i.e., w a = 0.9 and w f = 0.1.
2) Construction of Secondary Swarm Embedded With the Concept of mGA: Besides the velocity updating mechanism, the proposed PSO algorithm integrates the concepts of mGA and a secondary swarm, as well as the cooperation of local exploitation and global exploration search strategies to balance between convergence speed and swarm diversity.In summary, the proposed algorithm employs the diversity maintenance strategy of mGA using a nonreplaceable memory.This nonreplaceable memory comprises the initialized swarm to sustain search diversity.Motivated by the small population size concept of mGA, a secondary swarm with five particles comprising the swarm leader and four follower particles from the nonreplaceable memory with the highest or lowest correlation with the leader is constructed to increase local exploitation and global exploration.A subdimension-based search in the secondary swarm is also conducted, in order to identify the discriminative regional facial features.Moreover, the local exploitation and global exploration search strategies of the secondary swarm work in a collaborative manner to avoid stagnation and overcome premature convergence.The details of these strategies are as follows.
mGA is a small-population GA with a reinitialization mechanism.It was initially proposed by Goldberg [35], whose theories suggested that a small population was sufficient enough to achieve convergence regardless of the chromosome length.mGA usually employs a population of 3-6 chromosomes and shows great capability of solving nonlinear optimization problems [36].Instead of using the mutation operation as in classical GA, mGA employs a restart strategy to maintain genetic diversity in the population.
The mGA model is proven to be more capable of avoiding premature convergence and reaching the optimal search region than the classical GA [37].Because of its impressive performance and fast convergence speed, mGA has been widely used to deal with single-objective and multiobjective optimization problems [38].Furthermore, Coello and Pulido [11] proposed a multiobjective mGA with two memories, i.e., population memory and external memory.The population memory consists of replaceable and nonreplaceable aspects.The nonreplaceable fragment of the memory remains intact during the entire lifetime of the algorithm, in order to bring sufficient diversity to the algorithm, whereas the replaceable portion of the memory is used for conventional evolution where the solutions are kept updated in the subsequent evolutionary cycles.The multiobjective mGA shows efficient search This paper borrows the multiobjective mGA concept with the replaceable and nonreplaceable memories to update the swarm leader (replaceable portion) and preserve diversity of the initialized swarm (nonreplaceable portion), respectively.After initializing the swarm with 30 randomly generated particles at the beginning of the algorithm (see Algorithm 1), this original swarm is stored in the nonreplaceable memory, which remains intact during the lifetime of the algorithm, in order to reward swarm diversity when stagnation occurs.To balance between swarm diversity and convergence speed, a secondary swarm embedded with the small population concept of mGA is constructed.It has a typical population size of five, and consists of a swarm leader and four follower particles from the nonreplaceable memory.As illustrated in Algorithm 1, the followers are chosen based on two types of correlation relationships with the leader: 1) the lowest and 2) the highest correlations.Particles with the lowest correlation provide higher variations in the swarm to enable global exploration whereas particles with the highest correlation bring more similarity in the swarm where local exploitation can be observed.Moreover, we define the correlation relationship between particles using ( 9) and ( 10) [39].Since the extracted features using hvnLBP are in the binary format and can be converted into histogram easily, we use the histogram correlation comparison method, as shown in ( 9) and ( 10) [39], to identify particles with highest/lowest correlation to the leader corr where where corr indicates the correlation relationship between two particles with H 1 and H 2 representing the histograms for the swarm leader and a follower particle, respectively.H k indicates the mean of the histogram for the kth particle (k = 1, 2), whereas N represents the number of histogram bins and I indicates the intensity range present in the histogram.Equation ( 9) produces an output in the range of [0, 1], with 0 and 1 representing the lowest and highest correlations, respectively.As shown in Algorithm 1 and Fig. 3, first of all, after identifying the swarm leader by the previous modified PSO process, four follower particles from the nonreplaceable memory with the highest correlation with the leader are recruited to the secondary swarm.The aim of extracting the follower particles from the nonreplaceable memory, instead of using the particles from the main swarm, is to avoid diversity loss as the particles in the main swarm tend to be converged and become identical after ten iterations.Moreover, these follower particles with the highest correlation with the leader provide certain degree of position proximity in the secondary swarm, therefore enabling local exploitation of the search space.Subsequently, we divide each particle in the secondary swarm into five feature subsections, with each subsection representing each facial region to enable an in-depth local search to identify its discriminative features.This in-depth local optimal facial feature search is discussed in detail in Section III-B2a.This section-based local facial feature search reveals a new swarm leader whose fitness value is compared with that of the previous leader, in order to elect a new leader for the next iteration.
After employing particles with the highest correlation with the leader as followers to conduct an in-depth local optimal facial feature search, the secondary swarm recruits a new set of four particles with the lowest correlation with the leader from the nonreplaceable memory to replace the existing follower particles.Since the new set of follower particles with the lowest correlation recruited from the original swarm inject high variation to the secondary swarm, it boosts the swarm diversity significantly to increase global exploration and avoid premature convergence.Subsequently, the newly updated diversified secondary swarm is also used to conduct a local facial feature search (see Section III-B2a) to identify a new swarm leader.
In this way, particles with the highest or lowest correlation with the swarm leader from the nonreplaceable memory are recruited alternately in the secondary swarm to increase local exploitation and global exploration.Moreover, when local exploitation in the subdimension search using particles with the highest correlation with the leader stagnates, our PSO algorithm employs follower particles with the lowest correlation with the leader from nonreplaceable memory to increase swarm diversity and drive the search out of local optimum trap.On the other hand, when global exploration in the subdimension search using particles with the lowest correlation with the leader fails to generate a fitter leader, it recruits follower particles with the highest correlation to the leader from nonreplaceable memory to avoid stagnation and enable local exploitation.Therefore, the local and global search mechanisms embedded in the secondary swarm work cooperatively to mitigate premature convergence and lead the search toward the global optima.
a) In-depth local optimal feature search: As discussed earlier, after particles with the highest or lowest correlation with the leader are recruited in the secondary swarm, we divide each particle in the secondary swarm into five feature sections with each section consisting of partial dimensions which indicates a specific facial region (e.g., eye, eyebrow, nose, mouth, and cheek).For each facial region, we apply the above modified PSO operation with the updated velocity updating formula defined in Section III-B to conduct an indepth local search and to identify its optimal discriminative features.These optimal local solutions are then concatenated to generate a new swarm leader, which is used to replace the previous leader if it has a better fitness value.
The overall optimization process of our algorithm iterates until: 1) the number of evolution reaches 500 and 2) the fitness value does not show obvious improvement during the last 50 generations.The proposed PSO-based feature selection is conducted for each emotion category separately to identify discriminative features for each expression.The generated optimal feature subset of each expression by our PSO algorithm is shown in Fig. 4, with a detailed analysis provided in Section IV.Empirical results indicate that our algorithm outperforms other PSO variants and conventional methods significantly in terms of the search toward global optimum and discriminative feature selection.

C. Emotion Recognition
In this paper, we conduct a study of seven-class facial emotion recognition using the features automatically generated by the mGA-embedded PSO.NN with backpropagation, a multiclass SVM [40], and ensemble classifiers are used for classification.The detailed setting of the classifiers is introduced, as follows.In this paper, the trial-and-error method is conducted to identify the optimal NN structure, whereas a grid-search method is applied to find the optimal parameters of the multiclass SVM with the RBF kernel.After several trials, the NN is equipped with one input layer with 25-40 nodes indicating the optimized features obtained from the proposed PSO algorithm, one hidden layer, and one output layer with seven nodes, respectively, representing seven expressions.For the grid search of optimal settings for the multiclass SVM with the RBF kernel, we use exponentially growing sequences and search the ranges of [2 −5 − 2 15 ], [2 −10 − 2 5 ], and [2 −8 − 2 −1 ], respectively, for a soft-margin constant, C, a kernel parameter, gamma (γ ), and an epsilon (ε) in the loss function since the combination of these three parameters plays very important roles in affecting the SVM's performance.We also employ tenfold cross validation to identify the best combination of these parameters to avoid over-fitting.The identified optimal setting in the training stage is then applied to the subsequent experiments in the test stage.
Besides these single model classifiers, we also employ ensemble classifiers for expression recognition in order to improve accuracy.We use weighted majority voting for the construction of ensembles because of its impressive performance and suitability for undertaking small datasets (<1000) in this paper.We construct two ensembles with NN and multiclass SVM as the base model, respectively.Also the NN-based and SVM-based ensembles use three base models, respectively.The optimal settings identified earlier for NN and SVM are applied for building each base model.
The ensemble classifiers are constructed using an AdaBoost process so that the performance of the three base models within each ensemble classifier is complementary to each other [5], [41].The training process of each ensemble classifier focuses on misclassified instances.As an example, the weights of misclassified instances by the first base model are increased so that they are more likely to be selected for training the second base model.A similar case is also applied to the construction of the third base model, which employs the instances misclassified by the second base model for training.Therefore, each ensemble classifier is constructed with a number of base models that are complementary to each other [5], [41].Weighted majority voting is applied to combine the outputs from the three base models to generate the final output for each ensemble.The empirical results indicate that the constructed ensembles outperform NN/SVMbased emotion recognition for both within and across database evaluations.

IV. EVALUATION
In this paper, both CK+ and MMI are employed for evaluation.A set of 250 images from CK+ is used for training while 175 images extracted from CK+ and MMI, respectively, are employed for testing.

A. Comparison of Feature Extraction Techniques
First of all, a series of experiments is conducted to compare the proposed hvnLBP operator with other state-of-the-art texture descriptors including CLBP, DLBP, CS-LBP, LDP, and LPQ.The Gabor filter is integrated with each texture descriptor algorithm for feature extraction.Low-level raw features extracted by each descriptor are directly used for emotion classification without any feature optimization.When ensemble classifiers are applied, all algorithms achieve the best performance.Table I shows the evaluation results of all descriptors integrated with ensembles with each ensemble trained with features extracted by each texture descriptor.Both second-and third-order LDPs are implemented.The results of the secondorder LDP are presented in Table I, since it achieves the best performance.
Built upon the LBP methodology, DLBP and CLBP rely on the comparison between the center point and its neighbors but ignore the differences among neighborhood pixels themselves.Therefore, they show limitations in identifying different LPQ with decorrelation is implemented in our experiment.LPQ shows great robustness to blurred images by employing local phase information calculated using a short-term Fourier transform for each pixel position.However, it has higher computational complexity, and is expensive for online applications in comparison with hvnLBP.In addition, the window size is one of the important parameters in LPQ.A smaller window is able to capture detailed texture information, but other unimportant patterns caused by illumination changes and noise factors are extracted as well.On the contrary, a larger window sometimes is not able to extract sufficient discriminative information, therefore decreasing the performances for sharp images [16].
Among all the comparable descriptors, the second-order LDP achieves the best accuracy rate, which extracts more detailed high-order local pattern information.However, the empirical results indicate that sometimes it also extracts overdetailed patterns which contain more noise in comparison with hvnLBP.Moreover, the second-order LDP also generates highdimensional features with a high-computational cost, which makes it less suitable for real-time applications.Another limitation of using LDP is the requirement of identifying the optimal order of LDP that is suitable for a specific database although the third-order LDP outperformed all the other order LDPs in [15] for face recognition tasks.
In comparison with the abovementioned comparable methods, the proposed hvnLBP operator effectively extracts spatial

B. Comparison of Feature Selection Techniques
To evaluate the proposed mGA-embedded PSO algorithm for feature selection, we have implemented state-of-the-art methods for comparison, i.e., ELPSO [20], a PSO variant for multimodal function optimization (MFOPSO) [42], binary BPSO (BBPSO) [21], ThBPSO [24], HEPSO [18], conventional PSO, and classical GA.The features extracted by hvnLBP are further processed by each feature optimization algorithm for dimensionality reduction.NN, SVM, and NN-based and SVM-based ensembles are applied to recognize seven emotions using automatically generated features based on each feature optimization technique.
We have also conducted the cross-database evaluation with a training set of 250 images from CK+ and a test set of 175 images from MMI.Table III summarizes the average accuracy rates for all the selected models integrated with different classifiers over 30 runs for the cross-database evaluation.The best performances are yielded by the SVM-based ensemble for all feature selection methods.The proposed PSO algorithm extracts the smallest number of features, achieves an average accuracy rate of 94.66% for seven emotions, and outperforms seven other methods by 6.35% (BBPSO), 6.57% (MFOPSO), 7.21% (ELPSO), 9.49% (HEPSO), 12.28% (ThBPSO), 17.89% (PSO), and 18.35% (GA), respectively.In Fig. 5, the boxplot diagrams clearly demonstrate the distribution of the classification results over 30 runs of all the feature selection methods in combination with the SVM-based ensemble for the cross-database evaluation.
As can be seen in Fig. 5, the results of all 30 runs of the proposed PSO algorithm outperform those of all other state-ofthe-art PSO variants, conventional PSO, and classical GA by a significant margin.For example, all the results of 30 runs of our algorithm except for one outlier (with the lower whisker at 91.29%) are higher than the maximum results of all the following methods, i.e., 91.14% for MFOPSO, 90.57% for ELPSO, 88% for HEPSO, 86.57% for ThBPSO, 79.57% for PSO, and 80% for classic GA.Furthermore, at least 75% of the results of our algorithm (with the first quartile of 93.71%) are higher than the maximum result, i.e., 92.57% from BBPSO.Among all the selected state-of-the-art PSO variants, BBPSO, MFOPSO, and ELPSO achieve comparatively better performances than HEPSO and ThBPSO, i.e., with at least 25% of the results of these three PSO variants higher than the maximum result (88%) of HEPSO and at least 75% of the results of these three PSO variants higher than the maximum result (86.57%) of ThBPSO.In comparison with these three best PSO variants, i.e., BBPSO, MFOPSO, and ELPSO, the median value of our algorithm (94.71%) is higher than the median scores of BBPSO (88.29%),MFOPSO (88.29%), and ELPSO (87.43%) by 6.42%, 6.42%, and 7.28%, respectively.Besides outperforming these three best PSO variants, all the results of our algorithm are within a smaller variation range of [91.29%, 97.86%], as compared with those from BBPSO having a larger variation of [85.57%, 92.57%].Moreover, the lowest result of our PSO algorithm (i.e., the lower whisker at 91.29%) outperforms the maximum results of HEPSO (88%), ThBPSO (86.57%), classical GA (80%), and PSO (79.57%) by 3.29%, 4.72%, 11.29%, and 11.72%, respectively.Furthermore, the average classification results of each expression over the 30 runs for each optimization method with the SVM-based ensemble classifier for the cross-database evaluation are depicted in Fig. 6  detailed boxplot diagrams for the distribution of the detailed classification results over 30 runs for each emotion category.As indicated in Fig. 6(a)-(h), the proposed PSO algorithm achieves superior performance and outperforms all the other compared methods for each emotion significantly.With respect to the fear and sadness emotion categories, 75% of the classification results of our model are higher than the maximum results of all seven methods, whereas at least 50% of the results of our algorithm are also higher than the maximum results of all other methods for the anger, happiness, surprise, and neutral emotion classes.Meanwhile, for the disgust emotion, the results of our algorithm over 30 runs indicate the overall smallest variation of [89%, 97%], as compared with other larger variations of the other results, e.g., [78%, 97%] for BBPSO, MFOPSO, and ELPSO, respectively.The proposed diversity maintenance strategies of our PSO algorithm contribute to its superior performance over other state-of-the-art and conventional methods.
An analysis pertaining to the theoretical contribution of the proposed algorithm is as follows.We compare our PSO algorithm with the three advanced PSO variants, i.e., BBPSO, MFOPSO, and ELPSO, theoretically.BBPSO [21] employs a reinforced memory strategy for updating pbest for each particle and a uniform combination technique to replace subdimensions of each particle using a random number with the corresponding elements of a randomly selected pbest k from a set of stored pbests to avoid stagnation.It increases the execution of a uniform combination with respect to increased stagnant iterations.However, since the uniform combination operation is only applied to the subelements of swarm particles and simulates the effects of crossover and mutation operations of the GA, the generated offspring could be significantly similar (i.e., with a high correlation) to the parent particles.Therefore, their search strategy focuses more on local exploitation.In contrast, our PSO variant applies follower particles which have the highest or lowest correlation with the leader to diversify the search and increase both local and global search capabilities, in an attempt to avoid stagnation.Therefore, it shows a superior performance than that of BBPSO.
MFOPSO [42] divides the original swarm into several subswarms to increase search diversity.It is capable of dealing with multimodal function optimization.However, when the search fails to generate fitter leaders in the subswarms, MFOPSO does not include any diversity maintenance or jumpout mutation strategy to diversify the search in the subswarms, in order to avoid premature convergence.
The same explanation applies to ELPSO [20].It employs Gaussian, Cauchy, opposition-based, and DE-based mutation strategies to increase the exploration capability of the swarm leader.However, ELPSO only attempts to improve the leader when stagnation occurs, and no improvement strategy is applied to the follower particles to retain population diversity.In comparison with MFOPSO and ELPSO, our PSO algorithm utilizes the diversity maintenance mechanism of mGA and keeps a nonreplaceable memory to maintain swarm diversity.It not only applies Gaussian mutation to the swarm leader to enable long jumps in the primary swarm but also employs particles with the highest or lowest correlation with the swarm leader from the nonreplaceable memory to retain population diversity and increase local exploitation and global exploration.Most importantly, these local and global search strategies of the secondary swarm work collaboratively to lead the search toward the global optimum.Therefore, it outperforms MFOPSO and ELPSO significantly in terms of A comparison between our proposed PSO algorithm and other recent state-of-the-art facial expression recognition methods has been conducted.Tables IV and V show the comparison among different methods using the CK+ and MMI databases, respectively.As shown in Table IV, for the evaluation using CK+, which proposed both direct similarity and Pareto-based optimization for facial feature selection, Neoh et al. [41] achieve the best performance.The Pareto-based feature selection emphasizes both intraclass and interclass variations and achieves the highest accuracy rate.However, although related strategies are adopted in their fitness functions to prevent information loss, inspection of their results indicate that the algorithms produce a comparatively small subset of 13-39 features and, sometimes, could overlook certain important features pertaining to certain emotion categories (e.g., widened eyes for surprise, mouth stretch for fear, etc.) in comparison with our proposed algorithm.As illustrated in Fig. 4, the feature subregions extracted by our PSO algorithm indicate the most significant texture distortions around the eyes, eyebrows, and the mouth associated with each distinct expression.The key facial muscular actions defined in facial action coding system (FACS) [47] associated with each expression can be clearly seen in the optimized features revealed by our algorithm.E.g., for anger, significant features indicating brow lower, eyelid, and lip tightener are produced by our PSO algorithm, whereas the subregions indicating the significance of lip corner puller and cheek raising are revealed for the happy expression.Feature distribution pertaining to sadness clearly indicates the implication of the inner brow raiser and lip corner depressor whereas eyebrow raiser, widened eyes, and mouth open are demonstrated in the selected subregions for surprise, etc.Overall, the features identified by our PSO algorithm represent the characteristics of each emotion significantly and map closely to the action units given in FACS.
We also conduct the cross-database evaluation to further assess the scalability of the proposed PSO algorithm using the MMI database.Table V shows a comparison with other related methods.Fang et al. [45] employed MMI for both training and testing, whereas other methods including this paper used CK+ for training and MMI for testing.Results indicate our algorithm shows great scalability and extracts the most discriminative features of each expression for the cross-domain evaluation.It outperforms all related methods by a significant margin of approximately 20%-37%.

V. CONCLUSION
In this paper, we have proposed a facial expression recognition system with hvnLBP based feature extraction, mGA-embedded PSO-based feature optimization and diverse classifier based expression recognition.The proposed hvnLBP operator performs horizontal and vertical neighborhood pixel comparison to retrieve the initial discriminative facial features.It outperforms state-of-the-art LBP variants, LPQ, and conventional LBP significantly for texture classification.Moreover, a new PSO algorithm, i.e., mGA-embedded PSO, has been proposed to mitigate the premature convergence problem of conventional PSO in terms of feature optimization.The mGAembedded PSO algorithm incorporates personal average experience and Gaussian mutation for velocity updating as well as employs the diversity maintenance strategy of mGA by keeping the original swarm in a nonreplaceable memory, which remains intact during the lifecycle of the algorithm to increase swarm diversity.Furthermore, it also maintains a secondary swarm with a small population size of five to host the swarm leader and four follower particles with the highest/lowest correlation with the leader from the nonreplaceable memory to increase local and global search capabilities.The algorithm subsequently separates facial features into specific areas for in-depth local subdimension based search.Overall, the local exploitation and global exploration search mechanisms of the algorithm work cooperatively to guide the search toward the global optimal solutions.The empirical results indicate that our PSO algorithm outperforms other state-of-the-art PSO variants and conventional PSO and GA for optimal feature selection significantly.Integrated with the SVM-based ensemble, our algorithm achieves the best average accuracy of 100% over 30 runs for the within (CK+) database evaluation and 94.66% accuracy for the cross-domain (MMI) evaluation.On an average of 30 runs, it outperforms seven optimization algorithms by 2.6% (BBPSO), 2.7% (MFOPSO), 4.7% (ELPSO), 5.6% (HEPSO), 7.4% (ThBPSO), 14.7% (PSO), and 20.2% (GA), respectively, for the within-domain evaluation using CK+, and by 6.35% (BBPSO), 6.57% (MFOPSO), 7.21% (ELPSO), 9.49% (HEPSO), 12.28% (ThBPSO), 17.89% (PSO), and 18.35% (GA), respectively, for the crossdomain evaluation using MMI.The empirical results also indicate that our proposed PSO algorithm outperforms other related facial expression recognition methods reported in the literature by a significant margin.
We have identified the following directions for further improvements.Diverse search strategies such as the firefly algorithm and cuckoo search can be explored for search diversity of the overall swarm and for subdimension exploration.Multiobjective evolutionary algorithms can also be explored to further equip the current algorithm to deal with real-world challenging optimization problems containing multiple criteria.Motivated by Zavaschi et al. [26] and Diao et al. [27], ensemble construction using base models trained on diverse features provided by LBP variants and LPQ will be explored to further improve performance.We also aim to integrate the proposed PSO algorithm into a humanoid robot to enable it to deal with challenging real-world spontaneous human behavior interpretation and robot interaction tasks.

Fig. 2 .
Fig. 2. Example output of the proposed hvnLBP operator in comparison with that of the original LBP.

Fig. 5 .
Fig. 5. Boxplot diagram for the distribution of average recognition results for each optimization algorithm + SVM-based ensemble over 30 runs for cross-database evaluation.

Fig. 6 .
Fig. 6.(a) Overall comparison of our system with other methods.(b)-(h) Boxplot diagrams for the distribution of classification results for each emotion category for each optimization algorithm + SVM-based ensemble over 30 runs for cross-database evaluation.

Algorithm 1
Pseudo-Code of mGA-Embedded PSO

TABLE I COMPARISON
BETWEEN THE hvnLBP OPERATOR AND OTHER TEXTURE DESCRIPTORS local structures embedded in the neighboring pixels.CS-LBP employs center-symmetric pixel pairs for comparison, in order to extract local discriminative information.However, it overlooks other local differences among horizontal and vertical pixels.An example that demonstrates the difference among the proposed hvnLBP operator, LBP, and CS-LBP is provided, as follows.Given two patterns (50, 80, 85, 70, 50, 45, 55, 53, center-60) and (100, 230, 240, 230, 100, 50, 120, 160, center-200), although the local structures of both patterns are different, LBP generates the same binary code, 01110000, for both patterns.CS-LBP produces 11111000 for both patterns too.However, hvnLBP is able to generate two distinctive binary codes for these patterns, i.e., 01110010 for the former and 01110011 for the latter, indicating the two different local structures.

TABLE II AVERAGE
CLASSIFICATION PERFORMANCE USING THE SELECTED OPTIMIZATION ALGORITHMS INTEGRATED WITH DIVERSE CLASSIFIERS OVER 30 RUNS, RESPECTIVELY, WITHIN DATABASE EVALUATION relationships in a local region by conducting multiple direct horizontal and vertical neighborhood comparisons with an efficient computational cost.From the empirical study, it shows superior capabilities of preserving distinctiveness and differentiating different local structures embedded in the neighboring pixels for low contrast images.