A Grey Wolf Optimizer Feature Selection method and its Effect on the Performance of Document Classification Problem

Optimization methods


INTRODUCTION
Optimization is one of the best and most effective methods deployed to be used to solve real-world problems that are recently not limited to the academic and the scientific research only. For instance, such methods have been used for medical purposed, business, education, industry and others. The Evolutionary Algorithms such as the Genetic Algorithms (GAs), Genetic Programming (GP) and Evolutionary Strategies were among the first methods developed in this area. By time more and more methods have been invented in order to come with the changing needs in life and to solve more and more complicated problems. The Grey Wolf Optimization is one of those promising methods [1]. Recently, several effective feature selection techniques have been developed in the literature, and applied for English language text categorization classification.
Gray wolf optimization (GWO) is a one of the recently invented techniques in optimization, that suggests the gray wolf members have the capability to successfully reproduce 3 more members than hunting in pack while 2 gray wolf members (female and male) have an extra space and management authority of other members in the group [6]. Gray wolf's optimization method is one of the biologically inspired methods that mimics the hunt processes of a group of gray wolf in the forest [1] [2].
The krill herd algorithm (KHA) is a new metaheuristic search algorithm based on simulating the herding behavior of krill individuals using a Lagrangian model. This algorithm was developed by Gandomi and Alavi (2012) and the preliminary studies illustrated its potential in solving numerous complex engineering optimization problems [2]. Text labeling and classification is one of the essential computational tasks in machine learning applications due to the increased amounts of large amount of text documents available in the digital forms. In this process, feature selection (FS) is challenging phase due to thousands of possible feature sets will be considered in text classification. Text feature selection is the process of performing dimensionality reduction and analyzing a large amount of natural language text in the data mining discipline. It aims to detect useful patterns and trends from the text. Many methodologies have been developed for performing this task such as dimensionality reduction that includes feature selection and feature extraction [3][4].
The main challenges of conducting text mining operations is the handling the increasing number of data and the hidden relationships between them. In data mining terms, preprocessing is the first step that should be considered before conducting any post-processing methods [5]. Due to the high dimensionality of data such as the text data, it become necessary to consider the dimensionality reduction step as Abstract Optimization methods are considered as one of the highly developed areas in Artificial Intelligence (AI). The success of the Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) has encouraged researchers to develop other methods that can obtain better performance outcomes and to be more responding to the modern needs. The Grey Wolf Optimization (GWO), and the Krill Herd (KH) are some of those methods that showed a great success in different applications in the last few years. In this paper, we propose a comparative study of using different optimization methods including KH and GWO in order to solve the problem of document feature selection for the classification problem. These methods are used to model the feature selection problem as a typical optimization method. Due to the complexity and the non-linearity of this kind of problems, it becomes necessary to use some advanced techniques to make the judgement of which features subset that is optimal to enhance the performance of classification of text documents. The test results showed the superiority of GWO over the other counterparts using the specified evaluation measures. data mining pre-processing step in order to funnel down unwanted information. The existing methods used for classification still have many challenges; due to the huge increasing amount of data [6] [7]. Thus, it is essential to continue to improve and enhance these approaches and techniques which are supposed to deal with high dimensional data such as textual data. In figure 1 we notice the feature selection process is conducted as an iterative process. Figure  2, represents the original and the reduced spaces of features.
In this paper, it is suggested that a selected number of optimization methods are used to enhance the classification systems via feature selection. As wrapper methods, these optimization methods can be used to iteratively select the optimal subset of features. Thus, these techniques can assist in improving the text mining post processes as they deal with cases where imprecise, and uncertain data representations are existing. The aim of this paper is to present a method intended to reduce the extra text that affects any data mining or machine learning process with the objectives listed below: To develop a text feature selection method to reduce the original feature subsets into smaller feature space with an eliminated chance of falling in local optima.
The problem of missing class labels of text features is also handled in this research. This paper is organized as follows, in section 2 an explanation of the algorithm is given while in section 3 we explain the proposed method and in sections 4 and 5 we explain the test results and the conclusions.   [9] 2. THE PROPOSED METHODS

Krill Herd Optimization
The Krill Herd algorithm (KH) is based on the idea of krill individual's movements and foraging ( Figure 3). The objective function of the KH is calculated by measuring the minimum distance of each krill from the food source, at the same time, the entire herd density is also taken into account. The krill position is determined mainly by three factors, first its distance from the food, the impact of movement generated by other krill individuals and the krill physical diffusion. These three factors can be mathematically represented as [10][11]: The movement of the krill herd that induced by other individuals from the herd with the aim of keeping the swarm as dense as possible.
Lagrange model can be used to generalize n-dimensional search space. The predation can remove krill from the herd and that reduces the density of the krill swarm. Also, that disturbs the way of the swarm to the food source. This is considered the initialization of the swarm. According to the three factors, a Lagrange model can be generated as is shown in equation (1). (1) Where: Ni the parameter that determines the motion induced by other krill individuals whereas Fi represents the foraging activity. Lastly, Di represents the diffusion of any particular krill individual.
Krill individuals strive to keep the swarm condense and move as one unit due to the impact of their mutual effects. The estimation of the swarm direction αi can be determined by measuring the local swarm density, a target swarm density, and a repulsive swarm density. Therefore, in equation (2) these three densities are used to calculate the motion induced by other krill individuals. For the first parameter Ni it can be updated by applying equation (2).
Nmax is the max speed induced. The ω, is the inertia weight in the range [0,1]. Niold is the final motion induced. And αilocal is the effect of the surrounding krill individuals on the motion of a particular krill while αitarget is the effect of the best krill that has the best fitness value.
The effect of the surrounding krills αilocal can be obtained from the following equations: 7 Where kworst, kbest are worst and best fitness scores in the swarn while ki, kj are the fitness scores of the ith and the jth krills. The Xi and Xj are the locations of ith and jth krills.
Lastly, ε is a small positive value added to denominator to prevent any singularities. The target effect that determines the effect of the best krill that has the best fitness value can be calculated as follows: Where; Cbest is the effective coefficient of the krill that has the best fitness value to the ith krill. Cbest can be calculated as follows: (8) where rand is a random number between zero and one I is the iteration counter and Imax is the number of iterations.
kˆbest, is computed as same as kˆij, still the fitness value of jth krill individual kj, is substituted by the best fitness value.
Xˆi,best, is also computed as same as Xˆij, but Xj is substituted by the Xbest that represents the best fitness value.

Foraging Motion (Fi)
This factor can be computed in terms of food location and previous experience of where food was located. The foraging motion can be calculated as follows: (9) where βfood is the food attraction that is used to attract krills to global optimum while βbest is the effect of the current best krill individual.

Physical Diffusion
The physical diffusion is an arbitrary process which is computed as a function of the maximum diffusion speed and a random directional vector as follows: (10) Where, Dmax is the maximum diffusion speed and δ is a random directional vector with the values ranging between [1,-1].

The Motion Process of KH
According to the Ni, Fi, and Di, the krill positions can be calculated during the time interval Δt as is shown in the following equation.  (11) After the position updates of the krill individuals, the reproduction mechanisms are used, which are the crossover and the mutation. The krill distribution can be visualized as shown in Figure 4.  [11]

Grey Wolf. Optimization
The hunting social team's attitude of the gray wolf member can be represented formally using the mathematical equations with the assistance of the optimal wolves for obtaining the optimal solution [14][15][16][17]. The w wolves are following the other prominent wolf members for the hunt process. The major steps used for hunting are listed below: Approach the prey Encircle and harass the prey until the stopping condition is achieved.
Attack the prey.

Prey Encirclement
To hunt, grey wolf is chasing and encircling the prey. In mathematics, it is designed as shown in Equations (1)  and are factors that they are represented as: where r1 and r2 are the randomized vectors in the period of 0 and 1 of ~a are decreased in a linearly based manner from two to zero over the course of iteration.

The Hunt Process
Hunt can be represented in a mathematical basis as shown in Equation (7). Grey wolf encircling around for hunting when the locations of the pray is determined. Hunt mechanism is guided by α, β, δ grey wolf members. In the total searching space, there is no hint regarding the locations of the pray, merely a presumption is formed that α, β and δ wolf members have enough information relating to the pray locations. Thus, the best candidates gained are reserved and other candidates are eliminated, that means the w gray wolf members still in an update process of their locations on the basis of the optimal solutions.
( + 1) = ( 1 + 2 + 3 )/3 (16) Eexploitation and Exploration are achieved during the search and attack processes to the prey. Random aspect in this process helps avoiding being stuck in the local minimum. The details of the datasets are shown in Table (1). Test Results The experimental tests are made at first by utilizing the rendered attributes via the use of different methods. In order to conduct the comparison, the entire number of features are used under the name ALL, Feature Selection using GWO, KH, Harmony based optimization method (FS-HS-TC), the Feature-Selection-Genetic-Algorithm-Document-Clustering (FS-GA-TC) and feature selection using the modified Differential Evaluation method (DE) and differential evolution with simulated annealing (DE-SA) and PSO are utilized. The test results using the external evaluation measures showed that our method achieved the best results compared to the other state-of-the-art methods and outperformed other methods. This observation comes from using an intelligent method of classification with our proposed feature selection method. The GWO showed superiority over the other compared methods that used the traditional classification methods. This dataset has been chosen because they are a challenging dataset in terms of the variety of these subjects, consistent and the number of classes is varied, i.e., the number of classes is different from one dataset to another. On the other hand, these datasets are limited in their size. It is highly recommended to upgrade our proposed method dealing with big data-this number of datasets is used in our experiment to show how robust the system is working.

CONCLUSION
In this paper we introduced the use of a Grey Wolf Optimization (GWO) for the document classification using the selected subset of features generated by this method. The performance of the proposed GWO methods was compared with other optimization algorithms applied on the same documented classification for this purpose. The experiments were dedicated to test each one of those methods on the same data under the same conditions. Based on the test results given in Table 2 they reflect the superiority of the Grey Wolf Optimization over the other methods that use the external evaluation measures of classification.