A COMPARATIVE STUDY OF THE ENSEMBLE AND BASE CLASSIFIERS PERFORMANCE IN MALAY TEXT CATEGORIZATION

Automatic text categorization (ATC) has attracted the attention of the research community over the last decade as it frees organizations from the need of manually organized documents. The ensemble techniques, which combine the results of a number of individually trained base classifiers, always improve classification performance better than base classifiers. This paper intends to compare the effectiveness of ensemble with that of base classifiers for Malay text classification. Two feature selection methods (the Gini Index (GI) and Chi-square) with the ensemble methods are applied to examine Malay text classification, with the intention to efficiently integrate base classifiers algorithms into a more accurate classification procedure. Two types of ensemble methods, namely the voting combination and meta-classifier combination, are evaluated. A wide range of comparative experiments are conducted to assess classified Malay dataset. The applied experiments reveal that meta-classifier ensemble framework performed better than the best individual classifiers on the tested datasets.


INTRODUCTION
Text Categorization is defined as the way of making a decision if a certain piece of text belongs to one of sets of prescribed categories.As a significant stage in the Natural Language Processing system, it is convenient in indexing and later restoring texts.Moreover, Text Categorization is advantageous for content analysis, and a lot of other roles (Lewis & Gale 1994).On the other hand, a crucial problem may be arisen during data mining and, therefore, Machin Learning ML comes from the big confluence of information in the Internet due to the increase electronic documents, and information libraries available (Mitchell 1999).
The idea of Text Categorization is to specify one document to one or more categories, depending on its contents.In this regard, the automatic text categorization process foresees a set of tasks universally recognized by the research community (Abdullah et al 2005).These tasks include features design in which the corpus processing, extraction of relevant information, feature selection and feature weighting processes are performed.In addition, these tasks include training in which a machine learning classifier is trained using a set of labelled documents.The last task is the task of testing in which the classifier accuracy is evaluated through the use of a set of pre-labelled documents (i.e.test-set) which are not used in the training phase.
The key idea behind combining individual classifiers is that every individual classifier's certain strengths and weaknesses are emerged accordingly.Hence, it could be argued that they can benefit from the strengths of individual classifiers and their weaknesses could be positively enhanced.In addition, classifiers are combined in order to make use of their strengths.Therefore, combined methods are becoming more popular as they allow to overcome the weaknesses of single supervised approaches.Classifiers can be composed to be multiple distinct classifiers by selecting the best classifier to be used in different situations or contexts (Srinivas et al. 2009).In this paper, Combined Classifiers have been investigated along with the performance of different methods for classifiers combination.
The most of the work in this area were carried out for the English text and other wellstudied languages.Up to date, there are very few and scarce works have been carried out for the Malay, which differ morphologically and syntactically from other languages, due to the lack of resources for managing Malay Text Classification (MTC).Consequently, the need to construct the resources and tools for MTC is a growing.This motivates us to apply an appropriate methods for Malay Text which has different morphologically to can achieve the best results.
The paper is organized as follows: Section 1 has been devoted for brief Introduction and Section 2 sheds some light on the Methodology used.The different key techniques and approaches are described in Section 3, reviewing related works in Text Categorization.Section 4 is allocated for presenting the experiment setup and discussing the experimental results.Sections 5 is specified for the study conclusion focusing on the realized findings.Finally, Section 6 recommends future trends in this subject matter.

METHODOLOGY
The methodology used in the present paper, Malay Text Classification, is shown in Figure 1.First, tasks of pre-processing were used to eliminate the incomplete and inconsistent data.The purpose beyond this process was to perform further data mining functionality.Secondly, Feature selection methods were carried out to discriminate terms for training and classification.Thirdly, k-NN, NB, and N-gram, are applied on Malay ATC.The k-NN, NB, and N-gram methods are used due to their simplicity and effectiveness and their accurateness Fourthly, Combination algorithm will be used to select the best result from the three results obtained from the three single classifiers, k-NN, NB, and N-gram, on Malay text.Finally, shows evaluation method for measuring the correctness of our finding.

PRE-PROCESSING
In order to evaluate the used classification algorithms, several experiments were conducted.The performance of these classification algorithms was measured to classify the Malay corpus used in (Alshalabi et al, 2013) study.The corpus would be divided into six categories namely: Business, Crime, History, Health, Religion and Sports.
Before indexing all of the documents, including training sets and test sets, they were all passed the preprocessing phase.The phase was beneficial because it worked on minimizing the index size, raised accuracy and merged categorization activities.However, not all words of a document seemed to be significantly equivalent to their meanings .Some words have more meaning compared to others.Therefore, it was crucial to pre-process the text in the dataset collection to identify the proper words to be employed as features.Each specific word appears in a document was defined as a feature.In the preprocessing step, advantageous text operations could be performed such as removing stop words and noise removal (Baeza-Yates&Ribeiro-Neto 1999).Further description of these two operations has been described in detail in the next sub sections.Case folding is the phase of changing uppercase to lowercase in the document, then, the elimination of punctuation other than the "a" to "z" letter which is considered as the delimiter character.
Tokenizing and Noise Removal is the phase of splitting sentence to words.With the word's splitting first, the string that has been input will be simpler.Therefore, in each word, the string is shown according to the space which split it with that form.In this way, changing process a word stem becomes easier.On the other hand, Noise Removal is the process of refining words and removing special characters, numbers, and symbols which add up to the noise in the training dataset.
Stop-Words Malay language has a large number of stop-words (i.e.words having little content-bearings).The highly frequent words existed in documents collection, considered as noisy in the text, (e.g.pronouns, prepositions, conjunctions, etc.) are called stop-word.Malay words such as, apabila, bagi, dalam, para, and untuk are considered as stop words as shown by (Ahmad 1995) .The stop words are removed since they do not convey any important information, and thus will reduce the text representation and improve the performance of the classification.Conversely, the words that are more relevant to each document will be left.

FEATURE SELECTION (FS)
Feature Selection (FS) method is one of the most crucial tasks that improves the performance of text classification due to the selection of the most predictive features.In other words, FS develops the performance of text classification tasks in terms of learning speed and effectiveness and also reduces the number of data dimensions.Moreover, FS removes irrelevant, redundant, and noisy data (Sebastiani 2002).In this section, further explanation will be provided on the feature selection methods where Gini Index (GI) and Chi Squire are used in our Malay ATC: Gini Index (GI): A novel GI algorithm is introduced by Shang et al (2007) based on the Gini-Index theory.The researchers constructed a new measure function of Gini index.They consider feature t's condition probability, combining posterior probability and condition probability as the whole measure function to depress the affection when the class is unbalanced.The main idea of Gini index is that, first, removing the situation that feature words do not appear, second, introducing concentration between classes and within-class dispersion to the traditional information gain feature selection method.
Chi Square (CS) measures the absence of independence between t (term) and c (category) (Rogati&Yang 2002;Yang&Pedersen 1997) .It can be calculated as follows: (2) (3) where A is the number of documents that contain the term, t, and also belong to category, c.B is the number of documents that contain the term, t, but do not belong to category, c.C is the number of documents that do not contain the term, t, but belong to category, c.D is the number of documents that do not contain the term, t, and do not belong to category, c.N is the number of training documents (Thabtah et al. 2009).

CLASSIFICATION METHODS
As has been mentioned earlier, three classifier methods are selected and used in Malay Text Classification.Single Classifiers Methods as: (k-NN, NB and N-gram methods) and Classifier Combination (Simple Voting and Stacking combination) due to their simplicity, effectiveness and accurateness methods are assumed, as follows: The k-NN is a well-known example-based classifier.It is one of the most popular classification techniques due to its simplicity and accuracy.The k-NN is also known as lazy learner, since it delays the decision on how to generalize beyond the training data until each new query instance is encountered.In order to categorize a document, the k-NN classifier organizes scores of the document's neighbours among the training documents.Then, it uses the class labels of the k most similar neighbours.Given a test document d, the system finds the K nearest neighbours among training documents.The similarity score of each nearest neighbour document to the test document is used.The weighted sum in k-NN classification is written as follows: Where KNN (d) indicates the set of K nearest neighbours of a document d.If d j belongs to  , then δ(  ,   ) equals 1, or other-wise 0. For test document d, it should belong to the class that has the highest resulting weighted sum.In order to compute im (,   ), the Euclidean distance is used representing the usual manner in which humans think of distance in the real world (He et al. 2000): (5 The NB algorithm is widely used as an algorithm for document classification.It is a probability-based classifier.Based on the features, independent probability value is calculated for each and every model.NB is often used in text category tasks based on Bayes' formula: Where To explain how Naïve Bayes model works, two classes have been assumed.The first class is economic and the second is not-economic.Four training documents are gained, three of them are from economic class and the last one is not from the economic class, as shown in Table 1.Given a test document, the multinomial parameters are needed to classify the test document and they are considered as the priors p(c) = The denominators are (8+6) and (3+6) because the lengths of all documents in class economic are 8 and the length of all documents is not in class economic 3, respectively.The size of the distinct terms is 6 as the vocabulary consists of six terms.An N-gram is a continuous sequence of n characters or n words of a longer portion of a text (Mohan et al, 2010 ).This research paper intends to use the character level N-grams classifier.
In the N-gram training process, the N-gram profile needs to be generated.The generated Ngram profile consists of the text which is spilt into tokens consisting of letters only.The most frequent N-grams are the ones kept.This gives us the N-gram profile for the document.For the purpose of classifying each documents, each document needs to go through the text preprocessing phase, then, the N-gram profile is generated as described above (Ogada, 2016 ).
The N-gram profile of each document will then be compared with the profiles of all documents in the training classes (class profile) in terms of similarity.Specifically, the cosine similarity measurement is used to measure the similarity between two documents, the training document Di and test document D j : In this stage, an ensemble (classifier combination) approach is applied for the sake of selecting results based on the output of the three classifiers.The selection algorithm is employed as the main task in this methodology to determine the accuracy of the combined classifiers via choosing the best answer out of a set of three answers.Here, we list the selection algorithms used in our Malay TC

MAJORITY (SIMPLE VOTING)
In the simple voting mechanism each base classifier model has a single vote.For each test document, this vote is given to the class label returned by the base model.After all base classifiers are voted, the class label having maximum votes is selected as the correct class label for that document.The class that appears as the choice of the largest number of classifiers is picked as the answer.If all classifiers disagree, the algorithm will choose the result of the tagger with highest accuracy.In the (simple voting) each classifier has a single vote.(Srinivas et al. 2009).To explain how the voting algorithm work on TC, suppose that we have three classifiers as in our case (classifier 1 (S1), classifier 2 (S2) and classifier 3 (S3) and we have two classes one is economic and the second class is not economic.Let us assume that we want to assign test document x to either class economic or class not economic.As shown in Table 2. , the final decision depends on the majority.If two or more classifiers agree that the document is economic, then the final decision is economic, and two or more classifiers agree that the document is not economic, then the final decision is economic.The stacking combination consists of two phases.In the first phase, a set of base-level classifiers is generated.In the second phase, a meta-level classifier is learnt combining the outputs of the base-level classifiers (Xia et al, 2011 ).When using a meta-classifier for combination, the outputs of all the labels of the class of the participating classifiers are used as features for meta-learning (Koprinska et al, 2007 ;Mitchell, 1997).In our case, to combine the output of the three classifiers Naïve Bayes, k-NN and N-gram decision, we use as metaclassifier the Naïve Bayes.The formula (12) of the NB as meta-classifier, given the output of three classifiers, R 1, R 2 , R 3 : Where P(C i |R 1, R 2 , R 3 ) is the posterior probability of class C i given the new output of the three classifiers R 1 R, 2 , R 3 , P(C i ) is the probability of class C i .

EVALUATION
All algorithms are evaluated using a 5-fold cross-validation measurement tool.To measure the performance of these classification methods, we use the Macro-averaged (Macro-F1) measure.This measure combines Recall and Precision in the following way In which, True Positive (TP) is the set of document that is correctly assigned to the given category.False Positive (FP) is the set of documents that are incorrectly assigned to the category.False Negative (FN) is the set of documents that is not assigned incorrectly to the category.On the other hand, True Negative (TN) is the set of documents correctly not assigned to the category.To explain how to evaluate the classification algorithms work, let us assume that we have a set of documents as given for matches of human and computer document assignments in table 3 RELATED WORK Feature ranking and selection are essential parts of text classification, and a lot of methods and approaches have been investigated and applied to feature selection for text classification.Most of FS methods can be classified into two groups; information theory ranking methods such as chi-square and mutual information, and information retrieval ranking methods such as document frequency and odd ratio (Ghareb et al. 2014).For example, Yang and Pedersen (1997) and Thabtah (2007) evaluated five methods of feature selection namely: DF, chi square χ2, term strength (TS), information gain (IG), and mutual information (MI) with K-NN.The Naïve Bayesian classifier as presented by Chen et al. (2009).They pointed that these methods perform better than other feature selection approaches when they are experimented with English and Chinese text collections.
Chiang et al. ( 2008) modified TF-IDF and utilized it with ARM and category priority to construct their classifiers.Based on Alshalabi et al. (2013), this study depends on NB, Ngram and k-NN classifiers methods with the two feature selection methods, Chi and GI of the TC in order to enhance Malay TC.The first experiment examined the overall performance of the NB, N-gram and k-NN classifiers with the two feature selection methods, Chi and GI, are applied to reduce the dimension of feature spaces on Malay TC.According to Sanwaliya et al. (2010), NB and KNN classifiers are considered as a single classifier and their accuracy have been investigated using Reuters 21578 corpus data.Then, these classifiers are combined according to a proposed method (NB-KNN).The obtained results have shown that accuracy I significantly improved.In Nejat et al. (2012), NB, SVM, and DT classifiers are considered as single classifier and their accuracy have been investigated using corpus data.Then, these classifiers are combined according to a proposed Meta classifier (Boosting, Voting, and Bagging).The results have shown that the comparison between base and ensemble classifiers in terms of the best values for accuracy show that NB ensemble classifier is considered as a better alternative, although their classifications' accuracy turns to be equal.In Srinivas et al. (2009), the classifier combination methods and concept-based dimensionality reduction techniques are used for robust and scalable text classification.
The experimental evaluation confirms the hypothesis that combination based metaclassifiers give better accuracy than individual classifiers for a popular textual dataset, the Reuters 21578 news dataset.Additionally, text classification methods were first proposed in the 1950s where the word frequency was used to classify documents automatically.Applications of machine learning techniques help reduce the manual effort required for analysis and the accuracy of the systems also improved through the use of these techniques.Interestingly, several text mining software packages are available in the market.In addition, many machine learning methods have been proposed for text categorization in previous years including N-gram (Suzuki et al. 2012;Farhoodi et al. 2011), Naïve Bayes (Mccallum&Nigam 1998;Fan et al. 2001), and k-nearest neighbor (Hua&Sun 2001).Zhang et al. (2015) provided a study devoted for character-level convolutional networks for text classification.They compared a large number of traditional and deep learning models using several largescale datasets.The, analysis showed that character level convent is an effective method.Additionally, the model of comparisons depends on many factors, such as dataset size, if the texts are curated, and choice of alphabet.The study by Johnson and Zhang (2016) viewed that the model is considered as a special case of a general framework which jointly trains a linear model with a non-linear feature generator consisting of 'text region embedding + pooling'.In their study, the authors discovered a more sophisticated region embedding method using Long Short-Term Memory (LSTM).LSTM can embed text regions of variable or possibly large sizes.(Relatively, et al, 1998) introduced support vector machines for TC.The study provides both theoretical and empirical evidences that SVMs are significantly suitable for TC.

EXPERIMENTAL EVALUATION
This section is concerned with dividing the data into two subgroups.The first subgroup is called the single classifiers or individual classifiers.The second subgroup is called combined classifiers.For the purpose of this work, two kinds of experiments are carried out.

SINGLE CLASSIFIERS RESULTS
The first set of experiments show the performance of the individual based classifiers.In order to test the efficiency of the three classifiers k-NN, NB and N-gram, with the two feature reduction methods on Malay text Categorization, these methods are evaluated individually and features are selected from feature space at different size: 100, 200, 300, 400, 500 and 600.The results are presented in terms of macro-averaged F-measure where the averaged values are calculated across the whole 5-fold cross-validation experiments.The overall performance of the NB, N-gram and k-NN classifiers with the two feature selection methods, Chi-square and GI applied to reduce the dimension of feature spaces, has been examined precisely.At this phase, the effects of the individual feature selection method on classifiers performances have also been examined.The results of the performance (see Table 5) is displayed with features ranked in a degrading order and feature space at different sizes: 100, 200, 300, 400, 500 and 600.In Table 3, the best performance of 94.66 is the NB classifier when 300 of the features selected using Chi-square feature selection.In addition, the best accuracy of 90.72 with k-NN classifier is achieved when 300 of the features selected by GI method are used, and the highest performance with N-gram classifier has been obtained when 300 of the features by GI method are used.When the classifier performances are compared, the NB algorithm achieves a higher performance than that of the k-NN and N-gram algorithms.Thus, it is obvious that the highest performance is obtained when the feature selection operations are made by GI.This observation indicates that the k-NN and NB classifiers are both suitable for Malay Text Categorization.
In order to examine the overall performance based on document categories, all of the parameters for the three classifiers, k-NN, NB and N-gram, are fixed according to their best results in Table 3.The experimental results with the k-NN, NB and N-gram for Malay.As seen in Fig. 2, the NB achieves the best result in Sports, Business, Crime, and History domains while the NB obtains its best result in Sport and Business domains.

CLASSIFIER COMBINATION RESULTS
The second set of experiments has been combined into the three classifiers values which examine the classifier combination.This methodology is to determine the accuracy of the combined classifiers by choosing the best answer giving a set of three answers.There are two types of classifier combinations namely: Voting Combination and Stacking Combination.Through these experiments, the following are realized: The best performance of Voting Combination reaching 95.84%, is achieved when 500 of the features are selected by GI method.On the contrary, the worst performance of Voting Combination, being 92.14%, is achieved when 100 of the features are selected by Chi squire method.In Table 4, the best performance of Stacking Combination is 94.39% achieved when 300 of the features are selected by GI method while the lowest performance of Stacking Combination, reaching 91.23%, is achieved when 400 of the features are selected by Chi squire method.It is clear that higher performance is obtained when the FS operations are made by GI.The results obtained are convergent.It is further obvious that the results achieved by the Stacking Combination algorithm are better than that those scored by individual classifiers.However, the Voting combination achieves better results compared to Stacking Combination.

RECOMMENDS AND FUTURE WORKS
In this study, it could be reflected that experiments with a small dataset can show significant results.However, for future trends in this subject matter, it could be planned to expand the size of the corpus.Hence, we are preparing for collecting big dataset in future.This can provide a wider chances to carry out several experiments and improvements on the feature selection phase.The reason beyond that may due to the fact that many problems can be faced when proposing a method for Malay text Classification with satisfactory accuracy.For instance, the wide range of Malay language vocabulary, creates challenges such as the lack of text representation and the lack of important words identification.This problem arises from the misleading words that should be removed at the beginning of the performance stage.

FIGURE 1 .
FIGURE 1. Illustration of the methodology of Malay Text Classification Where as (c = economic |d5 ) > (c = not economic |d5 ), then the document d5 is in class economic N-GRAM CLASSIFIER

FIGURE 2 .
FIGURE 2. The performance (F-measure) on each class of single classifiers

FIGURE 3 .
FIGURE 3. The performance (F-measure) on each class of Meta classifier Where   , is the number of documents assigned to class C i and N is the number of classes, P(d|C i ) is the probability of a document d given a class C i , and P(d) is the probability of document d, and because of the independence assumption of NB, the probability of document d can be calculated by: Where   is the total number of documents that contain feature   and belong to class C i .The number 'l' indicates the total number of distinct features in all training documents that belong to class Ci.NB calculates posterior probability for each class, and then assigns document d to the highest posterior probability's class, i.e.
P(C i |d), is the posterior probability of class   given a new document d P(C i ) is the probability of class C i which can be calculated by: (  )

TABLE 1 .
Data for parameter estimation in NB classifier examples

TABLE 3 .
The performance (Macro-F1) of Meta classifier (feature selection methods vs. features sizes).The experimental results of the Voting and Stacking Classifiers for Malay Text Categorization are shown in Fig 3.The NB achieves the best result in Sports, History, Crime, and Business domains whereas the Voting achieves its best result in Sport and History domains.