Arabic Text Categorization Using Support vector machine, Naïve Bayes and Neural Network

Text classification is a very important area in information retrieval. Text classification techniques used to classify documents into a set of predefined categories. There are several techniques and methods used to classify data and in fact there are many researches talks about English text classification. Unfortunately, few researches talks about Arabic text classification. This paper talks about three well-known techniques used to classify data. These three well-known techniques are applied on Arabic data set. A comparative study is made between these three techniques. Also this study used fixed number of documents for all categories of documents in training and testing phase. The result shows that the Support Vector machine gives the best results.


INTRODUCTION
The number of documents available online has increased because of the rapid growth of internet.Also no doubt that number of documents available on line is countless.Regularly text documents include newspaper, research article, technical reports, blogs, journal paper...etc.This huge number of documents could be useful and valuable.[1].Also this enormous number of documents required effective retrieval methods.Text Classification (TC) is a very important area in information retrieval technology and also it is an active research area.TC aims to categorize this huge number of documents into a set of predefined categories based on its content.[2,3].Online textual documents and their categories are huge, numerous and diverge.TC is a method in data mining to give valuable information from large amount of data.[4].TC has been used in many applications such as email filtering (spam or legitimate) [5,6], news monitoring and searching for useful information on the web.[7].The aim of TC methods and techniques is to extract any valuable information from unstructured textual resource, these valuable information then used to classify the document into a set of predefined categories.[2].Text mining is not an easy process due to the massive availability of information in text documents and its diversity.Beside that the process of text mining includes derived linguistic features from the text.So far and because of this rapid development of web data, algorithms that can handle and develop text classification and increase its efficiency are still highly required.[8] Most of systems used are designed to handle English language.Also many researches talks about English text classification and there are many algorithms designed to handle English documents.Few researchers talks about Arabic text classification.Arabic language considered one of the widely spoken languages.Arabic language is the main language in the Arab world and the secondary language in many other countries.[9].Arabic language has rich morphology and a complex orthography [10].Creating text classification system for Arabic Adel Hamdan Mohammad, Tariq Alwada'n and Omar Al-Momani language is a not an easy task at all due to nature and difficulty of Arabic language.This research will use Support vector machine (SVM), Naïve Bayesian (NB) and Multilayer Perceptron Neural Network (MLP-NN) to categorize Arabic data set.
The rest of this paper organized as follow: section 2 talks about Arabic language, section 3 talks about document pre-processing steps, Section 4 talks about term selection and weighting, section 5 talks about SVM, section 6 talks about Naïve Bayesian, section 7 talks about Multilayer Perceptron Neural Network, Section 8 talks about related studies, section 9 talks about data set used in experiments and section 10 talks about experimental and results, and finally section 11 demonstrate our conclusions.

ARABIC LANGUAGE.
Arabic language is one of six official language of the United Nations.Arabic language is the official language of 25 countries, also Arabic language is spoken from over 250 million .Arabic language consists from 28 letters plus hamza ‫)ء(‬ which is considered a letter by some Arabic linguistics.Arabic language is written from right to left.The letters of Arabic language have different shapes when appearing in a word depending on the position of the letter (at the beginning, at the end, at the middle).Different shapes and diacritics of Arabic language make parsing difficult task.[11].Majority of Arabic words has its root, beside that over 80% of Arabic words can be mapped into 3-letter root.Fortunately, Arabic language has its built it filtering mechanism which means different words can be mapped into their root.Representing words to their root is very important and definitely will reduce the number of words.[12] 3. DOCUMENT PRE-PROCESSING Document pre-processing is very essential to reduce the complexity of documents and to handle it in an easy form.[13].Document preprocessing include several steps such as First, document cleaning, in this step document is cleaned from unnecessary words such as stop words, tags etc.Second, Document representation, in this step, Mostly, document is transformed to vector space model (VSM).Third, dimension reduction, in this step, the most useful and valuable features for classification are selected.Most commonly selection methods are Chi square and document frequency, information gain and mutual information [14,15,16,17] Usually text documents are represented as a vector of term weights (word features).One major problem in classification for Arabic and English language is the high dimensionality of text documents.Data pre-processing reduce the complexity of text documents and make data handling and representation easier.Without data pre-processing several major problems will be encountered.Documents preprocessing can be classified into Feature Extraction (FE) and Feature Selection (FS).[18].
Feature Extraction (FE) is the first step of text pre-processing which includes cleaning text documents and stemming.[19].FS used to eliminate text and to present it in a clear format.FS in very important and get much attention since it's effective, efficient and save storage space.[20].
Feature Selection (FS) is used after Feature Extraction to construct vector space.The main goal of FS is to select suitable features of the original document.FS aim to keep words with highest score according to a set of predefined measures.[21,22] Representing all document words into their root is critical and could result into reducing the number of words.[12].Root extraction (stemming) for Arabic language fall into two types.[23,24,25,26].First, getting the root based on the idea of removing prefixes, infixes, suffixes.Second, getting the root based on the weight of letters embedded within the text.[27].

TERM SELECTION AND WEIGHTING
Naturally, thousands of features can be found in document classification.So term selection techniques are used to reduce high dimensionality.Usually term selection based on term evaluation function.There are many term evaluation function used for English text categorization such as Chi square, information gain, Document Frequency Threshold [3,14,15,16,17].
After selecting the most important terms, each document is represented as weighted vector based on the words found in the text.Actually there are many weighting techniques such Term Frequency (TF), Inverse Document Frequency, Term Frequency Inverse Document Frequency (TFIDF), and Normalized-TFIDF weighting.[3,14,15,16,17] 5. SUPPORT VECTOR MACHINE.Support vector machines (SVMs) are one of the best well-known machine learning algorithms.SVMs are binary classifier initially proposed by Vladimir.[28].Also, SVMs are supervised learning model used to analyse and categorize text data.Furthermore, SVM typically used for text categorization and regression analysis.SVM has been introduced and used in text classification and categorization by Thorsten Joachims in 1998 [29].SVM training algorithm mainly used to builds a model that assigns new documents into a set of predefined categories.Beside that SVMs can be used as linear and non-linear classifier.[29,30]

SUPPORT VECTOR MACHINE (THE SEPARABLE CASE, NON-LINEAR CLASSIFIER)
Consider the data shown in figure 1; simply data can be linearly separated.For defining a linear classifier the dot product between two vectors is required, defined as w T x =∑i wi.Fouzi Harrag (2010) [39] present a model based on the neural network for classifying Arabic texts.Authors propose singular value decomposition (SVD).Experiments done on in-house consist from 5734 tokens (14 major categories).Number of words after preprocessing step is 1065 words.Results show that MLP NN model using SVD method is better for capturing the non-linear relationships between the input document vectors and the document categories than that of basic MLP NN model.Al-Harbi (2008) [43] uses SVM and C5.0 to classify Arabic text documents on seven different Arabic corpora by using a recognized statistics technique.C5.0 classifier, in general, gives better accuracy.
xi.A linear classifier is based on a linear discriminant function of the form f(x) = w T x + b.Vector w is known as the weight vector, and b is called the bias.First consider the case b = 0.The set of points x such that w T x = 0 are all points that are perpendicular to w and go through the origin ‫ـــــــ‬ a line in two dimensions, a plane in three dimensions, and more generally, a hyperplane.[10,29]The bias b translates the hyperplane away from the origin.The hyperplane {x: f(x) = w T x + b = 0} divides the space into two: the sign of the discriminant function f(x) denotes the side of the hyperplane a point is on.The boundary between regions classified as positive and negative is called the decision boundary of the classifier.There are many cases in which the data is not linearly separated.But, Linear SVMs can be generalised for non-linear problems.Doing this required set up an objective function that trades off misclassifications against minimizing ||˜w|| 2 to find an optimal compromise.Thus we introduce a non-negative "slack" variable > 0 for each training example.˜w.xi + w0 ≥ 1 −for yi = +1 ˜w.xi + w0 ≤ −1 + for yi = −1.More details about SVMs equations in[29,30,31,32,33]

Figure 1 :
Figure 1: Linear classifier.6. NAÏVE BAYESIAN Naïve Bayes classifiers are simple probabilistic classifiers based on applying Bayes' theorem.Naïve Bayes are simple and powerful method for text classification.When Naïve Bayesian classifier is used for text classification problem we use the equation: P (class | document) = P (class).P (document| class) / P (document) P (class | document): The probability that a given document D belongs to a given class C. P (document): The probability of a document.P (class): The probability of a class (or category), we can compute it from the number of documents in the category divided by documents number in all categories.P (document | class) represents the probability of document given class.[34, 35, 36].This means that under the above independence assumptions we can rewrite it as: P (document | class) = ∏ p (word i | class) and P (class | document) =p (class) ∏ p (word i |class).Where P (word i | class): The probability that the i-th word of a given document occurs in a document from class C, and this can be computed as follows: P (word i |class) = (Tct +λ )/(Nc+ λV) where Tct: The number of times the word occurs in that category C. Nc: The number of words in category C. V: The size of the vocabulary table.λ:The positive constant, usually 1, or 0.5 to avoid zero probability.[35 ]

Figure 3 :
Figure 3: Multilayer Perceptron Neural Network Model Authors in this paper use a three layer feedforward neural network with hyperbolic tangent (tanh) activation function in the hidden layer, followed by a linear output layer.Number of input layers used in this paper is

Table 4 :
MLP-NN Results Adel Hamdan Mohammad, Tariq Alwada'n and Omar Al-Momani 11.CONCLUSIONS AND FUTURE WORKS.In this paper authors discuss the problem of text classification in general and Arabic text classification in specific.Authors use three important text classification algorithms and apply experiments and tests on in-house developed data set.The three algorithms used are SVM, NB and MLP-NN algorithms.The importance of this study based on applying those three algorithms on large Arabic data set and then applying a comparative study.The average measures obtained from applying mentioned algorithms indicate that SVM algorithm outperformed NB and MLP-NN.Average precision results of SVM, NB and MPL-NN using 600 input layers respectively is 0.778, 0.754, and 0.717.This small fraction of difference forced authors to think that features extraction and selection is critical and affect results in many ways.Beside that authors thinks that MLP-NN could give better results if feature extraction and selection in preprocessing steps is developed enhanced since 600 input layer gives promising results.Future works of authors is applying different feature extraction and selection methods using MLP-NN and make a comparative study on feature extraction and selection.