A study and evaluation of classifiers for anti-spam systems

The volume of e-mails has been increasing in recent years. However, since 2005, at least half of these e-mails have been made up of spam. This massive traffic of unwanted messages causes losses to users, such as the excessive and unnecessary use of the bandwidth of their networks, loss of productivity, exposure of inappropriate content to inappropriate audiences etc. This paper proposes the study and the application of machine learning models to the classification of e-mails in existing anti-spam systems and, in particular, in the new anti-spam system Open-MaLBAS. After carrying out many experiments on different data sets, it was possible both to prove the feasibility of the proposal and to develop a powerful combination of techniques, methods, and models that can be successfully applied to the classification of e-mails in anti-spam systems.

source anti-spam system Open-MaLBAS [9], developed by the same authors of this paper. Open-MaLBAS is available in GitHub [10].
The most relevant contributions of the paper are: • the study and comparison of feature selection methods and the use of a dimensionality reduction method; • the test and validation of the eight ML models on different e-mail databases; • the comparison of the experimental results with those obtained by a current and well-known commercial antispam; • the possibility of including the eight models in the classification module of the Open-MaLBAS anti-spam system. The paper is structured as follows. Section II reviews papers on e-mail classification, covering models and methods used and results achieved. Section III discusses the preprocessing of e-mails. The feature selection methods and the dimensionality reduction method used in this paper are described, respectively, in Sections IV and V. In the Sections VI, VII and VIII, the ML models, e-mail databases and evaluation metrics used are detailed, respectively. In the Sections IX and X, the proposed methodology and the experiments performed are presented, as well as analysis and conclusions about the results. Finally, Section XI summarizes the e-mail classification problem and the proposed approach to solving it. In addition, it makes a qualitative analysis of the results and presents proposals for future work.

II. RELATED WORK
The study and development of anti-spam systems are recurrent subjects in papers in the area of computer engineering. Researchers have been approaching the topic in many different ways, proposing solutions that use from statistical techniques [11] and artificial neural networks [12] to the analysis of the reputation of senders [7] and the detection of recurrent patterns and writing styles in spams [13]. In this section, research papers involving e-mail classification are presented.
Drucker et al. [14] compared a Support Vector Machine (SVM) with three other models -boosted decision trees, Repeated Incremental Pruning to Produce Error Reduction (RIPPER) 2 and Rocchio 3 -for e-mail classification. The experiments were conducted on two e-mail databases from the American telecommunications company (AT&T). The emails were processed by several feature selection methods and empty words 4 were removed. In addition to the good results (precision around 98%), the authors highlighted other important results obtained by SVM, such as better training times when using binary representation to represent the fea- 2 Classification model based on rules induced from a training set. It is known for handling unbalanced and noisy data sets well. 3 Classification model used in information retrieval systems. It makes use of the relevance (or not), assigned by users of the system, to documents. 4 Empty words are words that are filtered before or after natural language processing. They usually refer to the most common words in a language.
tures of e-mails as well as its ability to handle well a large number of e-mail features.
Using the Naïve Bayes (NB) probabilistic model proposed by Sahami et al. [15] combined with linguistic lemmatisation techniques 5 and removal of empty words, Androutsopoulos et al. [11] obtained classification accuracy near to 99% in most cases on the e-mails of the Ling Spam database [16], with a relatively low computational cost. In the paper, the authors mention that the use of linguistic techniques significantly improved the accuracy in classification, as well as introduced new possibilities in the analysis of e-mails.
Meyer and Whateley [17] proposed a statistical model called chi-squared (χ 2 ) combining for e-mail classification. The model performs two χ 2 tests to determine the probability that a given e-mail be spam and ham, respectively. These probabilities are combined and scaled to provide, for each e-mail, an overall spam score in the range 0 and 1. Five email databases -four SpamBayes [18] and Spam Assassin [19] -were processed with n-gram and tiling-based feature selection methods, and used to train the model. The number of e-mails requiring manual classification after training was just over 1%, proving the efficacy of the model.
Carpinteiro et al. [12] proposed a pre-processing of e-mails to simplify them. Different feature selection methods were applied to the subject and body of the e-mails. A Multilayer Perceptron (MLP) neural network was used as a classifier model. The experiments showed classification accuracy over 99% on the Spam Assassin database [19].
Zhang et al. [7] proposed a reputation-based e-mail classification system -IPGroupRep. The reputation is provided by a server that stores Internet Protocol (IP) addresses and their respective scores. Scores are made up of sending histories of e-mail users, data from other anti-spam systems and from recipients. The authors used a database composed of almost three million e-mails from a university e-mail server. The experiments showed that the proposed system performed as well as the existing technique Distributed Checksum Clearinghouses (DCC) [20], and outperformed others such as Gossip Optimization for Selective Spam Prevention (GOS-SiP) [21] and RepuScore [22], reaching rates above 95% of accuracy, precision and recall.
Pérez-Díaz et al. [23] used the rough set model [24] for e-mail classification. They also used the MFD (Most Frequent Decision), LNO (Largest Number of Objects), and LTS (Largest Total Strength) heuristics to make decisions about indeterminate e-mail classifications. The authors used binary or frequency representations to represent the features of the e-mails from Spam Assassin database [19]. They compared the classification results obtained with those of the AdaBoost, Flexible Bayes (FB), Naïve Bayes and SVM models. The rough set model obtained F 1 score rates of 98%, surpassing those of the other models. The authors highlighted the importance of periodically regenerating the rule set of the model. In addition, they highlighted the long training time of the model and suggested the adoption of methods to reduce the e-mail feature space.
Barigou et al. [25] proposed a Cellular Automaton combined with K-Nearest Neighbors algorithm (CA-KNN) in order to reduce the amount of similar e-mails selected during the e-mail classification process. The use of CA-KNN produced a reduction in memory usage and an increase in performance of the model when compared to traditional KNN. The model showed an accuracy of more than 98% on the e-mails from the Ling Spam database [16], outperforming existing models such as NB [11] [26], stacked classifiers [27], open-source filters [28], and Topic-based Vector Space Model (eTVSM) [29].
Kaya and Ertuǧrul [30] proposed a new feature selection method -shifted one-dimensional local binary pattern (shifted-1D-LBP). In experiments, the method was applied to the e-mails of the Ling Spam [16], Spam Assassin [19] and TREC 2006 [31] databases. Then, six ML models -Fisher Linear Discriminant Analysis (FLDA), NB, BayesNet (BN), Functional Tree (FT), Random Tree (RT), and Random Forest (RF) -were trained on the e-mails of the databases. The performance of the models in e-mail classification was evaluated under several metrics, such as precision, recall and F 1 score. The results were promising, reaching approximately 92%, 93% and 95% on the Ling Spam, Spam Assassin and TREC databases, respectively.
Shams and Mercer [13] proposed a new method that consists in using stylometry attributes 6 to train e-mail classifying models. Examples of these attributes include the number of spelling and grammatical errors, indicators of ease of reading -Gunning fog index, Simple Measure of Gobbledygook (SMOG), Flesch Reading Ease Score (FRES), Forcast, Flesch-Kincaid readability -, quantities of simple words (with up to two syllables) and complex (with three or more syllables), and average size of e-mail and words. The method was tested with the NB, RF, SVM, Bagging, and Adaboost.M1 classifying models on the CSDMC2010 [32], Spam Assassin [19], Ling Spam [16], and Enron-Spam [33] e-mail databases. The Bagging and Adaboost.M1 models achieved the best results. The average classification accuracies ranged from approximately 92% to 95%. In addition, the authors concluded that the method is relevant in detecting spam on personalized e-mail databases (i.e., in those in which the collection of e-mails is not random), but limited on nonpersonalized e-mail databases, owing to the multiplicity of e-mail writing patterns on these databases.
Yang et al. [34] proposed the Anti-Spam Filter algorithm based on One-Class Information Bottleneck (SFOC-IB) model. SFOC-IB is suitable for training with small training sets. According to the authors, the frequent change of content in e-mails reduces the availability of large training sets with up-to-date content. SFOC-IB extracts highly signif-icant samples from training sets in order to build clusters. The clusters are used to classify the e-mails, in the ham and spam classes, through a similarity function -Jensen-Shannon divergence [35]. The SFOC-IB, SVM, NB, and AdaBoost models were evaluated on the e-mails from the Ling Spam [16], Spambase [36], PU3 [37] and TREC 2007 [38] databases. SFOC-IB presented accuracy, recall and F 1 score results comparable to those presented by the AdaBoost, NB, and SVM models, when trained with large training sets. With small training sets, however, SFOC-IB had less deterioration in its performance.
Tyagi [39] proposed the Stacked Denoising Autoencoder (SDAE) model, based on a deep neural network [40], for e-mail classification. She used the Term Frequency-Inverse Document Frequency (TF-IDF) feature selection method to select the most relevant features of the e-mails from the PU1, PU2, PU3, PUA [37] and Enron-Spam [33] databases. SDAE was compared to three other models -Deep Belief Network (DBN), Dense Multi Layer Perceptron (Dense-MLP), and SVM. It outperformed the other three models, achieving accuracy, precision, recall and F 1 score around 95%.
Kumaresan et al. [41] proposed a Hybrid-Kernel Support Vector Machine (HKSVM) model 7 for e-mail classification. They used two combined databases -Ling Spam [16] and Spam Archive [42] [43] -to evaluate the model. The combined databases are composed of e-mails containing text and images. The textual characteristics of the e-mails were selected by the Term Frequency (TF) 8 method, and the visual ones, by the correlogram 9 and wavelet moment 10 methods. After selection, the feature space was reduced by a modified version of the Cuckoo search algorithm [44], with heuristics given by Lévy flights 11 [45]. Finally, the proposed model was experimentally compared to SVM models. It achieved classification accuracy rate above 97%, surpassing the 94% obtained by other SVM models.
Douzi et al. [46] proposed a new e-mail representation, based on the Paragraph Vector-Distributed Memory (PV-DM) and TF-IDF methods. In this representation, both contextual characteristics (i.e., present in several other e-mails) and specific characteristics (i.e., present only in a particular e-mail) of each e-mail are considered. The authors used a double representation -the one they proposed and the traditional Bag-of-Words (BoW) [47] -for each e-mail from the Ling Spam [16] and Enron-Spam [33] databases. The Logistic Regression, KNN, and SVM models were trained and evaluated on the e-mails with double representation. They presented F 1 scores ranging from approximately 92% to 98% in the classification of e-mails from the two databases. 7 Model that uses a combination of two or more kernels, such as linear, polynomial and quadratic kernels. 8 Denotes the number of occurrences of a term (e.g., a word) in a particular document (e.g., an e-mail). 9 Graph that presents autocorrelations in a time series. 10 Technique used to measure the local regularity of a signal. 11 Random step succession whose lengths follow a probability distribution of heavy tail (e.g., Pareto, Lévy, Cauchy, Burr, and Student's t distributions). VOLUME 4, 2016 As summarized in Table 1, the reviewed papers proposed several approaches and models for e-mail classification. The models were evaluated on several e-mail databases and obtained an accuracy greater than 90% in the classification. Some papers made use of feature selection methods, such as those presented in Section IV. However, none of the papers addressed the dimensionality reduction of the feature space of e-mails, as presented in Section V. Thus, the proposal presented in this paper differs from those presented in the reviewed papers.

III. PRE-PROCESSING
The pre-processing of the body and subject of e-mails aims to increase the accuracy of the ML models in their classifications. In this paper, two types of filters -plain text and HTML (HyperText Markup Language) -were used.

A. PLAIN TEXT FILTER
The plain text filter has two functions. First, it makes the body text and subject text of e-mails uniform. For example, it converts uppercase characters into lowercase and removes accents from words. Second, it replaces parts of the text with special tags. For example, numbers are replaced by the !_NUMBER tag, values with currency symbols are replaced by the !_MONETARY tag, and words less than 4 or more than 19 characters are replaced by the !_SMALL_WORD and !_BIG_WORD tags, respectively.

B. HTML FILTER
The HTML filter, as expected, processes the HTML tags of the body and subject of e-mails. The processing performed by the filter is done at three levels, according to the relevance of the information contained in the tag and/or its attributes. The three levels of processing are described below: • Tags with information typically focused on document description are entirely removed. For example, the tag "<title>Lipsum</title>" is entirely removed; • Tags partially relevant to the classification of e-mails have their attributes removed and are replaced by a corresponding special tag. For example, the tag "<p id="par">text</p>" has its attribute "id" removed and is replaced by the special tag "!_IN_P text"; • Tags totally relevant to the classification of emails have only the parameters of their attributes removed and are replaced by a corresponding special tag. For example, the tag "<form action="script.php">contents</form>" has the parameter "script.php" of its attribute "action" removed and is replaced by the special tag "!_IN_FORM action contents". After pre-processing, each e-mail is therefore represented by a set of tokens. Each token is either a word or a special tag of the e-mail body or subject.

IV. FEATURE SELECTION
It is necessary to represent the pre-processed e-mails so that they can be classified by the ML models. This representation, which consists in a set of features (i.e., tokens -words and special tags) of the e-mail, can be simplified by using feature selection methods [48].
Three feature selection methods -chi-square statistics, frequency distribution, and mutual information -were used in this paper. They were chosen for four main reasons. First, because they are very well known. The literature contains many papers that report their use. Practically, if not everyone who works in the ML field knows them. Second, they are easy to implement. Third, they have a low computational cost, which is advantageous when using them in anti-spams. Finally, in the experiments carried out, they presented very good results. The three methods are described next.

A. CHI-SQUARE STATISTICS (CHI2)
The χ 2 statistics (CHI2) measures the dependency between a feature t and a class c, in particular. It is defined by the Equation (1), in which: • n is the total number of e-mails in the set; • p c (t) is the conditional probability c for e-mails that contain the feature t; • P c is the global fraction of e-mails that contain the class c; • F (t) is the global fraction of e-mails that contain the feature t.

B. FREQUENCY DISTRIBUTION (FD)
Frequency Distribution (FD) is a statistical method that measures the frequency with which a feature t occurs in a class c, in particular. It is defined by the Equation (2), in which: • the numerator is the number of occurrences of t in emails of the class c; • the denominator is the sum of the number of occurrences of all features in e-mails of the class c.

C. MUTUAL INFORMATION (MI)
Mutual Information (MI) is derived from the information theory [49]. It consists in the amount of information that a feature aggregates in relation to a given class. The mutual information M (t, c), between the feature t and the class c, is based on the level of co-occurrence between the class c and the feature t. It is defined by the Equation (3), in which: • F (t)·P c is the expected co-occurrence between the class c and the feature t, based on mutual independence; this value can be much higher or lower than expected, depending on the level of correlation between the class c and the feature t). Clearly, the feature t is positively correlated to the class c when M I(t, c) > 0, and negatively correlated when M I(t, c) < 0.

V. DIMENSIONALITY REDUCTION
The e-mail classification problem usually presents feature spaces with high dimensionality, even after the most relevant features have been selected through feature selection methods. The high dimensionality of the feature spaces can make it difficult to train certain ML models.
Thus, it is necessary to reduce the dimensionality of the feature space, so that the training algorithms of the models do not spend impractical amounts of time and computational resources. In addition, it is desired that the reduction of the dimensionality of the original feature space to another space preserves the greatest possible amount of relevant information, so as not to impair the generalization and classification capabilities of the models.
Several methods have been proposed to explore the feature space and find a relevant subset of features according to some evaluation metric. In this paper, the method employed -Multi-Objective Evolutionary Feature Selection (MOEFS) [50] -consists in a multi-objective evolutionary search that explores the feature space in order to generate the candidate subsets while two objectives are optimized simultaneously: • Maximization of the "merit" metric, given in terms of the correlation between features and classes of the problem and the intercorrelation between the features of the candidate subsets [51]; • Minimization of the cardinality (i.e., number of features) of the candidate subsets.
Class balancing is carried out after the dimensionality reduction. Its purpose is to ensure that there is the same amount of samples of ham and spam e-mails in the set, as the VOLUME 4, 2016 imbalance can cause the ML model to emphasize one class more than another, obtaining biased results.
NB and AODE are probabilistic models. Probabilistic models are models capable of predicting, given an input pattern, a probability distribution across a set of classes, rather than just predicting the most likely class to which the pattern belongs.
SLP and RBF are artificial neural models. Artificial neural models are models inspired by the biological neural system.
REPT and AB-M1 are models of decision trees commonly used in operational research to identify strategies most likely to achieve an objective.
Finally, L-SVM and NL-SVM are support vector machines. Support vector machines are models that define a mathematically optimal separability boundary between classes.

A. NAÏVE BAYES
Naïve Bayes (NB) is a model known both for its theoretical simplicity and ease of implementation, and for its effectiveness. The model learns the probability that any object (e.g., e-mail) with certain features belongs to a certain class or category. In addition, it is called Bayesian because it is based on Bayes' theorem and naïve because it supposes that the occurrence of a particular feature is independent of the occurrence and/or influence of the other features. The classifier model consists in the function that, among the existing classes {C 1 , . . . , C f , . . . , C K }, assigns to the object (e.g., e-mail) the classĈ = C f , for some f , as described in Equation (4).
The model in Figure 1 illustrates the NB model. In it, it is noted that the probabilities of class P (c) depend directly on the features x 1 , x 2 , . . . , x n , and no probability between them is considered.

B. AVERAGED ONE-DEPENDENCE ESTIMATORS
Averaged One-Dependence Estimators (AODE) is an extension of NB that introduces the notion of x-dependencies estimators, whereby the probability of the value of each feature is conditioned by the class and a predefined amount of other features. In this paper, the AODE with the value of x = 1 was used, which makes it a classifier model "less naïve" than the NB [54]. Figure 2 illustrates the AODE model. In it, P (c) is conditioned to the features that, in turn, take into account the joint probabilities in relation to a single other feature, thus characterizing a 1-dependency estimator.

C. SINGLE LAYER PERCEPTRON
Single Layer Perceptron (SLP) is a model that consists in a single artificial neuron with all its inputs connected directly to its outputs ( Figure 3) [55]. If the linear combination of its inputs exceeds a predetermined threshold, it will produce an excitatory potential at its output. Thus, if an output is produced, the SLP classifies the e-mail, for example, as being spam. Otherwise, it is classified as ham.

D. RADIAL BASIS FUNCTION NETWORK
Radial Basis Function Network (RBF) is a model that consists in a network of artificial neurons. The activations of the RBE artificial neurons start at the neurons of the input layer, then run through the neurons with Gaussian activation functions g 2 , . . . , g (1) n1 of the single hidden layer and, finally, reach the neurons with linear activation functions g 1 , . . . , g (2) n2 of the output layer [57]. Figure 4 illustrates the architecture of the RBF.

E. REDUCED-ERROR PRUNING TREE
Reduced-Error Pruning Tree (REPT) is a model that consists in a decision tree whose simplification process (i.e., the process of removing subtrees that make its structure unnecessarily complex and that reduce its ability to generalize) is based on the Reduced-Error Pruning method, proposed by Quinlan [59].

F. BOOSTED REDUCED-ERROR PRUNING TREE
Boosted Reduced-Error Pruning Tree (AB-M1) is a model that consists in a decision tree that makes use of a boosting method to improve its performance. Generically, this method repeatedly runs a "weak" learning algorithm (i.e., which produces slightly better responses than random guesses) over different distributions of training data [60]. It then combines the classifiers h 1 , . . . , h n produced by the algorithm with their respective weights α 1 , . . . , α n to form a single composite classifier H.
AdaBoost [61] is a boosting method that has two versions [62] -AdaBoost.M1, focused on binary classification problems, and AdaBoost.M2, focused on multiclass classification. The first version, illustrated in Figure 5, was used in this paper. The result of executing this method is a binary value, built from the combination of the outputs of the models taken into account. This binary value indicates the class resulting from the classification. The REPT model, described in the previous subsection, was used as a "weak" classifier.
Thus, by comparing the results of the REPT model -REPT tree, without boosting -with those of the AB-M1 model -REPT tree, with AdaBoost.M1 boosting -it was

G. LINEAR SUPPORT VECTOR MACHINE
Linear Support Vector Machine (L-SVM) is a linear model that searches, in the original feature space, for a maximum margin hyperplane (i.e., a hyperplane capable of separating samples of different classes with the greatest possible distance) [64].

H. NON-LINEAR SUPPORT VECTOR MACHINE
Non-Linear Support Vector Machine (NL-SVM) is a nonlinear model proposed by Boser et al. [65]. Its approach consists in applying a kernel function [66] to the maximum margin hyperplanes, transforming the input space into a linearly separable feature space of equal or greater dimensionality. Table 2 presents the most used types of kernel functions. In the experiments of this paper, the NL-SVM used a polynomial kernel function (P-SVM).

VII. E-MAIL DATABASES
The experiments used three public databases -Ling Spam, Spam Assassin, TREC -and two private ones -UNIFEI, UNIFEI-δ0 -, all composed by real e-mails. The databases are described below.

A. LING SPAM
The Ling Spam database [16] was compiled and made available in the public domain by Androutsopoulos et al. [11].
The database e-mails were obtained from several sources and pre-processed, in order to remove some information (e.g., attachments and HTML tags) deemed irrelevant or private. The database has four versions -bare, lemm, lemm_stop, stop. Bare is the most original version. In the lemm version, VOLUME 4, 2016 the words in the e-mails are lemmatized. In the lemm_stop version, in addition to lemmatizing the words, the empty words were also removed. In the stop version, only the empty words have been removed. Each version has 2,893 e-mails, containing 2,412 (83.4%) hams and 481 (16.6%) spams. In the experiments, the four versions were joined, forming a single database composed of 11,572 e-mails.

B. SPAM ASSASSIN
The Spam Assassin database [19] is comprised of e-mails from various sources. It is divided into five parts -spam, spam_2, easy_ham, easy_ham_2, hard_ham. The spam part contains 500 spam e-mails. The spam_2 part consists of a new addition of 1,397 spam e-mails to the database. The easy_ham and easy_ham_2 parts contain e-mails that are ham and are easily identified as ham. The easy_ham and easy_ham_2 parts contain 2,500 and 1,400 e-mails, respectively. The hard_ham part contains 250 e-mails that are ham, but which are hardly identified as ham. In the experiments, the five parts were joined, forming a single database composed of 6,047 e-mails, of which 4,150 (68.6%) are ham and 1,897 (31.4%) are spam.

C. TREC
The TREC Spam Track database [69], hereafter referred to as TREC, is composed of e-mails from various sources.

D. UNIFEI
The UNIFEI database is composed of e-mails collected, during the second semester of 2016, by the Research Group in Systems and Computer Engineering (GPESC) of the Federal University of Itajubá (UNIFEI). The database represents the reality of the university, containing e-mails received by professors, technical and administrative personnel, and students. The university imposed conditions for the collection of e-mails in order to preserve the confidentiality of the information contained therein.
The database contains 862,229 e-mails, of which 353,151 (41%) are ham and 509,076 (59%) are spam. The classification, in the ham and spam classes, of the database e-mails was performed by the commercial anti-spam CanIt-PRO 9.2.4 [71].
CanIt-PRO remained in use at the Federal University of Itajubá (UNIFEI), Brazil, until July-2019. From August-2019 on, the university network services, including e-mail service, have been providing through the Google G-Suite platform.
The classification carried out by CanIt-PRO was the subject of suspicion. There was a strong suspicion that identical e-mails could be classified into different classes. Thus, to check whether or not the classification of e-mails was consistent, five steps were taken: 1) Each e-mail was represented, through the pre-processing described in Section III, by a set of tokens; 2) The 1,024 most relevant tokens for the classification of the e-mails were selected using the FD feature selection method (Section IV). Then, each e-mail was represented as a multidimensional vector in 1024 , in which each dimension represents a token selected by the FD method; 3) Each group of identical vectors was placed in a separate set; 4) It was verified if CanIt-PRO had assigned the same class to all vectors in each set, for they were all identical. Whenever this was not the case, a brand-new ham/spam class was assigned to all vectors in the set; 5) The original 1024-dimensional ( 1024 ) vectors were exhibited in two dimensions ( 2 ), by means of the t-SNE method [72]. Figure 6 displays the e-mails of UNIFEI database in the bidimensional space. In the figure, ham e-mails are shown as blue "+" marks, spam e-mails as red "x" marks, and ham/spam e-mails as black " * " marks. It should be noted that the axes x and y of the figure have no meaning, as the t-SNE method takes into account only the distances between the points x ∈ 1024 of the database and the probability distributions between these distances [73].
Because e-mails were represented by vectors in 1024 , it is very likely that equal vectors represent equal e-mails. Hence, since the amount of black marks in Figure 6 is substantial, it follows that the UNIFEI database is highly inconsistent.

E. UNIFEI-δ0
The consistency-generating tool of the Open-MaLBAS antispam [9] was employed to correct the inconsistency of the UNIFEI database. The tool utilizes two integer parameters δ ∈ N and n ∈ N * -that are defined beforehand. The first parameter δ denotes both the level of divergence among e-mails and among vectors, as the e-mails are represented by tokens (Section III) which, in turn, are represented by vector coordinates. For example, if δ is set to be zero, this indicates either that only e-mails with identical tokens or that only vectors with identical values in their coordinates are deemed to be identical. If δ is set to be one or two, this indicates either that only e-mails that vary at most by one or two tokens or that only vectors that vary at most by one or two values of their coordinates, respectively, are deemed to be identical. The second parameter n denotes the number of dimensions of the vectors.
The consistency-generating tool performs the four steps below: 1) It checks, by examining their tokens, which are the emails identical to each other by the level of divergence δ (from now on, δ-divergence e-mails) and places each group of δ-divergence e-mails in a separate set; 2) It identifies the predominant class (i.e., with the highest amount of e-mails) of each set, and then attributes the label of that class to every e-mail in the set. For example, if a set of δ-divergence e-mails comprises 35 ham emails and 65 spam e-mails, the spam label is attributed to all 100 e-mails in the set; 3) It changes the representation of the e-mails. It changes their representation from tokens to n-dimensional vectors; 4) It executes again the first and second steps, but now, on the vectors produced in the third step. It is worth noticing that this fourth step may change the class of the vectors, i.e., e-mails may change indirectly their classes again. The UNIFEI-δ0 database was derived from the UNIFEI database, by running the consistency-generating tool with the value of the parameter δ = 0. From Figure 7, it is possible to verify that the consistency of UNIFEI-δ0 database is higher than that of UNIFEI database. The UNIFEI-δ0 database contains 862,227 e-mails, of which 353,910 (41%) are ham and 508,317 (59%) are spam.

VIII. PERFORMANCE METRICS
Classifying models of anti-spam systems should avoid false positives and false negatives. A false positive is a ham e-mail incorrectly classified as spam. In turn, a false negative is a spam e-mail incorrectly classified as ham.
The precision and recall metrics measure the percentage of absence of false positives and false negatives, respectively. The F 1 score combines the metrics precision and recall, in order to assess the accuracy of a classifying model in terms of the amount of false positives and false negatives it produces. The area under Receiver Operating Characteristic curve (AUC-ROC) is also a concise way to assess the classification ability of binary classifiers.  To describe the metrics, in terms of their equations, the following variables are needed: • N HAM : total number of ham e-mails in the test set; • N SP AM : total number of spam e-mails in the test set; • n H→H : number of ham e-mails correctly classified as ham; • n H→S : number of ham e-mails incorrectly classified as spam; • n S→S : number of spam e-mails correctly classified as spam; • n S→H : number of spam e-mails incorrectly classified as ham.

A. PRECISION
The precision metric is calculated both to indicate the accuracy in the classification of ham e-mails (Equation (5)) and spam e-mails (Equation (6)). The general 12 precision is given by Equation (7).
P SP AM = n S→S n S→S + n H→S (6) 12 General is abbreviated as GEN in the Equations 7, 10, 13, and 16.

B. RECALL
The recall metric is also calculated both to indicate the recall in the classification of ham e-mails (Equation (8)) and spam e-mails (Equation (9)). The general recall is given by Equation (10).
R HAM = n H→H n H→H + n H→S (8)

C. F 1 SCORE
The F 1 score is given by the harmonic mean between the values obtained by the metrics precision and recall. Thus, its best and worst values are 1 and 0, respectively. The F 1 score is calculated both to indicate the score in the classification of ham e-mails (Equation (11)) and spam e-mails (Equation (12)). The general score is given by Equation (13).

D. AREA UNDER RECEIVER OPERATING CHARACTERISTIC CURVE (AUC-ROC)
A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the classification ability of binary classifiers. It is constructed by plotting the true positive rate (TPR) -also known as sensitivity or recall, given by Equation (10) -against the false positive rate (FPR) -also known as fall-out or false alarm ratio, given by Equation (16). In Equation (16), F P R HAM and F P R SP AM are given by Equations (14) and (15), respectively.
F P R SP AM = n H→S n H→S + n H→H (15) F P R GEN = N HAM * F P R HAM + N SP AM * F P R SP AM N HAM + N SP AM (16) As shown in Figure 8, classifiers that produce curves closer to the top-left corner have better performances. As a baseline, a random classifier is expected to produce points lying along the diagonal (i.e., F P R = T P R).
A succinct way to evaluate classifiers using the ROC metric is by calculating the area under the curve, which can be done by trapezoidal approximation. As can be observed from Figure 8, the AUC value lies between 0.5 to 1 (i.e., 50% to 100%) in which the lower value denotes a bad classifier and the higher value denotes an excellent classifier.

E. TRAINING TIME
The training time metric measures the time needed to train the model on the training set. The metric is given by Equation (17), in which t T i and t T f mark, respectively, the start and end times (HH:MM:SS) of the training.

F. CLASSIFICATION TIME
The classification time metric measures the time the model spends to classify all e-mails in the test set. The metric is given by the Equation (18), in which t C i and t C f mark, respectively, the start and end times (HH:MM:SS) of the classification.

IX. METHODOLOGY
In order to evaluate the performance of the different ML models (Section VI) as well as the feature selection methods (Section IV) and dimensionality reduction of the feature space (Section V), a methodology was developed to carry out the experiments. The steps of the methodology are specified by the diagram blocks in Figure 9. The steps are described below.

A. PRE-PROCESSING
In the pre-processing step, described in detail in Section III, the subject text and body text of each e-mail are standardized (e.g., letters are converted to lowercase, accents to words are removed) and each information relevant to its classification is converted into a specific tag (e.g., attachments and figures are replaced by their corresponding specific tags). The information relevant to the classification of e-mails is based on techniques used by spammers 13 , described by Cournane and Hunt [75]. At the end of pre-processing, each e-mail is represented by a set of tokens. Each token is either a word or specific tag in the subject or body of the e-mail.

B. FEATURE SELECTION
In the feature selection step, described in detail in Section IV, the most relevant tokens for the classification of e-mails are selected by any of the three feature selection methods. Then, each e-mail is represented as a multidimensional vector in n , in which each dimension represents a selected token. The multidimensional vectors are normalized.

C. DIMENSIONALITY REDUCTION
In the dimensionality reduction step, described in detail in Section V, the MOEFS method is used to reduce the dimensionality of the e-mail feature space, that is, to reduce the dimensionality of the vectors that represent the e-mails.

D. CLASS BALANCING
In this step, the classes of e-mails are balanced in order to balance the contribution of each class in the training of the model. To this end, e-mails belonging to the class with the least amount of e-mails were randomly replicated. At the end of the step, both classes -ham and spam -have the same number of e-mails.

E. EXPERIMENTS
In this step, the e-mail database is shuffled and divided in half, preserving the balance of classes. The first half is used as training set and the second half as test set for the ML models. The null vectors (i.e., x = [x 1 , x 2 , . . . , x n ] = [0, 0, . . . , 0]) from the training set are removed. On the other hand, the null vectors of the test set are maintained.
Each ML model is trained and tested ten times. Thus, the results obtained by each model are described in terms of the average and the confidence interval C = 0.95 [76] of the ten training sessions and the ten tests. In addition, the classification results are presented in the form of the metrics described in Section VIII.

F. EVALUATION OF RESULTS
In this step, the results obtained by the ML models are evaluated in terms of the F 1 score, AUC-ROC, training time, classification time, influence of the space dimensionality value, feature selection methods, dimensionality reduction, and computer suitability.

X. EXPERIMENTS AND RESULTS
All experiments were carried out on a single computer. The computer had an Intel Core(TM) i5-4570, 3.2-3.6 GHz, 6 MB cache processor, 32 GB DDR3-1333 RAM and ran the Linux Mint 18.3 Sylvia operating system.
The experiments followed the steps of the methodology described in Section IX. First, the pre-processing step was performed on the e-mails of each database. Then, the remaining steps were performed. The results of the execution of these steps are presented below.

A. STEP: FEATURE SELECTION
The set of tokens that represent all e-mails in a database is much larger than the set of most relevant tokens, selected by the feature selection methods. Thus, the execution of the feature selection step generates null vectors. The Table 3 shows, for each feature selection method, the percentages of null vectors generated in each e-mail database.  From the results presented in the table, it can be seen that the CHI2 and FD methods are the ones that have, respectively, the highest and lowest amount of null vectors in all e-mail databases. Likewise, it can be verified that the number of null vectors generated by each method is inversely proportional to the number of features that represent the emails.

B. STEP: DIMENSIONALITY REDUCTION
In this step, the MOEFS method is executed to reduce, from N F to N F , the dimensionality of the e-mail feature space. In other words, the method is executed to reduce the dimensionality of the vectors that represent the e-mails.
The Tables 4, 5, 6, 7 and 8 present, for each feature selection method, the reduction of the dimensionality of the vectors of the databases Ling Spam, Spam Assassin, TREC, UNIFEI and UNIFEI-δ0, respectively. In the tables, N F is the original dimension of the vectors, N F is the reduced dimension of the vectors and Reduction (%) is the percentage of reduction.
From the results presented in the tables, it can be seen that: • The CHI2 method was the one that provided the highest percentage of dimensionality reduction on the Ling Spam and Spam Assassin databases. In turn, the FD method provided the highest percentage of reduction on the TREC, UNIFEI and UNIFEI-δ0 databases; • The highest percentage of dimensionality reduction was 98.6% and occurred with the CHI2 method on the Spam Assassin database. In turn, the lowest percentages of reduction occurred on the UNIFEI and UNIFEI-δ0 databases. In these two databases, there were occurrences of reduction percentages of 0% (i.e., there was no reduction in the number of features); • The highest and lowest percentages of dimensionality reduction provided by the FD method were 95.2% and 25%, respectively, both on the TREC database; • The highest percentage of dimensionality reduction provided by the MI method was 91.8% on the Spam Assassin database. The lowest percentage of reduction provided by the method was 25% on the Ling Spam database.

C. STEP: EXPERIMENTS
As described in Section IX-E, each ML model is trained and evaluated ten times on each e-mail database. Then, the mean and confidence interval of the F 1 score, AUC-ROC, training time and classification time metrics (Section VIII) are calculated. Tables 9, 10, 11, 12 and 13 present the best results, ordered by F 1 score, obtained by each of the eight ML models on each e-mail database. In the tables, times are given in hours, minutes, seconds and milliseconds (HH:MM:SS.mmm), F S is the feature selection method, and N F and N F are, respectively, the quantity of features before and after the dimensionality reduction. Table 14 shows the number of occurrences of each feature selection method in the results of the Tables 9 to 13.
Similarly, Table 15 shows the number of occurrences of each original quantity of features N F in the results of the Tables 9 to 13.

D. STEP: EVALUATION OF RESULTS
The results are evaluated according to the four metrics -F 1 score, AUC-ROC, training time, classification timeas well as according to the quality of the feature selection method, dimensionality reduction, and computer suitability. In the evaluation, results whose confidence intervals overlap are considered equivalent results.

Feature selection and dimensionality
• The MI feature selection method obtained the highest number of occurrences (62.5%) in the results of the Tables 9 to 13. Therefore, it is the method that allows the models to obtain their best results; • The FD method obtained the second highest number of occurrences (37.5%) in the results of the Tables 9 to 13; • The CHI2 method has no occurrences (0%) in the results of the Tables 9 to 13. Therefore, it is the method that prevents the models from obtaining their best results; • The original dimensions of 1024, 128 and 512 features present, respectively, the largest (37.5%), the second largest (22.5%) and the third largest (20%) number of occurrences in the results of the Tables 9 to 13. The other dimensionalities add up to the remaining 20% of occurrences; • The use of the MOEFS dimensionality reduction method (Section V) allowed a significant reduction in training and classification times of the ML models.

Computer suitability
The worst training time was obtained by the P-SVM model on the UNIFEI database. This time, however, is not significant, since the training process for ML models is always carried out offline. In turn, the worst classification time (approximately four minutes) was obtained by the RBF model on the UNIFEI database. This time is not acceptable. However, all other ML models studied, whose classification times are in the order of seconds, can be used as classification models for anti-spam systems. Thus, the resources -CPU, RAM -of the computer used in the experiments were adequate.

XI. CONCLUSION
The large amount of spam e-mails circulating on the internet requires the development of anti-spam systems with a high degree of accuracy in the classification of e-mails. The main objective of this paper consisted in evaluating machine learning models in the classification of e-mails. Such models can be incorporated into existing anti-spam systems and, in particular, into the open-source anti-spam Open-MaLBAS [9], developed by the same authors of this paper. The Open-MaLBAS is available in GitHub [10].
The models were trained and tested on three public databases -Ling Spam, Spam Assassin, TREC -and two private ones -UNIFEI, UNIFEI-δ0 -, all made up of real e-mails. The experimental results indicate that machine learning models, combined with feature selection methods and the MOEFS dimensionality reduction method can be successfully applied to the classification of e-mails in the ham and spam classes.
Three directions for future work may be proposed. First, the test of other feature selection methods (e.g., Information Gain (IG) and Term Strength (TS) [48]) and dimensionality reduction methods (e.g., Principal Component Analysis (PCA) [77] and Autoencoders [78]). Second, the test of deep learning models [79] [80]. Finally, the incorporation of the models, evaluated in this paper, into the Classification Module of the Open-MaLBAS anti-spam.