A Comparative Study with RapidMiner and WEKA Tools over some Classification Techniques for SMS Spam

SMS Spamming is a serious attack that can manipulate the use of the SMS by spreading the advertisement in bulk. By sending the unwanted SMS that contain advertisement can make the users feeling disturb and this against the privacy of the mobile users. To overcome these issues, many studies have proposed to detect SMS Spam by using data mining tools. This paper will do a comparative study using five machine learning techniques such as Naïve Bayes, K-NN (K-Nearest Neighbour Algorithm), Decision Tree, Random Forest and Decision Stumps to observe the accuracy result between RapidMiner and WEKA for dataset SMS Spam UCI Machine Learning repository.


Introduction
There are several services on a mobile device that can be used by the spammer to launch these attacks such as Email, Mobile Browser, Short Messaging Service (SMS), Voice Call and other mobile applications. In recent years, social engineering attacks such as Spam affect the security and privacy of mobile phone users. These attacks have mistreated the use of mobile phone [1]. They will scam the mobile phone user to respond to the attacker's and get profit from this scam.
Among these services, SMS is widely used in all over the world and the cost to sent per message is considered cheaper than a call. Many SMS websites provide services to sent a message or in bulk for free to worlwide telephone number. This service not only can be used for personal, it also can be used for business marketing. By using SMS to advertise a business product, it can help sellers to improve their business profits. However, the message recipients might feel uncomfortable when receiving unwanted advertisement message from an unknown sender. This unwanted message is known as SMS Spam [2]. The mobile device is also at risk because of Spam attacks. These attacks keep growing [3] and has become a serious problem in many countries [1]. Several mechanisms have been proposed to detect these attacks in website, email and SMS. Despite this, attack frequency still increases. Xu et al. [4] claims that SMS Spamming is, nowadays, a serious attack and manipulate the use of the SMS by spreading the advertisement in bulk. Sending unwanted SMS such as advertisement can make the user feeling disturb [5] and this against the privacy of the mobile user [6].
Several studies have been conducted in detecting SMS Spam using machine learning tool such as RapidMiner and WEKA. This paper will observe the SMS Spam detection results of dataset UCI Machine Learning repository [7] using several classification techniques in WEKA and RapidMiner.

Literature Review
RapidMiner and WEKA are one of the well known data mining tools and widely used by other researchers in SMS Spam detection. These tools were used to do tokenization, lemmatization, feature selection and classification.

Tokenization
Tokenization is a process to extract words in the message [8]. Tokenization is frequently applied in several attack detection frameworks. This process is one of the required processes in text messages [9]. Previously, several studies have already applied this process in SMS research including Shirani-Mehr [9], whom applied tokenization based on alphabets, the amount of dollar signs ($), the number of numeric strings and the total number of characters in the message. Najadat et al. [10] tokenized SMS using the WEKA tool to convert the text into individual number of words. Charninda et al. [11] applied the tokenizing process to token the SMS text into words.
Tiago Almeida, José María Gómez Hidalgo, & Silva [12] used the same process to extract and amalgamate text into words in order to further calculate the word frequency in each of the Ham and Spam classes to concisely evaluate keywords for each class. In addition, Chen and Kan [13] tokenized English and Chinese SMS. Al-Talib and Hassan [14] developed a program to classify SMS through tokenization of its text in order to distinguish the individual verbs, nouns and adjectives by colour. Verbs were highlighted with green, nouns in blue and adjectives in an orchid colour.
Taufiq Nuruzzaman et al. [15] removed non-letter words during tokenization to reduce the amount of unnecessary words in SMS. Mahmoud and Mahfouz [16] also applied tokenization but implemented the stop word removal to remove unwanted words. Maaske Treurniet et al. [17] used tokenizer tools to perform basic tokenizing to SMS messages in Dutch. Delany, Buckley, and Greene [18] did not tokenize alphanumeric values and punctuations. However, domain names and email addresses were included.
Almeida et al. [19] resulted with two sets of tokens, of which are words that start with printable characters followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern and any sequence of characters extracted by blanks, tabs, returns, dots, commas, colons and dashes. Walkowska [20] tokenized all SMS text into words and transformed the words into root words and then proceeded to manually removed unwanted words. Hidalgo et al. [21] applied the Information Gain (IG) technique for the tokenization process to select suitable words above zero (0). Finally, Jue [22] utilised tokenization in an SMS efficiency study where the process was applied to all characters including punctuation marks, symbols and numbers.

Lemmatization
Lemmatization is a process to group and calculate the same words. This module is also applied in most attack detection frameworks. Karami & Zhou [23], in their study, applied this process and calculated word frequency as one feature. Charninda et al. [11] had grouped special characters, punctuations and uppercase letters. Ahmed, Guan and Chung [24], lemmatized single words and also combined two and three words as single features.
Tiago Almeida, José María Gómez Hidalgo and Silva [12] calculated the frequency for each word in each Ham and Spam class. Additionally, Chen and Kan [13] in their study, had grouped and sorted the words in a descending order to compare the common words between two SMS corpuses. Meanwhile, Al-Talib and Hassan [14] grouped the words according to several classes such as occasions, greetings, friendships and sales.
Uysal et al. [25] lemmatized all words in SMS and applied the CHI2 technique to sort the words based on their occurrence frequency. Yadav et al. [26] also grouped each word in both English and Hindi SMSs based on the Ham and Spam classes. Nuruzzaman et al. [27] had also grouped individual letters from words and categorised them based on the Ham and Spam classes. Almeida et al. [19] identified two sets of token and calculated the occurrences of each token respectively.
According to the above studies, the lemmatization process aims to identify similar and unnecessary words. By identifying these reoccurrences, the removal of redundancy and the improvement of detection accuracy for SMS can be achieved. However, this process can also reduce the detection accuracy if the user selects the root word or basic words while removing other characters.

Feature Selection
Feature extraction in this study is a process conducted to extract words from the dataset. Some studies that are based on the SMS's contents were previously conducted by Joe and Shim [28], Hidalgo et al. [21], Jie et al. [29], Yadav et al. [26], and Uysal et al. [30]. But, Eshmawi and Nair [31] proposed feature based for SMS Spam detection based on the domain knowledge.
According to Basnet et al. [32], most studies would utilise all features and eliminate those that do not positively contribute to the model. The benefit of the feature selection technique is that it helps eliminate unnecessary features in order to more concisely and accurately classify and detect SMS Phishing attacks.

Classification
A plethora of classification techniques have been implemented for Spam filtering and an equal amount if not more have been proposed to classify each corpus respectively. Naive Bayes classifier is using estimator classes and the numeric estimator precision values are chosen based on analysis of the training data [33]. Several classification techniques are applied for SMS Spam detection (  [24] and Raziki (2015) in their respective studies all utilised Naive Bayes' technique. The Naive Bayes technique was proposed by Sahami, Dumais, Heckerman, and Horvitz [36]. It is one of the more effective techniques used to classify text documents [16]. The reason Naive Bayes is applied is because this technique is one of the simplest probability techniques that can predict the class that contains better results [24], [27]. Taufiq Nuruzzaman et al. [15] integrated Bayesian filtering to develop a simple tool to filter SMS Spam. Hidalgo et al. [34] used and extended version of Naive Bayes filtering that filters emails and also adapts to filter SMS Spam. K. Yadav et al. [26] developed the mobile based SMS filtering system called SMSAssassin through Bayesian learning and a form of sender-blacklisting mechanism. Additionally, K. Yadav et al. [26] conducted experiments on Bayesian and SVM. Uysal et al. [25] also applied the Naive Bayes classifier in their framework.
K-nearest neighbour algorithm or known as K-NN is one of the classifier technique used to classify the similar object. K-NN depend to the "K" value to give the result. If the value K is small, then it will classify the nearest object. If using a big value of "K", the classification will classify larger. Additionally, K-NN choose a similar object or items which is similar with the attribute that has used. In the SMS attack detection, several studies applied this techniques in their detection framework such as Liu and Wang [35], Najadat, et al. [10], Shirani-Mehr [9], Ho et al. [37] and Duan et al. [38].
A decision tree is a model that can help to do decision making by identify the result by prediction. Najadat, et al. [10] has done several experiments to detect SMS Spam and one of the technique used by using the Decision tree techniques. Additionally, decision tree classification is a fast techniques but have fragmentation problem [39].
Random Forest is suitable to used with the larger dataset and works on the randomly attribute on the dataset [40]. This classification technique also can generate decision tree based on the different sample. Several experiments has been done by Al Moubayed et al. [41] and Najadat, et al. [10] and one of the technique used is the random forest.
The Decision Stump classification techniques are one of the decision tree types and used to classify the data based on the single feature [42]. This techniques also called as one (1) rules because its detect the items by using single feature.
The experiments will be done to observe the result classification techniques of Naïve Bayes, KNN, Decision Tree, Random Forest and Decision Stump using RapidMiner and WEKA.

Methodology
According to Creswell [43], methodology is a strategy or plan of action that generates outcomes.

Phase 2: Experiment
The experiments will used UCI Machine Learning repository [7]. This datasets will be run into separate data mining tools of WEKA and RapidMiner. Each of tool will run five (5) classification techniques such as Naïve Bayes, KNN, Decision Tree, Random Forest and Decision Stump.

Phase 3: Result Analysis
The performance measurements for these experiments will be evaluated based on the result Accuracy(A) which is to measure the correct classify result. There are two (2) data mining tools involves in these experiments. This is to compare the Accuracy between these data mining tools. The reason is to investigate the result of classification accuracy by using several techniques by using the same dataset. Several researchers such as Gonçalves and Graczyk et al. [45]Barros [46], and Zainal et al. [47] have examined the accuracy result between these two tools and the result accuracy is not same. But, the result is still acceptable when comparing both data mining tools; RapidMiner and WEKA on both SMS datasets.

Conclusion
The SMS classification experiments are carried out using RapidMiner and WEKA tools. RapidMiner shows higher Accuracy results compare to the WEKA. However, WEKA Acuracy results are still acceptable which exceeds 80% in average. One major challenge in this experiment is that RapidMiner can execute large data meanwhile WEKA have limited memory which will suddenly stop to process the data. Some classification techniques exist for both tools but some classification techniques are different. RapidMinar and WEKA have several classification techniques. Thus, before applied the classification techniques, researchers are reminded to identify clearly the type of classification techniques which they are needed whether their desire classification techniques exists in these tools.