Correlation Impact by Random Forest Towards Prediction of Phishing Website

Phishing is an online identity theft that lure unsuspecting victims into phishing website that asks for their personal information such as financial credentials or online access. The most obvious phishing attacks use spam emails to lure victims into visiting its phishing website. The less obvious attack is when a phisher spoofes legitimate sites to bring the victims to the phishing website by presenting a visually similar websites. To address this issue, this paper focuses on the effectiveness of the Random Forest algorithm in predicting a website whether a phishing or lieitimate website. The prediction model is developed using the classfication methodology and the results revealed that the prediction accuracy of Random Forest in only 66.66%. This is due to potential high correlation among the features when training the dataset since each Random Tree in the forest should protect each other from individual errors. In the future, this project is hoped to investigate features selection of the same dataset because the phishing website prediction model requires features that at least some predictive power.


Introduction
The Internet has long empowered us in many daily social and financial activities such that much of our personal information are scattered in different webpages that belong to different organizations. This makes us vulnerable to various types of possible web threats that can lead to identity theft, loss of personal information, loss of financial wealth or damage to brand reputation due to increasing fraud cases. In 2004 alone, more than 57 million users in the United States received phishing and phishing scams nearly 2 million of them have become victims of the attack [1]. Phishing is considered as a form web threats where a the web pages of a legitimate organization are attacked by phishers aimed at secret information such as user name, password and financial ID. The attack manipulate the users into believing the phishing website as a legitimate website [2]. Email may appears to come from a legitimate company to the victim who asked they update or verify their information when victim visit the link in e-mail is the starting of phishing attacks. Phisher is now using different one technique of creating websites to trick users and flirt they are, but all of them use a common features set to design phishing website. Predicting a website whether it is a phishing website or a legitimate website is made possible via data mining approach. Data mining is one of the powerful techniques that can help researchers focus on the most important information in their data. Data collection is divided into several methods such as classification, anomaly detection, clustering, regression or summarization [3]. Each technique has its own advantages and methods in predicting large datasets. One approach to counter a phishing attack would be at the email level, since most current phishing attacks use spam emails to trick victims to a phishing website. The less obvious attack is when a phisher spoofes legitimate sites to bring the victims to the phishing website by presenting a visually similar websites. This is the main reason why detection of phishing website is necessary to protect the users from potential loss and harm.
The literature has shown a number of prediction works on phishing website using data mining approach such as the structural web mining (focusing on web links) [4], clustering approach [5,6] or associative classification [7,8]. This paper focuses on building a prediction model for phishing website using Random Forest that is able to classify a website into phishing or legitimate website based on various features. The remaining of this paper proceeds as follows. Section 2 presents the materials and methods of building the website phishing model including the algorithm and dataset along with the evaluation metrics. Section 3 presents the results and finally Section 4 concludes with some direction for future work.

Material and Method
According to [9], the most popular applied data mining technique is classification, whereby this method employs a set of pre-determined classes in building the classification model. This model, will in turn, be able to classify a large-sized dataset. The tasks of classification are organized frequently whether in binary or multiclass. The phishing website predicton model proposed in this paper is based on a classfiication methodology with Random Forest algorithm. The experiment is a binary prediction with two classes, which are legitimate vs. phishing. It will be performed by using hold-out validation method with 80% training and 20% testing data. Figure 1 shows the classification methodology in Azure Machine Learning Studio (https://studio.azureml.net/).

Dataset
The dataset used in this project is sourced from the UCI website at https://www.kaggle. com/akashkr/phishing-website-dataset/version/2. The data is intended to be used for experimental verification throughout the project. Figure 2 shows the excerpt of the dataset.

Figure 2. Excerpt of Dataset
There are 32 attributes and 11,055 instances included with various important features such as having IP Address, Request URL and Abnormal URL. Due to the limited space in this paper, only five features are described in detail as follows. if Favicon loaded From external domain then phishing else legitimate end if Other than the five features described, more features are available in the dataset such as HTTPS Token, Popup Window, DNS Record, Page Rank, and Goole Index. A phishing website usually add the "HTTPS" token in domain as part of the URL or has no DNS records for the domain or has a Page Rank < 0.2 or it is not indexed in Google due to its short lifespan.

Algorithms
Random forest is chosen for the data mining algorithm of this study. This algorithm is an ensemble learning technique that combines the decision tress rather than using the prediction of a single decision tree of classification [10]. Classification of all trees in random forest will provide a class to an unknown samples (test samples) and the class having maximum votes will be assigned to the unknown sample. The Random Forest algorithm is defined as an ensemble of n classifiers h 1 (x), h 2 (x), . . . , h n (x) that consisted of x attributes. The pseudocode for Random Forest is shown in Algorithm 1 [11].

Algorithm 1 Pseudocode for Random Forest
Set k = 1 for each sample i in bootstrapped dataset do Grow an unpruned classification tree h i (x) for sample i for each node in the classification tree h i (x) do Randomly sample k of the predictor variables Choose the best split from among those variables end for end for Predict new data by combining the predictions of the trees Calculate summary statistics and variable importance One thing to note is that Random Forest is a classification algorithm that is aggregated by a number of decisions trees with the idea that prediction by a group of trees is better than International Conference on Technology, Engineering and Sciences (ICTES) 2020 IOP Conf. Series: Materials Science and Engineering 917 (2020) 012043 IOP Publishing doi:10.1088/1757-899X/917/1/012043 5 prediction of a single tree. The key to the success of Random Forest algorithm is the low correlation between the individual decision tree model because uncorrelated trees protect each other from their errors. This means all the individual decision tree models are trained on different testing set and well as different features when building the model.

Evaluation Metrics
The evaluation metrics used in measuring the performance of Random Forest algorithm in predicting a phishing website are accuracy, precision, and recall, which are defined as follows.
• Positive (P): A phishing website is detected.
• Negative (N): A legitimate website is detected.
• True Positive (TP): Website is a phishing web, and is predicted to be phishing web.
• False Negative (FN): Website is a phishing web, but is predicted to be legitimate web.
• True Negative (TN): Website is a legitimate, and is predicted to be legitimate web.
• False Positive (FP): Website is a legitimate web, but is predicted to be phishing web.  Table 1 shows the evaluation metrics produced after modeling the phishing detection website.  Figure 3 shows the Receiver Operating Characteristic (ROC) curve that illustrates the diagnostic ability of Random Forest. An ROC space is defined by False Positive Rate at the x axis vs. the True Positive Rate at the y axis. The space shows a relative trade-off between true positive (a phishing website successfully detected as a phishing website) and false positive (a legitimate website but detected as a phishing website ). In this figure, the diagonal divides the ROC space. Points above the diagonal line represent classification results that are better than random, while points below the line represent results that are worse than random. From this curve, we can expect that the closer the curve result is to the upper left corner, the better it predicts. However, the distance from the diagonal line (random result) already shows the predictive power of Random Forests. The details of the prediction results are shown in Table  2. Based on the results, we can see that the performance of Random Forests is only 65.6% accuracy.

Conclusion
Prediction of phishing websites is essential in order to protect Internet users from becoming phishing victims as well organizations from losing their brand and customers. This paper presented a website phishing model using the Random Forest, a data mining classification algorithm that is able to automatically classify phishing pages. This project has proposed and evaluated a phishing detection model to detect in terms of accuracy. The performance of 67% accuracy indicates potential areas to be explored in the feature space such that they have more predictive power in the future.