P HI B OOST - A N OVEL P HISHING D ETECTION M ODEL U SING A DAPTIVE B OOSTING A PPROACH

Every day, cyberattacks increase and use different strategies. One of the most common cyberattacks is Phishing, where the attacker collects sensitive and confidential information by pretending as a trusted party. Different traditional strategies have been introduced for anti-phishing, such as blacklisted, heuristic search and visual similarity. Most of these traditional methods have a high false rate and take a long time to detect the phishing website. New modes have been introduced using machine learning techniques which improve the detection’s accuracy. Machine learning techniques require a huge amount of data called features that are collected from different websites. These collected features are classified into four categories. This paper introduces a novel detection model by utilizing features’ selection to pick up the highly correlated features with the class label. The phase of features’ selection employs independent significance features library from MATLAB and heat-map from Python to find the highly correlated features. Then, the proposed model uses an adaptive boosting approach which consists of multiple classifiers to increase the model’s accuracy. The proposed model produces an extremely high predictive accuracy of approximately 99%.


INTRODUCTION
Phishing is against the law. It uses social engineering and technical trick to thieve Internet users' nonpublic identity facts and financial account credentials. Social engineering schemes prey on unwary sufferers with the means of not only fooling them into believing they're managing a trusted and a legitimate party, but also using misleading electronic mail addresses and electronic mail messages [1]- [2].
Disasters have continually been a good chance for different types of criminals' special cyberattacks. The phishers have created violations to take advantage of hurricanes, recessions and different challenging times, merchandising fake charitable giving possibilities and nonexistent services or products. One of the most recent world catastrophes in 2020 is the COVID-19 pandemic. Anti-Phishing Working Group (APWG) classifies four cybercriminal methods that represent more complicated scenarios to lure their victims [3]- [4].
Several types of research introduced phishing attack problems and their consequences on customer trust in e-commerce and online services [5]. The phishing attackers create a website that pretends as a trusted website to collect valuable and sensitive Internet user information. At the same time, different antiphishing software models for phishing detections are introduced. The phishing detection strategies are classified into seven categories [6] as follows: 1. User education: this category depends on the educated Internet users to distinguish between a legitimate and a phishing website [7]. 2. Create a blacklist: this strategy creates centralized phishing websites and compares an URL with the list to find out if the URL is legitimate or not [8]. 3. Heuristic blacklist methods: in this strategy, the system identifies the signature of the phishing URL and blacklists it for the future use of intrusion detection systems [9]. 4. Visual similarity: These techniques use URL features to find out the similarity between websites (page source code, images, textual content, text formatting, HTML tags, CSS, website logo).
After that, the system compares the new website with previously visited ones and distinguishes whether it is a legitimate or a phishing website [10]. 5. Search engine-based techniques: in this mode, the system uses the search engine and extracts the website features, then checks the website legitimacy. However, the search engine does not give precise output for the non-English search query [11]. 6. Supervised Machine Learning detection system uses supervised machine learning models on phishing datasets with predefined features [12]. 7. Deep learning techniques: these techniques include Gated Recurrent Neural Network (GRU) and Convolutional Neural Network (CNN). Based on these techniques, the system automatically extracts the features from generic URL, file directory, ...etc. [13]. Table 1 shows a summary of phishing detection strategies and their main drawbacks. Supervised machine learning detection  The achieved performance depends on the features' selection and the classification algorithms 7 Deep learning techniques The rest of the paper is organized as follows: In Section 2, a review of related work is presented. In Section 3, the proposed methodology is described. In Section 4, the experimental results are reported. The conclusion of the paper is included in Section 5.

LITERATURE REVIEW
Different research papers have conducted an intensive work on website security, some of which manipulated the routing security [14], while others dealt with intrusion detection, intrusion prevention and smart grid security [15].
Pawan Parakash et al. proposed two methods to identify phishing websites, where first proposed method introduced five heuristics to enumerate the combination of the known phishing websites to find out the new phishing websites. The second method used matching algorithms to find out the new phishing websites [16].
Samuel Marchal et al. analyzed and evaluated the URL of the websites and extracted the features of the URL. Based on the several queries through Google and Yahoo search engines, the authors determined the keywords for each website. Then, the keywords with the extracted features are used in a machine learning classification algorithm to find out the phishing websites from the real dataset [17]. In [18], the authors introduced models using machine learning and data mining algorithms for detecting website phishing.
The authors in [19] used the artificial neural network to spot phishing websites. The proposed work used 17 neurons as input for 17 characteristics and one hidden layer level and two neurons as output to decide whether or not the website is phishing. The dataset was divided into 80 percent as a training set and 20 percent as a test set. The suggested model achieved 92.48 percent accuracy.
Authors in [20] introduced a model relying on a machine learning technique called PLIFER. This model requires an age of the URL domain. Also, ten features are extracted and Random Forest (RF) model is used to identify the phishing website. 96 percent of phishing e-mails were correctly identified by this model. Classification models are also used to identify phishing utilizing labeled datasets. Different classification methods used features, like URL-based and text-based applications.
A proposed software collection model hybrid set of features (HEFS) to identify phishing websites relying on machine learning algorithms is presented in [21]. A cumulative distribution gradient technique is used to extract the primary feature set. Then, the second set of features is extracted using a method called data perturbation ensemble. Random Forest (RF), an ensemble learner, is subsequently implemented to identify phishing websites. The results indicated that HEFS identified phishing features with a precision of up to 94.6 percent.
In 0, The authors selected the most suitable components to identify website phishing and proposed two new selection methods or detection techniques based on machine learning algorithms. The two methods include the AdaBoost classifier and the LightGBM classifier. When combined, they form a hybrid classifier. These two algorithms have proved to be effective and efficient in improving the accuracy of single classifiers in detecting web phishing attacks.
In 0, The authors investigated agreeing on the final conclusion of the features used to detect phishing on webpages. Using three standard datasets, the authors used the Fuzzy Rough Set (FRS) theory as a tool to select the most significant features to identify intrusion on webpages. The chosen features were then fed into three standard classifiers to detect phishing. When Random Forest classification was used, the maximum accuracy gained by Fuzzy Rough Set (FRS) feature selection was 95%. The Fuzzy Rough Set (FRS) had used three sets of data to come up with nine universal features of detecting phishing. When these versatile features were used to measure the accuracy value, the accuracy was about 93%, which is comparable to the Fuzzy Rough Set performance, with only a slight difference of 2%.
The authors of 0 proposed three ensemble learning models based on Forest Penalizing Attributes (Forest PA) algorithm. The algorithm exploited the prowess of all attributes in a given set of data using a weight increment and weight assignment strategy to build highly resourceful decision trees. The results of the experiment showed highly efficient meta-learners with an accuracy of 96.26%.

MOTIVATION AND MAIN CONTRIBUTION
All phishing attacks have some salient features; however, these attacks exhibit some similarities and patterns. Thus, using machine learning methods to detect these similar patterns and recognize phishing websites has become possible 00.
In this paper, an inventive detection model is introduced that utilizes feature selection to pick up similar features on phishing websites with the class label. The independent significant features library from MATLAB and heat-map from python are employed in the features' selection to find the associated features on phishing websites. The proposed novel detection model consists of multiple classifiers incorporated in an adaptive boosting technique to increase the model's accuracy.
The adaptive AdaBoost classifier is selected as an efficient technique for detecting website phishing, because it is flexible and straightforward, yet it has a high generalization performance 0-0. The fact that it is based on several weak classifiers makes it flexible and straightforward to implement. Also, it doesn't use large sets of features that may be unnecessary sometimes, but it treats each class's attributes separately 00. Moreover, the AdaBoost classifier achieves much high accuracy, as it regulates the errors of weak classifiers; therefore, it needs much fewer settings as compared to other robust classifiers 0-0.

PRELIMINARIES
This section provides a brief description of the phishing dataset used in the experimental comparison, as well as a background about the dataset, feature selection and the classification model used in this study.

Feature Selection
A subset of features that work well together is selected. The selection process aims to minimize the time needed to build the machine learning model and produce high accuracy. Selection features' process keeps features that have low correlation to each other, but have high correlation to the label feature [28]. The rest of the highly correlated features are dropped.

Adaptive Boosting
AdaBoosting is the decision tree on binary classification problems. AdaBoosting is usually used for a discrete dataset, so it's more related to classification than to regression. The AdaBoosting algorithm updates the weight to minimize error, which leads to minimize the misclassification rate. It is necessary to highlight that Freund, Schapire and Abe 0 developed the AdaBoost algorithm to increase the efficiency of binary classifiers. AdaBoost uses an ensemble learning method approach to learn from weak classifiers' mistakes and turn them into strong ones. AdaBoost generates a weak learner through primary training data. The data is then adjusted according to the foreseen performance for the next round of weak learner training. It is good to note that the training samples with the lowest predicting accuracy in the preceding step are approached with more attention in the step that follows. The weak learners with different weights are finally combined to create a strong learner 0-0. Figure 1 shows the system's flow diagram to recognize the URL. The proposed system reads the URL from the dataset, then the URL is classified into multidimensional features according to the dataset components. The model's detection accuracy is improved by selecting the most correlated features and eliminating the irrelevant features. The filtered data is split into the training set and testing data. Machine learning model is applied by using an adaptive boost classifier to create the adaptive boost knowledge base. The testing dataset is used as the input for the detection model to evaluate it.

. PROPOSED MODEL
The proposed model uses Weka 3.6, Python and MATLAB. Table 4 shows the experimental parameters, such as the evaluator, the search algorithm and the batch size, the classifier, the number of iterations and the weight threshold.

. DISCUSSION OF RESULTS
The proposed model classifies the features into four categories by utilizing the correlation relationship between features and the class label (phishing or legitimate).
The output from the feature selection process is nine features as follows: having_IP_Address, having_Sub_Domain, SSLfinal_State, web_traffic, Google_Index, Request_URL, URL_of_Anchor, Links_in_tags and SFH. In the next feature selection phase, MATLAB built-in procedure called independent significance features test (IndFeat()) is invoked. Figure 2 shows the Python heat map of the output of the independent significance features test. Four popular statistical measures were utilized to determine the efficiency of the proposed model. Table  5 lists these performance measures and their effects on the model performance. In our experiments, we evaluate the proposed system by using the accuracy to evaluate the ratio of correctly predicated observations to the total observations of the proposed system. Precision measure enables us to evaluate the ratio of correctly predicated observations to the total of positive observations. The recall measure evaluates the ratio of correctly predicated positive observations to all observations in the actual class. Fmeasure is a weighted average precision and recall.  Table 6 shows the experiments conducted on a different percentage split. The minimum accuracy achieved in the proposed model is 97.7% and the F-measure is 97.5% after training the model in 50% of the dataset. The best performance is obtained when the training percentage is 70%, where both accuracy and F-measure are approximately 99%.  Figure 3 shows the efficiency of the PhiBoost model which explores the precision and accuracy with different percentages of training and testing to avoid any overfitting problem. The minimum accuracy that PhiBoost achieved was when the training test is 50% of the dataset. On the other side, the performance of the PhiBoost model improves if the training set is 70%. In Table 7, the proposed model is compared with different detection machine learning models. As demonstrated in the results obtained, the proposed model enhances the accuracy of the detection system. In [27], the authors introduced a phishing detection model by utilizing feature selection and combining Feed-forward NN 97.40% [ 24 ] Logistic regression classifier 98.40% [ 25 ] Naïve Bayesian classifier 90% [ 26 ] HNB and J48 96.25% [27] Multilayer perceptron neural network 98.5% PhiBoost model 98.9 % as a pre-processing step for the dataset. After that, they employed a multilayer perceptron neural network as a classifier function. In our proposed work, we tried to optimize the accuracy by minimizing the number of selected features and utilizing the adaptive boosting classifier.

. CONCLUSION
This paper aims to introduce an outstanding solution to the threat of phishing in our modern community. As a result, this research proposed implementing feature selection and adaptive boosting for an efficient model for detecting phishing websites. The results of this study explored the best splitting rate for the dataset to train the machine learning model, which was 70%. The results achieved a high accuracy and a high F-measure with high predictive capability as well as with low false-positive rates and low falsenegative rates. The proposed model minimizes the time to build the training model by picking up the most correlated features and produces an extremely high predictive accuracy of approximately 99%. Conclusively, the application of the implemented methods of this research in a real-time environment remains pivotal in future work. In the future, the system's capability will be investigated by testing it over a real-time environment.