Look Before You Leap: Detecting Phishing Web Pages by Exploiting Raw URL And HTML Characteristics

Phishing websites distribute unsolicited content and are frequently used to commit email and internet fraud; detecting them before any user information is submitted is critical. Several efforts have been made to detect these phishing websites in recent years. Most existing approaches use hand-crafted lexical and statistical features from a website’s textual content to train classification models to detect phishing web pages. However, these phishing detection approaches have a few challenges, including 1) the tediousness of extracting hand-crafted features, which require specialized domain knowledge to determine which features are useful for a particular platform; and 2) the difficulties encountered by models built on hand-crafted features to capture the semantic patterns in words and characters in URL and HTML content. To address these challenges, this paper proposes WebPhish, an end-to-end deep neural network trained using embedded raw URLs and HTML content to detect website phishing attacks. First, the proposed model automatically employs an embedding technique to extract the corresponding characters into homologous dense vectors. Then, the concatenation layer merges the URL and HTML embedding matrices. Following that, Convolutional layers are used to model its semantic dependencies. Extensive experiments were conducted with real-world phishing data, which yielded an accuracy of 98.1%, showing that WebPhish outperforms baseline detection approaches in identifying phishing pages.


INTRODUCTION
Recently, phishing has become a go-to type of attack for cybercriminals because it is cost-effective and requires little technical knowledge [1].A phishing attack is launched primarily through spam email.Links often embedded within these emails lead to phishing web pages.Gmail daily intercepted over 100 million spam emails alone in April 2020, 18 million of which were COVID-19 pandemic phishing attacks 1 .The scale of this cyber-attack has necessitated the attention of industrial and academic experts.
Several techniques have been proposed to combat phishing attacks on desktop [2][3][4], and mobile web pages.[5].The basis of most phishing detection schemes is on the underlying technology used for the phishing website identification.The techniques are majorly classified as search engine based [6], [7], [8], statistic machine learning based [4], phishing blacklist and whitelist based [9], [10], and visual similarity based methods [6], [11].Research has shown that search engine based techniques are fleetly and less computational exhaustive.Despite the speed offered by search engine-based techniques, they are nevertheless dependent on third-party applications.They are also prone to high false positive rates (FPR).
Consequently, the use of machine learning methods has become popular.This popularity is because of its independence from external systems and its ability to detect zero-day phishing attacks, which reduces FPR.Most existing machine learning-based phishing detection methods, such as the works proposed by [12] and [13], are built using manually engineered features from the prominent components of a web page like the URL, HTML, and Network modules.These features fed into the machine learning classification model can be discrete or binary variables.Although these techniques have proven successful, they have some limitations.Manual feature engineering techniques can be tedious.They require specialized domain knowledge to establish the features that will be useful to a specific platform.Also, models built on manual features have difficulties accommodating new data.Therefore, they cannot detect phishing web pages with updated content and structure.This challenge necessitates the regular upgrading of the feature set.
Motivated by the above challenges, we propose WebPhish, an end-to-end deep neural network that takes advantage of the benefits of using both the URL and HTML content in its raw form to detect a phishing attack.We employ an embedding technique to initiate an automatic feature extraction of the corresponding characters into homologous dense vectors.Subsequently, the concatenation layer merges the URL and HTML embedding matrices.Then, a Deep Neural Network (DNN), specifically a Convolutional Neural Networks (CNNs), is used to model its semantic dependencies.
The main contributions of this work are as follows: • Different from existing methods, our proposed model, WebPhish, to the best of our knowledge, is the first to use a concatenation of the raw content of the HTML and URL to determine the maliciousness of a web page using deep neural networks.Automatic feature selection is applied while the Convolutional Neural Networks learn semantic dependencies in the input features' temporal process.• Using a robust automated feature selection technique, the proposed work reduces the difficulties faced by existing systems based on manually engineered features, such as the lack of flexibility of these systems to accommodate new data and the need for specialized domain knowledge.• Furthermore, as there are a limited number of characters, the use of character-level features in the proposed model enables the embedding feature vectors to generalize to new textual data.This technique ensures that WebPhish can detect zeroday phishing attacks.• We conduct extensive analysis on a real-world dataset of more than 50,000 URLs and HTML documents collected over two months.The distribution of the instances in the corpus reflects the ratio of phishing and legitimate web pages obtainable on the Internet.This approach ensures that our evaluation metrics and results are extendable to existing real-life systems.
• Experimental results show that the proposed model significantly outperforms state-of-the-art methods demonstrating the validity of our approach.
We organized the remainder of the paper as follows: the next section provides an overview of related works on proposed techniques of detecting phishing on web pages.Section 3 provides an in-depth description of our proposed model.Section 4 elaborates on the dataset collection and evaluation metrics used to analyze WebPhish.The detailed results on the evaluations of our proposed model are in Section 5. Finally, we conclude our paper in Section 6.

RELATED WORKS
This section reviews the most common technologies used for phishing detection, specifically; phishing detection methods using listbased methods, statistical machine learning based on manually feature design, and automatic feature extraction using deep neural networks.

Phishing Detection Using List Based
Methods and Manual Feature Engineering The list-based methods reviewed in this section use the whitelist of legitimate websites and the blacklist of unverified websites to detect phishing.The blacklist is accessed through user reviews or by third parties who use one of the other phishing mechanisms to conduct Phishing URL identification.However, the machine learning methods described below extract malignant and legitimate websites features from either text, image, or URL-specific content.Then these features use a group of algorithms and specified thresholds to determine the maliciousness of the web page.
[14] used natural language processing techniques to analyze the semantic meaning of a given sentence to detect a social engineering attack.Their approach, named SEAHound, analyses a given document to check for signs of phishing attacks such as urgency in the tone of the message, malicious link, or a generic greeting.[15] proposed a phishing detection technique that extracts 212 features from the URL, web page content and the registered domain name components of a web page.The extracted features trained with a Gradient Boosting classifier determines a given web page's legal status.Likewise, [16] proposed an online phishing detection system, PEDS, composed of a reinforcement learning agent trained on 50 features extracted from the URL, HTML content, and email body and header.The proposed model can mitigate the problem of a limited dataset using an updated offline database.[17] proposed a machine learning-based phishing detection approach that extracts client-side features from the URL and HTML content of a web page.Their approach yielded 99.09 percent accuracy on a random forest classifier using a dataset of 2,141 phishing and legitimate web pages.
The authors [18] proposed a whitelist-based method that relies on a centralized architecture.The method further compares the hyperlink in the HTML source code to verify the presence of null links, empty hyperlinks, and external links to determine the web page's maliciousness.Also, Google provides a Safe-browsing application that allows the browser to verify the URLs using a list of suspicious domains, which is regularly updated by Google [19].
Although the list-based methods tend to keep the FPR low, a significant shortcoming is that the lists are not exhaustive and fail to detect zero-day attacks.

Phishing Detection Using Automatic
Feature selection on Deep Neural Networks This section's proposed techniques take as input raw URL or HTML and apply the extracted features to a deep neural network to determine a web page's legality.[20] designed a model that receives the raw URL as an input, transforms it into a one-hot encoded vector, and applies LSTM units to determine if the URL is phishing.The results yielded an accuracy of 98.7 percent accuracy on a corpus of 2million phishing and legitimate URLs.Albeit [21] transformed the raw URLs into word embeddings, and then Convolutional filters where implemented.[22] proposed a model named URLNet, built with a concatenation of convolutional neural networks applied on character and word embedding matrices generated from the input URL.Also, [23] proposed HTMLPhish, a deep neural network model that takes as input only raw HTML content and uses both character and word embedding techniques.The concatenation of these two embedding techniques represents the features of each HTML document.Subsequently, Convolutional layers were applied to model the semantic dependencies.HTMLPhish yielded an accuracy of 94 percent on a dataset of over 50,000 instances.
Despite the similarity between the existing techniques discussed in this section and our proposed model, WebPhish, there are still some significant differences and contributions.Current approaches use either only the URL or HTML of a web page input to the network.However, WebPhish can exploit the benefits of using both URL and HTML in their raw form while maintaining impressive performance and computational costs even on an imbalanced dataset.

THE PROPOSED MODEL
In this section, we elaborate on the architecture of our proposed deep neural network model, WebPhish.Deep learning techniques have been successful in a lot of Natural Language Processing (NLP) tasks, for example, in document classification [24], machine translation [25], etc.The extensive application of Recurrent neural networks (e.g., LSTM [26]) is due to their ability to exhibit temporal behaviour and capture sequential data.However, CNN is best suited for text classification and sentiment analysis, as CNN learns to recognize patterns across space [27].
We define the problem of detecting phishing web pages using their URL and HTML content as a binary classification task for prediction of two classes: legitimate or phishing.Given a dataset with R web pages {(  1 ,   1 ,  1 ), . . ., (   ,    ,   )}, where    and    for r = 1, . . ., R represents the URL and HTML content of the  th web page from the dataset, while   ∈ {0, 1} is its label.  = 1 corresponds to a phishing HTML content while   = 0 is a legitimate HTML content.

The Deep Neural Network Model
As detailed in Figure 2a, WebPhish is a deep neural network comprised of the following layers: 1. Input layer 2. Embedding layer 3. CNN layer 4. Fully Connected (FC) layer 5. Sigmoid layer.We employ CNN kernels to learn the temporal relations in the input features for the web page classification.We also apply an Embedding Layer to extract useful features from the HTML content.At the same time, the FC layers serve as an additional layer to extract other relevant characteristics.Finally, the sigmoid layer outputs the results of the deep neural network model.
Table 3 shows the configuration of the layers of the proposed deep neural network model.The output dimension of the embedding layer, the kernel size, and filters in the CNN layer and the number of units in the FC layers are detailed.Taking raw URL and HTML content as input, we conduct tokenization on the input data and segment the strings into character tokens.
An index is then associated with each token from a finite dictionary M. By counting the number of unique characters, including punctuation marks in the URL and HTML corpus, we determined M. We obtained    = 75 unique characters for the URL corpus and    = 380514 unique characters for the HTML corpus.
3.1.2Embedding Layer.As an index associated with each character mapped using the finite dictionary for the URL corpus    and the HTML corpus    does not contain many valuable data, the character embedding matrix subsequently aligns each of these indexes into a feature vector.Specifically, for each input, the raw data is processed into character embedding matrices made up of character level feature representations.The embedding matrices which are randomly initialised are gradually modified during training by backpropagation, where they are structured into a vector space that is relevant to the phishing detection model, which are exploited by the Convolutional layers.Therefore,    →s R    × and    →s R    × where    and    are the lengths of the sequences of each URL and HTML instance respectively while d is the dimension of the embedding matrix.We experimental selected   = 180 and   = 2000, while d = 16.Figure 1 shows the process in the embedding layer of the proposed model.We chose to use character embeddings for our model instead of word embedding because of some inherent challenges with word embedding techniques.For instance, word embedding techniques cannot extrapolate their learning on unfamiliar words.The number of unique words depends on the given dataset.The character embedding technique efficiently handles these limitations because there is a finite number of characters and punctuation marks available.This attribute enables the character embedding technique to extract patterns on unfamiliar words.3.1.7Optimization.Using Adam, a method posited by [28] for stochastic gradient optimization, we trained WebPhish.Adam is a blend of two conventional methods of optimization: AdaGrad [29], the adaptive gradient algorithm and RMSProp [30], which adds a decomposition term.Adam computes specific adaptive learning rates based on projections for the first and second gradient moments for the different network parameters.Research shows that Adam performs equivalently or better than some other methods of optimization [31], regardless of the hyperparameter environment.Since our deep learning model WebPhish is a binary classification network, we implemented the binary cross-entropy to monitor its performance.

WebPhish Variants
Alongside our proposed end-to-end framework combining both character-level embedding of the URL and HTML content, we also derived three different variants, namely: 1. WebPhish-LSTM, 2. WebPhish-URL and 3. WebPhish-HTML and Their architectures are detailed in Figure 2b, 3a, and 3b.WebPhish-URL and WebPhish-HTML are CNN models trained on only URL and HTML content, respectively.The character embedding matrix in the Embedding layer is also applied to the CNN and Max-Pooling layers, which are subsequently passed into the FC layers and results  outputted through the Sigmoid layer.On the other hand, WebPhish-LSTM is a deep learning model that uses recurrent neural networks, specifically Long Short-Term Memory (LSTM), to learn the representation of the URL features and HTML content.The character embedding matrix in the Embedding layer is applied to the LSTM layer, whose results are concatenated and passed into the FC layers.Classification results are outputted through the Sigmoid layer.

EVALUATION OF WEBPHISH
This section elaborates on the experiments conducted to investigate our proposed phishing detection method's efficiency.Table 1 shows the associated dataset used for each experiment.Given below is an outline of each experiment: • Experiment 1 verifies the effectiveness of our proposed method in detecting phishing on the D1 dataset (Section 5.1) and compare its performance with state-of-the-art methods (Section 5.4).• Experiment 2 is a longitudinal study to demonstrate the temporal resistance of our proposed phishing detection approach (Section 5.2).This experiment illustrates how our model performs when detecting a phishing attack on a freshly collected dataset (D2 dataset).
• Experiment 3 shows the influence of the embedding, CNN, and FC layers on our proposed model to detect phishing on web pages (Section 5.3).• Experiment 4 in Section 5.6, we demonstrate the application of our DNN model on the US airline dataset to classify customer reviews according to their sentiments.This experiment shows that our proposed model can perform optimally on other textual datasets.

WebPhish Dataset
We collected real-world datasets from Alexa.com for the legitimate web pages and phishtank.comfor the phishing web pages to train our models.Using the Beautiful Soup [32]    Also, to ensure our model's deployability to real-world applications, our dataset provided a distribution of phishing to legitimate web page obtainable on the internet (≈ 10/100) [33,34].In summary, our corpus contained 47,000 legitimate URL and HTML documents and 4,700 phishing URL and HTML documents as shown in Table 2.
Note: Our dataset contains web pages written in different languages.Therefore, this does not limit our model to only detecting English web pages.Also, we removed the prefix in URLs such as http:// and https:// and www to prevent a skewed result on different URLs dataset and reduce FPR.We manually sanitized our corpus to ensure we removed replicas or web pages, pointing to empty content.

WebPhish Experimental Setup
A suitable combination of hyperparameters was needed to tune WebPhish and its variants.We conducted a grid search to select the best combination of CNN layers (ranging from 1 to 3) and the number of FC layers (from 1 to 3) to benefit our models.Table 10 and Table 9 shows the accuracy and training time obtained on WebPhish-Full when the number of convolutional layers and the number units in the FC layers were varied.Also, using the grid search, we were able to determine the best optimization algorithm suited for the models (varying between RMSProp and Adam) within a range of learning rates (from 0.0001 to 0.1).Table 3 details the selected parameters we found gave the best performance on our dataset bearing in mind the unavoidable hardware limitation.We implemented all WebPhish variants in Python 3.5 on a Tensorflow 1.2.1 backend.We adjusted the batch size for training and testing the model to 20.The Adam optimizer [28], with a learning rate of 0.0015, was used to update the network weights.At the same time, we implemented binary cross-entropy to monitor the performance of the model.The Early stopping technique [35] was adopted to prevent overfitting on the training data.We conducted all WebPhish and baseline experiments on a Google Colaboratory environment with 12GB GDDR5 VRAM.

Evaluation Metrics
We evaluated the performance of WebPhish using  = (  +  ) * 100.These metrics measure the percentage of correctly and incorrectly classified legitimate URLs out of the total number of legitimate URLs.Finally, the accuracy of WebPhish was determined using Equation 1.
We also used the receiver operating characteristic (ROC) curve and the Area Under the Curve (AUC) in our evaluation.The ROC curve is a probability curve, while the AUC depicts how much the model can distinguish between two classes: legitimate or phishing.The higher the AUC value, the better the performance of the model.The ROC curve is plotted with the true positive rate (TPR) against the false positive rate (FPR) where   =

RESULTS
To document the performance of WebPhish and its variants on our corpus, we split the dataset into 80 percent for training, 10 percent for validation, and 10 percent for testing.Also, taking cognizance of our dataset's imbalanced nature, we ensured we manually shuffled our datasets before training.

Experiment 1: Overall Result
In Figure 4a and Figure 4b, we show the ROC curves of WebPhish and its variants.As established in Table 4, WebPhish and its variants, state-of-the-art comparative models, and baseline models were trained and tested on the D1 dataset.The WebPhish-Full model outperformed all variants and baselines in every metric measured with a precision of 99 percent on the D1 datasetAmongst the WebPhish variants, WebPhish-HTML was the least performing with an accuracy of 96 percent on the D1 dataset.This outcome is because phishing web pages, especially those hosted on compromised websites, are known to systematically copy the legitimate web page source code in other to blend in effortlessly.
In Table 7, the classification report details how the WebPhish variants performed for each class on the D1 dataset.For the legitimate class, the WebPhish-Full classifier accurately predicted 2394 of the 2410 legitimate instances with an accuracy of 98 percent and an F-1score of 98 percent.On the other hand, WebPhish-HTML classified 2359 of the legitimate cases correctly with a precision of 97 percent and an F-1 score of 96 percent.Further analysis of the classifications reports in Table 7, and Table 8 WebPhish-Full, along with other models, did not perform as well when classifying the phishing instances.For example, in Table 7, 26 out of 230 phishing instances were predicted to be legitimate.We can attribute this result to the imbalanced nature of our corpus.Even though the distribution of the cases in the corpus reflects the ratio of phishing and legitimate web pages obtainable on the Internet, further work needs to be done to improve the classifiers' performance to predict a higher percentage of the phishing class correctly.We will address this study in our future work.
In general, WebPhish-Full significantly outperforms the other three variants, WebPhish-URL, WebPhish-HTML, and WebPhish-LSTM.WebPhish-Full yielded an average of over 98 percent across its precision, F-1 score, and recall metrics on the D1 dataset.WebPhish-Full takes advantage of the other variants' strengths and produces more consistently better results while capturing local and temporal patterns in the data.Furthermore, from the results, the precision, recall, and F-1 score from the experiment for WebPhish are well-balanced as their values are similar.This result indicates that WebPhish can accurately detect phishing web pages when implemented in the wild.

Experiment 2: Temporal Resilience of the WebPhish Model
The techniques for implementing a phishing web page is continuously evolving due to emerging technology applications for designing phishing web pages.The evaluation of the resilience of this evolution is paramount for a phishing web page detection technique.
In this paper, we applied the longitudinal study [37] by evaluating the accuracy of the WebPhish-Full using freshly collected data.This study enabled us to infer a maximum retraining period for which the system's accuracy does not reduce.For a security supplier deploying WebPhish-Full in the wild, the retraining period can provide an approximate cost of maintenance.
Using the evaluation metrics detailed above, we compared WebPhish variants' accuracy and baseline models on the training data D1 with its accuracy when applied to the test data D2 without retraining the model.From the results in Table 5, the accuracy of the models dropped a few percentages, specifically a minimal 4 percent for the WebPhish-Full model.This result could be due to different phishing content structures that might not have been present in the training set.The outcome of our longitudinal study demonstrates the readiness of WebPhish-Full for real-world deployment.WebPhish-Full will remain temporally robust and will not need retraining within at least two months.
Furthermore, to evaluate the model's performance when it is retrained with unseen data (D2 dataset), using the evaluation metrics detailed above, we experimented with the performance of WebPhish on the D2 dataset when initialized with transferred learned parameters from when we trained the model on the D1 dataset.We also removed the sigmoid layer from the transferred parameters and replaced it with a new one.The new sigmoid layer is trained  from scratch using backpropagation with data from the D2 dataset corpus.
From Table 6, it is clear that the performance of WebPhish and its variants improved by at least 1 percent each and at a reduced training and testing time compared with their performance on the D1 dataset.Our model's improved performance will remain efficient in detecting phishing attacks targeting new websites to imitate and new features used in phishing architecture.
Note: Given that WebPhish-Full outperformed the other configurations, we use WebPhish-Full as the default setting.For the rest of this section, we will use WebPhish to indicate WebPhish trained with both URL and HTML as input unless otherwise stated.

Experiment 3: Influence of DNN Layers on Phishing Detection Accuracy
Many similar deep learning based phishing web page detection models employ common structures.However, in our proposed model, we configured a variable number of FC layers and CNN layers.Subsequently, we examine the FC layers and CNN layers' effect on the accuracy of the proposed model WebPhish.
We show the impact of the FC layer in Table 9. Intuitively, we expect that more FC layers will mean an increase n the accuracy of the model.Our analysis found that the proposed model's configuration of 3 FC layers (2 FC layers and 1 Sigmoid layer) gave our task's best performance on the D1 dataset.The proposed model achieved an accuracy of 0.979, 0.98, and 0.982 percent with 1, 2, and 3 FC layers.
Table 10 shows the effect of the number of CNN layers in the model.We found that 1 CNN layer gave the best balance of training time of 240 seconds and an accuracy of 0.982 percent.Using 2 and 3 CNN layers, WebPhish can achieve an accuracy of 0.983 percent and 0.984 percent and training time of 241 seconds and 244 seconds, respectively.Furthermore, we demonstrate the importance of the Embedding layer in our DNN model.We achieved this analysis by checking the performance of WebPhish-full on the D1 dataset when the embedding layer is replaced with manually engineered features drawn from URL and HTML characteristics on the CNN and fully connected layers.The manual features are listed in Table 12.Although the manual features' training time is shorter than with embedding features, we can see a 4 percent drop across all metrics in Table 11.This result highlights the character embedding matrix's importance when analyzing textual content in the URL and HTML.It also demonstrates that using the tedious manual feature engineering process could overlook some salient characteristics that differentiate a phishing web page form a legitimate one.

Comparison Study of WebPhish with State-of-the-art Techniques
We compared WebPhish-Full with the technique and efficiency of the state-of-the-art models in [20], [23] and [21].[21] is a Deep Neural Network with multiple layers of CNNs that takes as input word tokens from a URL to determine the maliciousness of the associated web page.On the other hand, [20] takes as input the character sequence of a URL.It then models its sequential dependencies using Long short-term memory (LSTM) neural networks to classify a URL as phishing or benign.[23] takes as input both the character sequence and word sequence of the HTML content of a web page and uses CNN layers to learn its semantic dependencies.
Note. [20] and [21] were applied to only the URLs in our dataset as the original papers were built for only phishing URL detection while [23] was applied on the HTML contents.Table 4, 5, and 6, shows the precision, recall, and f-1 score of WebPhish against the state-of-the-art models for the D1 and D2 datasets.The ROC curves of the state-of-the-art techniques are shown in Figure 5a and 5b.WebPhish outperforms all state-of-art techniques (with at least 1 percent improvement in accuracy) in all categories and metrics.
The advantage of using both raw URL and HTML content as input is evident in the performance of WebPhish when compared with [20] and [21] that uses only the URL component.[21] performs the least amongst the deep neural networks because although the use of CNN is only a part of the temporal process, they cannot capture the long-term sequential dependencies in the text features.

Comparison Study of WebPhish with Traditional Feature Engineering
We compared URLs and HTML code's linguistic and statistical analysis as input for traditional machine learning classifiers with our DNN model, WebPhish.We investigate deep neural networks' effect in improving phishing web page detection using raw URL and  HTML content compared with simpler baseline models trained on manually engineered features.We used three machine learning models: logistics regression, kernel SVM, and a random forest classifier.We chose these models because these traditional classifiers were commonly used in sequence detection systems [38] and are therefore relevant baselines to compare with WebPhish.We used 31 features detailed in Table 12 culled from [5], [39], [40], [34].

URL Features.
For the manual features extracted from the URL, research has shown that phishing web pages developers frequently exploit an Internet user familiarity with a website [41] by adding terms to the URL that may trick a user into thinking that somehow the malicious website is the real website.Widely used terms to access genuine websites like admin and account used makes them particularly vulnerable to imitation.Therefore, the creator of a phishing website would intuitively use ambiguous terms at the URL's start.Therefore, including those terms in the URL is regarded as a feature.Many malicious domain names are hosting systems IP addresses.[42], [43].We counted the combination of numbers in a URL and the percentage of numbers in the hostname as a feature.Also, phishers create several subdomains to include tricky terms, for example, PayPal as a subdomain.This could make  phishing URLs longer [43].Therefore, we included the URL length, if the URL includes a subdomain, the number of sub-domains, and the number of dots as features.Furthermore, the number of some punctuation marks such as semicolons, hyphens, and underscores, etc. are included in our URL feature set too.
5.5.2HTML Features.For the HTML feature set, variables such as the number of white spaces, presence of internal and external links, and number and presence of images were extracted because of their relevance when differentiating between a phishing and legitimate web page.When the features are collected, a binary classifier is taught using the extracted features provided.We empirically set the number of trees as 70 for the random forest classifier, the penalty for the logistics regression as L1, and its kernel bias function (RBF) of the non-linear SVM as 50.0.Table 4, 5, and 6, shows the precision, recall, and f-1 score of WebPhish against the traditional machine learning classifiers for the D1 and D2 datasets.The ROC curves of the traditional machine learning classifiers are shown in Figure 6a and 6a.WebPhish outperforms all state-of-art techniques (with at least 2 percent improvement in accuracy) in all categories and metrics.The output of the traditional machine learning classifiers shows the limitation of manually engineered features.It highlights the importance of the temporal robustness of our proposed method.The random forest classifier yielded better results than the logistics regression and SVM classifiers.Given that the Random Forest classifier outperformed the other models, we analyzed which features were informative to the Random Forest classifier classification results.The algorithm's top 3 most important feature is the URL's length, number of Digits on the URL, and the number of misleading words in the URL; This is not surprising since attackers will try to deceive users by employing suspicious words known by the victims.Also, we observed that phishing URLs tend to have a higher length ratio between the length of the path and the hostname.Also, we measured the training and evaluation times of WebPhish and the compared state-of-the-art models.WebPhish needs 120 seconds to train while the Logistics regression model's training time was less than a minute.Nevertheless, once trained, the WebPhish model can conduct phishing detection on one URL and HTML contents within 194s.

Experiment 4: Application on Sentiment classification
To demonstrate our model's versatility by applying it to other textual datasets, we evaluated its performance on the publicly available US airline dataset.We collected the UR airline dataset from the Kaggle databases that were published by CrowdFlower.This data  includes 14,640 tweets.These tweets belong to six major U.S. airlines: American Airlines, United Airlines, US Airways, Southwest Airlines, Delta Airlines, and Virgin Airlines.Sentiment classification techniques can help researchers and decision-makers in airline companies better understand customers' feelings, opinions, and satisfaction.
We took as input into WebPhish the actual text tweeted by the customers and the associated airline sentiment confidence.The airline sentiment confidence is a numeric feature representing the confidence level of classifying the tweet to either neutral, positive, or negative classes.5.6.1 Implementation.Using the evaluation metrics detailed above in section 4.3, we applied the WebPhish model on the US airline dataset to evaluate the model's ability to differentiate between positive and negative emotions in text.
Note, as the US airline dataset is publicly available since 2015, previous studies exist on the classification of its sentiments.We compared the performance of WebPhish on the US airline dataset with the following studies: In [44], the authors applied a voting classifier (VC) to classify tweets according to their emotions on the dataset mentioned above.The VC is based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) and uses a soft voting mechanism to make the final prediction.Also, using TF-IDF as a feature extraction mechanism, the authors implemented a phrase-level analysis on seven classification algorithms: Decision Tree, Random Forest, SVM, Gaussian Naive Bayes, AdaBoost, Logistic Regression, Gradient Boosting, Decision Tree Classifier, Voting Classifier.We also compared WebPhish with the system proposed by [45].Using Doc2Vec embeddings on the US airline dataset, [45] explored the use of a bi-directional gated recurrent unit (GRU) network for sentiment analysis of Twitter data directed at US airlines.They fed into the bi-directional GRU network a trained word vector through a skip-gram model, initialized with the existing GloVe model [48].
Again, we compared our model with DICET proposed by [46].DICET is an automated text pre-processor fed into a Bidirectional LSTM with attention to detect Twitter sentiment analysis.Also, [47] proposed a sentiment analysis model that extracts relevant features from the US airline dataset using a Hadoop cluster.The obtained features are fed into a deep RNN network to perform the classification, providing two classes: positive and negative reviews.
The studies in [46] and [47] removed the neutral class from the US airline dataset during experimentation, thereby reducing the dataset to 11,541 tweets.On the other hand, the works in [44] and [45] experimented with the full corpus of the US Airline dataset of 14,640 instances.Consequently, to ensure fairness, WebPhish was trained and evaluated on both the full and abridged dataset.
Table 13 and 14 details the result of WebPhish and other state-ofthe-art techniques on the US airline dataset.WebPhish indicates a significant improvement in efficiency than current state-of-the-art methods trained on the US airline Twitter datasets.
WebPhish, with an accuracy of 94 percent, is higher than those of other models for the US airlines dataset.We can effectively conclude that our proposed model is a robust solution that applies to other text classification fields beyond social engineering.The core reasons behind the flexibility of WebPhish may include: (i) the concatenation of the embedding of raw textual content ensures its extendibility to unseen data.(ii) The processing of language nuances by identifying strong connections within the text.

Visualization of the WebPhish Layers
This section simulates the embedding feature information of the URL and HTML instances extracted from the WebPhish model and its variants trained on the D2 dataset.Here we chose the feature sequences after the HTML and URL embedding layers were concatenated (Concatenation layer in Figure 2a), and a 16-dimensional vector was obtained.For the baseline features, we extract the URL features used in WebPhish-URL and the HTML features in WebPhish-HTML.From the extracted feature vectors, we apply t-SNE [49] to reduce feature dimension and plot the concatenation of the HTML and URLs on a 2-dimensional embedding space for our proposed model and only the URLs for the baseline model.
The Figure of the concatenated embedded HTML and URLs can be seen in Figure 7a, only the embedded URLs in Figure 7c and only HTML in Figure 7b.As shown in Figure 7a, for WebPhish-Full, the phishing and legitimate web pages are separated into two groups of content.The right part of the plot contains most of the legitimate websites, while the phishing websites are located on the left side.Very few phishing instances overlap with legitimate instances.In the WebPhish variants, the separation between phishing and legitimate URLs is not as clear.Different from WebPhish-Full, in the WebPhish-URL and WebPhish-HTML embeddings, there are several distinct data points spread across the plot, making it difficult to establish the clusters of those data points.

CONCLUSION
In this paper, we proposed a web page phishing detection technique.We adopted a DNN, specifically Convolutional Neural Networks, to capture the semantic dependencies on raw URL and HTML content while employing character embedding techniques to initiate automatic feature extraction.Evaluation results based on real-world phishing and legitimate web page content demonstrate the effectiveness of our proposed model.The future work is implement our model as a browser extension.This will enable WebPhish to determine the maliciousness of a website in real-time.

3. 1 . 3
The CNN Layer.The Convolutional layers follow the Character Embedding layer.Using all URL matrix (for all URLs    ∀ = 1, ..., ) and HTML matrix (for all    ∀ = 1, ..., ) as training data, we can now add convolutional layers.We applied Convolutional filters   R × where n = 8,   = 8 and d = 16.A Max-Pooling layer whose characteristics are transferred to FC layers for output immediately comes after the convolution layer in our model.

3. 1 . 5
The FC Layer.The FC layers in our model provide it with an added tier for learning more complex representations.The two FC layers analyze the sequences concatenated from the CNN and Max-Pooling layers while applying a ReLU activation in each FC layer.3.1.6The Sigmoid Layer.The Sigmoid layer which is the final layer in our model uses the Sigmoid activation to output the result from the model.This last layer which comes after the FC layer squashes the output from the model into the range 0 to 1, according to the expression:  = 1 1+ − given the probability of two classes: legitimate or phishing, where  = (   + ).W and  are model parameters, and   is the input at time step t.

Figure 2 :
Figure 2: Overall Architecture of WebPhish-Full and WebPhish-LSTM.The character embedding matrix in the Embedding layer is applied to Convolutional and the LSTM layer, respectively.

Figure 3 :
Figure 3: Configuration of WebPhish-URL and WebPhish-HTML which are CNN Models trained on only URL and HTML Content, respectively.

(
) (  +  ) * 100 and    = (  ) (  +  ) * 100 where TP, FP, TN, and FN represent the numbers of True Positives, False Positives, True Negatives, and False Negatives, respectively.TPR measures the percentage of accurately predicted phishing URLs out of the total number of phishing URLs.Simultaneously, FNR calculates the percentage of incorrectly predicted phishing URLs out of the total number of phishing URLs.Also the TNR and the FPR metrics we calculated using    = (  ) (  +  ) * 100 and   = (  )

Figure 6 :
Figure 6: ROC Curves of Machine Learning Models on Manual Features

Figure 7 :
Figure 7: Visualisation of feature embedding of sampled URLs and HTML using WebPhish-full features, WebPhish-URL and WebPhish-HTML.The data points are colour-coded by the sample classes: Legitimate (Blue) and Phishing (Orange).
HTML source code from the final landing page.The phishing URLs were gathered from continuously monitoring phishtank.comfrom 11 November 2018 to18 November 2018 for the D1 dataset and from 10 January 2019 to 17 January 2019 for the D2 dataset, while we drew the legitimate URLs from Alexa.com's top domains.Phishtank.com

Table 2 :
Web Page Documents Used to Evaluate WebPhish

Table 4 :
Result of WebPhish and Baseline models on the D1 Dataset

Table 5 :
Result of WebPhish and Baseline models on the D2 Dataset without retraining

Table 6 :
Result of WebPhish and Baseline models on the D2 Dataset with retraining

Table 7 :
Confusion Matrix of WebPhish and Baseline Models on the D1 dataset

Table 8 :
Confusion Matrix of WebPhish and Baseline Models on the D1 dataset

Table 9 :
The Impact of The FC Layers

Table 10 :
The Impact of The Convolutional layers

Table 11 :
The Impact of The Embedding Layer

Table 12 :
Extracted Manual Features from Both URL And HTML Contents on Web Pages Number of misleading words in the URL such as login and bank 2. Number of forward slashes and question marks 3. Number of digits 4. Number of dots 5. Number of hyphens and underscores 6. Number of equal signs and ampersand 7. Number of two-letter subdomains 8. Number of semicolons 9. Number of subdomains 10.Presence of subdomain 11. % of digits in the hostname 12. Length of URL HTML Features 1. Presence of JavaScript 2. Presence of NoScript 3. Presence of internal JavaScript 4. Presence of external JavaScript 5. Presence of embedded JavaScript 6. Number of JavaScript 7. Number of NoScript 8. Number of internal JavaScript 9. Number of external JavaScript 10.Number of embedded JavaScript 11.Presence of internal links 12. Presence of external links 13.Presence of images 14.Presence of iframes 15.Number of images 16.Number of internal links 17.Number of external links 18. Number of iframes 19.Percentage of white spaces in the HTML content

Table 13 :
Result of WebPhish and State-of-the-art models on the US Airline Dataset Classifying 3 classes: Positive, Negative and Neutral

Table 14 :
Result of WebPhish and State-of-the-art models on the US Airline Dataset Classifying 2 classes: Positive and Negative