Using automated individual white-list to protect web digital identities

doi:10.1016/j.eswa.2012.02.020

Expert Systems with Applications

Volume 39, Issue 15, 1 November 2012, Pages 11861-11869

https://doi.org/10.1016/j.eswa.2012.02.020 Get rights and content

Abstract

The theft attacks of web digital identities, e.g., phishing, and pharming, could result in severe loss to users and vendors, and even hold users back from using online services, e-business services, especially. In this paper, we propose an approach, referred to as automated individual white-list (AIWL), to protect user’s web digital identities. AIWL leverages a Naïve Bayesian classifier to automatically maintain an individual white-list of a user. If the user tries to submit his or her account information to a web site that does not match the white-list, AIWL will alert the user of the possible attack. Furthermore, AIWL keeps track of the features of login pages (e.g., IP addresses, document object model (DOM) paths of input widgets) in the individual white-list. By checking the legitimacy of these features, AIWL can efficiently defend users against hard attacks, especially pharming, and even dynamic pharming. Our experimental results and user studies show that AIWL is an efficient tool for protecting web digital identities.

Highlight

► Protect web identities by using Automated Individual White-List. ► White-list based anti-phishing. ► White-list based anti-pharming.

Introduction

Web digital identities (Chen, Wu, Shen, & Ji, 2011) in the form of pairs of usernames and passwords is a commonly used mechanism to authenticate individuals wishing to carry on transactions across the World Wide Web (Web for short). Applications that rely on such mechanisms include webmail, on-line banking, and social networking services (SNSs). It is not a surprise, thus, that a variety of attacks that aim at stealing user’s web digital identities are perpetrated. Among these attacks, phishing is the most widespread one. Phishing employs social engineering to trick a user into revealing his or her web digital identities to a fraudulent web site. The open source model of web pages makes it easy for attackers to create an exact replica of a legitimate site. Because such a replica can be easily created with little cost and looks very convincing to users, many such fraudulent web sites continuously appear (Fette et al., 2007, Zhang, Egelman, et al., 2007). As a result, phishing not only leads to a severe threat to user’s web digital identities, but also erodes the fundamental premise of activities and business on the Web.

Users are not usually skillful enough to defend themselves against the theft attacks of web digital identities, especially phishing attacks (Dodge et al., 2007, He et al., 2011, Sheng et al., 2007), because fraudulent web sites generally have appearances similar to the genuine ones. Moreover, the URLs of fraudulent web sites are forged so to look very similar, and sometimes even identical, to the legitimate sites. So it is difficult for even a more careful user to detect fraudulent web sites.

Because of the potential severe damages resulting from phishing attacks, anti-phishing techniques and tools represent a very active research area in web security. Many approaches and tools have thus been developed to address the problem of phishing (Aburrous et al., 2010, eBay Toolbar’s Account Guard, 2011, CallingID, 2011, Chen et al., 2009, Dodge et al., 2007, EarthLink Tool, 2011, GeoTrust, 2011, Google, 2011, He et al., 2011, NetCraft, 2011, SpoofGuard, 2011). There are four main topics in anti-phishing research (Zhang, Hong, & Cranor, 2007): understanding why people fall for phishing attacks; methods for educating people in order not to fall for phishing attacks; user interfaces for helping individuals in making better judgments about trustable email and legitimate web sites; and automated tools for detecting phishing.

Among the four topics, designing automated tools for detecting detecting phishing is today the focus of intense research. Approaches to the design of these tools can be categorized in four types: blacklist, white-list, heuristic, and hybrid.

•
Blacklist approach: In the blacklist approach all web sites recognized as fraudulent web sites are listed in a list, referred to as blacklist. Since web sites are added into the blacklist after verifications, users can be sure of the illegitimacy of the web sites which cause warnings. But it takes a great deal of resources and time to maintain the blacklist. Furthermore, since fraudulent sites continously emerge, it is hard to keep the blacklist up to date.
•
White-list approach: Unlike the blacklist approach, the white-list approach maintains a list containing all legitimate web sites. Any web sites that do not appear the list are recognized as potential malicious web sites. Thus the white-list approach requires to list all legitimate web sites in the world and to keep the white-list up to date.
The current white-list tools usually use a global white-list where all legitimate web sites are required to be included in the white-list. But it is obviously impossible for the administrator of the white list to cover the information of all legitimate web sites in the Internet. Thus, when such types of tools alerts, users will not be sure whether the current web site is an illegitimate one or is a legitimate one whose information is not contained in the white-list in time.
•
Heuristic approach: The heuristic approach, adopted by the majority of anti-phishing tools, leverages the characteristics of a web site to decide the legitimacy of the web site. In a heuristic approach, web sites that have high similarity or tight relationship with legitimate web sites but actually are not the original ones are recognized as fraudulent web sites. The similarity or relationship of a web site with the legitimate ones is computed based on information collected on the legitimate web sites, referred to as a feature library (Chen et al., 2009).
•
Hybrid approach: A hybrid approach combines the above approaches, such as a global white list and some heuristic approaches (Xiang & Hong, 2009), or a combination of a heuristic approach and a blacklist approach (eBay Toolbar’s Account Guard, 2011), to recognize phishing pages.

Several experiments carried out by Zhang, Egelman, et al. (2007) have shown that the current automated tools are not effective in protecting not provide the users’ digital identities.

This paper, therefore, proposes an approach, referred to as Automated Individual White-List (AIWL), to protect user’s web digital identities. Although a global white-list approach is unpractical, we argue that an individual white-list approach is practical, because an individual white-list approach records the familiar legitimate web sites of a user rather than all the legitimate web sites in the world. The study of Florencio and Herley (2007) and our experiments in Section 4.3 show that a user only logs in a limited and stable number of web sites. AIWL, therefore, takes advantage of these observations to build an individual white-list to defend users against the theft attacks of web digital identities efficiently.

The main contributions of AIWL are as follows:

•
AIWL is the tool that employs an individual white-list, automatically maintained by a Naïve Bayesian classifier, to protect user’s web digital identities. In AIWL, any web site that does not match the individual white-list is classified as a fraudulent web site, and AIWL will alert the user who is trying to submit his or her account information to such a web site. Compared with the traditional blacklist approach and global white-list approach, this individual white-list approach is more practical.
•
AIWL offers an effective solution to defend users against pharming attacks, including dynamic pharming (Karlof, Tygar, Wagner, & Shankar, 2007). AIWL keeps track of the features of login pages (e.g., IP addresses, Document Object Model (DOM) paths of input widgets) in the individual white-list to detect these attacks. AIWL can recognize pharming by checking the IP addresses of web sites. In addition, AIWL is able to effectively defend users against dynamic pharming by checking the Document Object Model (DOM) paths of the input widgets in the web page. Because the dynamic pharming attack embeds a legitimate login web page into the phishing site, the DOM paths will be modified, and thus AIWL can detect the attack based on such modification.

The rest of the paper is organized as follows: Section 2 introduces some background knowledge needed for the discussion in the paper; Section 3 describes AIWL in details Section 4 reports experimental results and user studies concerning the efficiency of AIWL; Section 5 analyzes some important issues in AIWL and discusses the limitations of AIWL; Section 6 introduces related work; and Section 7 outlines the conclusions and our future work.

Section snippets

Phishing and pharming

A phishing attack (APWG, 2011, Fette et al., 2007) usually involves sending a user a fake e-mail claiming to be from a legitimate web site, leading the user to a fraudulent web site which looks very similar to the legitimate one, and tricking the user into exposing his or her web digital identity. Once the user submits his or her account information to such a fraudulent web site, the attackers are able to impersonate the victim and steal victim’s personal information, such as financial

Construct an individual white-list

To construct an individual white-list for a user, the familiar legitimate web sites of the user should be identified. In AIWL, we assume that the web sites where an individual user has successfully accessed the anticipatory services after submitting his or her account information are familiar legitimate web sites for the user. The reason is that the aim of malicious web sites is stealing user’s web digital identities. The malicious web sites would not provide the same services as the legitimate

Constructing the Naïve Bayesian classifier

The Naïve Bayesian classifier was constructed to enable AIWL to recognize a successful login process. We simulated login processes for 34 web sites. 18 of 34 web sites are phishing web sites from PhishTank.com (PhishTank, 2011). The other 16 web sites are legitimate web sites. For every legitimate web site, both the successful login process and the failed one were simulated. We simulated failed login processes by purposely using wrong passwords. Thus, there are altogether 50 login processes

Efficiency in identifying login processes

AIWL uses Inbrowserhistory, HasNopasswordField, Numberoflink, HasNoUsername and Opertime as the features to identify successful login processes. With those features, AIWL can classify login processes in 100% true positive and 0% false positive. That is, all login processes that AIWL recognizes as successful login processes are actually successful login processes and all login processes that AIWL recognizes as failed login processes are actually failed login processes. This perfect result is

Related work

The problem of protecting from the theft attacks of web digital identities, especially phishing attacks, has been widely investigated from several different perspectives and several approaches exist.

First, the user’s own security awareness is a very important factor in ensuring a safe and secure e-business environment. Therefore, the Anti-Phishing Working Group and other financial organizations have gathered a large amount of materials giving suggestions and guidelines to users in order to

Conclusion and future work

This paper proposes an approach, called Automated Individual White-List (AIWL), to protect user’s web digital identities. AIWL is effective in detecting the theft attacks of web digital identities by maintaining an automated individual white-list of all web sites familiar to the user together with the LUI information of these web sites. AIWL uses a Naïve Bayesian classifier to automatically build an individual white-list for the user. As is shown by our experiments, AIWL recognizes a successful

Acknowledgements

This paper is partly supported by “211-Project Sponsorship Projects for Young Professors at Fudan”, the 863 project (Grant No: 2011AA100701) and Key Lab of Information Network Security, Ministry of Public Security (Grant NO: C11601).

References (40)

M. Aburrous et al.
Intelligent phishing detection system for e-banking using fuzzy data mining
Expert Systems with Applications
(2010)
L. Bouchaala et al.
Improving algorithms for structure learning in Bayesian networks using a new implicit score
Expert Systems with Applications
(2010)
J. Chen et al.
Differentiated security levels for personal identifiable information in identity management system
Expert Systems with Applications
(2011)
M. He et al.
An efficient phishing webpage detector
Expert Systems with Applications
(2011)
R. Pavon et al.
Automatic parameter tuning with a Bayesian case-based reasoning system
A Case of Study, Expert Systems with Applications
(2009)
Androutsopoulos, I., Koutsias, J., Cbandrinos, K. V., & Spyropoulos, C. D. (2000). An experimental comparison of Naïve...
APWG. Anti-phishing working group (2011)....
CallingID (2011)....
Cao, Y., Han, W., & Le, Y. (2008). Anti-phishing based on automated individual white-list. In Proceedings of the 4th...
K.T. Chen et al.
Fighting phishing with discriminative keypoint features
IEEE Internet Computing
(2009)

Dhamija, R., & Tygar, J. D. (2005). The battle against phishing: Dynamic security skins. In Proceedings of the 2005...

R.C. Dodge et al.

Phishing for user security awareness

Computers & Security

(2007)

Domingos, P., & Pazzani, M. (1996). Beyond Independence: Conditions for the optimality of the simple Bayesian...

R.O. Duda et al.

Bayes decision theory

(1973)

EarthLink Tool (2011)....

EBay (2007). Spoof email tutorial....

eBay Toolbar’s Account Guard (2011)....

Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to detect phishing emails. In Proceeding of international world...

Florencio, D., & Herley, C. (2005). Stopping a phishing attack, even when the victims ignore warnings, microsoft...

Florencio, D., & Herley, C. (2007). A large-scale study of web password habits. In Proceeding of international world...

Cited by (56)

COVID-19 malicious domain names classification[Formula presented]
2022, Expert Systems with Applications
Citation Excerpt :
Finally, Section 6 presents and compares the results of the different algorithms, and Section 7 offers conclusions and suggests future works on the topic. Software-based detection techniques are generally divided into three classes: visual similarity-based (VSB) detection systems (Jain & Gupta, 2017), list-based (LB) detection systems (Cao et al., 2008; Han et al., 2012), and machine learning based (MLB) detection systems. VSB approaches can be grouped into HTML DOM (HyperText Markup Language Document Object Model), Cascading Style Sheet (CSS) similarity, visual features, visual perception, and hybrid approaches (ALmomani, 2013).
Due to the rapid technological advances that have been made over the years, more people are changing their way of living from traditional ways of doing business to those featuring greater use of electronic resources. This transition has attracted (and continues to attract) the attention of cybercriminals, referred to in this article as “attackers”, who make use of the structure of the Internet to commit cybercrimes, such as phishing, in order to trick users into revealing sensitive data, including personal information, banking and credit card details, IDs, passwords, and more important information via replicas of legitimate websites of trusted organizations. In our digital society, the COVID-19 pandemic represents an unprecedented situation. As a result, many individuals were left vulnerable to cyberattacks while attempting to gather credible information about this alarming situation. Unfortunately, by taking advantage of this situation, specific attacks associated with the pandemic dramatically increased. Regrettably, cyberattacks do not appear to be abating. For this reason, cyber-security corporations and researchers must constantly develop effective and innovative solutions to tackle this growing issue. Although several anti-phishing approaches are already in use, such as the use of blacklists, visuals, heuristics, and other protective solutions, they cannot efficiently prevent imminent phishing attacks. In this paper, we propose machine learning models that use a limited number of features to classify COVID-19-related domain names as either malicious or legitimate. Our primary results show that a small set of carefully extracted lexical features, from domain names, can allow models to yield high scores; additionally, the number of subdomain levels as a feature can have a large influence on the predictions.
A predictive model for phishing detection
2022, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
For instance, most client-side deployments suffer from the use of a specific browser to secure online communication against phishing attacks (e.g. SpoofGuard is deployed on Mozilla). In addition, intensive administration (e.g. undue updates of a browser for a security patch, installation, etc.) and JavaScript exploits limited the efficiency of client-side deployment (Aparna and Muniasamy, 2015; Han et al. 2012). Similarly, the use of anti-phishing scheme as a server-side filter is being challenged by trust and third-party involvement.
Nowadays, many anti-phishing systems are being developed to identify phishing contents in online communication systems. Despite the availability of myriads anti-phishing systems, phishing continues unabated due to inadequate detection of a zero-day attack, superfluous computational overhead and high false rates. Although Machine Learning approaches have achieved promising accuracy rate, the choice and the performance of the feature vector limit their effective detection. In this work, an enhanced machine learning-based predictive model is proposed to improve the efficiency of anti-phishing schemes. The predictive model consists of Feature Selection Module which is used for the construction of an effective feature vector. These features are extracted from the URL, webpage properties and webpage behaviour using the incremental component-based system to present the resultant feature vector to the predictive model. The proposed system uses Support Vector Machine and Naïve Bayes which have been trained on a 15-dimensional feature set. The experiments were based on datasets consisting of 2541 phishing instances and 2500 benign instances. Using 10-fold cross-validation, the experimental results indicate a remarkable performance with 0.04% False Positive and 99.96% accuracy for both SVM and NB predictive models.
Accurate and fast URL phishing detector: A convolutional neural network approach
2020, Computer Networks
Along with the development of the Internet, methods of fraud and ways to obtain important data such as logins and passwords or personal sensitive data have evolved. One way of obtaining such information is to impersonate a page the user knows. Such a site usually does not provide any services other than collecting sensitive information from the user. In this paper, we present a way to detect such malicious URL addresses with almost 100% accuracy using convolutional neural networks. Contrary to the previous works, where URL or traffic statistics or web content are analysed, we analyse only the URL text. Thus, the method is faster and detects zero-day attacks. The network we present is appropriately optimised so that it can be used even on mobile devices without significantly affecting its performance.
PhiDMA – A phishing detection model with multi-filter approach
2020, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
The primary characteristic for the whitelist is the magnitude of the lists. Researchers have contributed different strategies to distinguish phishing sites utilizing whitelist (Kang and Lee, 2007; Wang et al., 2008; Afroz and Greenstadt, 2011; Han et al., 2012; Cao et al., 2008). Although, whitelist gives a considerable commitment on recognizing phishing sites, notwithstanding, the main downside of this approach is retaining the whitelist updated.
Phishing remains a basic security issue in the cyberspace. In phishing, assailants steal sensitive information from victims by providing a fake site which looks like the visual clone of a legitimate site. Phishing shall be handled using various approaches. It is established that single filter methods would be insufficient to detect different categories of phishing attempts. This paper provides a multilayer model to detect phishing, titled as PhiDMA(Phishing Detection using Multi-filter Approach). The PhiDMA model incorporates five layers: Auto upgrade whitelist layer, URL features layer, Lexical signature layer, String matching layer and Accessibility Score comparison layer. A prototype implementation of the proposed PhiDMA model is built with an accessible interface so that persons with visual impairments shall access it without any barrier. The result from the experiment shows that the model is capable to detect phishing sites with an accuracy of 92.72%.
A comprehensive survey of phishing: mediums, intended targets, attack and defence techniques and a novel taxonomy
2024, International Journal of Information Security
Analysis model at the sentence level for phishing detection
2024, Deep Learning, Reinforcement Learning, and the Rise of Intelligent Systems

View all citing articles on Scopus

View full text

Using automated individual white-list to protect web digital identities

Abstract

Highlight

Introduction

Section snippets

Phishing and pharming

Construct an individual white-list

Constructing the Naïve Bayesian classifier

Efficiency in identifying login processes

Related work

Conclusion and future work

Acknowledgements

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

A Case of Study, Expert Systems with Applications

Fighting phishing with discriminative keypoint features

IEEE Internet Computing

Phishing for user security awareness

Computers & Security

Bayes decision theory