An Analysis of the Most Used Machine Learning Algorithms for Online Fraud Detection

Today illegal activities regarding online financial transactions have become increasingly complex and borderless, resulting in huge financial losses for both sides, customers and organizations. Many techniques have been proposed to fraud prevention and detection in the online environment. However, all of these techniques besides having the same goal of identifying and combating fraudulent online transactions, they come with their own characteristics, advantages and disadvantages. In this context, this paper reviews the existing research done in fraud detection with the aim of identifying algorithms used and analyze each of these algorithms based on certain criteria. To analyze the research studies in the field of fraud detection, the systematic quantitative literature review methodology was applied. Based on the most called machine-learning algorithms in scientific articles and their characteristics, a hierarchical typology is made. Therefore, our paper highlights, in a new way, the most suitable techniques for detecting fraud by combining three selection criteria: accuracy, coverage and costs.


Introduction
During the last decades, the dependency on ecommerce and online payments has increasingly grown. As the area of information technology is developing every day to be better over the time, illegal attempts in online transactions have been increased worldwide and because of that most organizations and people are suffering substantial financial losses [1]. In the literature [2], fraud is defined as "the abuse of a profit organization system without necessarily leading to direct legal consequences". Online bank fraud is continuously evolving and is difficult to analyze and detect because of the fraudulent behavior which is dynamic, spread across different customer profiles and dispersed in very large and dynamic datasets. Complex decision-making systems based on algorithms and analytical technologies have been developed. These can learn from previous experiences and create patterns that can detect proactively potentially fraudulent transactions.
Going through a number of important research studies within the last few years, this paper aims to provide a review of up-to-date techniques for fraud detection based on the most outstanding criteria:  The algorithm should achieve high accuracy while processing large volumes of transaction data => high accuracy  The algorithm should help to obtain high fraud coverage combined with low false positive rate => high coverage  The algorithm should be useful for both the organizations and individual users in terms of cost and time efficiency => cost The structure of the paper is divided as follows. The first part offers background over the machine-learning algorithms used in fraud detection highlighting the chosen criteria. The second part presents the methodology of the research and the classification of the various techniques used in fraud detection based on the defined criteria. Finally, the paper presents the research results and conclusions.

2 Background
Online banking fraud has become a serious issue in financial crime management for all bank institutions. It is becoming ever more challenging and leads to massive losses, due to the emergence and evolution of complex and innovative online banking fraud, such as phishing scams, malware infection and ghost websites. The detection of online banking fraud needs to be instant because it is very difficult to recover the loss if fraud is undiscovered during the detection period. Most customers usually rarely check their online banking history regularly and are therefore not able to discover and report fraud transactions immediately after an occurrence of fraud. This makes the possibility of loss recovery very low. In this context, online banking detection systems are expected to have high accuracy, high detection rate, and low false positive rate for generating a small, manageable number of alerts in complex online banking business. These characteristics greatly challenge existing fraud detection techniques for protecting credit card transactions, e-commerce, insurance, retail, telecommunication, computer intrusion, etc. These existing methods demonstrate poor performance in efficiency and/or accuracy when directly applied to online banking fraud detection [3]. For instance, credit card fraud detection often focuses on discovering particular behavior patterns of a specific customer or group, but fraud-related online banking transactions are very dynamic and appear very similar to genuine customer behavior. Some intrusion detection methods perform well in a dynamic computing environment, but they require a large amount of training data with complete attack logs as evidence. However, there is no obvious evidence to show whether an online banking transaction is fraudulent. As stated in the work of Wei et al. (2013) [1], the essence of online fraud reflects the abuse of interaction between resources in three worlds:  the fraudster's intelligence abuses in the social world,  the abuse of web technology and Internet banking resources in the cyber world  the abuse of trading tools and resources in the physical world. In the same work we find that most online fraud detection have the following characteristics and challenges:  The data set is large and highly imbalancedfor example, in a very large data set of more than 300 000 transactions in one day there were present only 5 cases of fraud which results in the task of detecting very rare fraud dispersed among a massive number of genuine transactions.  Fraud detection needs to be real timetaking into account the fact that the interval between a customer making a payment and the payment being transferred to its destination account is usually very short; to prevent instant money loss, a fraud detection alert should be generated as quickly as possible. This requires a high level of efficiency in detecting fraud in large and imbalanced data.  The fraud behavior is dynamicwith the everyday advances in information technology, fraudsters continually advance their techniques to defeat online banking defenses.  The customer behavior patterns are diversein this context, fraudsters tend to simulate genuine customer behavior. Also, they change their behavior frequently to compete with advances in fraud detection. All of these make it difficult to characterize fraud and even more difficult to distinguish it from genuine behavior.  The online banking system is fixed -customer accesses the same banking system which can lead to good references for characterizing common genuine behavior sequences, and for identifying suspicions in fraudulent online banking. The above characteristics make the detection of fraud very challenging, which is the reason why there have been developed many machine-learning techniques to fix this problem [5]. Seeja and Masoumeh (2014) [6] proposed a credit card fraud detection model for highly and anonymous dataset. Frequent item set mining was used to handle the class imbalance problem thereby finding legal and illegal transaction patterns for each customer. A matching algorithm was then used to determine the pattern of an incoming transaction whether it was genuine or fraud. The evaluation of this model confirmed that it is possible to detect fraudulent transaction and also improve imbalance classification. Duman and Ozcelik (2011) proposed a novel combination of the genetic algorithm and the scatter search algorithm to detect credit card fraud in a large Turkish bank. From this novel combination, the authors were able to improve the bank's existing fraud detection strategy by obtaining a high coverage of 200% [41]. Krenker et al. (2009) [7] proposed a model for real time fraud detection based on bidirectional neural networks. In their study, they used a large data set of cell phone transactions provided by a credit card company. The results confirmed that the proposed model outperforms the rule-based algorithms in terms of false positive rate. In the same context of false positive rate, in 2011 Bhusari V. et al. [11] used Hidden Markov Model in order to detect credit card fraud during transactions. Their experiment confirmed that HMM model helps to obtain a high fraud reporting combined with a low false positive. HMM model represents a great value solution for addressing detection of fraud transaction through credit card [11]. Also, Delio Panaro et al. (2015) [12] proposed a two layer statistical classifier for sensitive, highly skewed and massive data sets to detect fraud. The algorithm has been inspired by the necessity of analyzing a data set of about fifteen million real world online banking transactions, spanning from 2011 to 2013 with the aim of detecting frauds from legitimate operations. Results confirmed that the algorithm is particularly effective in detecting anomalies, achieving high true positive rates and reasonably low false positive rates. Therefore, several other studies [71][72][73] have been made to develop classifiers in this sense of high coverage, which include techniques based on Naïve Bayes, boosting, neural networks, and ensemble learning.
In a study made by Mishra et al (2014) [8] the analysis of credit card fraud detection has been done through three classification models on two datasets. The approaches were compared according to their accuracy and elapsed time. The comparison of its performance was done with two approaches like decision tree for fraud detection and multilayer perceptron network. Azeem Ush Shan et al. (2014) [9] proposed an algorithm named Simulated Annealing algorithm that was used to train the neural networks for the detection of credit card frauds in a real-time scenario. The proposed technique was useful for individual users and also for the organizations in terms of cost and time efficiency. In this context of cost efficiency, in 2013 Sahin et al. [10] proposed a new cost-effective tree decision approach to minimize the total cost of categorization which addresses the problem of detecting fraud. Analyzing the so far published literature it is pragmatic that most of the articles focus on detection of fraud in the context of high accuracy while processing large volumes of transaction data, cost and time efficiency, high fraud coverage combined with low false positive rate etc. Which represents the reason why the focus of our research is mainly on these three criteria.
In terms of accuracy, SVM, DT, Naïve Bayes and KNN presented high and medium rate on more than 1 million of transactions or data records and also on those that presented from thousands to several hundred thousand of transactions or data records [4, 17, 20, 21, 31, 35, 40, and 44]. In terms of costs, these were high for all the algorithms applied on huge volume of data (more than 1 million transactions or data records). Table 3. Data sets used in the specialized scientific articles No. crt.
Only mentioned the use of data sets 8 [9], [11], [16], [18], [36], [52], [56][57] Machine-learning techniques intensively use math statistics, as well as knowledge and results from fields such as artificial intelligence, mathematics, psychology, neurobiology, information technology. Thus, depending on the type of learning, machine-learning algorithms can be organized into two main categories:  supervised learning algorithms are mainly used for accurate classification and prediction, being a method of classification with labeled data (Artificial Immune System, Artificial Neural Network, Bayesian network, Support Vector Machine, Decision Tree, Logistic regression, Naïve Bayes, Random forest, Fuzzy logic based system, and K-nearest neighbor)  unsupervised learning algorithms cluster unlabeled data with similar attributes, usually performs lower accuracy than supervised learning algorithms (Hidden Markov Model, Self-organizing map, Genetic algorithm, K-means, DBSCAN, Expert system, Gradient Descendent, and Scatter search). To sum up, the analysis of a number of various relevant articles retrieved mainly from Data Science was meant to identify a set of machine-learning techniques that present similar properties and meet the defined criteria. For this, we grouped the machine-learning algorithms by supervised and unsupervised techniques and by the volume of data on which these algorithms were applied. The conclusion to all this process was that the supervised learning algorithms are the most called techniques and offer high accuracy, high coverage with the disadvantage of high costs than the unsupervised ones in detecting online bank fraud.

Research results
In this paper we presented a comparative study of 14 most called algorithms in scientific articles regarding online fraudulent transactions (artificial immune system, artificial neural network, Bayesian network, DBSCAN, decision tree, Fuzzy logic based system, genetic algorithm, Hidden Markov Model, knearest Neighbour, logistic regression, Naïve Bayes, random forest, Self-organizing map, support vector machine) based on their usage frequency and on certain criteria:  The algorithm should achieve high accuracy while processing large volumes of transaction data => high accuracy  The algorithm should help to obtain high fraud coverage combined with low false positive rate => high coverage  The algorithm should be beneficial for both the organizations and individual users in terms of cost and time efficiency => cost The classification of the algorithms that we can see in table 4 was made based on the positive and negative instances that a classifier predicts correctly, metrics that we find in the following formulas: The positive and negative instances that a classifier predicts correctly are called true positives TP and true negatives TN. The incorrectly classified instances are called false positives FP and false negative FN. Based on that True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and Accuracy concepts will occur which helps in classifying the techniques. The correctitude in the use of this metrics represented also a point for which we have chosen the articles for review and also the three criteria, as it is important to have algorithms that can deal with these issues in an efficient manner. As we didn't have access the datasets to apply the metrics, for future work we will use these formulas on our dataset and verify if the obtained set of machine-learning techniques present the same accuracy as stated in the reviewed literature.
According to the analysis of the reviewed articles and the above metrics, in the following part we present in a comparative way the results of the most applied fraud detection techniques in the online environment based on their usage frequency and the defined criteria accuracy, coverage, costs, where 1means low, 2means medium, 3means high. The above table illustrates that the supervised learning algorithms are used more frequently than the unsupervised ones, findings that are also sustained by the most used public platform for data science competitions [74]. One of our research limits was the lack of access to the data sets used in the reviewed articles to determine the characteristics of the techniques. Finally, we consider our paper highlights, in a new way, the most suitable techniques for detecting fraud by combining three selection criteria: accuracy, coverage and costs.

Conclusions and future direction
Relatively to our study it can be stated that the problem of credit card fraud in the online environment has gained the most attention in the literature, although there are a number of significant problems that have not been addressed closely by the researchers, like online intellectual property theft, pagejacking, fake money orders, wire-transfer fraud. Our classification criteria were chosen based on the most common difficulties encountered by credit card fraud detection techniques. The classification of the algorithms showed that the best results in terms of accuracy and coverage were achieved by the supervised learning techniques: support vector machine, artificial neural network and decision tree. These three algorithms also were the most called in the reviewed articles which demonstrates the fact that they present the best results. The classification was made with the intention to design an efficient and trustable fraud detection systems that should also take into account other variables like risk ranges, bank and customer behavior, geographic areas and so on. Thus, as far as the research direction is concerned, it is desirable to investigate possible improvements that can be made to algorithms in order to extend their applicability to the other types of online fraudulent transactions with high accuracy, high coverage and low costs. The research will focus mainly on the hybridization of the most used machinelearning techniques to help in improving the efficiency of the fraud detection process in other important areas and validate this on our own dataset.