Credit Card Fraud Detection with Unsupervised Algorithms

According to international credit card organisms such as VISA, there are more and more credit card frauds, both in quantity and in amount. To cure the problem, an anti-fraud project is developed using a combination of two unsupervised algorithms: Principal Component Analysis and SIMPLEKMEANS algorithm. To augment model accuracy, geographic positions of the transaction and of the client are added to traditional studied data, as everybody is fully connected with smartphones nowadays and as such tendency is growing up for a near future. Good results are obtained for proposed model on created test data base by achieving the foreseeing results and getting the classification of possible frauds. 


I. INTRODUCTION
Credit card payment is nowadays a very common process for most financial transactions.But in parallel, the number of fraudulent operations has been increasing [1] and is demanding active surveillance for reducing its impact on economy, the more as internationalization and extreme simplicity of transactions makes more difficult the application of different security norms.In France for instance, between Nov. 1 st 2013 and April 30 th 2014, 532.2 billion Euros transactions have been realized by 68.4 million cards in France, a total amount of card payments increased by 4.4% in comparison with 2012 [2].Meanwhile, the total fraud amount reached 469.9 million Euros during the same period, which represents a 4.3% raise.For a long time researches have been developed in the domain to find solution to fraud problem [3]- [15].Typically, fraudulent operations are representing a small fraction of all transactions, leading to skewed distributions, which are also noisy due to errors from collecting devices in data sets.Another difficulty stems from data overlapping when operations may look fraudulent when legitimate and vice versa.Obviously, fraudulent techniques are changing over time so detection system ought to be adaptive to maintain its efficiency.
For these various reasons it is difficult to design a very effective fraud filter, and usual approach is to take advantage of artificial learning systems for recognizing fraudulent features when facing them in real life after adequate training which mainly consists in optimizing a Manuscript received July 30, 2015; revised September 8, 2015.
cost function measuring the distance of legitimate observed real data to fraudulent ones once a convenient metrics has been set.In [16], supervised probabilistic Bayes and Bayesian Networks algorithms have been used on the following variables: operation code/ response transaction code/ transaction date (YYYYMMDD)/ hour, minute, second, transaction amount and other Boolean values.Results are obtained with an error rate between 0.92% and 0.47%.Decision tree method and different Support Vector Machine (SVM) with polynomial and sigmoid functions have been compared in [17], with the conclusion that SVM generates over-fitting and is less efficient than decision trees.Artificial neural networks and Bayesian belief networks have been taken in [18,19], and it has been observed that an error in selecting the set of detection variables could block the system due to imbalance between legal and fraudulent transactions.Most current approaches so far are depending on relatively heavy numerical treatment which makes improvement much heavier and obscures full understanding of their development.
A different approach is followed in present study where the intention is to reach more understandable results and at the same time simplify their getting.For that it is proposed to use two very well defined unsupervised algorithms, the Principal Component Analysis (PCA) and SIMPLEKMEANS (SKM) algorithm, both exhibiting full transparency in their operating process, and to discuss their reliability when applied to fraud detection problem.Knowledge discovery process from data base requires five steps as showed in Fig. 1.First of all, an Extraction Transfer Loading is necessary.This stage consists in extracting data from different sources (data base, files, applications...), to be transformed and to be regrouped in a same data base.Afterwards, this one has to be cleaned, detecting and correcting corrupted or missing data.Subsequently and just to simplify the number of operations and get a faster system, only the most relevant attributes of data are going to be considered.In this way, the fulfilment of the attribute selection step is achieved.As third stage, data have to be transformed, building new attributes or changing their own format to obtain easier future manipulations.Previously selected data mining algorithms are thus able to be applied to the abovementioned data.To finalize the process, all the results obtained by the system have to be analyzed and interpreted [20].

II. DATA GENERATION
In that way, to be able to test the efficiency of Credit Card Fraud Detection System, obviously data are required.Nevertheless, collection of real data from different banks is usually unsuccessful, because it is often related to sensitive financial transactions kept confidential for elementary privacy reasons.So randomized and forged data have to be generated for the purpose.These data are created so that data mining methods can be directly applied without having to clean and treat them.As it is necessary to deal with a large range of data types (coordinates, IBAN, dates, times…), they are generated regarding different algorithms such as a simple (just put a random value in each field) or a more complex one (fields linked to one another in order to simulate several transactions for a same person, for example).JAVA and JEE languages have been used to benefit from a web interface and easier implementation.
The generator allows the user create a structure with desired fields.As showed in Fig. 2, several fields are displayed from which the user can define different limits according to his needs.All choices can be modified by user request, except during data generation.After choosing the number of desired entities, the generator calls the Generable class which produces randomized data.Then a CSV file is created, containing all generated data according to user choices.This file can be used subsequently by the fraud detector program without any further needed modifications.

III. PCA AND SKM ALGORITHM
PCA is a powerful tool which allows us, with only some calculations, the obtaining a wide view of relationships among different credit card transaction characteristics.Its flexibility is demonstrated by the fact that it can be applied to very large data sets, independent of contents and size, an essential point for this problem.Afterwards, SKM algorithm will make an easier and faster identification of fraudulent or legal transactions.In other words, the following matrix is built: T represents all the transactions of a bank account and each transaction T j = {t j1 , t j2 ,… t jp } is described by p characteristics.T contains both legal TL and fraudulent TF transactions, T = {TL,TF} and the problem here is to exactly and only detect second ones by successive k }.Best ones are such that with minimum number of most transparent operations (so the choice of best filtering set is made difficult by the interaction between operations belonging to different successive filters).A test of full filter efficiency is the distance  = | k { k }T  TF| measured with adapted metrics when tested on a representative base set T of possible transactions T. As indicated above other important elements in the choice of filtering set { k } are calculation simplicity and operation transparency.Here filters are  1 = PCA and  2 = SKM algorithms, and after their application two sets are obtained: QL (transactions classify as legal) and QF (transactions classify as fraudulent).
PCA is a data analyzing method which transforms correlated variables into uncorrelated ones.In present case this method aims at representing transactions described by different attributes (transaction amount, date …) in a smaller subspace than initial one, and so that the least possible information is lost.
For each bank account, the matrix representing "n" transactions with their "p" respective attributes is built up.After centralized each value, variance-covariance matrix  = N 1 (X T X) is also constructed, representing the difference between the value and it respective estimation. (4) The solutions of error minimization are the different eigenvectors of matrix  obtained by application of Gram-Schmidt method.After having deduced the respective eigenvalues, the dimension d of new space is chosen following cumulated variance percentage technique.The new space is built from the first d eigenvectors related to the d higher eigenvalues.The last step consists in projecting each transaction in this new space.

After
PCA, SIMPLEKMEANS unsupervised classification scheme [21] has been applied to classify the transactions.This algorithm consists in picking up randomly k initial points (cluster center), assigning then each point to the closest cluster, reevaluating the center of each cluster and reassigning points to their closest cluster, see Fig. 3.This cycle is repeated until the different sets become stable.

IV. RESULTS
The model has been applied to manually implemented data containing on five bank accounts.The first one contains 8 transactions in which there are 2 fraudulent and 6 legal ones.In the second bank account, there are 2 legal transactions and 1 fraudulent one.The third bank account contains 3 legal transactions.The fourth bank account contains 20 transactions in which 15% of them are fraudulent and finally the last bank account contains 15 transactions with 33.33% of fraud.I (a diagonal matrix is obtained).An error has been detected in the third bank account where a legal transaction has been considered as fraudulent.This result could be explained by the fact that the number of clusters in SKM algorithm is fixed to 2 and that all transactions are forced to belong to one of these clusters even in indeterminate cases.The problem would also appear in the other extreme case of 100% fraudulent transactions.Nevertheless, even with first plain iteration  =  1  2 of filters PCA and SKM, results from proposed present model are attractive.With basic undifferentiated Euclidian metrics distance, measurement error is  = 1/(707) 1/2 , a figure which can be significantly reduced by iterating  2 several times without too much numerical involvement, see Fig. 4 which exhibits the remarkable reliability of proposed filter for fraud detection above some critical percentage.This suggests a reductive step by step procedure to eliminate as much legal transactions as possible to end up within absolute reliability interval as it will be discussed elsewhere.

Figure 2 .
Figure 2. Model used in the program, showing the fields and associated methods.

TABLE I .
RESULTS FOR 5 DIFFERENT BANK ACCOUNTS According to different tests, proposed present model gives good results.Transactions of bank accounts N°1, 2, 4 and 5 have been correctly classified, with 100% precision, see Table