Improving transaction safety via anti-fraud protection based on blockchain

Financial enterprises generate profits based on economic development. More importantly, a healthy market is difficult to achieve due to their susceptibility to the parasitic credit card fraud transactions that accompany economic growth, unless an effective anti-counterfeiting technology is developed to alleviate the issue. To solve the problem, we propose a gradient-boosting decision tree based anti-fraud protection with blockchain Technology, referred to as GBDT-APBT, which treats anti-fraud transaction model as the accumulation of the classfiers' weakness and builds up a classifiers' to judge whether the transaction is fraudulent. Each user's private data is trained offline at the local blockchain node, then the trained model is directly uploaded to the cloud, and the final consensus model is obtained by voting. Due to incorporating blockchain technology, GBDT-APBT demonstrates decentralisation, openness, autonomy, anonymity, and immutability, showing its ability to satisfying the demand for an effective and beneficial anti-counterfeiting system, with high performance and effectiveness in detecting fraud information. Experiments show that compared with other methods, GBDT-APBT offers a promising approach to the security of credit card transactions with reference to the detection accuracy.


Introduction
With the emergence and development of big data, it is propelled by the boom of Internet technologies marked by the convenience to collect, store, transmit and process data, and by the monetisation of user information.It turns out to be a prerequisite for enterprises to invest in information safety as a result of frequently-occurred centralised by those who are driven by tempting interest.Blockchain (Zheng et al., 2018) technology, for its security, reliability, and non-temperability liberates lots of user information, enabling machine learning in the central database because the data has been collected and uploaded to the central database before building the model (Kumar, n.d.).The traditional machine learning method cannot guarantee data security for the high leakage risk and that personal data information can be easily copied and spread (Saia & Carta, 2019).How to learn transaction information and ensure data security in the distributed system supported by blockchain is the focus of this paper (Balagolla et al., 2021).
Blockchain technology, due to its security, reliability, and non-temperability, makes a large number of private data information to be liberated.It is easy to apply machine learning in the central database (Bottou et al., 2018), since the data has been collected and uploaded to the central database before building the model.The traditional machine learning method (Bottou et al., 2018) cannot guarantee the data security of users.The risk of leakage of centralised database (Hitaj et al., 2017) is too high.In addition, because personal data information is different from other properties, it can be easily copied and spread.How to learn transaction information and ensure data security in the distributed system supported by blockchain is the focus of this paper (Kasa et al., 2019).Work (Rani et al., 2022) combined blockchain and k-Means Algorithm to train models.Work (Mayhew & Chen, 2019) explored the application of blockchain in the security field.
This paper proposes an anti-fraud protection model based on blockchain with Gradient Boosting Decision Tree (GBDT-APBT), which is different from the classical machine learning method so it does not need the centralised database to provide the integrity data, but carries on the online learning in the point-to-point network, which guarantees the real-time update of the model, and that the transmission of the model in the network combines with the gradient-boosting decision tree.Model parameters recorded in blocks can prevent data from being tampered with maliciously and realise real-time view and tracking of model parameters.
Due to the rapid expansion of the credit card business, a lot of data have been generated.It is necessary to manage and mine all kinds of data effectively.Blockchain technology is used to analyse customer's basic information and transaction behaviour information, obtain customer behaviour patterns, predict customer behaviour, identify and prevent fraudulent transactions, which can better provide personalised services for customers.Our contribution to this paper can be summarised as follows: First, we propose a classification method based on blockchain technology called GBDT-APBT.Second, we are the first to combine the decentralised blockchain and the gradient-boosting decision tree to detect credit card fraud.Third, we verify the effectiveness of the proposed method by comparing it with other machine learning methods.
The remainder of the paper is structured as follows.Section 2 introduces the preliminary knowledge of the blockchain and reviews the related works.In Section 3, we illustrate our method in detail.Experiments and analysis are presented in Section 4. Finally, a conclusion is given in Section 5.

The principle of blockchain
The application of blockchain originated from digital currency and it is a distributed data structure.Blockchain can be classified into three types: public chain, private chain, and consortium chain.The behaviour on the public chain is public, and it is not controlled or owned by anyone.For industries and applications that need confidentiality and do not need the public chain to be open and transparent, the consortium chain limited to alliance members is adopted.The private chain is a closed LAN, with some blockchain technologies added.Fraudsters can not get access to the private chain and alliance chain (Shafay et al., 2022).Blockchain can realise the consistency protocol and cryptography algorithm between nodes.Because of the decentralisation (Liu et al., 2018), the anti-corrosion of blockchain and the computer system of encryption algorithm, more and more attention has been paid to various fields.For example, IBM blockchain provides distributed financial services, reducing transaction time from hours to seconds.At the same time, because of its distributed and anti-corrosion computing ability, it enhances financial security.There is also a lot of work that focuses on the medical insurance fraud (Mahapatra et al., 2022;Zhang et al., 2022).The technology of blockchain shifts the way of data processing and storage in a revolutionary way.
The theoretical basis of blockchain technology has a long history.Since 1991, Haber and stornetta (Nguyen & Kim, 2018) have proposed to record the timestamp of digital files safely.When users send files, the system designed can provide users with timestamp services (Gupta et al., 2020).After receiving the file, the server uses the current time and the pointer to the previous file as the signature to sign the file and generate the authentication containing the signature information.If the data in a file is tampered with, the pointer to the file will be automatically invalidated to ensure that the entire file system will not be changed.After that, they put forward a more efficient scheme, that is, the files are connected through the tree structure to form "blocks" (Saberi et al., 2019), and each block is linked to form a chain, so as to greatly reduce the time required to find a specific file.This is the prototype of blockchain data structure.The data structure of the linked list built by the hash pointer is called a blockchain, as shown in Figure 1.Hash pointer (Ding et al., 2019) is a pointer to the data storage location and the Hash value of the data under a timestamp.The hash function has collision resistance, the modified data hash value will not match the previous hash value, so the hash pointer can point to the data storage location, as well as can be used to check whether the data has been tampered.
Previous work has used blockchain algorithms to detect credit card fraud.Work (Rani et al., 2022) uses a combination of blockchain and Simulated Annealing k-Means Algorithm to good effect.Work (Mayhew & Chen, 2019) explores whether blockchain can solve the security problem of credit card commerce.However, the previously introduced methods have to download the data locally.Our work exploits the decentralised nature of the blockchain to protect the privacy of such data as credit card transactions without the need to download the data locally, and uses the GBDT-APBT Algorithm to achieve the best results compared to other methods.

Data structure of blockchain
The data storage in the blockchain uses the structure of Merkle tree.The binary Merkle tree is the simplest form of a Merkel tree (J.Xu et al., 2018), where the Merkle tree is built from the bottom up.Initially, all transaction data is stored in data blocks and the data in these data blocks is hashed.And then, the hash value of these data is stored in the corresponding leaf node.The hash values of two adjacent leaf nodes are concatenated for hashing, and then the pair of leaf nodes are summarised as parent nodes.Continue the similar operation until the top one node is left: the Merkle root.In the end, the Merkle root hash is produced by running the Merkle tree algorithm and storing it in the block header as the summary of the transaction list.Figure 2 illustrates the binary Merkel tree structure (J.Xu et al., 2018).For the data block of any tree node, its modification will lead to the change of the Merkle root hash, which can be effectively detected.

Classification of samples with asymmetric distribution
The historical sample statistical characteristics (Eom et al., 2019) of the credit card transactions show that the sample of fraudulent transactions is far less than that of normal transactions, that is to say, the sample data has the problem of sketched distribution.The general classification method classifies the set with uniform distribution of class samples, which has high efficiency.If the sample data with asymmetric distribution is trained directly (Ammour et al., 2018), it will lead to inaccurate classification and prediction mode, and it is difficult to correctly identify the target data with a small amount.Therefore, the efficiency of the classification model (Li & Wang, 2016) can be improved by processing the sample data with asymmetric distribution first.Among other problems of bank retail customer classification knowledge discovery, such as the analysis of credit classification and customer churn, this problem needs to be addressed.
Classification technology in statistical learning plays a key role in daily life, and machine learning method for classification and prediction has attracted extensive attention.With its own shortcomings, the classification technology can only have better classification and recognition ability for the data set with symmetrical data distribution.In practical applications, such as fraud identification of credit card, customer churn prediction and loan classification management, the data set is often concentrated on non-target data, which makes the established classification prediction model unable to correctly classify and predict the small amount of classification data.For example, in the telecom industry where only 2% of all customers are lost every month, the prediction model of customer loss predicts that all customers will not lose as high as 98% accuracy.However, such a high prediction model cannot accurately predict the target data of customers, that is, the accuracy of prediction for lost customers is poor.Although the accuracy of classification prediction is high, it is far from the expected prediction purpose and deviates from the research goal.
Researchers have conducted a lot of efforts into the problem of asymmetric data distribution.To obtain the data set with symmetric data distribution, there are three main strategies: reducing the majority, increasing the minority (Fernández et al., 2017), and multiexpert classifier (Castrillón-Santana et al., 2017).For the convenience of description, the data in the data set is divided into two categories: Majority category(MJC): data category with a large amount of data; Minority category(MNC): data category with sparse data.
The strategy of reducing MJC samples cannot make full use of the information provided by all the original materials.Adding MNC samples may introduce incorrect information, which will reduce the efficiency of classifier.Multi-expert classifier (Pan et al., 2017) uses the method of data set random segmentation to adjust the asymmetry of the original data set and make full use of all the information of the original data.In this method, MJC samples in the original data set are randomly divided into several subsets according to the appropriate ratio of the selected sample to MNC.It puts the samples of MNC into multiple subsets after cutting to form multiple subsets of symmetry allocation and then constructs multiple classifiers based on these subsets.After the training subsets of these categories are formed, they are all data sets that can operate independently.Heterogeneous classification pattern construction method (Boot et al., 2016) can be used to construct the classifier according to the characteristics of the subset.When a data item is to be predicted by category, multiple sub-classifiers (Maillo et al., 2017) will predict the data by category.Based on the classification results of each sub-classifier, the final classification of the data is determined by an expert meeting.The simplest expert meeting is the voting method (J.Xu et al., 2018), which takes the opinions of most experts as the final decision-making result.

Anti-fraud transaction protection based on blockchain
It is of great significance to build a system to ensure the security of user information and to protect the enterprise from fraud.In order to break the traditional centralised database architecture, we introduce blockchain technology to build a point-to-point distributed network architecture, and user data information is no longer uploaded to the centralised database.This chapter mainly studies the model of bank credit card fraud identification.To solve the problems, among which the sample distribution of fraud transaction data and non-fraud transactions data is asymmetric, while the efficiency of a single classifier is weaker than that of a combined classifier?We propose a gradient-boosting decision tree expert classification model based on blockchain technology, so as to better identify whether the sample is a fraud transaction, and further improve the effect of classification to provide a better model for banks to identify fraud transactions.Figure 3 shows the structure of the whole model.

Algorithm
For anti-fraud transactions, it is generally necessary to establish a binary classification model with a strong generalisation ability to determine whether it belongs to fraud transactions.We innovatively combine the gradient-boosting decision tree with the blockchain.The significance of model parameters placed on the blockchain is that user privacy data does not need to be uploaded to the cloud when training the consensus model.Specifically, each user privacy data completes training offline at the local blockchain node, after which the model parameters are uploaded directly to the cloud consensus model.Logically there is no need to threaten the model to prove it, and physically the possibility of user privacy data leakage is eliminated.We regard the anti-fraud transaction model as the sum of the weak classifier models, shown as: where w is the weight and φ is the set of weak classifiers.The equation can be regarded as a linear combination of basic functions.
The residual elimination of gradient-boosting decision tree and the decentralised structure of blockchain ensure the real-time identification of fraudulent transactions.In our model training process, each blockchain node model is trained locally, and each node is responsible for training a subset of the corresponding dataset and finally voting to arrive at the consensus model.Corresponding to real life, privacy data need not be uploaded to the cloud for aggregation, each user's privacy data is trained locally, and finally, the excellent model parameters are selected to be uploaded to the consensus model.To eliminate the residual in the anti-fraud transaction model, we can construct a new model along the gradient direction of residual reduction.The purpose of each new model construction is to decrease the residual of the previous model along the gradient direction in gradient boost, which is significantly different from the traditional boost in terms of the weight of correct and incorrect samples.Let F(x, P) be the classification function and p the parameter set, then extend the addition function to the following equation: where h(x; a) is the base function, which is a single parameterised function for the input variable x, and a = a 1 , a 2 , . . ., a m which represents the partition variable in the classifier.We solve the parameters {β m , a m } of Equation ( 2) by optimising the loss function: The basis function is applied to the regression tree model, that is, GBDT is applied to the fraud transaction model.Therefore, the addition model of fraud prevention model can be expressed as follows: where l(•) is the indicating function, if x is in space R J , it is 1, otherwise and 0. b J is the value on the node of the tree model.
Therefore, the updated equation of the whole model is: where γ represents the loss function: In the above function, β denotes the degree of confidence and b corresponds to the value of the node of the tree model.The GBDT-APBT algorithm is shown in Algorithm.1.The algorithm takes the initial constant value as the input and gets the classification model by training the decision tree iteratively.
Algorithm 1 Anti-fraud protection model based on blockchain technology with GBDT Input: initial constant value to minimize loss function γ , transaction information sample set (x 0 , y 0 ), (x 1 , y 1 ), (x 2 , y 2 ) . . .(x N , y N ), number of iterations T Output: Transaction information model of classification tree end for 6: Fit a classification tree to the targets r it giving terminal regions R jt , j = 1, 2, . . ., J t 7: for j = 1, 2, . . ., J t do 8: end for 10: In the GBDT classifier for fraud protection, it is necessary to initialise the classifier model first and estimate the constant value that minimises the loss function, which is a tree with only one root node.The negative gradient of the loss function in the current model is calculated as the residual estimation in the 5th line of Algorithm 1.Then, the leaf node region is estimated to fit the approximate residual value and the loss function is minimised by using linear search to estimate the value of leaf node region in lines 7-11 of Algorithm 1. Finally, all information samples are used to generate the final classification tree model.
In this paper, the tree in GBDT is adjusted for the classification of anti-fraud transaction information.After the network initialisation, the bank issues the model training request.After the user agrees to join the training network, the initial model f 0 will be synchronised to each client.The model updates with user's data (x i , y i ).The client computes the minimum loss function of GBTD locally as Equation ( 7), where y denotes the lable, x means the input and f (x) corresponds to the output of the trained model.The transfer process of the whole Algorithm 2 Updating of GBDT-APBT Input: number of user nodes M, transaction information sample set (x 0 , y 0 ), (x 1 , y 1 ), (x 2 , y 2 ) . . .(x N , y N ), number of iterations T Output: The final model f n 1: Initialize GBDT tree model f 0 with initial constant value to minimize loss function 2: for every transaction information n = 0, 1, 2, . . ., N do  4. Algorithm 2 shows the process of model updating.For each loop, each user node updates its model with reference to Algorithm 1.This is also shown in Figure 5 L(y, f We assume that there is an up-to-date model f t−1 in the network, which network node can update f t−1 at time t.The trained model parameters f t are written into the block at timestamp t.Based on the update mechanism we designed, network nodes a, b, c, d, e . . .have the right to check the model parameters in the previous block.Different nodes can use their own data to get f t . ... Each node packages and uploads its own model parameters to the data pool, and each node votes the models of other nodes.Node uses the 0-1 loss function to vote for other nodes, where 0-1 loss function is shown in Equation ( 8), where y means the label, x represents the input and f   t .If the prediction is correct, node a will vote for the updated model of node b, otherwise, it will not vote.Suppose the voting results are shown in Table 1.
The different nodes vote for each other and calculate the accuracy of each model.As can be seen from Table 1, the accuracy rate of node d is the highest (100%), so all nodes reach a consensus that node d obtain the power of time t to update the model f t−1 , and write the updated model parameters f t into the data block with timestamp t.It is worth noting that it will not continue to participate in the subsequent model updates after node d updates the model, so that the data of each node can be trained only once.

Experiment and analysis
Credit card fraud data set is a data set downloaded from KAGGLE, consisting of nearly 3 million credit card transactions recorded by Deutsche Bank.It is used for training of antifraud model, the size being 143 Mb.Data sets can be divided into two categories, where 0 represents there's no credit card fraud and 1 represents there's fraud.Each data has 31 data characteristics, among which the first is the time interval between all the other transactions and the first transaction.The second to 29th features are processed features.Because of the need for confidentiality, the original information will not appear.Item 30 is the amount feature, which is the transaction amount.The last is whether it is fraud.One of the data is: 406, −2.31, 1.95, −1.60, 3.997, −0.52, −1.42, −2.53, 1.39, −2.77, −2.77, 3.20, −2.89, −0.59, − 4.28, 0.38, −1.14, −2.83, −0.01, 0.41, 0.12, 0.51, −0.03, −0.46, 0.32, 0.04, 0.17, 0.26, − 0.14, 0, 1.
Figures 6(a,b) shows the distribution of data categories, Figure 7(a,b) shows the distribution of transaction amount.
There are 284,807 credit card transactions in the data set, of which 284,315 are no-fraud data, accounting for 99.83% of the total; only 492 are fraud records, accounting for 0.17% of the total.Generally, for skew records, we need to do one up sampling or down sampling.But because of our design scheme, there is no longer a centralised database to save the complete data.The skew records can't be predicted in advance, and online sample preprocessing can't be achieved.We used 70% of the whole data set, that is, 199,020 pieces of data as train data.We used 30% as the test set, that is, 85,295 pieces of data as the test data.We use a random partitioning of the dataset to adjust for the asymmetry of the original dataset.We randomly partition the MJC samples in the credit card fraud dataset into M subsets, then similarly randomly partition the MNC samples in the dataset into multiple subsets, and finally merge the subsets of the MJC samples with the subsets of the MNC samples to obtain M symmetrically assigned subsets.Multiple classifiers are then constructed based on these subsets.Here each subset is trained on a blockchain node, which makes full use of all the information in the original dataset.
We implemented a method that combines decentralised blockchain and gradientboosting decision trees to detect credit card fraud.Such offline unbalanced datasets and in landed scenarios where user privacy needs to be protected and data cannot be uploaded to the cloud can use our method to process data and train detection models.When learning the skew data set, we compared the GBDT-APBT (anti-fraud protection based on blockchain technology with gradient boosting decision tree) block chain based GBDT model updating algorithm with other basic machine learning methods.Other machine learning methods include GBDT (gradient-boosting decision tree), Bayes (Granik & Mesyura, 2017), DT (decision tree) (Tanha et al., 2017), SVM (support vector machine) (Al-Yaseen et al., 2017), KNN (k-nearest neighbours) (Maillo et al., 2017), XGBoost (eXtreme Gradient Boosting) (Priscilla & Prabha, 2020), RF (RandomForest), CNN (Convolutional Neural Network) (Vardhani et al., 2019) and DNN (Deep Neural Networks) (Lebichot et al., 2019).Figure 8 shows the confusion matrix results of prediction on testing set after 10 methods are tested.Figure 9 shows the PR curve results of prediction on testing set after 10 methods are tested.
Table 2 shows the comparison of predicted results of 10 schemes on the test data set.There are many differences in the performance of different schemes in skew data sets.
As shown in Table 3, we compare with the method proposed by Pang and Shen (2019), Pang et al. (2018), Ruff et al. (2018), H. Xu et al. (2022) and Pang et al. (2015) on the AUC-ROC and AUC-PR metrics.The experimental results demonstrate that our proposed method is optimal in both of these metrics.
As can be seen from the data in the table, the recall rate of different models on the test set varies greatly.The recall rate of GBDT-APBT in training set is 93.24%, which is better    than other methods, meaning our method has a higher probability of predicting negative samples.
To verify the validity of our proposed model, we also conducted comparison experiments on the Heart Failure Prediction Dataset and the Default of Credit Card Clients Dataset downloaded from KAGGLE, respectively.The Heart Failure Prediction Dataset is a dataset of cardiovascular diseases (CVDs) where all samples can be classified into two categories, with samples with CVDs labelled as 1 and samples without diseases labelled as 0. The entire dataset has 918 observations, with 11 features per entry.We used a 5-fold cross-validation approach for the experiments.
Default of Credit Card Clients Dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.All samples in the dataset are similarly divided into 2 categories, each with 25 features, and the entire dataset has 30,000 samples.We randomly used the 0.7 data as the training set and the 0.3 data as the test set.The default label for customers who paid on time was 1 and the default customer label was 0.  Tables 4 and 5 show the test results in the Heart Failure Prediction Dataset and the Default of Credit Card Clients Dataset, where precision stands for the accuracy of the prediction, recall represents proportion of the true positive and F1-score means the harmonic mean of the precision and the recall.We owe the credit to the adoption of the GBDT, which use the integrated algorithm based on decision tree.
As the results of the Tables 4 and 5 experiments show, our method remains highly competitive in both datasets.The optimality in Precision, Recall, and F1-score was achieved in all three metrics, which proves the effectiveness of our blockchain-based approach.

Conclusion
In the protection of credit card fraud transactions, currently, there is a serious asymmetry in the distribution of data set samples.Because there is no centralised database to save complete data, the skew of sample categories can not be predicted in advance, and online sample preprocessing can not be achieved.This paper solves this problem by proposing a classification method based on blockchain technology called GBDT-APBT.Different from other methods, we first combine the decentralised blockchain and the gradient-boosting decision tree to detect credit card fraud.Therefore, we can train and update the model in real time without preprocessing the data by blockchain.Blockchain technology enables a collective group of select participants to share data; its data is held by the corresponding blockchain cloud services.Data is transferred by network routes, and each node obtains the latest data by sending data requests to neighbouring nodes.Blockchain technology spreads the huge risk of centralisation to each user node.Users utilise their own data to train and update the model, and the privacy data will not be spread in the network.The model makes full use of hash algorithm, digital signature, timestamp and other cryptography tools in blockchain technology, which reduces the possibility of brute force cracking.Finally, the efficiency of the model is verified by the experiment of credit card fraud identification.For future work, since GBDT is not applicable to high-dimensional sparse data, our method may perform poorly in relevant fields.Thus, we will look up the combination of the neural network and GBDT for classification, using the neural network to handle the highdimensional sparse input data.Besides, we fail to consider the unbalance of the dataset.We will also try the data augmentation method for data preprocessing before training for future work.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 2 .
Figure 2. The structure of binary Merkel tree.

Figure 3 .
Figure3.An overview of GBDT-APBT where multiple experts voted, followed by a consensus model that determines whether it is a fraud transaction.
node m = 1, 2, . . ., M do 11: Using the result of 0-1 loss function to determine whether to vote for the model f for every user node m = 1, 2, . . ., M do m packs and signs the model parameters and uploads them to the blockchain network 18: Node m exits the next model update 19: end for 20: return The final model f n model is shown in Figure denotes the output of the trained model of node b at time t.

Figure 6 .
Figure 6.The distribution of data categories.

Figure 7 .
Figure 7.The distribution of transaction amount.

Figure 8 .
Figure 8. Confusion matrix of different algorithm models.

Figure 9 .
Figure 9. Precision-Recall curve of different algorithm models.

Table 1 .
Results of the vote.

Table 2 .
The result of predictions.

Table 3 .
The result of predictions.

Table 5 .
Default of credit card clients dataset.