Dandelion optimization based feature selection with machine learning for digital transaction fraud detection

: Digital transactions relying on credit cards are gradually improving in recent days due to their convenience . Due to the tremendous growth of e-services (e.g., mobile payments, e-commerce, and e-finance) and the promotion of credit cards, fraudulent transaction counts are rapidly increasing. Machine learning (ML) is crucial in investigating customer data for detecting and preventing fraud. Conversely, the advent of irrelevant and redundant features in most real-time credit card details reduces the execution of ML techniques. The feature selection (FS) approach’s purpose is to detect the most prominent attributes required for developing an effective ML approach, making sure that the classification and computational complexity are improved and decreased, respectively. Therefore, this study presents an evolutionary computing with fuzzy autoencoder based data analytics for credit card fraud detection (ECFAE-CCFD) technique. The purpose of the ECFAE-CCFD technique is to recognize the presence of credit card fraud (CCF) in real time. To achieve this, the ECFAE-CCFD technique performs data normalization in the earlier stage. For selecting features, the ECFAE-CCFD technique applies the dandelion optimization-based feature selection (DO-FS) technique. Moreover, the fuzzy autoencoder (FAE) approach can be exploited for


Introduction
Information technology developments have highly influenced the financial industry, resulting in the extensive adoption of electronic commerce (e-commerce) platforms [1].The main problem related to advanced e-commerce is the optimistic cases of credit card fraud (CCF).In recent years, there has been a growth in CCF that is a great burden on financial organizations.CCF happens in all businesses, ranging from the home appliance to the banking and automotive sectors [2].Because of the expansive application of credit card fraud detection (CCFD) techniques, users can prevent fraud and be protected from alternative categories of cyber criminals.Automatic fraud detection increases online security and protects users from cybercriminals [3].Thus, it is important to accurately design automatic fraud detection approaches used for credit card transactions [4].Several techniques are designed for identifying fraudulent credit card transactions.An increased CCF rate is related to the increasing development of e-commerce and popularity of online transactions.Therefore, CCFD is essential for financial organizations to prevent losses [5].
The machine learning (ML) method has been extensively used for detecting CCF [6].There are vast databases because of the arrival of the Internet of Things (IoTs) and big data.Due to the size of databases, many features in them may be unrelated or redundant to the response variable [7].ML can improve the complexity of the model and result in over-fitting by these features.To address the great dimensionality problem, a dimensionality reduction technique like feature selection (FS) is required for obtaining useful insights and making accurate predictions [8].FS methods aim to detect the most significant features required to design a high-performance ML technique, ensuring decreased computational complexity and enhanced classification performance by extracting redundant and inappropriate features.FS techniques are categorized into three method types: embedded, filter, and wrapper.The internal functioning and configuration of different FS approaches make them suitable for various applications.Filter techniques use feature ranking to determine the useful features.Features that achieve scores more than a given threshold are chosen, and those less than the threshold can be rejected [9].Subsequently, the identification of key features involves supplying input to the learning method.Filter techniques differ from embedded and wrapper techniques because they are independent of classification bias and are not reliant on the classifier [10,11].
This study presents an evolutionary computing with fuzzy autoencoder based data analytics for credit card fraud detection (ECFAE-CCFD) technique.The ECFAE-CCFD technique performs data normalization in the earlier stage.For selecting features, the ECFAE-CCFD technique applies the dandelion optimization-based feature selection (DO-FS) technique.The global searching abilities of the dandelion optimization (DO) algorithm can efficiently discover the feature space and recognize the highly related features, resulting in significantly better model performance.Moreover, the fuzzy autoencoder (FAE) technique can be implemented for the recognition and classification of CCF.Autoencoders, particularly FAE, are known for their proficiency in capturing non-linear relationships within data and extracting related features.In CCF, where patterns can be complex and non-linear, FAE can provide effective data representation and improves the classification performance.Last, an improved billiard optimization algorithm (IBOA) can be utilized for the optimum selection of parameters based on the FAE algorithm, increasing the classification accuracy.The IBOA's strategy of escaping local optima can stop the model from getting stuck in suboptimal solutions, guaranteeing improved overall performance.The use of IBOA is motivated by its ability to competently search for optimal parameter values, which is important in improving the performance of the FAE model.The simulation outcomes of the ECFAE-CCFD model are examined on the benchmark open-access database.In short, the contribution of the study will be as follows.
• Introduces the ECFAE-CCFD method, providing an innovative and comprehensive technique for the detection of CCF.• The DO-FS leverages evolutionary computing to select the most relevant feature for fraud detection, potentially reducing computation complexity and enhancing model performance.• Employed method represents a significant contribution, which showcases an advanced technique for the detection and classification of CCF.FAE adds a layer of sophistication to the fraud detection model.• Uses an IBOA for optimum selection of parameters within the FAE algorithm, further increasing the accuracy and generalizability of the model.• The combination of the DO-FS approach and IBOA for parameter tuning within the FAE framework for the CCFD is an innovative method that has not been discovered in the literature review.

Related works
Raghavan and El Gayar [12] target to benchmark multiple ML techniques like support vector machine (SVM), KNN, and RF, while the DL techniques like restricted Boltzmann machine (RBM), autoencoders, CNN, and deep belief network (DBN).These datasets like the German dataset and the European (EU) Australian were used.In [13], current advancements in ML techniques and deep reinforcement learning (DRL) were exploited for CCF detection methods, which include non-fraud and fraud classes.The Adaptive Synthetic Sampling (ADASYN) and Synthetic Minority Oversampling Technique (SMOTE) were the two resampling approaches leveraged for resampling the imbalanced CCF data.To establish CCF detection, mechanisms like ML techniques were applied to this balanced database.Then, based on the imbalanced CCF database, DRL was used for creating a detection system.Through practical experiments, the author discovered the reliable degree of ML approaches depending on the above-mentioned resampling methods and DRL approaches for the detection of CCF.Alharbi et al. [14] presented the Kaggle dataset to design a DL-related method to sort out the text data problem.The images were given to a CNN structure with class weights through the inverse frequency approach to address the imbalance class problem.ML and DL methods have been implemented to verify the validity and robustness of the presented system.Sanober et al. [15] introduce a new structure that incorporates Spark with a DL method.This study applies various ML approaches, such as DT, RF, SVM, LR, and KNN, to detect fraud.also, a comparative analysis was done using different parameters.Nguyen et al. [16] offer user separation, where the author splits users into new and old persons, before implementing DNN and CatBoost in all categories.Also, various methods to boost detection accuracy, like handling feature engineering, heavily imbalanced datasets, and feature transformation, were presented in detail.Almhaithawi et al. [17] addressed fraud detection issues as one common issue in the secure banking research domain because of their significance in decreasing the losses of e-transaction companies and banks.This work includes implementing common classification techniques like LR and RF, along with modern classifiers with existing results such as CatBoost (CB) and XGBoost (XG), testing the outcome of an unbalanced dataset by comparing their outcomes without and with balancing after concentrating on the savings measure for testing the result of cost-sensitive wrapping of Bayes minimum risk (BMR).
Taha and Malebary [18] present an intelligent method to detect CCF transactions utilizing an improved light gradient boosting machine (OLightGBM).A Bayesian-based hyperparameter optimizer method can intelligently combine to adjust the parameters of LightGBM.In [19], the shuffled shepherd political optimizer-based deep residual network (SSPO-based DRN) technique was presented for CCFD.In [20], the authors leveraged the XGBoost method, an effectual method for forecasting appropriately predict fraud.While the important count of fraudulent transactions will be much less than legitimate transactions, the authors offered to set this bias by leveraging sampling approaches like oversampling, undersampling, and ITS combination.Karthik et al. [21] examined a new model for CCFD that integrates ensemble-learning approaches like bagging and boosting.This method combines the main features of both methods by creating a hybrid method of bagging and boosting ensemble methods.

The proposed model
In this article, we have proposed the ECFAE-CCFD methodology.The major purpose of the ECFAE-CCFD method is to detect the presence of CCF in real time.To achieve this, the ECFAE-CCFD technique comprises data normalization, IBOA-based parameter tuning, FAE classification, and DO-FS-based feature subset selection.The overall working process of the ECFAE-CCFD technique is depicted in Figure 1.

Feature selection using DO-FS technique
The DO-FS method was executed to select the optimum features.The DO algorithm is based on the performance of determining the optimum reproduction place while dandelion seeds mature [22].Also, it is highlighted that the flight behaviors of dandelion seeds are crucial in biological evolution.The longer-distance fight comprises three stages: descending, rising, and landing.The mathematical method of DO is discussed as follows.
The single objective DO technique with  parameters is formulated below: where () signifies the objective function and ,  ∈   indicates the lower and upper boundaries of the parameter  ∈   .Like other optimization techniques, DO comprises the following phases to resolve optimization problems.DO randomly produces a candidate solution using Eq (2), where  and  are correspondingly set to the population size and dimension parameter. = ( 1 ,  2 , ⋯   ) and = ( 1 ,  2 , ⋯   ) are the upper and lower limitations of the seed location.
(  ) represents the fitness values of  ℎ seeds from the population and seed with smaller fitness is considered the best position for transmitting dandelion seeds, as   : where  () represents an index with two equivalent values.Dissimilar wind speeds and weather conditions define the increasing height of DO; hence, the weather was categorized into sunny and rainy.Case1.During sunny days, the wind speed will be -uniform distribution, meaning DO will have a higher probability to a distant area.Thus, dandelion seeds emphasize exploration on sunny days.This process can be mathematically modelled as follows: where    represents the seed location at ℎ  iteration, the random location of the dandelion seed at the  iteration is represented as    ,  denotes the maximal number of iterations, ln  indicates a  -normal distribution followed by  = 0,  2 = 1,  shows the adaptive parameter,   and   indicate the dandelion seed lift module coefficient, and  ̂ denotes the arbitrary integer within [−, ].Case2.During rainy days, DO cannot rise well with the wind, so DO emphasizes local neighborhood exploitation: where  denotes the local adaptive parameter and  refers to the maximal number of iterations: where  shows the uniform distribution random integer.Dandelion seeds emphasize global discovery in the decline phase.It facilitates the dandelion population and reflects the stability of decline to travel toward the preferred position for reproduction: where  n_t denotes the mean place of the DO population in  ℎ iterations, and   refers to the Brownian movement.Dandelion seeds focus on local neighborhood development in the landing stage.Based on the rising and descending phases, the DO arbitrarily chooses the landing site.The data around the existing elite seed can be utilized for local exploitation to approach the global optima: where   stands for the better position of seeds at  iteration,  describes the maximal number of iterations, levy ( ̂〉〉) represents the function of Levy's fight,  ̂〉〉 = 1.5,  ̂ linearly increases within [0,2].
In the presented DO-FS method, the fitness function (FF) is deployed to get a balance between the count of FSs from classifier accuracy (maximal) and every performance (minimal) achieved by using FS as follows: where   () denotes the classifier number of errors,  and  signify the two parameters equal to the effect of classifier quality and subset length, ∈ [1,0] and  = 1 − , || indicates the cardinality of the chosen subset, and || indicates the overall count of features from the database.

Data classification using FAE model
The FAE model is used for the classification of CCF.Autoencoder (AE) was initially coined in the 1980 for dimensionality reduction with encoded and decoded parts [23].During the encoded part, the input layer  = { 1 ,  2 , ⋯ ,   } ∈ ℝ × is defined as a dimensionality decrease procedure as hidden state  = { 1 ,  2 , ⋯ ,   } ∈ ℝ × with weight linked matrix  ∈ ℝ × and bias vector  1 ∈ ℝ ×1 .During the decoder part, the HL reconstructs the input layer with  ∈ ℝ × and  2 ∈ ℝ ×1 by minimalizing the loss function as follows: The values of every node in the output layer were evaluated as follows: where () refers to a non-linear activation function, generally a logistic sigmoid function () = 1/(1 + (−)).The HL values are attained using activation function () and bias ℬ 1 : Parameters  = {,  1 ,  2 } are used for minimizing the loss function.Particularly, the HL is considered an effective outcome of dimensionality reduction once the output reconstructs the input data.However, the conventional AE only follows the minimal reconstructed error in an undirected manner that is weaker to supervise.Therefore, classical AE is considered an unsupervised model.Figure 2 displays the structure of AE.
where  = {,  1 ,  2 } is the parameter of the model and  denotes a parameter adjusting for regulating the impact of the reconstructed loss and cluster-oriented loss.FAE should be trained in a self-supervised way and enhance the discriminatory learned features by presenting a clusteringoriented loss as to the presented method using fuzzy optimizer: In the training process, the HL feature of the block was forced to cluster towards the block center, leading to the features with the best separability.Where the block center of HLs and in every iteration is denoted as   , and the discrimination of learned features can be improved by the similarity of instances from the block.

Hyperparameter selection using IBOA
The IBOA is exploited for the optimal selection of the hyperparameter values.Like other metaheuristics, the BOA technique has various disadvantages such as occasional instability and premature convergence [24].Therefore, an improved version of BOA is introduced to resolve these drawbacks.To improve the efficacy in IBOA, a chaotic process can be employed by Lé vy flight, which balances the exploration and exploitation: where  describes the step size,  describes to the Gamma function,  represents the Lé vy index within zero and two, ,  ∼ (0,  2 ), and the value of  is assumed to be 3/2.Therefore, the upgraded position of ordinary balls can be described as follows: Whereas where () denotes the random location vector, and  ∈ [0,2] and  ∈ [0,1] represent the arbitrary variable.The pseudocode of IBOA is shown in Algorithm 1.The classifier outcome of the ECFAE-CCFD method on the German credit database is exhibited in Figure 3.The confusion matrices obtained by the ECFAE-CCFD model on 70:30 of the TRPH /TSPH is illustrated in Figure 3(a) and (b).These findings specified that the ECFAE-CCFD algorithm can be appropriately detected and categorized with two classes.The PR outcome of the ECFAE-CCFD approach is depicted in Figure 3(c).The simulation value showed the ECFAE-CCFD algorithm has gained maximum PR solution on two classes.Moreover, the ROC examination of ECFAE-CCFD methodology is demonstrated in Figure 3(d).The outcomes showed that the ECFAE-CCFD method provides excellent performance with greater ROC outcomes on two classes.According to 70% of TRPH, the ECFAE-CCFD approach achieves an average   of 94.30%,   of 94.30%,   of 94.30%,   of 94.30%, and   of 94.62%.Simultaneously, with 30% of TSPH, the ECFAE-CCFD method realizes an average   of 94.49%,   of 94.49%,   of 94.49%,   of 94.49%, and   of 95.72%.The training accuracy _  and _  of the ECFAE-CCFD algorithm under the German credit dataset is described in Figure 4.The _  can be calculated by evaluating the ECFAE-CCFD algorithm on the TR database while the _  can be measured by calculating the effectiveness TS datasets.The experimental outcome exhibits that _  and _  upsurge with increasing epoch count.So, the efficiently of the ECFAE-CCFD approach is increased under datasets of the TR and TS with higher epoch count.The classifier outcome of the ECFAE-CCFD method on the credit fraud detection dataset is shown in Figure 6.The confusion matrices achieved by the ECFAE-CCFD system with 70:30 of the TRPH/TSPH is depicted in Figure 6(a) and (b).The accomplished findings outcomes showed that the ECFAE-CCFD technique can be precisely recognized and categorized the two classes.Next, the PR examination of the ECFAE-CCFD algorithm is shown in Figure 6(c).The simulation value showed that the ECFAE-CCFD algorithm had higher PR outcomes on two classes.Finally, the ROC curve of the ECFAE-CCFD methodology is represented in Figure 6  In Table 3, the experimental validation of the ECFAE-CCFD model under 70:30 of the credit fraud detection database.The simulation value outcomes of the ECFAE-CCFD approach state the good and bad samples.Based on 70% of TRPH, the ECFAE-CCFD system gets an average   of 96.83%,   of 96.83%,   of 96.83%,   of 96.83%, and   of 96.82%.Afterward, on 30% of TSPH, the ECFAE-CCFD technique accomplishes an average   of 96.58%,   of 96.58%,   of 96.58%,   of 96.58%, and   of 96.65%.The training accuracy _  and _  of the ECFAE-CCFD technique on the credit fraud detection dataset is depicted in Figure 7.The _  can be described by the calculation of the ECFAE-CCFD system with TR dataset while the _  can be measured by calculating the outcomes on testing datasets.The experimental outcome shows that _  and _  are increased with increasing epoch count.Therefore, the effectiveness of the ECFAE-CCFD method can be increased on the datasets of TR and TS with maximum epoch count.In Table 4, a wide-ranging comparison analysis of the ECFAE-CCFD model is made with recent models [25].Figure 9 investigates a brief outcomes analysis of the ECFAE-CCFD technique with respect to   ,   , and   .Based on   , the ECFAE-CCFD technique offers increasing   of 96.83% whereas the AdaBoost, LR, RF, SVM, ELM, IG-ELM, GAW, and ML-HFSICCFD techniques obtain decreasing   values of 80.23%, 80.24%, 83.23%, 90.70%, 80.71%, 79.29%, 89.80%, and 95.97% respectively.Also, with respect to   , the ECFAE-CCFD method offers an increasing   of 87%, whereas the AdaBoost, LR, RF, SVM, ELM, IG-ELM, GAW, and ML-HFSICCFD approaches achieve decreasing   values of 64%, 84%, 65%, 73%, 89%, 90%, 94%, and 96.83% respectively.Finally, in terms of   , the ECFAE-CCFD approach achieves an increasing   of 88%, whereas the AdaBoost, LR, RF, SVM, ELM, IG-ELM, GAW, and ML-HFSICCFD systems gain lesser   values of 72.30%, 87%, 71.60%, 79.30%, 89.20%, 90.90%, 95.20%, and 96.82 % respectively.Figure 10 examines a brief results investigation of the ECFAE-CCFD method in terms of   and   .Based on   , the ECFAE-CCFD system attains enhanced   of 87%, whereas the AdaBoost, LR, RF, SVM, ELM, IG-ELM, GAW, and ML-HFSICCFD algorithms attain reduce   values of 62.50%, 82.90%, 62.60%, 71%, 87.40%, 89.90%, 94.50%, and 96.83% respectively.In addition, with respect to   , the ECFAE-CCFD system obtains higher   of 89%, whereas the AdaBoost, LR, RF, SVM, ELM, IG-ELM, GAW, and ML-HFSICCFD algorithms gain minimal   values of 83.70%, 91.40%, 81.90%, 88.50%, 91.10%, 92%, 96.10%, and 96.83% respectively.These outcomes highlighted the maximum efficacy of the ECFAE-CCFD method with other systems.

Conclusions
In this manuscript, we have presented the ECFAE-CCFD method.The major purpose of the ECFAE-CCFD model is to detect the presence of CCF in real time.To accomplish this, the ECFAE-CCFD technique comprises data normalization, IBOA-based parameter tuning, FAE classification, and DO-FS-based feature subset selection.The ECFAE-CCFD method exploits the DO-FS technique for effectual selection of the features.Meanwhile, the FAE approach can be exploited for the recognition and classification of CCF.At last, the IBOA is applied for the optimum selection of parameters based on the FAE algorithm, increasing the classification accuracy.The simulation outcomes of the ECFAE-CCFD method could be examined on a benchmark open-access database.The obtained values display the promising performance of the ECFAE-CCFD system in terms of various measures.
The study could leverage a more comprehensive review of the practical applicability of the ECFAE-CCFD technique in practical scenarios.Specifically, insights into the adaptability of the method to diverse financial ecosystems, different scales of credit card transaction datasets, and the computational resources needed for real-time implementation could improve its relevance to real-time deployment.Furthermore, considering challenges such as data privacy regulations and incorporation with existing financial systems would provide a more detailed understanding of the feasibility and potential hurdles of the method in an actual operational context.
While the proposed method illustrates considerable developments in the field of automated fraud detection, it is crucial to consider its wider impact, especially in terms of ethical considerations.Automated fraud detection systems, such as ECFAE-CCFD, increase concern regarding bias, privacy, and transparency.The ethical implication might emerge from the wide usage of personal financial information and the potential for false positives impacting individuals.Transparency in the algorithm's decision-making process is vital to building trust, and this study could be beneficial for discussing how ECFAE-CCFD contributes or addresses to this ethical consideration.Furthermore, attention should be given to potential bias in the training data that might inadvertently perpetuate discriminatory outcomes.Since an automated fraud detection system plays a major role in a financial transaction, an ethical discussion surrounding the deployment of ECFAE-CCFD must emphasize the need for fair and responsible practices, ensuring that the benefits of improved fraud detection are balanced with ethical considerations to protect user trust and privacy.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
(d).The ECFAE-CCFD model resulted in promising performance with enhanced ROC results on two classes.

Figure 10 .
Figure 10.  and   analysis of ECFAE-CCFD algorithm with other methods.

Table 1 .
Description of two datasets.

Table 2 .
Classifier analysis of ECFAE-CCFD algorithm on German credit dataset.

Table 3 .
Classifier outcome of ECFAE-CCFD algorithm on credit fraud detection dataset.