Machine Learning Models and Applications for Early Detection

From the various perspectives of machine learning (ML) and the multiple models used in this discipline, there is an approach aimed at training models for the early detection (ED) of anomalies. The early detection of anomalies is crucial in multiple areas of knowledge since identifying and classifying them allows for early decision making and provides a better response to mitigate the negative effects caused by late detection in any system. This article presents a literature review to examine which machine learning models (MLMs) operate with a focus on ED in a multidisciplinary manner and, specifically, how these models work in the field of fraud detection. A variety of models were found, including Logistic Regression (LR), Support Vector Machines (SVMs), decision trees (DTs), Random Forests (RFs), naive Bayesian classifier (NB), K-Nearest Neighbors (KNNs), artificial neural networks (ANNs), and Extreme Gradient Boosting (XGB), among others. It was identified that MLMs operate as isolated models, categorized in this article as Single Base Models (SBMs) and Stacking Ensemble Models (SEMs). It was identified that MLMs for ED in multiple areas under SBMs’ and SEMs’ implementation achieved accuracies greater than 80% and 90%, respectively. In fraud detection, accuracies greater than 90% were reported by the authors. The article concludes that MLMs for ED in multiple applications, including fraud, offer a viable way to identify and classify anomalies robustly, with a high degree of accuracy and precision. MLMs for ED in fraud are useful as they can quickly process large amounts of data to detect and classify suspicious transactions or activities, helping to prevent financial losses.


Introduction
Machine learning (ML) has become a discipline that automates repetitive and complex tasks through its algorithms, thereby increasing operational efficiency across various organizations.ML analyzes large amounts of data to identify patterns and trends, aiming to improve decision making in different contexts [1][2][3][4][5][6][7].This discipline of artificial intelligence trains models based on data analysis to make automatic predictions, allowing the models to deduce the correct labels based on the learning acquired from historical data.
Although ML has advantages over classical methods used in multiple areas, each ML model is unique.Each model is trained with data of different characteristics that must be identified to make correct predictions with a high degree of accuracy and precision (e.g., numerical, alphanumeric, and discrete data).Additionally, ML training takes a long time due to the large volumes of data that the models require [3].
Another associated disadvantage is the interpretability of the data [8], which limits the understanding of the model to make high-quality predictions.There is also an ongoing need for high-quality data, which must be sufficiently reliable and that effectively allows expression of the problem statement.In addition to this, the need to have a sufficient Given the importance of ML-based ED, this article aims to conduct a systematic literature review to identify the most commonly used ML models (MLMs) in ML-based ED in the aforementioned areas.As a second objective, this article aims to identify new methodologies for using these MLMs to improve classification or prediction in ED.This literature review is motivated by the exploration of MLMs currently used in ED for fraud detection.The intention of this article is to identify how ML-based ED models have impacted the field of fraud detection by analyzing their advantages and disadvantages, and to present a discussion on the benefits that the fraud domain can obtain as part of Fiscal Surveillance and Control.
This review aims to provide academics and professionals with guidance in their work, facilitating the quick identification of current algorithms and methodologies used in ML for the application of ED in the referenced areas and in the field of fraud.The main contributions of this article are as follows: • The presentation of the most used MLMs for ED.
• The division of MLMs for ED into two main categories: Single Base Model (SBM) and Stacking Ensemble Model (SEM).• The identification of SBM or SEM in ED for fraud.
• A discussion on how ML-based ED can improve processes in fraud.
The article is structured in the following sections: Section 2 describes the research article selection process for a systematic literature review on MLMs for ED.Section 3 gives an overview of data balancing and model validation metrics currently used in machine learning.Section 4 gives an overview of the machine learning models found and their performance in multiple applications and specifically in fraud detection.Section 5 discusses the performance of the machine learning models found for early detection in multiple areas and the importance of using these models in early fraud detection.Finally, conclusions are presented.

Article Selection Process
A systematic literature review is a research approach that examines information and findings regarding a research topic [31].This approach aims to locate the largest possible number of relevant studies on the subject of study and, through referenced or proprietary methodologies, determine what can be confidently asserted from these studies [32,33].This section provides an overview of the literature to help understand the MLMs for ED used in the present literature.
In this article, the process of searching and selecting articles consists of two stages, aiming to answer the following two questions: • RQ1: Which MLMs are currently used in the literature for early detection in multiple areas?• RQ2: How have these MLMs for ED been implemented in the context of fraud?
In stage 1, based on RQ1 and RQ2, Scopus was selected as the search engine, and the following keywords were used for the search equation: "machine learning model", "data analytics" or "data analysis", "early detection", and, finally, "fraud detection".These words were searched in the titles, abstracts, and keywords of the articles according to the use of the Scopus search engine.These five keywords formed the first search equation within the time frame between 2017 and 2025.As a result of this search, a total of 61 scientific articles were obtained, included in electronic databases such as Springer Link, Elsevier, IEEE, MDPI, and Taylor and Francis.
Figure 1 presents an analysis of the occurrence of words in the selected articles.The cooccurrence analysis of the words of the articles was performed according to the information that was exported from the Scopus database.Information such as citation information: "Author(s), Document title, Year, Source title, etc.", bibliographical information: "Abbreviated source title, Affiliations, etc.", abstract and keywords: "Abstract, Author keywords and Indexed keywords", funding details: " Acronym, etc.", and other information like "Conference information and Include references".The color map in Figure 1 illustrates the frequency of recurring concepts found in the literature.It is inferred, then, that models such as Logistic Regression (LR), Support Vector Machines (SVMs), decision trees (DTs), data analytics (DA) and Random Forests (RFs) are base models used in ED or early diagnosis.
"Abbreviated source title, Affiliations, etc.", abstract and keywords: "Abstract, Author keywords and Indexed keywords", funding details: " Acronym, etc.", and other information like "Conference information and Include references".The color map in Figure 1 illustrates the frequency of recurring concepts found in the literature.It is inferred, then, that models such as Logistic Regression (LR), Support Vector Machines (SVMs), decision trees (DTs), data analytics (DA) and Random Forests (RFs) are base models used in ED or early diagnosis.Having reviewed the information from this preliminary search, in stage 2, it was identified that currently in ML, from a methodological aspect, with the aim of improving the prediction of base models, stacking ensemble has been implemented in recent years.This allowed extending the initial search equation by including the keyword "Stacking Ensemble" with a focus on ED.As a result, a total of 62 articles related to ML-based ED were obtained.Additionally, a search matrix was synthesized to identify the contribution of each article, the base MLMs (49 articles) and the ensemble MLMs (12 articles), and their application areas and one article of literature associated to the topic.Adding "Stacking Ensemble" provided only one more article, further completing the search for models in assembly.Once the review was advanced, the above keywords included multiple articles of the assembly models in an accurate manner, ensuring that the information on models under this methodology was mapped in the best possible way.
Finally, the necessary information was extracted regarding applications and MLMs applied to ED in multiple areas (49 articles) as shown in SBM and SEM tables, particularly in fraud detection Table (12 articles).

MLM Data Balancing and Performance Metrics
In the literature review consulted, various procedures were evident for conducting proper validations of MLMs, such as data balancing and the application of performance metrics.Validating MLMs is crucial to ensure their reliability and effectiveness in decision making.Validation allows for evaluating the predictive capacity of models, identifying potential issues like overfitting, and ensuring that the results obtained are generalizable and consistent with new data [34].In model validation, its performance, accuracy, and Having reviewed the information from this preliminary search, in stage 2, it was identified that currently in ML, from a methodological aspect, with the aim of improving the prediction of base models, stacking ensemble has been implemented in recent years.This allowed extending the initial search equation by including the keyword "Stacking Ensemble" with a focus on ED.As a result, a total of 62 articles related to ML-based ED were obtained.Additionally, a search matrix was synthesized to identify the contribution of each article, the base MLMs (49 articles) and the ensemble MLMs (12 articles), and their application areas and one article of literature associated to the topic.Adding "Stacking Ensemble" provided only one more article, further completing the search for models in assembly.Once the review was advanced, the above keywords included multiple articles of the assembly models in an accurate manner, ensuring that the information on models under this methodology was mapped in the best possible way.
Finally, the necessary information was extracted regarding applications and MLMs applied to ED in multiple areas (49 articles) as shown in SBM and SEM tables, particularly in fraud detection Table (12 articles).

MLM Data Balancing and Performance Metrics
In the literature review consulted, various procedures were evident for conducting proper validations of MLMs, such as data balancing and the application of performance metrics.Validating MLMs is crucial to ensure their reliability and effectiveness in decision making.Validation allows for evaluating the predictive capacity of models, identifying potential issues like overfitting, and ensuring that the results obtained are generalizable and consistent with new data [34].In model validation, its performance, accuracy, and capability to handle different scenarios can be verified, which helps ensure that it is useful and reliable in real-world applications [17].

Data Balancing
Data balancing in MLMs addresses the problem of class imbalance.One class may have significantly more instances than another in a dataset.When classes are imbalanced, models tend to favor the class with more instances, which can lead to poor performance in predicting the class with fewer instances.
Using data balancing techniques in the preprocessing of information to train an ML model improves the model's ability to learn patterns from all classes equitably, resulting in more accurate, precise, and generalizable predictions.The purpose of balancing is to achieve an equilibrium where the detection of both minority and majority classes is of interest.
Data balancing can be achieved through techniques such as oversampling (duplicating instances of the minority class), undersampling (removing instances of the majority class), or more advanced methods like SMOTE (Synthetic Minority Over-sampling Technique) [35].These techniques help improve the predictive capability of models by ensuring that all classes are treated equitably during training.

Performance Metrics
Performance metrics in ML are measures used to evaluate the performance and effectiveness of machine learning models in data prediction and classification.In the literature review consulted, metrics such as accuracy, precision, recall (sensitivity), specificity, F1-score, AU-ROC (area under receiver operating characteristics curve), AU-PRC (area under the precision-recall curve), the MCC (Matthews correlation coefficient), and the confusion matrix [34] are used.These metrics provide information about the predictive capability, accuracy, and overall effectiveness of MLMs.All the mentioned metrics are related to the confusion matrix.The confusion matrix (Table 1) is a tool that allows visualizing the performance of a classification model by showing the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) that the model has produced on a test dataset [22].Table 1 presents the confusion matrix for a binary problem.The confusion matrix can also be extended to multiclass problems.It is not necessarily intended only for binary problems.In a multiclass classification context, the confusion matrix is expanded to include all classes present in the problem.Thus, its construction will have a length of N, corresponding to the number of classes.Table 2 presents the descriptions of the evaluation metrics.

Machine Learning Models for ED and Applications
According to the literature review, two methodologies for applying MLMs in multiple applications were identified.These methodologies describe the use of MLMs from the following perspectives.
Single Base Models (SBMs).These base models serve as individual classifiers or regressors that make predictions based on input data.Among the most common base models are LR (Logistic Regression), SVM (Support Vector Machine), DT (decision tree), RF (Random Forest), NB (naive Bayes), K-Nearest Neighbors (KNNs), and neural networks (NNs).When using a single model, the choice depends on the characteristics, distribution, and properties of the datasets [16].SBMs are also used as a basis for ensemble methods and more complex stacking models [36].Table 3 presents a brief description of SBMs.
Stacking Ensemble Models (SEMs).They involve combining multiple base models to improve the predictive performance of SBMs.These models use a two-level stacking approach; base models make predictions at the first level, and meta-learning combines these predictions at the second level [13,16].The purpose of this ensemble methodology is to combine two or more models, each with its strengths and weaknesses, to construct a more robust model.Stacking ensemble models have proven to be promising in various applications by offering advantages such as improved accuracy, reduced overfitting, and enhanced performance compared to individual models [37].SEMs use boosting, bagging, and stacking schemes [30].Each SEM operates within its own domain space, showing varying levels of performance based on the aggregated selection of base models and the distribution, nonlinearity, and class imbalance present in the dataset.Some of the most popular boosting algorithms are AdaBoost (Adaptive Boosting), Gradient Boosting, XGBoost 2.0.1, and LightGBM 4.4.0 [13].These algorithms have variations in how they adjust weights and combine weak models to form the final model, but they follow the general scheme of boosting.An example of bagging is RF, where multiple decision trees are trained on training datasets generated by bootstrap sampling (sampling with replacement), and predictions from individual trees are averaged to produce the final prediction.
Within the literature review, different types of SBMs and SEMs were found.Under the search parameters in the matrix of sintering literature review, for ED, different configurations of these models were obtained in areas such as medicine (37 articles), fraud detection (11 articles), agronomy (2), energy efficiency (2 articles), industrial processes (2 articles), education (1 article), and telecommunications (1 article).Table 4 presents Multidisciplinary MLM -SBM for early detection, for multiple areas excluding articles related to fraud detection, which are analyzed later.Table 5 presents Multidisciplinary MLM-SEM for early detection.

Model Description
LR [38] A statistical model used to analyze the relationship between a dependent variable (binary outcome) and one or more independent variables.It is commonly used for binary classification tasks where the outcome variable is categorical with two possible outcomes.Logistic regression estimates the probability that a given input belongs to a specific category by fitting the data to a logistic function, which transforms the outcome into an interval between 0 and 1.

SVM [39]
A supervised machine learning algorithm used for classification and regression tasks.SVM works by finding the optimal hyperplane that best separates data points into different classes in a high-dimensional space.Its goal is to maximize the margin between the classes, making it effective for both linear and nonlinear classification problems.SVM can handle high-dimensional data and is known for its ability to generalize well to unseen new data.
DT [40] A machine learning algorithm used for classification and regression tasks.It is a tree-shaped model where internal nodes represent features, branches represent decisions based on those features, and leaf nodes represent the outcome or decision.The algorithm recursively splits the data based on the most significant feature at each node, aiming to create homogeneous subsets.Decision trees are easy to interpret and visualize, making them valuable for understanding the decision-making process in a model.They can handle both numerical and categorical data, making them versatile for various types of datasets.
RF [41] A machine learning algorithm composed of multiple decision trees.Each tree is built using bootstrapping and random feature selection to create an ensemble of uncorrelated trees, resulting in more accurate predictions than individual trees.The algorithm leverages the concept of collective knowledge, where the forest of decision trees works together to make predictions, and the final prediction is based on the majority vote of the trees.

NB [42]
A probabilistic classifier based on the application of Bayes' theorem.It assumes that the presence of a particular feature in a class is not related to the presence of any other feature.Despite their simplicity, naive Bayes classifiers are known for their efficiency and effectiveness in various classification tasks, especially in text classification and spam filtering.

KNN [43]
A machine learning algorithm used for classification and regression tasks.In KNN, the class or value of a data point is determined by the majority class or the mean value of its nearest neighbors in the feature space.The algorithm calculates the distance between data points and classifies them based on the majority class of the nearest k data points.
ANN [44,45] A computational model inspired by the structure and functioning of the neural networks in the human brain.ANNs consist of interconnected nodes, known as artificial neurons, that process information and learn patterns from data.These networks are used in machine learning and deep learning to solve complex problems such as pattern recognition, classification, and regression, among others.
XGBoost [46,47] A model that uses gradient boosting to optimize the loss function and handle complex patterns in data.XGBoost is widely used for classification, regression, and ranking tasks due to its speed, accuracy, and ability to handle large datasets efficiently.It uses decision trees as base models and trains them sequentially.XGBoost in some cases is considered a base model grounded in DT.According to Table 4, the most used SBMs for early detection, based on the reviewed literature, are Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN).These models are applied in the medical field, achieving average performances above 80% according to the metrics reported by various authors.In other fields, RF also stands out as a frequently used model, with performance levels similarly approaching 80%.
On the other hand, in the SEMs presented in Table 5, it is identified that RF is the most commonly used base model, followed by DT, then LR, SVM, and KNN with the same frequency, followed by ANN and XGB also at the same level.NB is the least used base model for early detection in the literature studied.As the best meta learner, LR is identified as the most used due to its simplicity in computation and inference for decision making.It is also identified that the SEM methodology achieves performances in some cases exceeding 90%.
According to Table 5, the SEM yields superior results compared with the SBM.It is important to note that the SEM approaches found are implemented in the medical field, which is a critical area for decision making, as a misdiagnosis can have severe repercussions in various health contexts.This justifies why most of the literature on SBMs and SEMs consulted focuses on the field of medicine, in comparison with other areas for early detection.
Hybrid models acting as MLMs and SEMs were also found.The selection of base, ensemble, or hybrid models will depend on the working context and the characteristics of the data.

Machine Learning Models for ED in Fraud Detection
MLMs are used in fraud detection to analyze patterns and behaviors in data in order to identify fraudulent activities.These models are important for early fraud detection (ED) as they can quickly process large amounts of data to detect and classify suspicious transactions or activities, helping to prevent financial losses.
ML-based models for ED offer several advantages over traditional fraud detection methods.MLMs have the ability to adapt and improve over time as more data are analyzed, thereby increasing their accuracy in fraud detection [69].They can also analyze complex and diverse data sources, allowing them to detect sophisticated and evolving fraud schemes that may go unnoticed by traditional rule-based systems or human intervention [70].Table 6 presents the MLMs found in the literature for ED for fraud detection.  Gaussian process; 7 Insolation Forest; 8 DenseStream; 9 Boosting Trees; 10 Lasso Regression; 11 Decision Support System; 12 Generalized Linear Model; 13 Linear Regression; 14 Gradient Boosted Tree; 15 CatBoost, 16 classification and regression tree, 17 Spectral Clustering.
Table 6 presents the MLMs found in the literature consulted for ED in fraud detection.Only the use of SBMs was found for fraud detection.RF persists as the most frequently used model, followed by KNN.LR, SVM, and XGB are used at the same level for fraud detection.RF continues to be the most used and reliable model for fraud detection according to Table 5, with two authors reporting better performance with this model.Although RF is based on bagging and XGB utilizes boosting, they are considered base models grounded in DT.Table 6 reflects that in most cases, each MLM achieved a high number of accurate fraud detections in real fraud cases, with authors reporting accuracy metrics exceeding 90%.
All the MLMs found in Tables 4-6 were evaluated using the metrics reported in Table 2.Each achieved positive performance metrics in the respective contexts where they were implemented.

Discussion
ML algorithms are increasingly used in various fields due to their ability to adapt to new data and identify hidden patterns, enabling decision making with a higher degree of Sensors 2024, 24, 4678 14 of 20 reliability.Although most models found in the literature work as standalone base models, the use and experimentation with ensemble methodologies to improve the performance of base models is becoming increasingly common.
The information in Tables 4 and 5 shows that base models are used diversely across multiple areas, unlike SEMs.Specifically, Table 5 reports that SEMs work exclusively in the medical field.This is justified because SEMs, by gathering the various decisions from SBMs, allow for a more robust acquisition of data variability and a better fit to the data.Consequently, the identification of the problem is more accurate and precise, whether in the context of regression or classification.The unification of these decisions constitutes a more solid knowledge base that serves as input to another model for final decision making.This is particularly important in the medical field, where decision making is critical for diagnosing a person, requiring a minimal margin of error.
Within the models in Table 6 for early fraud detection, it was found that the models found in the literature are SBMs.Although the authors do not consider SEMs for early fraud detection, the SBMs used achieved significant performance with good adaptability to emerging patterns, good training times, and good adaptation to data for fraud detection [1,70,71,[73][74][75].
An important aspect to achieve satisfactory results in the training, validation, and testing of SEMs and SBMs in any context will be the associated data engineering analysis [77].This refers to the effective selection of the data characteristics that would be supplied as information to the SBM.Additionally, analysis of data balancing techniques is used to balance major or minor classes [69], such as the SMOTE method [74].
This aspect of SEMs is not considered critical since the patterns of training, validation, and testing are responses from the SBM.However, in this case, cross-validation processes must be ensured to avoid overfitting issues, model selection bias, and errors in variance estimation.The use of SEMs must strictly consider cross-validation methods such as K-Fold, Hold-Out, Leave-One-Out, Leave-P-Out, Monte Carlo, Stratified K-fold, and Repeated K-fold, Time Series Cross-Validation, and Nested Cross-Validation [25,52].These methods provide a more reliable estimation of model performance on unseen data, reducing the risk of overfitting and enabling better hyperparameter tuning and model evaluation.
The advantages of machine learning models in early anomaly detection include their ability to process large amounts of data quickly, identify suspicious patterns, and adapt and improve over time.
However, some disadvantages may include the lack of labeled data available and the complexity in identifying fraudulent operations.A brief description of some advantages, disadvantages, and areas for improvement for the models found in the literature for early anomaly detection is presented below.The use of MLMs in ED for fraud detection, whether as SBMs or SEMs, will continue to be the subject of study in areas such as corporate security, surveillance, and fiscal and financial control, due to their ability to process large amounts of data rapidly and adapt to new information over time.
A determining factor in MLMs' training in fraud ED is the limitations in data collection for model training.Data collection for fraud ED is constrained by the availability of labeled data due to the lack of digitalization of this information.Without digitized data, the process of consolidating a labeled historical dataset is slow and costly, which restricts the applicability of MLMs in fraud ED, especially in cases such as tax fraud detection [29].The lack of digitized data can hinder the effectiveness and accuracy of fraud ED processes, as manual data handling is time-consuming and error-prone.Moreover, the challenge of labeling transactions as fraudulent or non-fraudulent can be complex due to the difficulty in definitively asserting the fraudulent nature of transactions, leading to careful use of labeled examples and the need for verification by expert personnel to identify such specific fraud, which is also subject to ethical considerations [69].
It is important that in the field of fraud detection in ED, strategies for data balancing will be considered based on available information to reduce intrinsic bias that may include human manipulation or unwanted value judgments when labeling data.
Implementing SEMs as MLMs for fraud ED offers several advantages over base models and traditional methods.Some of the advantages that this type of model can offer include:

•
Improved prediction performance by maximizing fraud detection through effective identification of patterns and anomalies in the data, leading to better prediction performance.Using features based on SBM responses allows the analysis of behaviors that may not have been explored in traditional methods, thereby enhancing a more comprehensive analysis of fraud indicators.Additionally, the adaptability and robustness of SEMs enable them to adjust to the strengths and weaknesses of multiple baseline models, improving overall detection performance and robustness in identifying fraudulent activities [16].

•
Combining the predictive power of various models enables the identification of fraudulent behaviors at an early stage with greater accuracy, which allows for timely intervention and prevention of fraudulent activities [20].

•
SEMs provide a more reliable balance between precision and interpretability, making them operationally viable for fraud detection tasks, due to the adoption of the features and operating dynamics of SBMs.

Conclusions
The literature review under the search equation allowed for the consolidation of two ways of using MLMs for ED: Single Base Models (SBMs) and Stacking Ensemble Models (SEMs).The implementation of SBMs was identified in different areas, whereas SEMs were implemented only in the field of medicine due to the high precision required in this area.
The implementation of SEMs can favor and strengthen conventional fraud detection efficiently, allowing to improve prediction performance by leveraging the integration of features from different base models.An SEM enables better adaptability to data and robust decision making.Additionally, it provides more accurate early detections in scenarios with high data variability, reduces issues such as overfitting, and allowing handling biases in the data.
Both SBMs and SEMs have proven to be efficient in early detection across multiple areas, particularly in fraud, with accuracies in some cases exceeding 90%.For the use of MLMs, it will always be relevant to perform data engineering processes to select appropriate features for model training, and to pay special attention to data balancing to achieve adequate results in predictions.
From the analyzed information, it can be inferred that a challenging task in the field of fraud detection is the consolidation of reliable databases for training MLMs for ED, as well as the adoption of new models and cutting-edge methodologies such as deep learning.
Future research lines may be oriented to develop further advancements in techniques and technologies like deep learning models as SBMs and SEMs to enhance the accuracy, efficiency, and scalability of fraud detection systems.Other potential research lines include: Enhanced Data Enrichment (the quality and quantity of data), Advanced Machine Learning Algorithms in fraud detection (convolutional neural networks and ANNs, Long Short-Term Memory), Real-time Fraud, and Blockchain Technology for secure and transparent transaction verification and enhancing fraud detection capabilities through immutable and decentralized data storage.
Another future line of research is to focus the analysis of machine learning models according to specific areas of application.As found, there are a higher number of studies available in the field of medicine compared to any other area.In this way, being specific in the search can provide a selective scope that can be important for specific researchers.In addition, it will be important as future work to consider other considerations to obtain a greater number of articles from areas different than medicine, to obtain a more comparable analysis in these types of models and methodologies.

Figure 1 .
Figure 1.Analysis of the occurrence of words in the literature review.

Figure 1 .
Figure 1.Analysis of the occurrence of words in the literature review.
Proportion of correct predictions out of the total predictions made by the model.ACC = TP+TN TP+TN+FP+FN Precision Proportion of true positives (TP) over the sum of true positives and false positives (FP).P = TP TP+FP Recall (Sensitivity) Proportion of true positives to the sum of true positives and false negatives (FN).R = TP TP+FN Specificity Proportion of true negatives (TN) over the sum of true negatives and false positives (FP).

Table 1 .
Confusion matrix for a binary problem.

Table 2 .
Descriptions of the evaluation metrics.

Table 3 .
Description of SBM and XGBoost of SEM.

Table 5 .
Multidisciplinary MLM-SEM for early detection.

Table 6 .
MLM for early detection in fraud.

•
Logistic Regression.Advantages: Simple and easy to interpret.It is efficient in detecting linear patterns in data and can be useful in situations where relationships are simpler and more direct.Disadvantages: Not effective in detecting anomalies in highly imbalanced datasets or datasets with complex characteristics.Improvement directions: Combining with other supervised or unsupervised learning techniques can improve its performance, especially in situations where relationships are more complex.• Support Vector Machine.Advantages: It is effective in identifying complex and nonlinear patterns in data.Additionally, it can handle high-dimensional datasets.Disadvantages: Requires longer training time compared to other algorithms and can be sensitive to the choice of hyperparameters, affecting its performance.Improvement directions: Parameter optimization techniques and data balancing can improve the model's performance.• Decision tree.Advantages: Easy to interpret and visualize, which facilitates understanding of how decisions are made in anomaly detection.They can effectively handle mixed data, including categorical and numerical data.Disadvantages: They tend to overfit and exhibit high sensitivity to small changes in input data.Improvement directions: Incorporating regularization techniques can mitigate overfitting and im-prove generalization in anomaly detection.Combining multiple decision trees into an ensemble, such as Random Forest or Gradient Boosting, can enhance model accuracy and robustness.• Random Forest.Advantages: It can effectively handle imbalanced datasets and re- duces the tendency to overfit.Disadvantages: The complexity of Random Forest can make it difficult to interpret how decisions are made and sometimes requires careful hyperparameter tuning.Improvement directions: Perform comprehensive hyperparameter optimization to improve the model's detection capability and combine it in an ensemble with other models to significantly enhance decision making.• Naive Bayes.Advantages: Computationally efficient, can handle large datasets, and can effectively manage high-dimensional datasets.Disadvantages: Highly sensitive to noisy or outlier data.Improvement directions: Use more advanced versions of naive Bayes, such as Kernel naive Bayes or Multinomial naive Bayes, which can improve detection capability.• K-nearest neighbor.Advantages: Simple, easy to implement, and can identify nonlinear patterns in the data.Disadvantages: Sensitive to noisy and outlier data.Improvement directions: It can benefit from parameter optimization, such as the number of neighbors.Using weighting methods to give more weight to closer neighbors can reduce the impact of outlier data in anomaly detection.• Artificial neural network.Advantages: This model can learn complex and nonlinear patterns in the data, and can effectively handle structured and unstructured data.Dis- advantages: They require large amounts of data, and the complexity and training time of ANNs can be significant.Improvement directions: Implement regularization techniques and hyperparameter optimization to improve generalization and performance.Exploring ensemble learning approaches that combine multiple neural networks can enhance the model's accuracy and robustness.• XGBoost.Advantages: High performance and accuracy, effective handling of imbalanced datasets.Disadvantages: May require careful hyperparameter tuning, and the model's complexity and training time can be significant.Improvement directions: Combining with other models in an ensemble can lead to significant improvements in anomaly detection.• Stacking Ensemble.Advantages: Combines multiple models to improve accuracy in early anomaly detection.The combination of models allows for greater robustness and generalization in anomaly detection.Disadvantages: Implementation can be more complex than a single model, and there is a risk of overfitting when combining multiple models.Improvement directions: Perform comprehensive optimization of model combinations in Stacking Ensemble to enhance early anomaly detection, and integrate regularization techniques to mitigate the risk of overfitting.