Systematic Review of Financial Distress Identification using Artificial Intelligence Methods

ABSTRACT The study presents a systematic review of 232 studies on various aspects of the use of artificial intelligence methods for identification of financial distress (such as bankruptcy or insolvency). We follow the guidelines of the PRISMA methodology for performing the systematic reviews. The study discusses bankruptcy-related financial datasets, data imbalance, feature dimensionality reduction in financial datasets, financial distress prediction, data pre-processing issues, non-financial indicators, frequently used machine-learning methods, performance evolution metrics, and other related issues of machine-learning-based workflows. The study findings revealed the necessity of data balancing, dimensionality reduction techniques in data preprocessing, and allow researchers to identify new research directions that have not been analyzed yet.


Introduction
Predicting the possibility of bankruptcy is considered as one of the key issues of current economic and financial research. The growing importance of corporate bankruptcy prediction as a research subject has been confirmed in recent years by the appearance of various thorough reviews in the literature with the goal of summarizing the important findings of previously published studies (Chen, Ribeiro, and Chen 2016;Matenda et al. 2021). Analysing financial distress and its various forms (insolvency, bankruptcy, etc.) is important because of its essential role in society and economy (Aljawazneh et al. 2021;Zelenkov and Volodarskiy 2021), energy sector (Ayodele et al. 2019) and social security (Okewu et al. 2019). The prediction of"company survival" is a challenging task due to many numbers of factors, which have to be considered . Relationships (obvious and hidden) between these factors make the task even more difficult (Mora García et al. 2008). Financial distress predictions have become an essential key indicator for decision-makers, such as financial market players, fund managers, stockholders, employees, etc.
The remaining parts of this article are organized as detailed further. Section 2 presents the details of the methodological procedure used for the systematic review. Section 3 presents the analysis of the identified articles identified from the search with keywords: "Bankruptcy" or "Financial distress," which included 335 articles. The section is further continued with the analysis of the specific problems related to the use of AI methods for bankruptcy prediction, e. g. dimensionality curse, class imbalance, anomaly, etc. The limitations of this study are discussed in Section 4, while the conclusions are presented in Section 5.

Procedure of the Systematic Review Based on the PRISMA Methodology
The main aim of this study is to identify the context of "Financial distress" and its usage of machine-learning methods, including additional aspects related to it, such as imbalance, dimensionality, etc. Therefore, the systematic review technique is implemented in this study. We gathered relevant studies based on a search query using databases such as Science Direct, Springer, IEEE, Google Scholar, etc. following the guidelines of the preferred reporting items for systematic reviews and meta-analyses (PRISMA) (Page et al. 2021).

Inclusion and Exclusion Criteria
The topic"Financial distress" and"Bankruptcy" is widely analyzed in the literature; therefore, this article seeks to extend and supplement existing systematic reviews (Bhatore, Mohan, and Reddy 2020;Chen, Ribeiro, and Chen 2016;Matenda et al. 2021;Shi and Li 2019) by more relevant context. The inclusion criteria are as follows: (1) The studies that are found with keywords: "Bankruptcy" or "Financial distress;" (2) The studies from 2017 till 2022 February; (3) The studies are published in English; (4) The full text is accessible.
The exclusion criteria selection is related to financial distress barometer creation for SME. The exclusion criteria are: (1) The studies with no companies/enterprises bankruptcy or financial distress data. (2) The studies with no indication of the used data set. (3) The studies with macroeconomic research.
(4) The studies analyzing financial sector: banks, insurance, etc. (5) Research with only one class analysis. (6) The traditional Altman method implementation without any new variables or methods comparison. Additional inclusion criteria is selected due to the need of wider analysis in the context of sentiment analysis, dimensionality reduction, imbalance, and outliers. This wider analysis seeks to create knowledge of the variety of methods, their taxonomy, and then the use cases in"Financial distress" context. The additional inclusion RQ1: What is the difference between Financial Distress, Insolvency, and Bankruptcy?
RQ2: What indicators are used as financial distress predicators? Its suitability for SMEs.

RQ3
: What data sources are used?
RQ4: What data normalization techniques are used?

RQ5
: What non-financial indicators for financial distress prediction are included?
RQ6: What machine-learning models are used?

RQ7
: What the addition techniques are important for machine-learning algorithms and which of them are used in financial distress context?

RQ8:
What performance metric is used for machine-learning algorithms evaluation in financial distress context?

Conceptual Analysis of Domain Terms: Financial Distress, Insolvency, and Bankruptcy
The concept of financial distress in scientific literature is often related to bankruptcy, insolvency, probability of default, and failure patterns. The common definition of financial distress is a condition of a firm that has difficulties fulfilling its financial obligations (Farooq, Jibran Qamar, and Haque 2018;Yazdanfar and Öhman 2020). In scientific literature is different use cases (interpretation views) of the same financial distress definition. For example, one point of view is financial distress = bankruptcy, and the second point of view is financial distress � bankruptcy. A common view is that financial distress differs from bankruptcy, it is a distress situation of the company that leads to two possible states: 1) recovery state to become healthy again; 2) bankruptcy state, making reorganization or liquidation of the organization. Bankruptcy is the legal status of the company when creditors take legal actions, so the company cannot repay its debt (du Jardin 2018; Farooq, Jibran Qamar, and Haque 2018;Salehi and Davoudi Pour 2016;Veganzones and Severin 2020). In the context of bankruptcy, the words failure and default can be used as synonyms (Letizia and Lillo 2019;Salehi and Davoudi Pour 2016), contrary to the word insolvency. Insolvency is the middle stage between financial distress and bankruptcy. The main difference between financial distress and insolvency is that in first terminology companies have difficulties paying, or in insolvency used expression is being unable to pay. After insolvency, if legal action is taken, the insolvent firm is declared bankrupt (Farooq, Jibran Qamar, and Haque 2018). Two types of bankruptcy are identified (Lukason and Vissak 2017): (1) A gradual failure ("chronic"): The financial situation is incrementally declining for the firm some years before its bankruptcy. (2) An acute failure ("sudden"): The financial situation of the firm rapidly collapses, and sudden bankruptcy occurs. From a year before bankruptcy declared financial statement, there are no indicators about possible companies' failure.
Most researchers make a separation of the financial distress and bankruptcy concepts. The main difference is that financial distress leads to bankruptcy, which means that firms can recover or become bankrupt. Therefore, there are two classes in bankruptcy, while in the subject of financial distress the researchers choose not only the number of classes but also the method by which the classes will be identified. The choice of class identification method selection is based on the quantity of the available data and the types of firms (public, nonpublic).

Data Sources
Data researchers commonly choose publicly available, e.g. public companies or open data sets. This is evident from the distribution of the data sources from the analyzed articles, which is presented in Table 2. The Private and Other sections are similar due to summarizing different data sources. Nevertheless, the Private section combines data sources, which researchers have given access to data from private sources or specific banks, while the Other section is data sources with a higher probability for other researchers to access such data, e. g. Orbis, Retriever, Thomson Reuters, etc. The Other section summarizes data sources used by the researchers, which are mentioned in the analyzed studies � 2. If a private database is chosen, researchers choose to analyze public companies and combine more different databases (Compustat, LoPucki's Bankruptcy Research, New Generation Research, Center for Research in Security Prices, etc. ). Eventually, the concepts of bankruptcy and financial distress are important for diverse parties: shareholders, investors, creditors, partners, etc. Furthermore, the beginning of bankruptcy prediction is considered the late sixties with Beaver (1966) and Altman's (1968) works (Joshi, Ramesh, and Tahsildar 2018;Shen et al. 2020). These authors concentrated on financial indicators as the main historical information holders of the firm. The main of Beaver's (Beaver 1966) research idea is "finding optimal cutoff point" among healthy and bankrupt firms. The author established that is relevant to use no more than five years' windows before bankruptcy due to differences between the ratio classes distribution (Beaver 1966). In addition to this, (Altman 1968) created a z-score model, which is used today. Expanding research on this topic was understood that financial ratios distribution differs from a normal distribution, and covariance matrices in groups are not equal, which leads to logistic regression use in analyses (Wagner 2008). Furthermore, additional features are added to research on financial distress or bankruptcy topic due to technical improvements (software, algorithms functions) and data availability (Chollet 2018). These reasons may lead to better model accuracy and the creation of a more universal model.
Research on the financial distress and bankruptcy topic is developing in the context of big data, where new features (indicators) are added for greater model accuracy. This leads to higher-dimensional space, which causes the need for data dimensionality reduction before a machine-learning algorithm are used.

Dealing with High Data Dimensionality
The curse of dimensionality claims that the machine-learning algorithm's cost is exponentially related to dimensions (Kuo and Sloan 2005). The data becomes more sparse when a new feature (dimension) is added. This reason leads to challenges in achieving better model accuracy. There is a common belief that more data is better than less (Altman and Krzywinski 2018). Data pre-processing step is one of the most important tasks for data analytics, which leads not only to simplified model design, but also to the creation of more efficient models. The main problems caused by dimensionality are: • Data sparsity, e.g. it is used Euclidian distance as a similarity measurement, then a new feature appearance leads to greater dissimilarity, due to increasing distance between classes (points) (Altman and Krzywinski 2018), • Multiple testing (also known as signal to noise ratio problem), e.g. the ability of pattern detections decreases of the essence inadequate features (Millstein et al. 2020). • Multicollinearity, e.g. if several samples are less than features, this situation leads to redundant feature creation (linear algebra) (Altman and Krzywinski 2018). • Overfitting (lower model interpretability ).
Dimensionality reduction methods can be separated to feature selection and feature extraction (Figure 3). Regardless of the method, it is important to perform data preparation techniques such as: • Feature cleaning: remove features with high missing value ratio, low variance, or select only one feature from two highly correlated features. • Feature normalization or standardization.
The feature selection approach is used to find a narrow subset of appropriate features from the initial wide range (Al-Tashi et al. 2020). This approach consists of three steps: method selection, evaluation, and stopping criteria .
The feature selection approach can be divided into filter, wrapper, embedded, and hybrid models. The first bankruptcy concept researcher - Beaver (1966) used the filter selection technique for healthy and bankrupt firm separation, by comparing the different class ratio distributions. Now in practice are used t-test , Cohen's D, χ 2 , F-score, information gain ratio, Correlation Feature Selection (CFS), ReliefF, etc. The filter method assumes that features are independent of each other . The main advantages of this method are a fast, scalable, simple design, easier understanding for other researchers, and working independently from the classifier (Li, Li, and Liu 2017). The last advantage can become a disadvantage if an interaction with the classifier could lead to better performance of the model or could save costs.
The wrapper method uses a classifier in evaluating the features al- (Al-Tashi et al. 2020). This method analyzes features using forward or backward techniques, e.g., forward -begins with one feature subset and adds a new one if accuracy is better than this feature is left in the analysis, otherwise removed, backward -begins with all features subsets and removes by one, if the accuracy is better before removal, the feature is returned to a subset of feature, both algorithms continue until is analyzed all the subset of features. For feature, evaluation is often used accuracy rate or classification error (Cai et al. 2018). Comparing filter and wrapper feature selection methods for the classificational task, the wrapper method tend to have better performance results (Al-Tashi et al. 2020), but needs much more computational power and time (Cai et al. 2018). The main wrapper methods advantages are: simplicity, interaction with classifier, model features dependencies, disadvantages: overfitting risk, less variety of the classifier which are used, and intensive computation (Li, Li, and Liu 2017). The embedded feature selection method differs from others due to feature selection and classification methods integration into a single process, feature selection becomes a part of the classifier ), e.g., Random Forest, lightGBM, XGBoost, LASSO, and others. This method is less complicated than the wrapper method (Li, Li, and Liu 2017). The hybrid approach combines filter and wrapper methods, firstly using the filter method for primary feature selection, then applying the wrapper method, this combination makes balances the accuracy rate and intensive computation .
Feature extraction transforms high-dimensional data into a new lowerdimensional space, which has maximum information from the initial data set (Ayesha, Hanif, and Talib 2020). This approach is used not only for mapping but also for class visualization in 2-dimensional or 3-dimensional space, in which the essential data is visualized (Ye, Ji, and Sun 2013). This mapping approach can be divided into linear and non-linear methods. Linear methods attempt to reduce dimensionality by linear functions implementations, which forms new lower dimension feature set (Ayesha, Hanif, and Talib 2020), e.g. Principal component analyses (PCA), Linear discriminant analyses (LDA), Canonical correlation analysis (CCA), Singular Value Decomposition (SVD), Independent component analysis (ICA), Locality Preserving Projections (LPP), Neighborhood preserving embedding (NPE), Robust subspace learning (RSL), Latent semantic analysis (LSA) (for text), Projection Pursuit (PP), etc. Every technique is oriented in some information extraction, for example, PCA extracts global information, LPP extracts local information, and LDA merges the information of classes to the feature set, which means that other information of data is lost (Wang, Liu, and Pu 2019). Nonlinear feature extraction methods have greater performance results than linear due to the reality of the real-world data, which has a higher probability to be nonlinear, than linear (van der Maaten, Postma, and Herik 2007). Nonlinear methods are auto-encoders, Kernel principal component analysis (KPCA), Multidimensional Scaling (MDS), Isomap, Locally linear embedding (LLE), Self-Organizing map (SOM), Learning vector quantization (LVQ), T-Stochastic neighbor embedding (t-SNE) (Ayesha, Hanif, and Talib 2020), etc.
Financial distress and bankruptcy concepts tend to expand analyzing indicators for greater model accuracy and new important patterns foundation. This leads to dimensionality reduction issues: data sparsity, multiple testing, multicollinearity, and overfitting, which are solved by feature selection or extraction approaches. In the context of bankruptcy or financial distress this feature extraction technique is used: 1) linear: PCA ( Mokrišová 2021), tSNE (Zoričák et al. 2020), SOM (Mora García et al. 2008) and autoencoder . Instead of feature extraction, researchers use feature selection, due to achievable knowledge of the feature's importance. Frequently used feature selection techniques are: 1) Filter: CFS Séverin and Veganzones 2021), ReliefF Kou et al. 2021), χ 2 (Azayite and Achchab 2018; Kou et al. 2021); Gain ratio (Kou et al.

Dealing with Bias and Imbalance
The class imbalance occurs when one class's number of instances is much greater than the other, which is common in real-data set analysis Liu, Zhou, and Liu 2019). In the context of financial distress or bankruptcy, the financially successful number of firms is higher than distressed , which can be expressed as a range of proportion from 100:1 to 1000:1 . A firm's activity sector influence a higher or lower probability of default, e. g. financial distress is more common in the manufactory industry than in transportation . The most prevailing solution is to add more instances of the minority class. The class imbalance problem is generally related with: (1) Lack of minority data (cannot be found feature patterns due to limited amount of minority class examples); (2) Overlapping or class separability (class examples are mixed up between each other in the feature space); (3) Small disjuncts (intervenes of small groups from minority class in majority classes feature space) (Fernández et al. 2018;).
These problems lead to difficulties in creating an effective machine-learning classification model. In addition to this, researchers dealing with imbalanced data set issue accuracy metrics to consider as inappropriate evaluation measures due to the dominating classes effect, e. g. the classifier can achieve 99% of accuracy without correctly classifying rare examples (Weng and Poon 2008). This measurement is replaced with Precision, Recall, F-scored, the area under the ROC curve (AUC), G-mean, or balanced accuracy metrics (Fernández et al. 2018;Kotsiantis, Kanellopoulos, and Pintelas 2005;Weng and Poon 2008). Class imbalance reduction methods are used for binary classification issues, however not all methods can be used for multiclass imbalance issues. To use them researchers apply One-vs-One (OVO), and One-vs-All (OVA) strategy schemes, in which multiclass imbalance issue converts to a binary one (Fenández et al. 2018). Methods dealing with class imbalance issues can be separated into data level, algorithm level, and hybrid approach (Figure 4). The data level approach is directly related to changes in the data set, it is rebalancing data that its distribution of the classes becomes more equivalent. On the other hand, the algorithm level approach is modifying the classifier to the bias of prioritizing minority classes learning (Fernández et al. 2018;Kotsiantis, Kanellopoulos, and Pintelas 2005;. The combination of these two approaches forms hybrid methodology, which makes changes to the data and the classifier for specific problem solvation. The main advantage of the data-level approach is independent process creation, which is made separated from the sampling and classifier training process . From a data modification perspective, there are three possibilities to make modifications: 1) to reduce the majority class, 2) to increase the minority class, or 3) hybrid to combine majority class reduction with minority class increase. Hence, the data level approach consists of undersampling, oversampling, and hybrids methods. From the undersampling and oversampling methodology perspective the simplest way is to make a random reduction of majority class instances (RUS -random undersampling) or increase of minority class instances (ROS -random oversampling) (Liu, Zhou, and Liu 2019). These random methods are not very efficient due to information loss or overfitting (Kotsiantis, Kanellopoulos, and Pintelas 2005). For this reason, is developing new undersampling models which use clustering, for instance, identification in feature space and majority class instance elimination due to redundancy, distance from the decision border, etc., such as Tomek Links (TL), Undersampling Based on Clustering (SBC), Class Purity Maximization (CPM), Condensed Nearest Neighbor Rule (US-CNN), One-Sided Selection (OSS), colony optimization Sampling (ACOSampling), etc. (Fernández et al. 2018). Synthetic Minority Over-sampling Technique (SMOTE) is the most often used oversampling method , which generates synthetic instances by using the interpolation method dependent on the required balance of the classes. SMOTE interpolating technique use k-nearest neighbor logic, close examples in the feature space are selected forming a line area in which new synthetic instances are created (Ashraf and Ahmed 2020). The main advantage of SMOTE technique is the improvement of the capacity generalization to the classifier, which leads scientists to create more than 85 different SMOTE technique extensions: Borderline-SMOTE, ADASYN, Safe-Level-SMOTE, DBSMOTE, ROSE, MWMOTE, MDO, etc. (Fernandez et al. 2018). In addition to this, the highest performance of the classifier can be achieved by the combination of the undersampling and oversampling techniques, as hybrid methods I creation: SMOTE +Tomek Link, SMOTE + ENN, AHC, SPIDER, SMOTE-RSB (Fernández et al. 2018). This method maintains weaknesses of both methods: the possibility of important information loss and overfitting.
The algorithm level approach can be divided into the threshold, one-class classifier, cost-sensitive, and ensembles of the classifier's methods. The threshold method also known as "decision threshold" or "discrimination threshold" makes better classifier label prediction by moving the default threshold (0.5 probability) upper or lower for better special class identification (Zhou and Liu 2006). For example, the financial institution can save costs if the threshold of good credits is 0.8, which means it saves 2 out of 10 cases for creditworthiness tests. The main idea of using this method is to know the boundary that leads to one prior class label identification (Chen et al. 2006). If classes are highly overlapping threshold boundary can not be achieved. Another algorithm level method is one class classifier also known as recognition based learning, this method uses only one specific class example in the training set ), this method is used when there are small disjuncts or noisy instances in the data and can be deviated into 3 types: 1) learning from minority class; 2) learning from majority class; 3) output combination after learning on both approaches (Fernández et al. 2018). The method applications lead to a decrease in specificity metrics, one of the methods dealing with this issue is the one-class classifier combination approach (Fernández et al. 2018). Since one-class classifier is trained only on one class instance, other instances are treated as outliers, for this reason, this method is used as one of the outlier's detection approaches. The cost-sensitive method uses a cost matrix for misclassification of unequal cost between classes creation (Kotsiantis, Kanellopoulos, and Pintelas 2005), it is a penalty treatment for the classifier. In literature two different views are assigning a cost-sensitive method to the class imbalance approach, for one author it is a direct branch of the class imbalance approach (Fernández et al. 2018;Sisodia and Verma 2018), for others, it is a subclass of the Algorithm level approach (Kotsiantis, Kanellopoulos, and Pintelas 2005;Liu, Zhou, and Liu 2019;Wang et al. 2020). The cost-sensitive method depends on the selected cost, incorrect cost selection leads to impair results of the classifier (Fernández et al. 2018). The main idea of Hybrid method II is to combine different classifier outputs for more accurate final decision creating and this method can involve different algorithm level method combinations (Zhou and Liu 2006). The main difference from the third type of class imbalance -Ensembles of the classifier's approach, is that the Hybrid methods II do not mix data level and algorithm level approaches outputs. Ensembles of the classifier's approach allow researchers not only to use a combination of data level and algorithm level approaches but also make their ensembles learning classifiers.
In the financial distress and bankruptcy context class imbalance reduction techniques are used contradictory. The authors Inam et al. 2019;Kanojia and Gupta 2022;Perboli and Arabnezhad 2021; do not use any class imbalance technique instead they chose majority instances depending on the number of minority instances. For example, the authors (Huang and Yen 2019) have 32 financially distressed firms and select 32 nondistressed firms from the same industry. Conversely, authors who adopted class imbalance reduction methods. The commonly used is SMOTE Aljawazneh et al. 2021;Angenent, Barata, and Takes 2020;Choi, Son, and Kim 2018;Faris et al. 2020;Jiang et al. 2021;Kim, Cho, and Ryu 2021;Letizia and Lillo 2019;Roumani, Nwankpa, and Tanniru 2020;Sisodia and Verma 2018;Sun et al. 2021;Vellamcheti and Singh 2020;Zelenkov and Volodarskiy 2021;Zhou 2013), then other oversampling techniques except SMOTE (Aljawazneh et al. 2021;Sisodia and Verma 2018;Smiti and Soui 2020;Zelenkov and Volodarskiy 2021;Zhou 2013), undersampling (Angenent, Barata, and Takes 2020;Le et al. 2019;Sisodia and Verma 2018;Vellamcheti and Singh 2020;Zelenkov and Volodarskiy 2021;Zhou 2013), ensembles classifiers approaches (Aljawazneh et al. 2021;Roumani, Nwankpa, and Tanniru 2020;Shen et al. 2020;Sun et al. 2020;UlagaPriya and Pushpa 2021;Wang et al. 2020), Hybrid I (Aljawazneh et al. 2021;Le et al. 2019), cost-sensitive (Angenent, Barata, and Takes 2020;Chang 2019;Ren, Lu, and Yang 2021;, and threshold . It is noted that authors (Aljawazneh et al. 2021;Angenent, Barata, and Takes 2020;Sisodia and Verma 2018;Vellamcheti and Singh 2020;Zelenkov and Volodarskiy 2021;Zhou 2013) who use techniques other than SMOTE perform a comparative analysis of these techniques, and compared them with SMOTE. Further, SMOTE is one of the common data preparation steps. Authors  have noticed, that the efficiency of the classifier decreases with class imbalance increases, especially if one class instance is less than 20%. This assumption was made only on data level approach application in the research methodology and the maximum performance of the classifier was achieved after SMOTE techniques implementation. Using one classifier technique the class imbalance problem can be seen as anomaly or outlier detection, which is also applied to concepts of financial distress and bankruptcy Zoričák et al. 2020).

Looking from Outliers' Perspective
The anomaly is understood as a strong outlier, which is significantly dissimilar to other data instances, on the contrary, a weak outlier is identified as the noise of the data (Aggarwal 2017). It is important to understand, that outlier exists in approximately every real data set due to: malicious activity, change in the environment, system behavior, fraudulent behavior, human error, instrument error, setup error, sampling errors, data-entry error, or simply through natural deviations in populations (Chandola, Banerjee, and Kumar 2009;Hodge and Austin 2004;Wang, Bah, and Hammad 2019). Authors are using different approaches (terminologies), such as outlier detection, novelty detection, anomaly detection, noise detection, deviation detection, or exception mining (Hodge and Austin 2004), which all lead to the same outlier identification problem. The first step to solving this issue is a precise description of normality, but finding the boundary between normality and nonnormality is often fuzzy due to the instances (data points) which are between the boundary, which can be treated as normal or vice versa (Chandola, Banerjee, and Kumar 2009). We identified three types of anomaly/outliers: (1) Point anomaly or Type I outlier occurs when is used technique, which is analyzing an individual instance with the rest of the data (Ahmed, Naser Mahmood, and Hu 2016;Chandola, Banerjee, and Kumar 2009). For example, if a person spends three times more than they used to.
(2) Context anomaly or Type II occurs when it is a structure in the data, for example, seasonality of spending Christmas period. For this type is needed two sets of attributes are: 1) contextual attributes (location, time, etc.); 2) behavioral attributes (noncontextual characteristics of the instance: time interval between purchases) (Bhuyan, Bhattacharyya, and Kalita 2014;Chandola, Banerjee, and Kumar 2009). (3) Collective anomaly or Type III outlier occurs when is analyzed the sequence of events, in which a separate event is not an anomaly, but the collection of similar events behave anomalously, for example, the sequence of transactions (Aggarwal 2017;Ahmed, Naser Mahmood, and Hu 2016;Chalapathy and Chawla 2019).
We categorize the anomaly detecting methods into six approaches according to literature reviews (Aggarwal 2017;Ahmed, Naser Mahmood, and Hu 2016;Bhuyan, Bhattacharyya, and Kalita 2014;Chandola, Banerjee, andKumar 2009, 2009;Hodge and Austin 2004;Wang, Bah, and Hammad 2019) in the context of anomaly/outlier detection ( Figure 5): (1) Statistical-based approach is the first algorithms group used for outlier detection (Hodge and Austin 2004), which is splinted into parametric and non-parametric methods. The fundamental idea of this approach is the identification of new instance dependencies to the distribution model (Wang, Bah, and Hammad 2019), e. g. instances are declared as anomalies if has a low probability to be generated from the learning model (Bhuyan, Bhattacharyya, and Kalita 2014). Parametric models use hypothesis testing, if the hypothesis is rejected instance is declared as an anomaly, for hypothesis testing is used χ 2 , Grubb's test, etc. Of course, assuming that data is generated from a Gaussian distribution (Bernoulli (if categorical), etc.) maximum likelihood function can be used as well, where the threshold is applied for the anomaly identification as a distance measure from the mean (Aggarwal 2017;Chandola, Banerjee, and Kumar 2009 (2) Distance, Density-based approach otherwise known as a Nearest neighbor-based approach due to regularly k-NN technique application. The main idea of this approach is that normal data points instance occurs in more dense or nearby neighborhoods, while anomalies are more distant and may form their local dense groups (Aggarwal 2017;Chandola, Banerjee, and Kumar 2009;Wang, Bah, and Hammad 2019). The main advantages of this method are that: a) it can be used during unsupervised learning, b) it is easily scalable in a multidimensional space, c) has efficient computation, d) does not have a prior assumption about data distribution (Chandola, Banerjee, and Kumar 2009;Wang, Bah, and Hammad 2019). This method is sensitive to parameter settings, which include k-neighbors identification. Also, it relies on the analyzed data, if it is scattered or does not have enough similar normal instances, that leads to a high false-positive rate (Chandola, Banerjee, and Kumar 2009). Method performance decreases due to the course of dimensionality, and it is not suitable for the data stream (Wang, Bah, and Hammad 2019). (4) A clustering-based approach can be used during unsupervised learning, hence pre-label class instances do not require (Ahmed, Naser Mahmood, and Hu 2016). The main method assumption is that normal data instance belongs to the cluster, while anomalies do not or form their own -smaller cluster (Aggarwal 2017;Chandola, Banerjee, and Kumar 2009). Author (Zhang 2013) clustering-based outlier detection algorithms distinguish into seven major categories: Partitioning Clustering methods; Hierarchical Clustering methods; Density-based clustering methods; Grid-based clustering methods. The main advantages of using clustering are a) stable performance, b) no prior knowledge about data distribution is needed, c) adaptable for different data types and data structures d) incremental clustering (supervised) methods are effective for fast response generation (Bhuyan, Bhattacharyya, and Kalita 2014;Chandola, Banerjee, and Kumar 2009). Some cluster categories have additional advantages, for example partitioning clustering is approximately simple and scalable, hierarchical based methods "maintain a good performance on data sets containing non-isotropic clusters and also produce multiple nested partitions that give users the option to choose different portions according to their similarity level" (Wang, Bah, and Hammad 2019). The main clustering-based approach's main disadvantages are a) dependence on proper cluster algorithm selection that could capture the structure of normal instances, b) high sensitivity for initial parameters, e.g. clustering is optimized for a prior number of cluster creations not for anomaly detection, hence, identifying the proper number of clusters for normal instances and anomalies is challenging (Bhuyan, Bhattacharyya, and Kalita 2014;Chandola, Banerjee, and Kumar 2009;Wang, Bah, and Hammad 2019). (5) The information theoretic-based approach's main assumption is to find out irregularities in the information content, which are caused by anomalies in the data set (Chandola, Banerjee, and Kumar 2009). The information content is analyzed using different information-theoretic measurements. e. g. entropy, relative entropy, conditional entropy, information gain, information cost, (Ahmed, Naser Mahmood, and Hu 2016;Chandola, Banerjee, and Kumar 2009). The main advantages of this approach are that it can be used during unsupervised learning and does not have a prior assumption about data distribution. The main weaknesses are a) performance dependence on the informationtheoretic measurement selection; b) date sets application limitation: in most cases used for sequential or spatial data, c) computation time and power resources grow exponentially in more complex data sets, d) it is difficult the information-theoretic measurement output connects with anomaly score or label (Aggarwal 2017;Chandola, Banerjee, andKumar 2009, 2009). (6) The combination-based approach otherwise known as Ensemble-based, which the main idea to use several machine-learning algorithms results and combine them using weighted voting, and majority voting techniques (Bhuyan, Bhattacharyya, and Kalita 2014; Wang, Bah, and Hammad 2019). Author (Aggarwal 2017) ensemble-based outlier detection algorithms distinguish into two categories: sequential and independent ensembles. The main idea of sequential ensembles is that is formed sequential algorithms are used dependent on the data, while independent ensembles use a combination of different algorithms voting outputs. The main advantages of this method are a) performance is more efficient, b) predicts results are more stable, and c) applicable for high dimension data, and streaming data. It is difficult to: a) obtain realtime performances, b) select classifiers in the ensemble, and c) interpret a result, which was get during unsupervised learning (it could lead to robust decision making) (Bhuyan, Bhattacharyya, and Kalita 2014; Wang, Bah, and Hammad 2019).
Anomaly detection techniques generate the output, which can be one of two types: • Labels. Output technique, which assigns labels, e.g. a normal instance or outlier. It generally uses a threshold for conversion from probability score to binary labels (Aggarwal 2017;Chandola, Banerjee, and Kumar 2009). • Scores. Output technique, which uses direct algorithm outputs as probability score for the outlier, which is ranked (Aggarwal 2017;Chandola, Banerjee, and Kumar 2009).

Machine-Learning Methods
Machine-learning is the study of computer algorithms, which have the capability to learn and improve automatically through experience (Helm et al. 2020;Huang and Wang 2019). The machine-learning techniques can be classified into four main groups.
(1) Supervised machine-learning methods are based on useful information with labeled data (Liu and Lang 2019). It is called the Task-driven approach due to it uses a sample of input-output pairs to convert an input to an output (van Engelen and Hoos 2020). Depending on provided data it can be a regression (continuous data) or classification (discrete data) task (Sarker 2021).
(2) Unsupervised machine-learning methods do not have any provided output and their main task is to map similar inputs to the same class (van Engelen and Hoos 2020). This Data-Driven approach is widely used for feature extraction, clustering, association rules detection, density estimation, anomaly detection, etc. (Sarker 2021).

(3) Semi -supervised machine-learning methods combines Supervised and
Unsupervised methods, which seek to improve performance in one of these two tasks by using data that is commonly linked with the other. For example, clustering can benefit from knowing some data points provided output, further classification can be added to additional data points without any output (van Engelen and Hoos 2020). Often used in text classification, fraud detection (Akande et al. 2021;Awotunde et al. 2022), money laundering prevention, data labeling, etc. (4) Reinforcement machine-learning methods is based on long-term rewords maximization, which is obtained by imitation of human behavior to take environmental action (by reward or penalty learning).
In financial distress and bankruptcy context, the authors most often apply supervised machine-learning methods, especially popular methods are Logistic regression, Artificial neural network (ANN), and Support vector machine (SVM). Logistic regression is the most popular method due to several reasons: 1) one of the first methods applied in a bankruptcy context (Altman's z-score model is based on LR); 2) popular in social science for the evaluation of e2138124-3546 analyzed variables; 3) one of the main method used in the efficiency comparison with other machine-learning methods. However, LR method performance results is lower comparing with other Machine-learning methods (SVM, Xgboost, ANN, Random Forest, etc.). For this reason, authors extend their research by implementing new methodology for"Financial distress" classification and other machine-learning issues solving such as: dimensionality reduction, imbalance, etc. A few unsupervised machine-learning methods are applied: One class SVM, Isolation forest (IF), Least-Squares Anomaly detection (LSAD), K-means, and from deep learning methods group -Auto-encoder. For each method, a more detailed description is given in Table 3.

Performance Metrics
The effectiveness of the methods is evaluated by a comparison of different methods' performance results. In this study, we are interested in evaluating performance metrics suitable for labels. Most of them begin with the confusion-matrix (Altman 1968). In the case of class imbalance, the minority class is presented as negative (Fernández et al. 2018). In the case of financial distress, classes would be presented as follows for positive: Non-Financial DistressjNon-Bankrupt, and for negative: Financial Distressj Bankrupt.
Based on the confusion matrix, many performance measurements can be constructed. For more accurate evaluation of the methods, researchers use three to five measurements. The Figure 6(a) shows that 30.9% of studies used one evaluation metric. The authors use other evaluation metrics, such as R 2 or Log-loss, in the regression analyses; therefore, it is common to use 3-5 evaluation metrics. The common evaluation matrix is accuracy and the area under the ROC curve (AUC), then recall, specificity, and type I error ( Figure 6(b)). Comparing AUC and ACC ratios, it is observed that studies closer to the present period tend to choose AUC metrics. For AUC calculation is needed ROC curve identification. ROC (the receiver operating characteristic) curve is a graphical evaluation method (Fernández et al. 2018) for the binary classification problems, also called a two-dimensional space coordinate system , in which Sensitivity (TPR) is plotted on the Y-axis and 1-Specificity (FPR) is plotted on X-axis (Zhou 2013). There is a point in the ROC space for each potential threshold value conditional on the values of FPR and TPR for that threshold (Fernández et al. 2018). Linear interpolation is used to construct the curve. AUC is an average performance metric that helps analysis to compare and contrast different models (García, Marqués, and Sánchez 2019).
Kolmogorov -Smirnov statistic (K-S), Matthews correlation coefficient (MCC), H-measure, and Brier score (BS) have a low number of studies from Table 3. Machine-learning methods applied in the context of financial distress. No.

Method modification
Source Supervised machine-learning methods

Aalen's Additive Regression
(AAR) AAR is one of survival analysis Models (an alternative to Cox model).
This model can detect changes in coefficient at each distinct survival time due to linear function in the hazard rate estimation.
À  2. Accelerated Failure Time Models (AFT) AFT is an alternative to proportional hazards models. AFT assumes that a covariate effect can become some constant, which accelerate or slow down the progression of a disease.
Weibull Accelerated Failure  Time Models (WAF)  3. Artificial neural networks ANN Multilayer perceptron (MP|MLP) The purpose of an artificial neural network (Multilayer perceptron) is to imitate the operation of biological neural systems, which is an example of neuron connectivity architecture design. The neurons are arranged in three completely connected layers: input, output, and one or more hidden layers. ANN provides a framework for modeling nonlinear functional mappings between sets of input and output variables. ANN with several hidden layers is also cold Deep neural network (DNN).  Sun et al. 2017;Séverin and Veganzones 2021;Tsai et al. 2021;Veganzones, Séverin, and Chlibi 2021;Vellamcheti and Singh 2020;Ye 2021;Zhou 2013) 4. Bagging Bootstrap aggregation An ensemble learning approach is a technique for reducing variance in a noisy data set. Bagging is the process of selecting a random sample of data from a training set with replacement-that is, the individual data points might be chosen many times. Another difference between the boosting and bagging techniques is that weak learners are trained parallel.

ANN-Backpropagation
Bagging CART ( ( Barboza, Kimura, and Altman 2017;Chen et al. 2021;du Jardin 2017du Jardin , 2021adu Jardin , 2021bdu Jardin , 2021cFaris et al. 2020;Gnip and Drotár 2019;Keya et al. 2021;Liang et al. 2018;Qian et al. 2022;Roumani, Nwankpa, and Tanniru 2020;Shen et al. 2020;Sisodia and Verma 2018;Sun et al. 2017;Wang et al. 2020Zelenkov and Volodarskiy 2021;Zhao et al. 2022) (Continued)   Kanojia and Gupta 2022;Roumani, Nwankpa, and Tanniru 2020; 7. Data Envelopment Analysis (DEA) DEA is a non-parametric approach, which empirically quantifies the relative evaluations of the efficiency between multiple similar factors inputs and outputs. DEA -additive model )  (Continued) Decision tree (C4.5) Decision tree consists of root, internal/decision node and leaf node. The model has to satisfy the following criteria: 1) The tree must have a single root, which has no entrance and from which the division into branches and leaves begins. 2) Each internal/decision node has only one input. 3) There is a unique path from each leaf node to the root. The splitting for internal/decision node is made by information gain ratio (C4.5), Gini coefficient (CART), χ 2 -based test (CHAID), etc. The decision tree pruning techniques are used to deal with overfitting. The decision tree is a nonparametric method, which means that it does not learn the parameters by which to evaluate attributes. The system remembers the main properties of the data. Therefore, even a small change in the data can lead to the formation of a new tree.
Linear regression (LR) Multiple regression Multivariate regression The linear approach for relationship modelling between dependent and independent variables. Simple or Multiple linear regression method usage depends on the number of independent variables used in the model (accordingly one or more than one).
Random Forest (RF) The type of ensemble estimator method which consist of creating multi-decision trees using various samples from the original data set. Each tree in RF is generated from a bootstrap sample of the data, and a random sample of predictors is inspected at each split. The classification class is determined by the majority voting from the decision trees.

Definition
Method modification Source 18. Support vector machine (SVM) SVM uses high-dimensional feature space for class separation by the identification of optimal separating hyperplane (decision boundary line). SVM is applied for linearly (separable by a linear hyperplane) and non-linearly separable data. The SVM approach uses quadratic programming to find unique solutions to the idea of structural risk reduction, which tries to lower the boundaries of misclassification errors by generating an optimal separating hyperplane in a high-dimensional feature space. 19.

TOPSIS
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) provides ranking alternatives scores, which is known as the positive and the negative ideal solutions, and is based on geometric distance calculation.
VIKOR VIKOR is a multi-criteria method created for the ranking a variety of options. VIKOR is designed to chose the compromise solution that is closest to the ideal.    Séverin and Veganzones 2021;Veganzones, Séverin, and Chlibi 2021; 25. Long Short-Term Memory (LSTM) LSTM advanced or extension of RNN. LSTM processes information in a sequential manner, but there is a memory cell that remembers and forgets information. Three multiplication units regulate the flow of information in each memory cell: input gate, output gate, and forget gate. The three gates control the flow of information into and out of the cell, and the cell recalls values across arbitrary timestamps. Bi-LSMT ; Dependency Sensitive RNSA (Abdi et al. 2019) (Abdi et al. 2019;Aljawazneh et al. 2021;Da et al. 2022;Kim, Cho, and Ryu 2021) 26. Recurrent Neural Network An agent learns from interactions with its environment in discrete time steps to update its mapping between the observed state and the probability of selecting possible actions in a reinforcement learning process. Q-learning, as a typical reinforcement learning approach, imitates human behavior by taking actions in the environment to maximize long-term rewards. (Abdi et al. 2019;Kim, Cho, and Ryu 2021) 27.
Recursive deep learning An RNN has feedback connections, which can be between hidden units or from the output to the hidden units. It is able to process the sequential inputs by having a recurrent hidden state whose activation at each step depends on that of the previous steps (Abdi et al. 2019) Unsupervised machine-learning methods

1.
K-means K-means aggregates data points based on certain similarities (nearest mean, cluster centers or centroid) into k clusters.
Fuzzy C-means (Chou, Hsieh, and Qiu 2017) (Chou, Hsieh, and Qiu 2017) (Continued) The anomaly scores generated from the path length of each observed example from root node to node in which the example is isolated are used to create the final model.
- Gnip and Drotár 2019;Zoričák et al. 2020) 3. Least-Squares Anomaly Detection (LSAD) LSAD is based on the idea of One-Class SVM, which uses a hypersphere to encompass all the instances. Though, LSAD is used an extended application of the least squares probabilistic classification for a lost function. (Zoričák et al. 2020) 4. One-Class SVM (OCSVM) OCSVM aim is to make separation between majority (training) class instances and outliers instances, which are points laying outside captured characteristics of training instances.
Auto-encoder An Autoencoder is a neural network type, where the input and the output are the same.The autoencoder consists of: input layer, one or more hidden layers (encoding layers), and output layer (decoding layer). The autoencoder's primary goal is to lower the dimensionality of the input data by reducing noise.
Stacked Auto-encoder ) Soui et al. 2020) 1 to 6. The application of these methods in the financial distress context was found only in 2021-2022 studies.

Discussion
The scope of this study is limited to the period 2017-2022 February, which helps to identify under-explored research fields in financial distress and bankruptcy context. The separation between "Financial Distress" and"Bankruptcy" can be addressed as the "gray" area due to unclear "Financial distress" class indicator. For this reason, author's results comparison is problematic, because the main class indicator is different. In contrast, the concept"Bankruptcy" has the same understanding as class indicator, but it is too late indication for decision-makers. Analyzing the used data source's it is noticeable the researchers choose open data sources (Stock exchange, Polish data set). Unfortunately, the authors do not provide data pre-processing steps or present them succinctly, e.g. data were pre-processed, data were normalized, five feature selection methods were used, etc. This information limits ability to identify common data pre-processing steps in Financial distress and bankruptcy context. Therefore, it is needed further analysis in this context. "Financial distress" topic is challenging for artificial intelligence experts due to existing issues with high data dimensionality, imbalance, sentiment analysis, outliers. These problems author's analyses separately by combining the most appropriate methodology for their analyzed data. However, at least dimensionality and imbalance pragmatic's have to be addressed in each study in order to get comparable results.
This study seek to bring knowledge and key insights for further researchers by filling the gaps discussed in previous literature reviews. However, further work can be developed in the directions named by other authors, which were not discussed in this article: dynamic models applications (Chen, Ribeiro, and Chen 2016;Matenda et al. 2021); switching from binary classification (Chen, Ribeiro, and Chen 2016); the tools suitable for different domains of datasets analyses (Bhatore, Mohan, and Reddy 2020); users knowledge integration in black-box models (Chen, Ribeiro, and Chen 2016).
Finally, one of the main literature analysis limitations is the lack of dynamic view incorporation in the used methods analysis. For example, to know a timeline, which methods are now at peak, or which are less applicable. Another interesting direction of literature review would be the analysis of ensembles. Studies have shown that the author's design ensembles of classifiers. In further financial distress scope review would be interesting to know, which methods or their modifications usually fall into voting ensembles.

Conclusion
The main aim of this study is to identify the context of "Financial distress" and its usage of machine-learning methods, including additional aspects related to it, such as imbalance, dimensionality, etc. This study analyzed 232 articles, which most of which are Financial distress and Bankruptcy content research from the period 2017 to 2022 February using the guidelines of the PRISMA methodology.
Our main findings are as follows: (1) Data researchers commonly choose publicly available datasets, e. g.
public companies or open data sets such as Polish, Spanish, or Japanese. Consequently, the results of the study are difficult to compare due to the different analysis periods and the inclusion of new data (indicators, sources, etc.) in the analysis.
(2) Data pre-processing steps in financial distress and bankruptcy context are often forgotten or succinctly present. Information on data normalization provided about 14% of analyzed studies. The commonly used functions are normalization and Z-score. (3) The authors used 27 supervised and 5 unsupervised methods, of which 8 belong to of Deep learning methods subgroup. The most popular method remains in the Logistic regression for the following reasons: 1) one of the first methods applied in the bankruptcy context (Altman's z-score model is based on LR); 2) popular in social science for the evaluation of analyzed variables; 3) one of the main method used in the efficiency comparison with other machine-learning methods. Other commonly used algorithms are ANN, SVM, Decision tree, Random forest, Boosting (AdaBoost, XGBoost), etc. (4) The most popular data preprocessing are dimensionality reduction and data balancing, which are becoming essential data pre-processing steps.
However, each of these topics contains under-explored research fields for future development. (5) Lastly, we analyzed evaluation performance metrics suitable for labels.
For better evaluation of the methods, researchers use three to five metrics. The common evaluation matrix is accuracy and AUC, then recall, specificity, type I error, etc. The application of K-S, MCC, H-measure, and BS methods in the financial distress context was detected only in 2021-2022 studies.

Disclosure statement
No potential conflict of interest was reported by the author(s).