Keywords

1 Introduction

Although corruption and asset misappropriation tend to occur at greater frequency, financial statement fraud is reported to be the most costly internal (occupational) fraud with a median loss of 800,000 USD at global level [1]. Note that this value is based on the estimate of the gross amount of the financial statement misstatement. Financial statement fraud is caused by intentional omission or misstatement of material information in a company’s financial report, including net worth (net income) under- and overstatements using timing differences, improper disclosures, fictitious or understated revenues, and so on. Fraud incidents have adverse impact on fraudulent companies’ value, often resulting in filing for bankruptcy.

Early detection of financial statement fraud is therefore of eminent importance for companies’ stakeholders. Recently, researchers have shown an increased interest in the automatic detection systems using computational intelligence methods [2]. Despite their good prediction performance, the earlier models suffer from several limitations. A major problem with those intelligent systems is the lack of interpretability, which prevents their adoption in financial industry. Moreover, far too little attention has been paid to modelling uncertainty inherently present in the knowledge of financial analysts and auditors. Indeed, earlier research has shown that expert forecasts significantly outperform traditional statistical models regarding consistency and prediction accuracy [3]. When combined with data analytics methods, expert knowledge was utilized to detect the most serious financial statement frauds [4] and improve the effectiveness of financial audits [5]. The aim of this study is to propose an interpretable automatic detection system of financial statement fraud. This system incorporates two main components: (1) feature selection component that reduces the search space of fuzzy rules and thus the number of conditions in the antecedents of the rules, and (2) fuzzy rule-based systems optimized by evolutionary and non-evolutionary algorithms to yield a competitive prediction performance and highly interpretable rule base at the same time. This paper attempts to show that these competing, often contradictory, objectives can be met in detecting financial statement fraud.

The rest of this paper is organized in the following way. Section 2 gives a brief overview of the recent advances in the detection of financial statement fraud. Section 3 outlines the research methodology, including data description and methods used. Section 4 presents the results of experiments performed on a dataset of U.S. companies. Section 5 concludes the paper and identifies possible future directions.

2 Financial Statement Fraud Detection - A Literature Review

Initial research on financial statement fraud detection was concentrated on traditional computational methods, such as logistic regression and shallow neural networks, see [2] for a comprehensive review. For example, Ravisankar et al. [6] compared a wide range of statistical and data mining methods on a dataset of 202 Chinese companies. The results showed that probabilistic neural networks performed best, while the performance of evolutionary algorithms was significantly improved when applying feature selection first.

Detecting financial statement fraud is usually approached as a two-class classification task, categorizing companies as fraudulent or non-fraudulent. However, a multi-class approach was used by [7] to further discriminate between intentional and unintentional financial statement fraud.

Recently, text mining approaches have been used to increase the prediction performance of traditional financial indicators [8,9,10]. It was demonstrated that companies’ communication with their stakeholders may contain a higher level of uncertainty and misleading statements [8]. Negative sentiment present in the managerial statements was considered another important indicator of financial fraud [11, 12].

Fuzzy rule-based systems is a neglected area of research in this field. In a related study, a fuzzy rule-based system designed by domain experts was proposed to assist auditors in detecting managerial frauds [13]. However, this system was not empirically verified on real-world data and the generated rule base was derived from the subjective judgment of the authors only. To overcome this limitation, a group of experts was created to perform a multi-criteria decision making task using Analytic Hierarchy Process in [14]. Again, this approach was based on expert knowledge only, without utilizing available companies’ data. Alden et al. [15] employed evolutionary algorithms to learn a rule base and demonstrated that such system is more accurate than traditional logistic regression model. Tang et al. [16] extracted crisp rules from the data using a C4.5 decision tree algorithm and incorporated them into a financial statement detection ontology. However, none of those studies have considered the interpretability of the rule-based systems as a key objective.

3 Research Methodology

3.1 Dataset

Our dataset comprised 622 companies, out of which 311 were identified as fraudulent by the U.S. SEC (Security and Exchange Commission) between the years 2005 and 2015. The matched sample of non-fraudulent companies was obtained based on industry classification and market capitalization. Companies’ annual reports were used as the source of input attributes. Both the financial indicators were calculated and linguistic attributes were obtained using textual analysis. Note that this dataset was previously used in a comparative study of a wide range of machine learning algorithms [12], so we could easily verify the classification performance of the proposed fuzzy rule-based system. As details on the used input attributes and their descriptive statistics can be found in [12], here we provide only a briefly introduction of the attributes.

Financial indicators included 32 attributes categorized into nine subsets: (1) firm size (assets and revenues), (2) company reputation (investors’ and insiders’ shares), (3) profitability (net income, net margin, profitability ratios), (4) activity (growth in working capital, activity ratios), (5) asset structure (fixed capital), (6) business situation (growth in revenue), (7) liquidity (working capital), (8) leverage (debt ratio), and (9) market value (earnings per share, stock price to earnings, price to earnings, price to book value, reinvestment rate and price to revenue). This selection was based on the theoretical and empirical evidence provided in earlier studies [2, 4, 17]. Poor financial performance is considered an important incentive for employee engagement in financial statement fraud. Growth in revenue and earnings have been detected as particularly important indicators of future financial statement frauds [12].

The selection of linguistic indicators was based on the theoretical assumption that fraudsters use uncertain and negative words more frequently in their communication with other company’s stakeholders [8, 10, 11]. Therefore, the proportion of those word categories were used as additional input attributes also in this study. More precisely, we calculated the raw frequencies of word categories developed in [10] specifically for financial domain. The management discussion section of a company’s annual report was used as the source of the communication and all word counts were normalized by the length of the document.

3.2 Feature Selection

It is well known that high-dimensional datasets lead to the exponential growth of the number of fuzzy rules. Therefore, feature selection is considered an important task that reduces the search space of fuzzy rules. Moreover, previous systems for financial statement fraud detection have been equipped with feature selection component because reduction in data dimensionality may lead to increase in the accuracy of fraud detection [2]. In addition, the interpretability of fuzzy rule-based system can be substantially increased by reducing the number of antecedents in the rules.

Here, we used a steady-state genetic algorithm (GA) for wrapper feature selection [18]. This algorithm was designed to select the best feature subsets for fuzzy rule-based systems. The best solution is obtained by using a heuristic algorithm that maximizes the accuracy of a specific learning algorithm. To provide a computationally effective solution, the chosen approach applies the k-nearest neighbour (k-NN) algorithm because it is reportedly highly sensitive to irrelevant features and requires no learning time [18]. The feature selection method is performed in two steps. First, class separability is examined to find the optimal number of features. For this purpose, the Las Vegas filter algorithm is used based on an inconsistency measure as the relevance evaluation of features. Thus, the minimum number of inconsistencies is obtained and the candidate subset is used in the next stage. Second, the subset size determines the chromosome length of GA in the process of a wrapper genetic feature selection. The integer coding is used in the GA feature selection, where each gen in the chromosome represents an attribute. The accuracy of the k-NN algorithm is employed as the fitness function. To avoid overfitting, random resampling of training data repeated five times is used and the performance is measured on the five test classification results. More diversity in the GA population is achieved by adding more than two new chromosomes in each GA generation, while maintaining the advantages of the steady-state reproduction scheme. To keep a balance between exploitation and exploration in GA, the partially complementary crossover operator and two-point crossover with repair operator is employed, respectively. Finally, more diversity in the population is introduced by the uniform mutation operator.

3.3 Fuzzy Rule-Based Systems

Fuzzy rule-based systems include a fuzzification process, an inference system and a defuzzification that converts the fuzzy sets into class associated to the input instance [19]. The inference system is comprised of data base and rule base. The data base contains the set of linguistic terms and corresponding membership functions that define the semantics of the linguistic terms. The rule base contains a set of if-then rules that can be defined as follows.

Let us assume an n-dimensional problem with m training instances classified into M classes. Then, the k-th rule \(R_{k}\) in the fuzzy rule base can be defined as follows:

$$\begin{aligned} \begin{array}{l} {R}_{k}: \text {if}\ {x}_{1}\ \text {is}\ {A}_{1,k}\ \text {and}\ {x}_{2}\ \text {is}\ {A}_{2,k}\ \text {and}\ \dots \text {and}\ {x}_{i}\ \text {is}\ {A}_{i,k}\ \text {and}\ \dots \text {and}\ {x}_{n}\ \text {is}\ {A}_{n,k}\\ \qquad \text {then}\ \text {class}\ {c}_{k}\ \text {with}\ {CF}_{k}, \end{array} \end{aligned}$$
(1)

where \(A_{1,k}, \dots , A_{1,n}\) are antecedent fuzzy sets, \(c_k\) is the consequence class (one of the M classes), \(R_{k}\) denotes the k-th rule, \(k=1, 2, \dots , N\) and \(CF_k\) is the grade of certainty of the k-th rule.

In our comparative study, we used four types of fuzzy rule-bases systems: (1) evolutionary fuzzy rule-based classifiers using Michigan learning (the population size is given and each coded rule is represented by an individual in the evolutionary algorithm), (2) evolutionary fuzzy rule-based classifiers using iterative learning (the rule base is generated gradually), (3) evolutionary interval-valued fuzzy rule-based classifier (IVTURS), and (4) non-evolutionary fuzzy rule-based classifiers. The algorithms are briefly introduced as follows.

(1) Evolutionary fuzzy rule-based classifiers using Michigan learning:

  • The genetic cooperative-competitive learning (GCCL) algorithm [20] employs a GA to optimize the rule base while the data base is fixed. Thus, a computationally effective classifier with interpretable rule base can be obtained.

  • Genetic programming-based learning of COmpact and ACcurate fuzzy rule-based system for High-dimensional problems (COACH) [21] aims to obtain a compact rule base. Rule confidence and support are combined in the fitness function and a global fitness score is used for the whole population that considers accuracy and the numbers of attributes, antecedents and rules at the same time. The mechanism of token competition is used to maintain diversity in the population of individuals.

(2) Evolutionary fuzzy rule-based classifiers using iterative learning:

  • The genetics-based machine learning (GBML) algorithm [22] combines the Michigan learning approach with the Pittsburgh approach (each rule base is handled as an individual). First, the rules base is generated in the Pittsburgh style. This is combined with a prespecified probability given to each rule base as a single iteration of Michigan approach is performed (this is the rule generation and replacement of the worst rules in the current population). Then, the best rule set is added to the current population to form the next population. This hybrid approach was more effective than GBML performed using the Michigan or Pittsburgh approach separately. This is due to the combination of the high search ability of the Michigan approach with the direct optimization ability of the Pittsburgh approach.

  • Structural learning algorithm on vague environment (SLAVE) [23] uses a GA to generate rules iteratively. Each rule is then penalized by eliminating from the training data all those instances that are covered by the previous rule base. The process of generating rules ends when all the instances are eliminated and, therefore, it is not required to set the number of rules a priori.

  • Steady-state GA for extracting fuzzy classification rules (SGERD) [24] uses a nonrandom selection strategy to retain only the best rules in the rule base. Rule confidence and support, as well as accuracy, are considered in the fitness function of the steady-state GA.

  • New SLAVE (NSVL) [25] was proposed to enhance the effectivity of SLAVE. Unlike SLAVE, NSVL obtains a complete rule (the antecedent and consequent) in each iteration, thus reducing the required learning time. In the iteration, GA selects the best antecedent for the fixed consequent (class).

(3) Evolutionary interval-valued fuzzy rule-based classifier:

  • IVTURS [26] is a generalization of fuzzy rule-based systems, which applies interval-valued fuzzy sets (defined by lower and upper bounds) instead of fuzzy sets in the antecedents of rules. Thus, additional level of uncertainty can be modelled in the rules. IVTURS is performed in three steps. First, a fuzzy association rule-based classification algorithm is used to obtain an initial population of rules. Then, the rule weights are combined with interval matching degrees between the input instances and antecedents. Finally, an evolutionary algorithm is used to tune the rule base and data base with classification accuracy as the fitness function.

(4) Non-evolutionary fuzzy rule-based classifiers:

  • The weighted fuzzy (WF) classifier [27] generates the rule base by considering both the weights and the compatibility of training instances. An incremental learning algorithm is used in the WF classifier for the generated rule base to adjust \(CF_k\) in order to maximize the classification accuracy.

  • The rule weight (RW) classifier [28] generates the rule base based on an association between the feature space and the space of the classes. Membership degree between a training instance and fuzzy partitions are calculated using a conjunction operator and the matching fuzzy region is assigned to the instance. Thus, the antecedents of fuzzy rules are generated and the instance’s class is used as the consequent. The maximum confidence for the antecedent is used as the rule weight.

  • FURIA [29] is a fuzzy extension of the RIPPER algorithm. First, a modified RIPPER is performed by gradually adding antecedents so that the highest possible accuracy of the rule is achieved. The rule set is learned for each class using a one-vs-rest decomposition to avoid systematic bias in favor of one class. Then, the rule base is pruned to minimize its description length. Maximum support bound is used as the criterion for the fuzzification of antecedents (the antecedent with the largest rule purity is chosen). The rules are obtained by replacing crisp intervals from RIPPER with fuzzy intervals (trapezoidal membership functions).

3.4 Performance Evaluation

To evaluate the classification performance of the fuzzy rule-based systems, we used two measures, namely classification accuracy and misclassification cost (hereinafter referred to as cost). The cost is a particularly important measure in financial statement fraud detection task because false negative classification (fraudulent companies classified as non-fraudulent) is associated with higher cost compared with false positive classification (non-fraudulent companies classified as fraudulent). Here, we use the cost ratio estimate of 1:2 based on the ratio between audit fees and loss incurred by financial statement frauds [12]. In other words, the cost can be calculated as:

$$\begin{aligned} \text {cost} = FPR + 2 \times FNR, \end{aligned}$$
(2)

where FPR is false positive rate and FNR denotes false negative rate.

To evaluate the interpretability of the fuzzy rule-based systems, two measures at the rule base level were used, the number of conditions in the antecedents and the number of rules. Note that there is an inverse relation between the accuracy and interpretability measures and that the optimal trade-off between them depends on the needs of the user [30].

4 Experimental Results

First, a 10-fold cross-validation was applied on the dataset to avoid overfitting. For all the evaluation measures, we report average values from the 10 experiments together with standard deviations. To test the results statistically, we performed Wilcoxon signed-rank test.

As noted above, we performed the wrapper feature selection first to obtain the best feature subset for the fuzzy rule-based systems. In our experiments, we followed the recommendations in [18] and set the learning parameters of the steady-state GA for feature selection as follows: k = 3 in the k-NN algorithm, GA with 100 individuals in population and 5,000 generations. The number of attributes to be selected was increased in a stepwise fashion in order to maximize k-NN accuracy. In Table 1, we present the attributes selected in at least 30% of the 10 experiments.

Table 1. The most frequently selected features.

As expected, financial ratios prevailed in the selected feature subsets. This finding also corroborates those presented in earlier research [2, 4, 12, 17]. On average, 5.6 ± 1.2 features were selected using the steady-state GA method, indicating a substantial reduction rate of 82.5% in the search space of fuzzy rules.

The feature subsets were further applied to predict financial statement fraud using the fuzzy rule-based systems. All experiments were conducted in the KEEL software environment and the settings of the algorithms are presented in Table 2. Note that we fixed the granularity of fuzzy partitions to 5 linguistic labels (membership functions) for all the algorithms to obtain a fair comparison.

Table 3 presents the prediction performance of the used fuzzy rule-based systems in terms of accuracy and cost. The obtained results clearly show that FURIA performed best regarding both performance measures, achieving average accuracy of 86.8% with low cost. This suggests that FURIA performed well on both classes, fraudulent and non-fraudulent, and that it was capable of predicting fraudulent companies correctly. GBML, IVTURS and COACH performed relatively well concerning accuracy, while SLAVE and SGERD provided a relatively low cost, indicating their good performance on the class of fraudulent companies. The Wilcoxon signed-rank test confirmed that FURIA and GBML statistically outperformed the remaining methods. Overall, the results suggest that the evolutionary fuzzy rule-based systems using the Michigan learning was not effective, whereas the gradual generation of the rule base as applied in the iterative learning is a more effective strategy.

Table 2. Settings of fuzzy rule-based systems.

To compare the prediction performance with state-of-the-art models, we used the models that performed best for the same dataset in [12]. More specifically, we used Bayesian belief networks (BBN), Decision table/Naïve Bayes (DTNB) and Random forest (RF) for comparative purposes. Note that other models, such as Support vector machines and several ensemble methods, were outperformed in [12]. We obtained the accuracy of 89.8%, 87.8% and 90.4% for BBN, DTNB and RF, respectively. Only RF performed statistically better than FURIA at \(P<0.05\). The respective cost obtained ranged between 0.340 and 0.406 for the RF and DTNB model, with statistically insignificant differences to FURIA. These results confirm that RF is considered a state-of-the-art method in this domain [31]. Note that these results must be compared with those in [12] with caution, given that we used a different feature selection method. Nevertheless, from these results we can deduce that the fuzzy rule-based systems performed relatively well compared with their machine learning competitors.

The number of conditions in the antecedents of the rules remained low across all methods, with GCCL as the best performer (Table 4). This can be attributed to the feature selection component performed in the first step. More complex rules were generated only by the COACH and SLAVE methods, respectively. Regarding the number of rules, SGERD provided the best interpretability at the rule base level (and statistically outperformed the other methods), with less than three rules on average. In contrast, WF failed to achieve reasonable interpretability. The remaining methods performed well concerning the number of rules (below 33 rules), suggesting that even the most accurate methods provided highly interpretable prediction models at the rule base level. For example, the FURIA rules with the highest grade of certainty for both classes were as follows:

  • (InsidersShares <= 0.232(-> 0.407)) and (GrowthInEPS <= 0.485(-> 0.486)) and (ExpGrowthInRevenue >= 0.319(-> 0.314)) and (PEG >= 0.021(-> 0.019)) => FRAUD = 0 (CF = 0.91);

  • (InsidersShares <= 0.158(-> 0.5)) and (NegativeSentiment >= 0.473(-> 0.465)) => FRAUD = 1 (CF = 0.77).

Although expert forecasts were present in the set of attributes, note that domain experts were not involved in the design of the rule bases. In other words, the rules were generated automatically using membership functions distributed uniformly over the universe of discourse.

Table 3. Prediction performance of fuzzy rule-based systems.
Table 4. Interpretability measures of fuzzy rule-based systems.

5 Conclusion

Returning to the hypothesis posed at the beginning of this paper, it is now possible to state that the proposed fuzzy rule-based system with evolutionary feature selection yields both competitive accuracy and desirable interpretability. One of the most significant findings to emerge from this paper is that a relatively high accurate fraud detection can be achieved using only a few fuzzy rules with several conditions in their antecedents. Therefore, the proposed system aspires to become an effective decision support system for auditors and financial analysts.

Finally, several important limitations need to be considered. First, our findings might not be transferable to other countries due to the differences in reporting and auditing. Additional empirical evidence for different countries is therefore recommended in future research. Another limitation might be the fixed granularity at the fuzzy partition level. Here we attempted to provide sufficient granularity, while retaining interpretability of fuzzy partitions at the recommended level [30]. However, different levels of granularity should be examined to provide empirical evidence. It is also suggested that the ensembles of fuzzy rule-based systems are investigated in future studies because ensemble methods have shown promising results for crisp rule-based systems in the related literature [31]. Further investigation into semantic interpretability of the fuzzy rule-based systems is also strongly recommended.