Predicting financial statement manipulation in South Africa: A comparison of the Beneish and Dechow models

Abstract Recently, South Africa has suffered from several large financial statement frauds. To assist stakeholders in identifying fraud, this study investigated the ability of the Beneish M-score and the Dechow et al. F-score to identify fraud in South Africa. The study also explored similarities in earnings management characteristics between false positives and fraudulent companies. Finally, the study re-estimated the models’ coefficients based on current South African data to determine if this improved their predictive capabilities. The study used a sample of 23 manipulated and 2 320 non-manipulated observations from 2006 to 2018 and found that both scores showed low sensitivity and precision. The false positives share similar, or higher, earnings management characteristics to the manipulators. Re-estimating the coefficients reduced the M-scores’ sensitivity by, on average, 6.52% but improved precision by, on average, 4.21%. Conversely, re-estimation increased the F-scores’ sensitivity by, on average, 58.70% but increased the type II error by, on average, 48.09%. These findings suggested that either the M- and F-scores are unsuitable in the South African context or that regulators have failed to identify manipulators adequately. Therefore, investors and other stakeholders should use caution when applying these models in South Africa.


Introduction
Globally, financial statement fraud accounts for ten per cent of occupational frauds (Association of Certified Fraud Examiners, 2020). While this is the least common of the three major fraud categories (asset misappropriation, corruption and financial statement fraud), it is the costliest, resulting in a median loss of United States (US) $954 000 in 2020 (Association of Certified Fraud Examiners, 2020). Concerningly, an increase in financial statement fraud is anticipated in the post-COVID-19 pandemic period (Association of Certified Fraud Examiners, 2021). Financial statement fraud undermines the quality of financial data utilised to make economic decisions. Poor economic decisions lead to financial loss for the stakeholders and may have negative consequences for an economy due to an inefficient allocation of resources (Pududu & De Villiers, 2016).
South Africa (SA) is no stranger to financial statement fraud, with the Steinhoff and Tongaat-Hulett scandals being two of the largest frauds in recent years. The Steinhoff scandal broke in December 2017 with the resignation of then-CEO Markus Jooste and the commencement of an investigation into accounting irregularities, including overstating revenue and hiding losses in offbalance-sheet companies (Hlobo et al., 2022;Rossouw & Styan, 2019). These revelations resulted in the share price declining from ZAR45.65 at the start of trading on 6 December 2017 to ZAR17.61 by the close of the day (Van Der Linde, 2022). By the close of trading on 8 December 2017, the share price had declined to ZAR6.00 and continued to descend (Rossouw & Styan, 2019). A few months later, in 2018, fraud at Tongaat-Hulett was revealed. The company's financial results had been overstated by approximately ZAR4.5 billion through the overstatement of revenue and assets and the understatement of expenses (Hlobo et al., 2022;Muzata & Marozva, 2022).
In addition to the frauds mentioned above, the PricewaterhouseCoopers (2020) Global Economic Crime and Fraud survey reported that SA had the third-highest occurrence of economic crime in the world, after India and China. The survey revealed that 60% of SA companies had been affected by fraud or economic crime between 2009 and 2020, compared to 47% of companies globally. The survey indicated that the percentage of companies experiencing accounting and financial statement fraud in SA had increased from 22% in 2018 to 34% in 2020.
Notwithstanding the prevalence of fraud in SA and the related economic costs, companies' responses to fraud prevention and detection have been ineffective. In Sub-Saharan Africa (a region that includes SA), an external audit is the most common anti-fraud control, despite only being responsible for the initial detection of 4% of frauds (Association of Certified Fraud Examiners, 2020). Concerningly, several financial statement frauds are committed with the auditor's knowledge (Mongwe & Malan, 2020). Only 58% of companies in SA reported having performed an investigation of their most severe fraud, and 59% of such frauds were never reported to the board of directors, 66% were not reported to the appropriate regulator, and 72% were never disclosed to the auditors (PricewaterhouseCoopers, 2020).
Given the country's high levels of fraud, SA provides a unique environment to study the detection of financial statement fraud. The country has the third-largest economy in Africa and is an emerging economy (World Bank, 2020), and is characterised by a small stock exchange, an insider economy, concentrated ownership and weak legal enforcement. These factors increase the risk of fraud (Pududu & De Villiers, 2016). Although investors in a high-risk country should be able to better detect manipulated financial statements, SA investors struggle to do so (Rabin, 2016). Despite these negative characteristics, SA has, until 2017, consistently ranked highly in the World Economic Forum's (2017) Global Competitiveness Report in terms of strong investor rights, the strength of auditing and financial reporting standards, protection of minority shareholders, efficiency of corporate boards and firm ethical behaviour. Following the revelations around the Steinhoff scandal, SA's rankings in the Global Competitiveness Reports declined markedly post-2017.
Academic research on fraud detection in SA is limited. An early study by Koornhof and Du Plessis (2000) considered red flags as an early warning system to identify potential fraud. A series of articles used machine learning models to identify qualified audit opinions (see Moepya et al., 2016;Moepya, Akhoury, et al., 2014;. Finally, Rabin (2016) used earnings discontinuities to identify companies engaging in earnings management, a precursor to financial statement fraud (Mishra & Malhotra, 2016).
Given SA's prevalence of fraud, the inability of investors to detect misrepresented financial statements, and the limited academic literature on fraud detection, the purpose of this study was three-fold. The first objective used a sample of 23 manipulated observations and 2 320 nonmanipulated observations from 2006 to 2018 to determine the usefulness of two popular financial statement fraud detection models (namely the Beneish (1999) M-score and the Dechow et al.
(2011) F-score) to detect cases of fraud in SA correctly. Several recent academic studies in African countries have used these models as proxies for financial statement fraud risk (see, for example, Mavengere (2015), Nyakarimi (2022), Nyakarimi et al. (2020), Okiro and Otiso (2021), Onyebuchi and Nkem (2021)). However, few studies have thoroughly tested the models' prediction abilities in contexts outside the United States and, specifically, in Africa. Consequently, Rad et al. (2021) call on researchers to test the accuracy of fraud detection models to determine their effectiveness in the context they are applied. This is particularly relevant given that both models were developed in the US using pre-2005 data. As South Africa is considered an emerging economy and uses International Financial Reporting Standards (IFRS), it provides a very different context to the US, a developed country that uses US GAAP.
The second objective was to investigate the nature of the false positives produced by these models to determine whether they have similar earnings management characteristics to fraudulent companies. Concerningly, prediction models tend to generate many false positives (Beneish & Vorst, 2021;Walker, 2020). Consequently, Dechow et al. (2011) call for further research into the characteristics of false positives, but research in this area is limited. The final objective was to reestimate the coefficients of the two models based on SA data to increase the models' predictive ability in SA. This addresses initial concerns about the differences in the US (where the models were developed) and SA contexts, as well as the later period under consideration.
This study contributes to the existing body of knowledge by showing that the M-and F-scores perform poorly in correctly identifying manipulating companies in SA. African studies have incorrectly relied on the earlier good performance of these models in non-African contexts (such as the US, Europe and Asia) without testing the validity of the models in an African context. Of further concern is that recent studies in the US and China have shown declined performance of these prediction models (see, for example, Beneish and Vorst (2021) and Lu and Zhao (2021)), highlighting the need to test these models thoroughly in different contexts before relying on them. The study further contributes to understanding the nature of false positives generated by the models. In this study, false positives were shown to have similar or higher levels of accruals-based earnings management compared to the manipulator sample, highlighting that the models may not be picking up fraud but rather aggressive accounting practices. Finally, the study further provides evidence that re-estimation may not improve the models' performance. Re-estimation of the M-score coefficients using publicly-available South African data reduces the models' ability to identify manipulators correctly by, on average, 6.52%. Conversely, re-estimating the coefficients for the F-score improves the scores' ability to classify manipulators correctly by, on average, 58.70%, but the number of false positives is substantially increased.
The remainder of the article is organised as follows: section 2 presents the literature review and hypothesis development, section 3 details the methodology applied in the study, the results are presented and discussed in sections 4 and 5 and, finally, section 6 concludes.

Defining financial statement fraud
Financial statement fraud is defined as an intentional misstatement of financial statements to gain some benefit (Association of Certified Fraud Examiners, 2020). It is essential to distinguish between financial statement fraud and earnings management. While both relate to intentional misstatement for economic gain, financial statement fraud occurs outside acceptable accounting standards, while earnings management occurs within such standards (Albizri et al., 2019).

Financial statement fraud detection models
Financial statement fraud detection models incorporate financial ratios and other elements, such as textual analysis, which contain proxies for the fraud risk factors identified in the theoretical literature. Models have been developed using various methods, including simple financial ratios, statistical methods (such as logit and probit models) and advanced machine learning methods (such as artificial neural networks and support vector machines). While neural networks are the most widely used method in the academic literature, they are complex, lack transparency and are less interpretable (Mongwe & Malan, 2020). As a result, they are not suitable for widespread use in emerging markets such as SA. In addition, these advanced methods do not necessarily deliver superior predictive power than the F-score or a simple screen of sales growth (Walker, 2020). Mongwe and Malan (2020) claim that there is no overall best method, with performance often based on the data set used.
For these reasons, this study used the M-and F-scores. Both models are widely used in the literature and require only information directly obtainable from the company financial statements to estimate. They can thus serve as suitable screening tools (Skousen & Twedt, 2009), particularly in emerging economies where there is increased information asymmetry and a lack of comprehensive databases compared to advanced economies. In addition, the F-score is considered the standard in financial statement fraud prediction (Walker, 2020).

Beneish (1999) M-score
The M-Score was developed by Beneish (1999) using probit estimation with data from 1982 until 1992. US Security Exchange Commission (SEC) enforcement actions and news reports were used to identify 74 non-financial US companies that manipulated their earnings matched to 2 332 nonmanipulators by industry and year. The financial statement elements used to predict manipulation were based on signals identified in the academic and practitioner literature. The unweighted model, as estimated by Beneish (1999), is as follows: Where DSRI refers to the days' sales in receivables index, GMI refers to the gross margin index, AQI denotes the asset quality index, SGI denotes the sales growth index, DEPI is the depreciation index, SGAI is the sales, general and administrative expenses index, TATA refers to the total accruals to total assets, and finally, LVGI is the leverage index (Beneish, 1999). The detailed variable calculations are presented in Appendix 1. Beneish (1999) then determined three cut-off points that minimised the expected cost of misclassification (ECM) at different relevant costs of type I and II errors. 1 These cut-off points were −1.49, −1.78 and −1.89, representing relative costs of 10:1, 20:1 and 40:1, respectively, where a score greater than the cut-off indicates that the company is classified as a manipulator.
Not all the variables in the M-score are equally important (Paolone & Magazzino, 2015). As a result, a simplified five-variable model was also developed in the literature as follows (Nyakarimi, 2022): Where the variables maintain their meaning from the original model. However, as the full M-score has not yet been thoroughly tested in the South African context, and in line with the majority of academic literature (see, for example, Aghghaleh et al. (2016), M. D. Beneish et al. (2013), D. Beneish and Vorst (2021), Cecchini et al. (2010), Jones et al. (2008), Kamal et al. (2016), Price et al. (2011) andNurul Herawati (2015), this study uses the original M-score, inclusive of all eight variables. Dechow et al. (2011) F-score Dechow et al. (2011) also recognised the usefulness of financial information beyond accruals to detect financial statement fraud. Unlike prior models, however, they aimed to allow a user to calculate the F-score for an individual company and simplify the assessment of whether it was misstating its financial statements. To achieve this, they did not include any indices as their variables or perform any form of matching between manipulating and non-manipulating firms. Using a total of 2 190 accounting violations identified by the US SEC from May 1982 to June 2005, they developed three models using logistic regression to detect manipulation. Model 1 contained financial statement variables only as follows:
Model 2 introduced off-balance sheet and non-financial variables as follows: Where ΔEMP and LEASE represent the abnormal change in employees and the existence of operating leases, respectively (Dechow et al., 2011).
The first model offers two advantages. First, it contains most of the predictive power. Second, it is the least restrictive model, as the required information may be accessed from financial statements (Price et al., 2011;Skousen & Twedt, 2009). This second benefit is particularly relevant for emerging economies. Thus, given the importance of this second benefit for the current study's context, as well as in line with the majority of the prior literature (see, for example, Aghghaleh et al. (2016), Chakrabarty et al. (2022), Price et al. (2011) and Walker (2020)), Model 1 of the F-score is used in this study.
Following the calculation of the predicted value, it is then converted to a probability as follows: Finally, the F-score is calculated by dividing the probability by the "unconditional expectation of misstatement" (UEM). The UEM is the proportion of misstated firms to total firms (Dechow et al., 2011:60). Companies that obtained an F-score above one are considered an above-normal risk, whilst companies scoring above 2.45 have a high risk of manipulation (Dechow et al., 2011).

Comparative performance literature
Numerous studies have investigated the ability of the M-and F-scores to detect financial statement fraud. In his original study, Beneish (1999) determined that the M-score could correctly detect 76% of manipulating firms and 82.5% of non-manipulating companies in the estimation sample. The model only identified 56.1% of manipulators in the holdout sample, although the correct classification of non-manipulating companies rose to 90.9%. Several later studies also found positive results for the model. In the US, using a maximum of 142 manipulated and 72 815 non-manipulated observations from 1988 to 2001, Jones et al. (2008) found that the model was significantly positively associated with both the occurrence of fraud and the magnitude of the fraud. Using a later sample of 43 534 US observations over the period 1993 until 2010, Beneish et al. (2013) showed that the M-score could identify 71% of manipulators. In Asia, Tarjo and Herawati (2015) used a matched sample (based on assets and industry) of 35 manipulators and 35 non-manipulators from 2001 to 2014. They found that 77.1% of the manipulators and 80% of the non-manipulators were correctly classified. In Malaysia, Kamal et al. (2016) tested the M-score's ability to identify 17 manipulated companies from 1993 to 2014. They reported an 82% accuracy when using a −2.22 cut-off, a 76% accuracy for a −1.89 cut-off and a 71% accuracy for the −1.78 cut-off.
Regarding the F-score, in their original study, Dechow et al. (2011) identified that Model 1 correctly classified 68.6% of manipulating companies and 63.7% of non-manipulators in the estimation sample and 73.8% of manipulating companies and 61.7% of non-manipulating companies in the holdout sample. A subsequent study in the US from 1991 until 2008 by Chakrabarty et al. (2022) used a sample of 853 manipulators and 119 967 non-manipulators. They found that the F-score correctly identified 68.5% of manipulators and 57.5% of non-manipulators.
Based on the above results and the detective power of the M-and F-scores, recent African literature has relied on these models as proxies for fraud (see, for example, Mavengere (2015), Nyakarimi (2022), Nyakarimi et al. (2020), Okiro and Otiso (2021), Onyebuchi and Nkem (2021)). However, these studies ignore that these models have not been tested in the African context, where they may not be applicable due to the different context from the US and the later period (Lu & Zhao, 2021). Further, more recent studies have found that the models, particularly the M-score, are less able to predict manipulation in recent times correctly. For example, Beneish and Vorst (2021) used a sample of 768 manipulated observations and 136 144 non-manipulated observations from 1979 to 2016 in the US. They found that the M-score only identified 23.18% of manipulators. Likewise, Lu and Zhao (2021) randomly selected 40% of a sample of 190 manipulators and 9 693 non-manipulators for Chinese listed firms. They found that the M-score could only detect 29.63% of the fraud sample.
Thus, given the mixed findings and the seeming decline in the models' performance, there is a need to test whether the M-and F-scores are relevant in the SA context before being able to rely on the models as proxies for fraud risk. Consequently, the following hypothesis is drawn:

H1:
The M-and F-scores can detect financial statement fraud in SA.
Several studies have compared the performance of the M-and F-scores on a homogenous sample. These studies have demonstrated that, while both models can correctly identify manipulating companies, the F-score is a more robust model with greater predictive accuracy. Cecchini et al. (2010) used US data from 1991 to 2003. Using 149 fraudulent observations matched to 3 389 nonfraudulent observations (based on industry and year), they found that the M-score correctly classified 54.2% of fraudulent and 45.5% of non-fraudulent observations. Using 57 fraudulent and 1 244 non-fraudulent observations, 4 the F-score outperformed the M-score by correctly identifying 70.0% of fraudulent and 84.9% of non-fraudulent observations. Price et al. (2011) also studied US companies. They used a total sample of 57 185 observations from 1994 until 2008, including 866 SEC enforcement actions, 542 accounting irregularities and 948 lawsuits. Their results found that the F-score outperformed the M-score. In a Malaysian context, Aghghaleh et al. (2016) used a one-for-one matched sample (based on industry and year) of 82 fraudulent observations from 2001 to 2014. They found that the F-score identified a higher proportion of fraudulent observations than the M-score (73.17% compared to 69.51%) with a lower type II error (26.83% compared to 30.49%).
Based on these studies, the F-score seems to have greater detecting power than the M-score. Therefore, the following hypothesis is drawn:

H2:
The F-score outperforms the M-score in detecting financial statement fraud in SA.

Earnings management characteristics of false positives
A fundamental problem with financial statement fraud detection models is the high occurrence of type II errors (false positives) generated (Beneish & Vorst, 2021). This problem is particularly prevalent when detecting a rare event such as financial statement fraud (Walker, 2020). Given the inherent unobservability of financial statement fraud and the resource constraints regulators face when investigating such fraud, an avenue for further research is identifying characteristics of the false positives (Dechow et al., 2011).
Multiple studies revealed that companies that commit fraud have previously engaged more aggressively in earnings management (Dechow et al., 1996;Marinakis, 2011;Perols & Lougee, 2011). As extensive earnings management eventually reverses or reduces manipulation flexibility, managers may resort to fraud to maintain appearances (Perols & Lougee, 2011). For this reason, earnings management is considered a precursor to accounting fraud (Mishra & Malhotra, 2016). Therefore, it is expected that companies identified as false positives by the M-and F-scores would display earnings management characteristics more in accordance with the manipulator sample. Hypothesis three is thus:

H3:
The false positive samples generated by the M-and F-scores display earnings management characteristics consistent with the manipulator sample.

M-score, F-score and model drift
The M-and F-scores were developed in the US using pre-2005 data. These models are static; the world, however, changes. Thus, using these models on more recent data in a different country may reveal model deterioration (Lu & Zhao, 2021). This is due to either concept drift (where the output characteristics change) or data drift (where the input characteristics change) (Ackerman et al., 2019;Webb et al., 2016).
Several studies have updated the M-and F-scores in different ways. First, some studies (such as Cecchini et al. (2010) and Marinakis (2011)) re-estimated the coefficients using US data from 1991 to 2003 and UK data from 1994 to 2007, respectively. Next, other studies (such as Hung et al. (2017) and Putra and Dinarjito (2021), who studied 614 Vietnamese observations from 2014 to 2016 and 81 Indonesian companies from 2012 to 2018, respectively) first identified variables within the scores which could differentiate between manipulators and non-manipulators. Variables that were unable to differentiate were omitted from the models before re-estimating the coefficients. The last group of studies (such as Chakrabarty et al. (2022), Hung et al. (2017), Lu and Zhao (2021) and Marinakis (2011)) added additional variables in an attempt to improve the models. While most of these studies do not report a direct comparison between the predictive ability of the original and revised models, Chakrabarty et al. (2022) noted that, for the estimation and holdout sample, the model's ability to correctly detect manipulators increases by 3.6% and 3% respectively after the inclusion of additional variables and re-estimation 5 .
The following research hypothesis is, therefore, developed: H4: Updating the coefficients of the M-and F-scores will increase the ability of the two models to identify manipulators and decrease misclassification errors in SA.

Population, sample and data collection
The population for this study is all 330 non-financial companies listed on the main board of the Johannesburg Stock Exchange (JSE) from 2006 until 2018. Financial companies are excluded, because the M-score was developed on non-financial firms (Kukreja et al., 2020) and financial firms have different regulatory and other requirements which may influence the outcome of the calculations (Orazalin & Akhmetzhanov, 2018). The 2006 year represents the first available enforcement action by the Financial Sector Conduct Authority (FSCA). Ending the sample in 2018 allows regulators sufficient time to investigate suspected irregularities. Walker (2020) notes that the mean and median time between the fraud and the SEC issuing an enforcement action is four years in the US. Based on 1 243 SEC enforcement actions, Karpoff et al. (2017) found the median period from the violation until the first enforcement action was 2.41 years. Finally, Bao et al. (2020) allowed for a two-year gap. In SA, studies have used other measures, such as qualified audit opinions (Moepya, 2017), small losses (Pududu & De Villiers, 2016) and earnings distribution discontinuities (Rabin, 2016), rather than enforcement actions to proxy for financial statement manipulation. Consequently, there is a lack of data on how long the Financial Reporting Investigation Panel (FRIP) and FSCA take to issue an enforcement action or equivalent. Thus, this study allowed for a three-and-a-half-year gap for regulators to identify violations (2019 until mid-2022).
Based on the above, an unbalanced panel of 330 firms across 13 years, representing 2 775 firmyear observations, was arrived at. The financial data were collected from the Standard and Poor's Capital IQ and Bloomberg databases. The "as originally reported" data was used to avoid the risk of abnormalities being removed when the data was restated.
In arriving at the final sample, 26 firm-year observations in which the company listed after yearend but before the release of the annual report were removed. Further, 52 firm-years in which a company's year-end changed were removed together with the year immediately after the change in year-end (for a total of 104 firm-years). This was due to the length of the periods not being comparable. Next, five firm-years were removed because the financial statements were reported in a currency experiencing hyperinflation.
A total of 272 observations with missing data that prevented the calculation of either the M-or F-scores were removed from the sample. Only using observations for which both models can be calculated increases the power of the statistical tests (Price et al., 2011). Mongwe and Malan (2020) also found that 94% of studies surveyed on fraud detection either do not deal with missing data or simply delete the affected observation. While this approach results in data loss, it avoids imputing data that may not exist (Mongwe & Malan, 2020).
Finally, 25 companies with only one firm-year observation were removed from the sample. This process resulted in a final unbalanced panel of 274 companies representing 2 343 firm-years, summarised in Table 1 below.

Identifying earnings manipulators
In SA, a complete list of firms that have manipulated their earnings is not readily available. Further, unlike advanced economies, the oversight bodies are not considered sophisticated and do not examine IFRS compliance on a sufficiently regular basis (Rabin, 2016). As such, a list of instances when companies engaged in manipulation was compiled as described below.
Investigations by regulators (such as the SEC in the US) are the most common proxy for financial statement fraud (Mongwe & Malan, 2020). SA has two regulatory bodies that monitor listed company financial statements: the FSCA (formerly the Financial Services Board) and the FRIP (formerly the GAAP Monitoring Panel). The FSCA is responsible for regulation and supervision within the SA financial markets and addresses issues around market abuses, including prohibited trading practices, insider trading and false and misleading reporting. As this study focused on financial statement fraud, only those enforcement actions relating to section 76 of the Securities Services Act no. 36 of 2004 (pre-2013) and section 81 of the Financial Markets Act no. 19 of 2012 (post-2013) were used. FSCA enforcement actions were obtained from the FSCA website.
The JSE tasks the FRIP to investigate instances of non-compliance with IFRS. Unlike the FSCA, the FRIP does not publish a list of investigations and their outcomes. However, following the investigation, the JSE may instruct companies guilty of non-compliance to publish or reissue any necessary information and make a public announcement via the Securities Exchange News Service (SENS) (Watson & Rossouw, 2012). Following Watson and Rossouw (2012), the IRESS database was searched to identify SENS announcements which included the words "GAAP Monitoring Panel", "GMP", "Financial Reporting Investigation Panel" and "FRIP". Each FSCA enforcement action and FRIP restatement identified was then examined to determine whether it involved an IFRS violation and the year(s) to which that violation relates.
Finally, similar to Moepya (2017), companies that had a qualified audit opinion during the period were included in the manipulator sample. However, not all qualifications relate to fraud (Jones et al., 2008). Thus, unlike Moepya (2017), this study excluded the emphasis of matter opinions and qualifications that did not relate to IFRS violations (i.e. going concern issues). Thus, only qualifications related to IFRS violations and disclaimers of opinion, where the auditor cannot draw an opinion, formed part of the manipulator sample.
Thus, only companies found guilty of fraud or a violation by the FSCA or FRIP, or having received a qualified audit report due to fraud or violation, were included in the manipulator sample. All other non-financial companies listed on the JSE between 1 January 2006 and 31 December 2018 formed part of the non-manipulating sample (i.e. these companies had not been found guilty of fraud or a violation, nor had they received a relevant qualified audit opinion). Table 2 discloses a sample of 23 manipulated firm-year observations (9 unique companies) representing 0.98% of the total observations. This provides a smaller absolute number of manipulated observations Listed after year-end but before the release of AFS (1) Change of year-end (3) Other anomalous situations (0) Missing data for M-or F-score (27  Note: (Source: Researchers' own construction, as well as data obtained from Beneish (1999) and Dechow et al. (2011)) compared to the original studies. Proportionally, however, this sample does compare favourably to the original studies, particularly that of Dechow et al. (2011).

Calculation of the M-and F-scores
The M-and F-scores were estimated as described in Equations (1) and (3) above. As justified under sections 2.2.1 and 2.2.2, the original eight variable M-score and model 1 for the F-score were used. Following Beneish (1999) and Dechow et al. (2011), all variables used in calculating the M-and F-scores were winsorized at the first and ninety-ninth percentiles.
For the M-score, in the original study, Beneish (1999) used a balance sheet approach to determine total accruals (refer to TATA_BS in Appendix 1). However, more recent studies (such as Beneish et al. (2013) and Beneish and Vorst (2021)) have used an income statement approach to determine total accruals (refer to TATA_IS in Appendix 1). This change was driven by new disclosure requirements in financial reporting standards (Beneish et al., 2013). This study presents both methods separately, referred to as the M-score (BS) and M-score (IS). In addition, all three cut-off points (i.e. −1.49, −1.78 and −1.89) were used to predict whether an observation was manipulated.
For the F-score, Dechow et al. (2011) determined the UEM to be 0.0037 based on their sample of US companies. However, it is unclear in the literature whether the UEM should be updated for country-specific risk, particularly given SA's higher risk of economic crime (PricewaterhouseCoopers, 2020). As a result, this study used the original US UEM of 0.0037 and a recalculated UEM specific to the SA sample of 0.0098 (23/2343).

Testing the detective power of the M-and F-scores
Following the estimation of the M-and F-scores, various classification performance metrics and the area under the receiver operating characteristic curve (AUC) were used to test the detective power of models in SA. Mongwe and Malan (2020) identify the common classification performance metrics in the literature as follows: Sensitivity ¼ True positive True positive þ False negative (8) Precision ¼ True positive True positive þ False positive (10) While classification accuracy was the most commonly used measure in the prior literature, it is not appropriate due to the scarcity of financial statement fraud cases (Mongwe & Malan, 2020). Instead, sensitivity and precision are superior in such situations (Moepya, 2017). For this study, the accuracy, sensitivity, precision and F-measure are presented for a clearer picture of classification performance.
The final measure of model performance is the AUC. This measure provides a single statistic based on the sensitivity and specificity of the model. A higher AUC statistic represents better model performance, with an AUC of one representing perfect prediction and an AUC of 0.5 representing a random guess.

Investigating the earnings management characteristics of false positives
This research focused on accruals-based earnings management (AEM) and companies that meet or just beat prior-year earnings to investigate the earnings management characteristics of falsepositive observations. AEM was measured using the cross-sectional modified Jones model. This model is considered one of the most powerful accruals-based models and is widely used throughout the earnings management literature (Mishra & Malhotra, 2016;Rabin, 2016). This model was estimated as follows: Where NDA i,t represents the estimated non-discretionary accruals for company i in year t, A i,t-1 represents total assets for company i in the year t-1, ΔREV i,t represents the change in revenue for company i between years t and t-1, ΔREC i,t is the change in net receivables for company i between years t and t-1, and PPE i,t is the gross property, plant and equipment for company i in year t (Dechow et al., 1995). The residual from Equation (12) represents the discretionary accrual element. A Wilcoxon rank sum test was used to identify any statistically significant difference in the means of the discretionary accruals between the manipulator and non-manipulator samples.
Earnings per share (EPS) was used to identify companies that meet or just beat the prior year's earnings, defined as the change in EPS falling between zero and a small positive number. For robustness, three measures of a small positive number were used; namely, a one, two or three cents change in EPS (Lo et al., 2017). A Pearson Chi-squared test was used to identify any statistically significant difference between the two samples' proportions of meet or just beat prior year EPS.

Re-estimating the coefficients of the M-and F-Scores for the SA context
The coefficients of the M-and F-scores were re-estimated by applying the same variables and methodologies (i.e. probit and logit estimation, respectively) originally used by Beneish (1999) and Dechow et al. (2011) to the current SA data. To determine the appropriate cut-offs for the M-score, following Beneish (1999), the ECM was minimised at the cost-error ratios of 10:1, 20:1, 30:1 and 40:1. The ECM was calculated as: Where P(M) represents the prior probability of encountering earnings manipulators (calculated as 0.0098), P I and P II represent the probability of type I and type II errors, respectively, and C I and C II represent the relative costs of type I and type II errors respectively (Beneish, 1999). For the F-score, the UEM of 0.0037 and 0.0098 were used with the cut-off of one representing aboveaverage risk observations.
Classification performance and the AUC were used to compare the detective powers of the original models compared to the re-estimated models. For the classification performance metrics, the number of manipulator companies was considered too small to keep a holdout sample.
When determining the AUC, k-fold cross-validation with ten folds was used. Determining the out-of-sample prediction error is essential to avoid hindsight bias when developing predictive models. K-fold cross-validation is considered superior to bootstrapping procedures, which overlap the training and test samples. This overlap underestimates the prediction error (Witten et al., 2011). Following Larcker and Zakolyukina (2012) and Moepya (2017), ten folds were used. The AUC of the ten iterations were then averaged to determine the overall AUC. Table 3 below presents descriptive statistics on the breakdown of manipulated and nonmanipulated observations across industries. Although basic materials, consumer services and industrials are the three largest sectors in the SA economy, they only account for a combined total of five (21.74%) manipulated observations. Instead, consumer goods, a medium-sized sector, accounts for sixteen (69.57%) of the manipulated observations. This is due to the companies involved in SA's recent major frauds (i.e. Steinhoff and Tongaat-Hulett) being classified in this sector. SA's three smallest sectors (healthcare, oil and gas, and telecommunications) have no manipulated observations. 6 Table 3 also presents the average size and return on assets for the manipulated and nonmanipulated samples. On average, manipulated observations tend to be smaller and show a lower return on assets. This lower average performance may have provided an incentive for the companies to engage in manipulative practices. Table 4 Panel A presents the distribution of the variables underlying the M-score for manipulators and non-manipulators for the current sample compared to those obtained by Beneish (1999). Unlike in the original study, where a significant difference in mean between manipulators and nonmanipulators was found for five of the eight variables, in the current sample, a significant difference was only found for one variable (TATA_BS). This finding is also contrary to Marinakis (2011) and Tarjo and Herawati (2015), who found that four of the eight variables could be used to detect manipulation.

Distribution of variables underlying the M-and F-scores
Similarly, Panel B shows the distribution of variables underlying the F-score and the comparison to the original study by Dechow et al. (2011). Unlike the original study, for which six of the seven variables showed a significant difference between manipulators and non-manipulators, only the AISS variable showed a significant difference in the current sample. This finding is contrary to   Beneish (1999) and Dechow et al. (2011)) Bertomeu et al. (2021), who found that the variables included in the F-score are influential in detecting manipulation. However, it does align with Deniswara et al. (2022), Hung et al. (2017) and Putra and Dinarjito (2021), who found that the variables underlying the F-score had limited, if any, ability to distinguish between manipulating and non-manipulating companies in Indonesia.
The lack of a statistically significant distribution of the underlying variables indicates that these variables appear unable to differentiate between manipulating and non-manipulating firms in the current SA sample.

Detective power of the M-and F-scores
The classification performance of the M-and F-scores in SA at various cut-offs and UEMs are summarised in Table 5. The accuracy (i.e. correct classification of both manipulators and nonmanipulators) of the M-scores across all cut-offs is high, comparable to the original study. This high accuracy is also in accordance with studies by Aghghaleh et al. (2016), Beneish and Vorst (2021) and Tarjo and Herawati (2015), who report accuracies of 73.17%, 82.59% and 78.57%, respectively. 7 For the F-score, the SA-specific UEM of 0.0098 yields the highest accuracy of all the models. However, the original UEM of 0.0037 produces the lowest accuracy of all the models, which is reasonably in line with the original study results as well as subsequent results of Aghghaleh et al. (2016), Beneish and Vorst (2021) and Chakrabarty et al. (2022), who report accuracy levels of 76.22%, 60.71% and 57.60% respectively 7 . However, the high accuracy across all models benefits from the imbalance between manipulators and non-manipulators. As such, it is primarily driven by the correct classification of the non-manipulator sample (true negatives).
For sensitivity, which measures the scores' ability to classify manipulating firms correctly, the M-score performs poorly: the best variation of the score can identify only 26.09% of manipulators. At all cut-off levels, the results of the current study are substantially worse than the original study, as well as studies by Aghghaleh et al. (2016), Beneish et al. (2013) and Tarjo and Herawati (2015), who reported sensitivity of 69.51%, 71.00% and 77.10% respectively. However, these results align with Beneish and Vorst (2021) and Lu and Zhao (2021), who found that the M-score could only correctly predict 23.18% and 29.63% of manipulators, respectively. For the F-score, sensitivity is also low, with the UEM of 0.0037 achieving the highest sensitivity of 52.17%, which is worse than the original study. The performance of the F-score is also worse than subsequent studies by 2016), Beneish and Vorst (2021) and Chakrabarty et al. (2022) (for the in-sample test), who reported sensitivities of 73.17%, 64.71% and 68.50% respectively. In their out-of-sample test, however, Chakrabarty et al. (2022) reported a sensitivity of 54.61%, which is more in line with the current study.
In terms of precision (i.e. the ability to classify only true manipulators as manipulators) and the F-measure (a metric which combines sensitivity and precision), the M-score's performance in the SA sample is poor compared to what was achieved in the original study as well as studies by Aghghaleh et al. (2016) of 75.00% and 72.15% 7 respectively as well as Tarjo and Herawati (2015) of 79.41% and 78.26% 7 respectively. However, the M-score's precision and F-measure are similar to the results achieved by Beneish and Vorst (2021) of 0.76% and 1.48% 7 , respectively. Surprisingly, the F-score (UEM = 0.0037) achieves higher precision and F-measure than the original study, despite being worse at correctly classifying manipulators. Further, the precision and F-measure of the F-score (UEM = 0.0037) are in line with other studies by Beneish and Vorst (2021) and Chakrabarty et al. (2022), who report a precision of 0.92% and 1.13% 7 respectively and an F-measure of 1.81% and 2.22% 7 respectively.
Considering the performance across scores, the M-score (BS) outperforms the M-score (IS) across all metrics for equivalent cut-offs (except for sensitivity and the type I error at the highest cut-off of −1.49, which are equal). By comparison, the F-score (UEM = 0.0098) has the highest accuracy across all scores, while the F-score (UEM = 0.0037) has the lowest accuracy. Despite this low accuracy, the F-score (UEM = 0.0037) is the best-performing score in terms of sensitivity. In Table 5 Beneish (1999) for the M-score and Dechow et al. (2011) for the F-score. No original performance measures were presented for the M-score based on the income statement approach or for the F-score based on the UEM of 0.0098, as these were not included in the original study.
2 These performance measures were not presented in the original studies by Beneish (1999) and Dechow et al. (2011), but they have been recalculated based on the data presented in these studies. 3 The results for the area under the receiver operator curve were based on the underlying M-or F-score rather than a specific cut-off point or unconditional expectation of misstatement. The original studies did not include a calculation of the AUC, nor was it possible to recalculate based on the data presented in the studies. 4 Due to both precision and sensitivity being equal to zero, it was impossible to compute the F-measure.
addition, it is only outperformed in terms of precision and the F-measure by the M-score (BS) at the lowest cut-off point (−1.89).
The AUC reflects that both the M-score (BS) and F-score outperform a random guess, while the M-score (IS) does not. The AUC for the M-score (BS) of 0.5936 is substantially below Price et al. (2011), who report an AUC of 0.7324, but more in line with the AUC of 0.5770 reported by Beneish and Vorst (2021). The AUC for the F-score of 0.6067 is below that achieved in studies by Beneish and Vorst (2021), Chakrabarty et al. (2022), Price et al. (2011) and Walker (2020) of 0.6730, 0.6670, 0.7238 and 0.6600 respectively. While the F-score slightly outperforms the M-score based on this metric, Price et al. (2011) caution against such an interpretation as the AUC does not distinguish well between two "good" models.
Despite the high overall accuracy of the models, their ability to correctly predict manipulators is low, as shown by the poor sensitivity, precision and type I error metrics. Based on this, hypothesis 1, that the M-and F-scores can detect manipulation in SA, is not supported. Further, while the F-score does outperform the M-score on some metrics, it underperforms on other metrics, depending on the cut-off points used. Thus, there is insufficient evidence to support hypothesis 2, that the F-score outperforms the M-score in the SA context.

Earnings management characteristics of the false positives
Given the inability of the M-and F-scores to identify manipulators in SA, it is helpful to consider the earnings management characteristics of the false positives to understand better what the models are identifying. Table 6 summarises these results. Panel A compares the false positives to the manipulator sample, whereas Panel B compares the false positives to the true negatives.
For the M-score (BS), the false positive samples do not display similar discretionary and absolute discretionary accruals levels compared to the manipulator sample. Rather, all three cut-offs display higher discretionary and absolute discretionary accruals levels. The F-score (UEM = 0.0098) shows similar results. However, for the M-score (IS) and the F-score (UEM = 0.0037), there is no statistically significant difference between the discretionary accruals and absolute discretionary accruals for the false positive and manipulator samples. For all scores, the discretionary and absolute discretionary accruals of the false positive samples are significantly different from the true negative samples. This indicates that the false positive samples have similar or higher levels of AEM compared to the manipulator sample. It also shows that the false positive samples do not share the same level of AEM compared to the true negative sample.
For all scores, the false positive samples do not display a significantly different proportion of observations that meet or just beat prior year EPS at any level (1, 2 or 3 cents) compared to the manipulators. However, for the M-score models, the false positive samples reveal a higher proportion of observations just beating the prior year's EPS by 1 cent compared to the true negative samples. At the 2 and 3-cent levels, there is no difference between the manipulators, true negatives or false positives for the M-score. For the F-score (UEM = 0.0037), there is a lower proportion of false positives, which just beat the prior year's EPS by 2 and 3 cents compared to the true negatives.
Thus, the evidence presented indicates that the false positives, as determined by the M-score (IS) and F-score (UEM = 0.0037), share similar AEM characteristics as the manipulators, while the false positives, as determined by the M-score (BS) and F-score (UEM = 0.0098), show higher levels of AEM compared to manipulators. When considering earnings thresholds, the M-score (both BS and IS) false positives display similar proportions of meeting or just beating prior year EPS by 1 cent to the manipulators. Considering the F-score, false positives do not display different meet, or just beat, prior year EPS by 1 cent to either the manipulators or the true negatives. As a result, hypothesis 3 is partially supported as the false positive samples appear to have similar or higher levels of AEM than the manipulator sample and share similar meet, or just beat, EPS characteristics, but only at the 1 cent level for the M-score.

Re-estimating the coefficients of the M-Score and F-Score
Due to the poor performance of the original M-and F-scores in detecting manipulation in SA, the models were re-estimated to determine the coefficients that apply in SA, as shown in Table 7. Except for the constant terms and the AISS term in the F-score, none of the variable coefficients were statistically significant. This closely mirrors Table 4, where only TATA_BS and AISS showed a significant difference between manipulators and non-manipulators. These results again revealed the inability of the underlying variables to distinguish between manipulators and nonmanipulators in SA.
It should also be noted that the models as a whole display little explanatory power. All revised models have insignificant LR Chi 2 statistics and low pseudo R 2 statistics. This contradicts Marinakis (2011), who used UK data to report a revised M-score model with a statistically significant Chi 2 at the 1% level and a pseudo R 2 of 0.318.
Following the estimation of the models in Table 7, the M-score cut-offs that minimised the ECM were determined at relative costs of 10:1, 20:1, 30:1 and 40:1. Like Beneish (1999), the ECM at the relative costs of 20:1 and 30:1 were the same for both the balance sheet and income statement versions, resulting in the same cut-off. These cut-offs were determined as −1.7910 (10:1), −1.9653 (20:1) and −2.0407 (40:1) for the M-score (BS) and −1.4539 (10:1), −1.9735 (20:1) and −2.1641 (40:1) for the M-score (IS). Table 8 presents the classification performance for the revised M-and F-score models based on the estimation sample. Comparing the re-estimated models' classification performance to the original models' performance produced mixed results. For the M-score, the revised models performed better than the original models in this sample for accuracy, precision, the F-measure and the type II error. In contrast, the original models performed better in terms of sensitivity and the type I error. Thus, re-estimating the M-score coefficients reduced sensitivity but improved precision. By comparison, for the F-score, the revised scores performed better than the original scores for sensitivity and the type I error. In contrast, the original scores were superior in terms of accuracy and the type II error. The precision and F-measure were comparable and produced mixed results depending on the selected UEM. Thus, re-estimating the F-score improved sensitivity but at the cost of a higher type II error. This trade-off between sensitivity and precision in fraud detection models is also identified by Beneish and Vorst (2021). It should be noted, however, that this comparison for the re-estimated models is based on the estimation sample and may suffer from hindsight bias. Further out-of-sample testing is required to validate these findings, but could not be performed on these data due to the small sample of manipulators.
The AUC results were more robust as they were based on k-fold cross-validation using ten folds. Here, the revised models were consistently outperformed by the original models.
Unfortunately, comparable studies such as Cecchini et al. (2010) and Marinakis (2011), who also re-estimated the coefficients of the M-and F-scores, did not provide comparative results between the original and the re-estimated coefficients. However, studies which added variables to the models before re-estimation have shown improved performance across all metrics. For the M-score, Marinakis (2011) revised model outperformed his re-estimated model for accuracy, sensitivity and precision in both the estimation and holdout samples for all relative cost levels. Likewise, Chakrabarty et al. (2022) revised F-score outperformed the original F-score based on the same metrics as well as the AUC, which increased from 0.6670 to 0.7271. Given the mixed results presented above, hypothesis 4 is partially accepted. The re-estimated M-score failed to improve the identification of manipulators but did reduce misclassification errors. On the other hand, the re-estimated F-score improved the identification of manipulators but failed to reduce misclassification errors. Following Beneish (1999), the manipulator sample was matched to the non-manipulators based on industry and year as an additional test. Regarding the classification performance of the original M-and F-scores, the scores' accuracy, precision and F-measure were marginally superior compared to the unmatched results. The matched AUC was also marginally better than the unmatched AUC. The sensitivity, however, remained unchanged. Regarding the earnings management characteristics of the false positives, the results using the matched data revealed no differences compared to using the unmatched data. Finally, regarding the reestimated M-and F-scores, the comparative performance of the matched data for the M-score was mixed. The matched data results were marginally worse than the unmatched data for accuracy. Precision metrics were generally slightly superior for the matched data, but this depended on the relative cost ratio. Sensitivity was unchanged across the matched and unmatched data. Overall, the conclusions drawn remained unchanged, given the additional testing based on the matched data.    Further, Kukreja et al. (2020) argue that the M-score cannot detect every type of misstatement. The same may be true of the F-score. Consequently, the classification sensitivity 8 was recalculated based on the separate categories of manipulators (i.e. FSCA enforcement action, FRIP restatement and relevant qualified audit opinion) based on the original versions of the M-and F-scores. The results are presented in Table 9 below.

Additional tests
Regarding FSCA enforcement actions, the scores performed worse for this category. The M-score (BS and IS) could only correctly classify 7.14% of such actions when the broader cut-offs of −1.78 and −1.89 were selected. The narrower cut-off of −1.49 was unable to classify any enforcement action correctly. Likewise, the F-score (UEM = 0.0098) could also not correctly classify any enforcement action. However, the F-score (UEM = 0.0037) identified 50% of such enforcement actions. The scores appeared to perform better with regard to classification sensitivity for FRIP restatements and qualified audit opinions. The M-score (BS) correctly classified 60% of FRIP restatements when using the more lenient −1.89 cut-off. However, the cut-off of −1.49 failed to classify any FRIP restatement correctly. The M-score (IS) performed worse than the M-score (BS) at the broadest cut-off of −1.89 by only correctly identifying 40% of FRIP restatements. However, it performed better at the more stringent −1.49 cut-off as it identified 20% of FRIP restatements. Again, the F-score (UEM = 0.0098) could not identify any FRIP restatements, while the F-score (UEM = 0.0037) performed the best of all the scores and correctly classified 80% of the FRIP restatements. Finally, regarding the qualified audit opinions, the M-score (BS and IS) performed moderately at the broadest cut-off of −1.89, identifying 50% of qualified opinions. At the most stringent cut-off (−1.49), the M-score (BS) outperformed the M-score (IS) but was still only able to identify 25% of qualified opinions. The F-score performed worst of the scores in correctly classifying qualified opinions, only correctly identifying 25% when using the UEM of 0.0037.
These results show that the F-score (UEM = 0.0037) outperformed both M-score models for FSCA enforcement actions and FRIP restatements. However, the M-score outperformed the F-score when identifying qualified audit opinions. Caution, however, should be applied when relying on this set of additional results. Firstly, the sensitivity was based on very few observations, particularly for FRIP restatements and qualified audit opinions. Secondly, only the classification sensitivity is provided. As the false positives would have changed only slightly, the scores would continue to perform poorly in terms of precision, the F-measure and the type I error.

Summary of results
This study tested four hypotheses. The first hypothesis, of whether the M-and F-scores could detect financial statement manipulation in SA, was not supported. The second hypothesis was that, based on the findings of prior studies, the F-score would outperform the M-score. Given the inability of both models to successfully detect manipulation in SA, this hypothesis was also not supported. Partial support was found for the third hypothesis, which expected the false positive sample to share similar earnings management characteristics with the manipulator sample. Here, the study found that the false positives tended to have similar or higher levels of discretionary accruals in comparison to the manipulators. Finally, the fourth hypothesis expected that updating the coefficients of the M-and F-scores would improve the models' ability to identify manipulators in SA. This hypothesis was partially supported as, for the M-score, misclassifications were reduced, although the ability to identify manipulators worsened. The opposite occurred with the F-score as correctly identifying manipulators improved, but misclassifications increased substantially. In summary, the findings failed to support hypotheses 1 and 2, while partial support was found for hypotheses 3 and 4.

Discussion
The performance classification and AUC results reveal that both the M-and F-score appear ineffective in accurately identifying cases of manipulation in the SA context. This is consistent with more recent studies such as Beneish and Vorst (2021), Comporek (2020) and Lu and Zhao (2021), who also found limited ability of the models to detect fraud.
One possible explanation is that the models are inappropriate in the SA context. This could be a result of the underlying variables being unable to distinguish between manipulators and nonmanipulators, as seen in section 4.2. This explanation is consistent with Lu and Zhao's (2021) argument that the M-score does not work in the Chinese context, given the period it was developed and the different reporting contexts. Such an argument also applies to SA as the period under consideration is predominantly post the 2008 financial crisis. Also, SA is an emerging market and uses IFRS rather than US GAAP. Further supporting the above explanation are the earnings management characteristics. The false positive sample has either similar or higher levels of AEM than the manipulator sample. In addition, the false positive sample displays higher levels of AEM than the true negative sample. This presents evidence that the M-and F-score may identify firms with high AEM levels. However, Enomoto et al. (2015) claim that SA companies are less likely to manage earnings through AEM and more likely to manage them through real earnings management. Also, the false positive sample shows different proportions of meeting or just beating the prior year's EPS to the true negative sample. However, Pududu and De Villiers (2016) contend that SA may focus on thresholds other than earnings. Thus, models that distinguish between manipulators and non-manipulators based on AEM and earnings thresholds may be inappropriate in SA.
The final support for the M-and F-scores being inappropriate in SA is that the models cannot identify the type of manipulation that occurs in SA. Kukreja et al. (2020) note that different models have different limitations. In particular, the M-score is unable to detect every form of manipulation. This is evident in the SA context from the additional tests where the models show different abilities to detect FSCA enforcement actions, FRIP restatements and qualified opinions. In particular, the M-score appears to struggle with FSCA enforcement actions, while the F-score has the worst performance for qualified audit opinions.
An alternative explanation could be that SA regulators are unable to identify manipulators. In SA, 59% of companies experiencing fraud do not report the fraud to the board, 66% do not report fraud to an appropriate regulator, and 72% do not report to the external auditor (PricewaterhouseCoopers, 2020). This culture of not reporting fraud, together with regulators lacking appropriate resources, lower legal enforcement associated with emerging economies and SA investors being unable to detect earnings management (Rabin, 2016), makes it difficult for regulators to investigate fraudulent activities and take appropriate action. As a result, the models may identify valid manipulators that regulators have not yet identified. Both explanations provide reasons why re-estimating the coefficients of the original models would be insufficient to improve the ability of the models to detect manipulation without substantially increasing the extent of false positives.

Conclusion
This study investigated the ability of two popular fraud detection models (the Beneish (1999) M-score and the Dechow et al. (2011) F-score) to identify manipulating companies in the SA context correctly. Based on a sample of 23 manipulators and 2 320 non-manipulators from 2006 to 2018, the study found that both models showed limited ability to classify manipulators correctly. Further investigation into the earnings management characteristics of the false positive sample revealed that the models might be categorising companies based on AEM and earnings thresholds. While extensive earnings management is associated with financial statement fraud, it is not a definite indication that such fraud is occurring. Finally, updating the coefficients of the two models did improve aspects of detection, but at the cost of another. For example, re-estimating the M-score coefficients generally improved precision but at the expense of sensitivity. Conversely, re-estimating the F-score improved sensitivity but at the cost of an increased type II error. These results indicate that either the models are not appropriate in the SA context or that SA regulators cannot identify manipulators due to a lack of reporting fraudulent activities, a lack of resources and weak legal enforcement.
This study makes several contributions. First, the study investigates the ability of two popular models in fraud detection to identify manipulators in the SA context accurately. The results indicate that stakeholders should apply caution in using such models to predict fraudulent financial reporting, given their inability to accurately classify manipulators without generating many false positives. Additionally, regulators should allocate more resources to identify and combat fraudulent financial reporting. Second, the study provides a caution to other academics. Researchers need to report on a wide range of performance metrics so that users understand what the model does well compared to what it does not do well. In addition, academics, particularly in African contexts, are cautioned against indiscriminately using these models as proxies for fraud risk without extensively testing them in the local context. Third, the study contributes to the academic literature by investigating the earnings management characteristics of false positives generated by the models, showing that the models tend to differentiate companies with high levels of earnings management rather than companies which commit fraud. Finally, the study contributes to the development of fraud detection models by showing that re-estimating the model coefficients is likely insufficient to improve the models' performance, particularly if the underlying variables appear incapable of distinguishing between manipulators and non-manipulators. Instead, the focus should be placed on incorporating new variables that better distinguish between manipulators and non-manipulators, especially as the global economy changes and new reporting conventions and standards are developed. This research has some limitations that provide avenues for future research. The study investigated the fraud detection ability of only two popular models (which only incorporate information directly obtainable from the financial statements) in an SA-specific context. Consequently, the results may not be generalisable to other countries, even in Africa. Future researchers should test the models' performance in their country's context and employ more sophisticated models (which include non-financial information such as the modified M-score by Lu and Zhao's (2021) and models 2 and 3 of the F-score) and compare their performance. A second limitation is that this study considered only AEM and earnings thresholds when investigating the earnings management characteristics of the false positives. Given that companies may use different types of earnings management to achieve the same goals, future studies may consider investigating the real earnings management characteristics of the false positives as well as identifying other thresholds that may be more applicable in SA. A third limitation of this study is that it only updated the original model coefficients for SA. The study did not attempt to add additional explanatory variables or remove insignificant variables from the models. Subsequent studies should attempt to identify new variables that are superior in discriminating between manipulators and nonmanipulators and include such variables when revising such models.