A Multi-industry Default Prediction Model using Logistic Regression and Decision Tree

The accurate prediction of corporate bankruptcy for the firms in different industries is of a great concern to investors and creditors, as the reduction of creditors’ risk and a considerable amount of saving for an industry economy can be possible. Financial statements vary between industries. Therefore, economic intuition suggests that industry effects should be an important component in bankruptcy prediction. This study attempts to detail the characteristics of each industry using sector indicators. The results show significant relationship between probability of default and sector indicators. The results of this study may improve the default prediction models performance and reduce the costs of risk management.


INTRODUCTION
Prediction of corporate default is of a great concern to investors/creditors, borrowing firms and governments.During last two decades, the world has experienced a large number of financial crisis in emerging market economies of Latin America and Asia during 1994-1998 and the recent crisis in USA due to the sub-prime mortgage during 2008.These financial crisis were not confined individual economy, but affected directly or indirectly almost all the countries of the world.As a result, many voices have called for a revolution of existing default warning system to detect or prevent default problems in real time.Any organization may face default due to the competition and uncertainty which is observed increasingly in the business environment.An improvement in model accuracy in the default likelihood assessment leads to enormous future savings for the credit industry.Therefore, various profits such as cost decline in credit analysis, an increased debt collection rate and better monitoring attain as of accurate default prediction.
Review of literature on the subject confirmed hand full of studies conducted in the last four decades.Despite of these studies, the recent credit crisis indicated that yet there are areas of the study that needs researchers' attention.Moreover, emerging of the regulatory changes such as Basel III accord and the need for more precise and comprehensive risk management procedures justifies need of research in area of credit risk modelling and banking supervision.This requirement like these pushes companies especially banks and insurance companies to have a very robust and transparent risk management system.
Since the study of Fitz (1932), default prediction becomes a challenging issue in corporate finance.A number of default prediction models have developed extremely due to the emergent accessibility of data and the improvement of econometrical methods during the 1980s and 1990s.Most of this study has been persuasively directed by a small number of early studies (Beaver, 1966;Altman, 1968;Ohlson, 1980;Zavgren, 1985) on US extracted companies.Earlier, most of the studies on default risk focused on firm-specific indicators as a predictor of firms default across United States including (Courtis, 1978;Deakin, 1972;Jackendoff, 1962;Merwin, 1942;Meyer and Pifer, 1970;Smith and Winakor, 1935).Although, majority of the studies used the firm-specific variables, some researchers tried to use some other indicators such as interest rate, stock index return and GDP that affects default prediction.As a result of relationship between general economic and bankruptcy rates, some attempts have been made to predict default based on macroeconomic variables.
Over the years, a large strand of research on default prediction remained restricted to firm-level factors.Based on the surveys of the literature, not much attention has been paid to industry effects.Yet, there are some reasons which represent the importance of industry effects on default prediction.It is plausible that probability of default can differ for firms in different industries due to different levels of competition amongst various industries.Different industries may have different accounting principles, involving that the probability of default can vary for firms in different industries with otherwise the same balance sheets.Keeping in view the importance of external environmental factors, little attention has so far been paid to sectors and industry factors.Recent developments in the literature of default prediction have highlighted the importance of the effects of industry factor.In this regard, Lang and Stulz (1992) argued that sectors have distinctive nature and need to be intensively explored.Accordingly, related researches stress the need to examine the industry's behavioral effect on firm's default and support the importance of industry effects on default (Opler and Titman, 1994).
These studies on default prediction employ dummy variable to control the industry effect on default.We attempt to detail the characteristics of each industry, following the Kayo and Kimura's (2011) approach that justifies the characteristics influencing leverage.According to Gianneth (2003), firms in sectors with highly volatile returns are more likely to default due to temporary illiquidity; longer debt can help reducing inefficiencies in these sectors.In order to capture the more realistic effect of sector or industry on default prediction, this study employs munificence, dynamism and firm's concentration of an industry.
This study contributes to the default prediction literature by outlining a procedure to be used by banks to assess the likelihood for borrower default.Rather than focusing on financial measures which may be backward looking, this study investigates three industry factors including: munificence, dynamism and HHIndex as part of mechanism for selecting potentially distressed firms.Thus, this study intends to fill this gap comparing different methods including logistic regression, decision trees.

Variable selection:
In default prediction, the most important concern of interest among researchers is to construct the prediction model which characterizes the association between the default and financial ratios and then deploy the model to identify the high risk of default in the future.A large number of characteristics are usually incorporated so that the training data is not enough to cover the decision space, which is represented as the curse of dimensionality.Feature selection represents the problem by excluding unimportant, redundant and correlated features in order to increase the accuracy and simplicity of classification model, reducing the computational effort and enhancing the use of models.The representative features for default prediction can be presented as follows: Profitability: Profit before interest and tax/Total assets.
Size: Natural logarithm of sales.
Tangibility: Ratio of fixed assets to total assets.
In order to capture the more realistic effect of sector or industry on corporate default prediction, this study employed three variables at industry level including: munificence, dynamism and Herfindhal-Hirschman Index (HHIndex).The first two variables are derived from the model of Dess and Beard (1984), known as multidimensional model of environment.This model so far has been used in the context of corporate strategies.Consequently, effects of industry specific properties on bankruptcy prediction of firm have been analysed by Kayo and Kimura (2011).

Munificence:
The ability of an atmosphere to preserve a constant expansion is called munificence (Dess and Beard, 1984).The sectors/ industries operating in normal environment with high munificence tend to have larger level of opportunities as compared to industries with low munificence (Almazan and Molina, 2005).
Dynamism: Generally, the environmental dynamism describes the rate and instability of changes in firm's external environment (Dess and Beard, 1984;Simerly and Li, 2000).
HHIndex: On the basis of industry concentration, it can be divided as high and low concentrated industries.The level of industry concentration affects the firm leverage differently.

METHODOLOGY
Logistic regression: Logistic regression is a type of regression methods (Allison, 2001;Hosmer and Lemeshow, 2000) where the dependent variable is discrete or categorical, for instance, default (1) and nondefault (0).Logistic regression examines the effect of multiple independent variables to forecast the association between them and dependent variable categories.According to Morris (1997) and Martin (1977) was the first researcher who used logistic technique in corporate default perspective.He employed this technique to examine failures in the U.S. banking sector.Subsequently, Ohlson (1980) applied logistic regression more generally to a sample of 105 bankrupt firm and 2,000 non-bankrupt companies.His model did not discriminate between failed and nonfailed companies as well as the Multiple Discriminant Analysis (MDA) models reported in previous studies.According to Dimitras et al. (1996), logistic regression is in the second place, after MDA, in default prediction models.This method creates a score for every observation's dependent variable based on its independent pointers' weights.This score demonstrates the likelihood of membership in the objective category.For instance, the following equation can be used for default prediction model: where, Probability (Default │X 1 , …, X i ) is the probability of default, X i (i = 1,…, n) are independent variables such as firm-specific variables and β 1 to β i are coefficients which have estimated by the model.This model can be explained as the probability of default based on firm's given characteristics.In this model, maximum probability function is applied.In this regard, the weights are employed to make best use of the probability of default for the identified failed companies and the probability of non-default for nonfailed companies.Thus, based on this technique, using a broken-off point, a firm is classified as failed or nonfailed.Logistic regression is also able to verify the significance of individual variables in the model (Allison, 2001;Hosmer and Lemeshow, 2000).
Decision tree: Decision trees are the most popular and powerful techniques for classification and prediction.
The foremost cause behind their recognition is their simplicity and transparency and consequently relative improvement in terms of interpretability.Decision tree is a non-parametric and introductory technique, which is capable to learn from examples by a procedure of simplification.Frydman et al. (1985) first time employed decision trees to forecast default.Soon after, some researchers applied this technique to predict default and bankruptcy including (Carter and Catlett, 1987;Gepp et al., 2010;Messier and Hansen, 1988;Pompe and Feelders, 1997).Decision trees allocate data to predefined classification groups.For instance, in terms of business default prediction, this technique assigns each firm to a failed or non-failed group.Decision tree is a nonparametric and introductory technique, which is capable to learn from examples by a procedure of simplification.Generally, decision trees are binary trees include a set of branches (paths from roots to leaf nodes), leaf nodes (objects classes) and nodes (decision rules) which classifies objects according to their attributes (Dimitras et al., 1996).Therefore, the decision tree takes the form of top-down term structure, which divides the data to generate leaves.Under the structure, one target class is central and each record flows through the tree along a path determined by a series of tests until it obtains a terminal node (Quinlan, 1986).
There are two types of decision tree models, regression trees when the response variable is continuous and classification trees when the response variable is quantitative discrete or categorical.There are various algorithms to make decision trees which the most popular are C4.5 and CART.The main advantage of decision tree is that there is no restrict statistical requirement such as normality for dataset as this technique is a non-parametric method.Also due to the simplicity of the model, this technique became so popular and easy to use for the purpose of classification.

Data description:
The dataset was used to classify a set of firms into those that would default and those that would not default on loan payments.It consists of 285 observations of Malaysian companies during 2007-2012 from four different sectors including: trading and services, manufacturing sectors (consumer product and industrial product) and Construction and Property Sector.Of the 147 cases for training, 67 belong to the default case under the requirements of PN4, PN17 and Amended PN17 respectively and the other 201 to nondefault case.Consulting an extensive review of existing literature on corporate default models, the most common financial ratios that are examined by various studies were identified.The variable selection procedure should be largely based on the existing theory.The field of default prediction, however, suffers from a lack of agreement as for which variables should be used.The first step in this empirical search for the best model is therefore the correlation analysis.If high correlation is detected, the most commonly used and best performing ratios in the literature are prioritized.Therefore, the choice of variables entering the models is made by looking at the significance of ratios.
The components of the financial ratios which are estimated from data are explained below and Table 1 shows the summary statistics for selected variables for default and non-default firms.The most significant variables based on two methods were identified.These variables selected from the significant indicators for the model which could best discriminate the default firms from the non-default firms.

Logistic regression to model default prediction:
As shown in Table 2, five independent variables made statistically significant contribution to the model.The independent variables are size, profitability, liquidity, munificence and HHIndex.This is based on the Wald test that shows the contribution of each of the predictor or independent variables to a model.Its interpretation is similar to the F or t values for the significance testing of regression coefficients (Hair et al., 2006).Variables that contribute significantly to the models should have significance value of less than 0.05 (Pallant, 2007).A remarkable result specifies a predictor that is faithfully associated with the outcome (Tabachnick and Fidell, 2007).The B coefficient value is shown in Table 2 for each significant determinant.
Therefore, based on Table 2, the equation for the different sectors using financial ratios and sector indicators is:   The findings show that sector variables in corporate with financial ratios can be used to predict corporate default among different industries in Malaysia.

Decision tree to model default prediction:
The tree diagram is a graphic representation of the tree model.This tree diagram shows that: • Using the CHAID method, profitability is the best predictor of firm's default.• For the low profitability category, the next best predictor is liquidity.Of the firms in this category, only 7.3% have defaulted on loans.• For the high profitability category, the model includes one more predictor: dynamism.About 15% of those firms with the value more than 1.1 of dynamism have defaulted on loans (Fig. 1).
The decision tree model shows an accuracy about 85.83% and mean absolute error about 0.16 (

DISCUSSION AND CONCLUSION
Default prediction takes an important role in the prevention of corporate default, which makes the accuracy of default prediction model be widely concerned by researchers.Appropriate identification of firms 'approaching default is undeniably required.There is a large volume of published studies describing the role of firm-specific factors in default prediction models and during the past 40 years, the use of firmspecific variables in default prediction models has been subject of many studies.It is evoked (implied) by researches that there exists significant relationship between default prediction and firm specific variables.The results of this study supports the literature on default prediction.According to the results, financial ratios such as liquidity and profitability affects the probability of default significantly.
Although the main part of default prediction literature across developed and developing economies focused on firm specific and macroeconomic indicators.However, a number of studies on default prediction highlighted the importance of industry on default prediction.These studies on default prediction employ dummy variable to control the industry effect on default.We attempt to detail the characteristics of each industry, following the Kayo and Kimura's (2011) approach that justify the characteristics influencing leverage.The results show a significant relationship between industry indicators including munificence and HHI on default prediction.As compared to developing countries, the business environment in developed markets is more competitive, therefore, the munificence tends to be insignificant in developed countries.Since the nature of every industry is different in developing countries and every industry is subject to different level of competitiveness.Therefore, it is plausible to find significant relationship between probability of default and munificence.The results of this study may improve the default prediction models performance and reduce the costs of risk management.However, this study also has the limitation that the experimental data sets are only collected from Malaysian listed companies and further investigation can be done based on other countries' real world data sets in future study.

Table 1 :
Summary statistics for selected variables

Table 3 :
Detailed accuracy by class