Estimating credit and profit scoring of a Brazilian credit union with logistic regression and machine-learning techniques

Purpose – Although credit unions are nonprofit organizations, their objectives depend on the efficient management of their resources and credit risk aligned with the principles of the cooperative doctrine. This paper aims to propose the combined use of credit scoring and profit scoring to increase the effectiveness of the loan-granting process in credit unions. Design/methodology/approach – This sample is composed by the data of personal loans transactions of a Brazilian credit union. Findings – The analysis reveals that the use of statistical methods improves significantly the predictability of default when compared to the use of subjective techniques and the superiority of the random forests model in estimating credit scoring and profit scoring when compared to logit and ordinary least squares method (OLS) regression. The study also illustrates how both analyses can be used jointly for more effective decision-making. Originality/value – Replacing subjective analysis with objective credit analysis using deterministic models will benefit Brazilian credit unions. The credit decision will be based on the input variables and on clear criteria, turning the decision-making process impartial. The joint use of credit scoring and profit scoring allows granting credit for the clients with the highest potential to pay debt obligation and, at the same time, to certify that the transaction profitability meets the goals of the organization: to be sustainable and to provide loans and investment opportunities at attractive rates to members. Keyword Credit unions Paper type Research paper


Introduction
Credit unions use the financial resources raised via members to finance the loans of the same members.Mutual help with accessible conditions increases the financial returns of the community associated with the union over time (Polônio, 1999).
At the end of each fiscal year, the union returns the surplus to their members, either by cash distributions proportional to members' contributions and uses or by reinvesting it in the union, as stated in the bylaws (Geriz, 2010).
Although profit is not the main objective of these organizations, they have to remain competitive to survive.Cooperatives need to implement liquidity and solvency controls, to search for economies of scale and to manage their financial assets efficiently (Silva Filho, 2002).Silva Filho (2002) advocates that credit unions should use management tools to measure performance and achieve goals to increase the effectiveness of the decision-making process.
The organizational structure of credit unions may create agency problems in the process of approving loans (Lima, Araújo, & Amaral, 2008).When a member applies for a loan in his or her institution, the granting decision depends on the judgment of the credit analyst who is designated by the credit union administration to assess the creditworthiness of its members.The fact that the members of a credit union are also its owners makes the judgment of the credit analyst vulnerable to the influence of applicants.Some members may seek benefits divergent of the best interest of the cooperative, as, for instance, insolvent debtors wanting to renegotiate bad loans.
The lack of objectivity in credit policies and the absence of reasonable internal controls are obstacles that hinder the sustainable growth of a credit union (Giarola, Santos, & Ferreira, 2009).Comparatively, major Brazilian financial institutions have been investing substantial amount of capital in technology for credit systems in retail operations.They use sophisticated statistical methods to open accounts, approve credit, determine limits, extend loans and perform collection actions to make the internal processes more efficient and allow for a credit portfolio, which is well balanced with the expected levels of risk and return.
In this context, credit risk modeling and loan profitability forecasting are useful resources to improve the performance of credit unions.
Credit scoring models are widely used in the financial industry to measure credit risk.The model estimates default probability (DP) based on customers' past behavioral and demographic characteristics (Lewis, 1992).
Profit scoring models are used to predict the profitability of a client or a transaction.This approach is based on the concept that nondefaulting borrowers may not generate sufficient revenue to offset the costs of maintaining their accounts, whereas defaulting borrowers may be profitable if they actively engage in credit transactions and honor most of their commitments (Sanchez-Barrios, Andreeva, & Ansell, 2016).The methodology addresses decisions regarding the selection of the desired risk combined with a return level and the formulation of strategies to acquire and retain profitable clients (Sanchez-Barrios et al., 2016).Lima et al. (2008) argue that the agency conflict in credit unions can be mitigated by the adoption of efficient internal controls and clear governance rules.
We propose the use of a credit-scoring model combined with a profit scoring model for loan-granting decisions, determining an acceptable credit risk without excessive loss of profitability.The combined use of these approaches may improve the credit concession process, by not only lending to clients with the potential to pay off debt but also considering the expected profitability of the operation.It may reduce the agency problems and may improve the efficiency of credit unions, increasing the long-term viability of the organization.We expect that a process based strictly on objective and quantitative analyses provides loans suited for its members without favoring minor groups at the expense of the whole.
We contribute to the literature by comparing logistic and machine-learning approaches to subjective methods, and by combining credit scoring to profit scoring in a credit union context.As credit unions are private, it is difficult to have access to their data, and we are not aware of a similar analysis in Brazil.We had access to data of loans transactions granted in 2015 and 2016 for only one credit union, but as the agency problems and granting processes are similar, our main findings will probably be applied to most corporations.Our results indicate that logistic models increase the GINI coefficient by 5.8 times when compared to subjective analyses and the area under the ROC curve (AUROC) by around 62 per cent.The random forest increases the GINI index further by 4 per cent and the AUROC by 2 per cent.We conclude that statistic models improve the efficiency of credit granting significantly.

Theoretical basis
Modern cooperatives emerged in 1844 in Rochdale, England, during the Industrial Revolution (Pinheiro, 2008).A group of weavers founded a cooperative based on ethical and behavioral principles, which have become the basis of contemporary cooperativism.Friedrich Wilhelm Raiffeisen founded the pioneering institution that served as a model for credit union activities in Germany in 1847.The German rural cooperatives had the following characteristics: unlimited and joint liability of members, that is, the member is accountable for debts contracted by the organization by pledging his or her private assets as collateral; members' votes have equal weights, regardless of their equity stake; and nondistribution of profits, surpluses or dividends.In 1856, Herman Schulze created the first urban credit union of Germany.The cooperatives founded by Herman Schulze differ from those of Raiffeisen's because they return dividends to members proportionally to their equity stake.The cooperatives that follow these rules are now known as cooperatives with a Schulze-Delitzsch model.
The first credit union in Brazil was Sociedade Cooperativa Caixa de Economia e Empréstimos de Nova Petr opolis, founded in Rio Grande do Sul in 1902 (Soares & Melo Sobrinho, 2008).It was a Raiffeisen-type cooperative, and it remains operational until today.After this pioneering initiative, other credit unions were created to serve rural communities.
Credit unions are financial institutions regulated by the Central Bank of Brazil through Resolution No. 4434 of 2015 ruled by the National Monetary Council.The liability of the partners may be limited or unlimited, as determined by the bylaws (Geriz, 2010).
In Brazil, credit unions have grown significantly in recent years along with their representativeness in the banking sector.Table I illustrates the growth of cooperatives in terms of assets and equity compared with consolidated banking data.
The compound annual growth rate (CAGR) of credit-unions aggregate equity (net worth, assets, deposits and credit portfolio) between 2011 and 2016 was in the range of 18 to 22 per cent.This exceeded the growth rate of other financial institutions.The representativeness of credit unions jumped from 4.1 per cent in 2011 to 6.6 per cent in 2016 of the consolidated banking sector net worth.
Despite not being the dominant model in the banking sector, not-for-profit financial institutions, like credit unions, play an important role in many countries (Canning, Jefferson, & Spencer, 2003).However, literature on credit unions is little when compared to other financial institutions (Cuevas & Fischer, 2006).
Credit unions offer lower interest rates to their members and return part of the cash surplus (the difference between revenues and operating expenses) to them if approved by the general assembly (Giarola et al., 2009).Those are advantages in comparison to other credit institutions.However, credit unions have to retain an adequate level of cash surplus to make the necessary investments in operating assets and risk management to offer lower interest rates and retain clients with higher credit risk in a highly competitive sector dominated by large institutions.
Credit and profit scoring Giarola et al. (2009) explain that credit union financial resources are mostly generated by deposits and the acquisition of equity by members, which are then transferred as loans to other members.Therefore, poor management affects all members negatively.Lima et al. (2008) argue that the most important advantage of credit unions is the access to financial services to their members, even during credit rationing in the conventional financial market.The reduced tax burden is an important competitive advantage that allows cooperatives to have lower costs than other financial institutions in the retail sector.
The agency problem is a risk factor for the long-term sustainability of credit unions (Cuevas & Fischer, 2006).The agency theory postulates that the main interest of shareholders is profit maximization (Cornforth, 2004).Characteristics of the corporate control market, such as pressure from major shareholders, takeover threats and board monitoring align managers' interest with this goal.In cooperatives, the situation is different: they are established to serve the interests of their members; therefore, profit is a way to achieve a purpose and not a purpose itself.
Agency conflict has particular characteristics in credit unions: First, cooperative owners are also clients.Conflicts of interests between lenders and borrowers occur due to heterogeneity in customer preference for profitable applications or loans with attractive rates.Second, decisions made at a general assembly, including election of the administrative staff, are based on the vote of each member, with no distinction by individual equity stake.Finally, the members elected to the board are usually less technically qualified than experts working in the financial industry (Lima et al., 2008).
The more homogeneous the long-term concerns of the members, the lower the agency costs (Souza, 2017), and the vote will express the desire of the majority.Conversely, in highly heterogeneous credit unions, decisions will greatly differ among members.Therefore, shareholder organizations, whose profit maximization objective is better defined and achievable, tend to present more advantages (Hart & Moore, 1998).The identification of variables related to the credit quality of members allows statistical and objective control of lending processes, increasing scalability and impartiality in decision-making, thereby reducing agency costs for the cooperative.
Credit unions operate on behalf of all members, and they are not motivated by profit (Taylor, 1971;Spencer, 1996).Members will most likely play both roles: lenders and borrowers.As long as the difference between the borrowers' and lenders' rates is reduced, credit unions have incentives to expand.The difference between borrowers and lenders' rates covers the long-run average costs of the unions.The combination of profit and credit scoring models may turn the credit granting process more efficient, thus reducing the average long-term cost of the union.This can result in a lower difference between borrowers and lenders' rate and, consequently, in a higher incentive for the credit union to expand.

Methodology
3.1 Credit and profit scoring models Fisher (1936) introduced the discriminant analysis, and Durand (1941) pioneered in proposing a credit-scoring model.Since the second half of the twentieth century, creditscoring models have become a benchmark for the financial industry.
Profit-scoring modeling has emerged as part of a credit granting decision-making tool.Borrowers are grouped according to profitability ratios instead of default probabilities and credit losses (Sanchez-Barrios et al., 2016).
The increase in computational processing capacity enabled financial institutions to use more efficient models based on machine-learning techniques (Yap, Ong, & Husain, 2011).These techniques may have higher accuracy and robustness in analyzing nonlinear relationships than traditional techniques.

Random forests
Random forest (Breiman, 2001) is a machine-learning technique that combines decision trees (Breiman, Friedman, & Stone, 1984).It may be used for data classification (categorical dependent variable) and regression (continuous dependent variable) problems, and it is an evolution of the bootstrap aggregating (bagging) algorithm (Breiman, 1996).
The classification and regression trees, proposed by Breiman et al. (1984), seeks the creation of homogeneous subsets by successive binary partitioning of data, until meeting predefined quality (or purity) criteria.The values or categories of the dependent variable are predicted based on the end nodes of the decision tree.Figure 1 exemplifies a decision tree.
Figure 1 shows the process of binary data partitioning into classes.From the root node, the variable that enables a better separation of groups according to a quality criterion is selected, yielding two new nodes.The process continues recursively, until meeting the stopping criteria.
In the bagging method, the data set is randomly divided into a large number of subsamples drawn from the original sample with replacement, generating a classification or Credit and profit scoring regression tree for each subsample.The prediction is calculated as the mean (for regression) or the vote majority (for classification) of the responses of the trees generated with the subsamples.This technique generates more precise and stable results because the effects of noise and outliers are attenuated with the various samplings.
A key limitation of the bagging technique is the possibility of generating very similar trees, which in turn increases the prediction error rate of the model because the independent variables are always the same.
The random forests method has two steps in addition to the selection of subsamples performed in the bagging technique.Having M predictor variables in the data set, m < M variables will be randomly selected for each subsample in the construction of individual trees.The m value is kept constant during the model learning process.This feature enables reduction of prediction errors due to multicollinearity problems, for example.
We estimated the credit-scoring models with logistic regression to verify whether the accuracy of the subjective analysis increases when including demographic and behavioral variables.Subsequently, we compared the results with the one obtained from the random forest algorithm.The comparisons are based on the Kolmogorov-Smirnov (KS) statistics, the Gini coefficient and the AUROC.Source: (Silverio, 2015) 4. Data description We obtained our data from a Brazilian credit union.Behavioral and demographic variables of loans were observed in a 24-month period (from January 2015 to December 2016), totaling 2,012 observations.We considered personal loan transactions only.The credit union did not register rejected applications.Each database record is a unique credit transaction that contains information about the characteristics of the loan, the borrower and the development of the outstanding balance.The last set includes the payment history, overdue payments and end-of-month financial statements.

Representation of classification trees
We classify a client as a "bad" payer if he or she does not pay off their debt commitments within 90 days after the due date; and "good" otherwise.
We used the internal rate of return (IRR) on the loans as a profitability measure in profit scoring models, as proposed by Serrano-Cinca and Gutiérrez-Nieto (2016).The IRR calculation parameters are the cash flows of payments received and the costs involved in each contracted operation.
We used the following variables: Type: type of credit operation chosen by the applicant, such as personal loan, loan for acquiring goods and payroll loan; Rating: the credit rating subjectively assigned by the credit union to the loan transaction based on the client's payment history records and information provided by credit bureaus such as Credit Protection Service and Serasa-Experian; Income: the borrowers' proven monthly net income in Brazilian real; Collateral: information about available goods or property that can be used as collateral; Reltime: time elapsed between the date the checking account was opened and the date the loan transaction was settled, in years; Interest: nominal interest rate of the transaction; Value: value of the loan requested by the borrower in Brazilian real; Debt: Debt to disposable income ratio: the ratio between the value of the loan installment and the monthly income of the borrower; Evhist: number of months since the last occurrence of a late payment (delay higher than 30 days after the credit obligation due date) of a specific client in the past two years.In case there are no observed events for a given customer, this variable is zero; and Term: Term in days corresponding to the period between the transaction date and the contractually agreed loan settlement date.
The credit-scoring dependent variable Y receives value 0 (zero) if the client is "bad" and value 1 (one) if the client is "good."Table II shows the statistics of variable Y in the sample.The dependent variable of the profit score model is the IRR on the transactions.

Credit and profit scoring
Figure 2 shows the histogram and statistics of the IRR-dependent variable.We observed that the distribution is asymmetric.The default events markedly shift the left tail of the distribution because these events are associated with significant losses.The distribution of the positive IRR values is concentrated in the range of rates of return from 5 to 15 per cent.
Table III shows the descriptive analysis of the categorized independent variables.In each line, for each possible value of the categorical variables, one may find information regarding the number of bad or good clients in the sample and descriptive statistics for IRR.We applied chi-square homogeneity tests to assess whether the ratios of good and bad payers differ between the categories of the independent variables and the mean comparison test (analysis of variance -ANOVA) between the different possible values of the independent variables regarding the IRR.All tests, except for the COLLATERAL variable, identified significant differences at p < 0.01.We did not find a significant difference between "good" and "bad" in the COLLATERAL variable according to the F-ANOVA test (p > 0.10).This indicates that there is no significant contribution from COLLATERAL to the profit scoring models.Table IV contains the descriptive analysis of the continuous independent variables.In each line, we present, for each variable, some descriptive statistics, the observed means for bad and good payers and the correlation between the variable and IRR.We used the t-test to compare the means of the independent variables between good and bad payers.Although the normality assumption of t-test is not achieved, the large sample size enables us to carry out the tests.We found significant correlations between IRR and all continuous variables, except for the variables VALUE, DEBT and TERM.
Variables VALUE and TERM are nonsignificant in all statistical tests.The results of the t-test improved after we applied the transformations of the variables outlined in Table V.
The value of the variance inflation factor (VIF) for the lnVALUE variable was 4.40, suggesting multicollinearity problems.When we excluded the maximum value, the VIF was 1.37, indicating the absence of strong multicollinearity among the variables.
The general specification of the logistic and linear models used in the paper is given, respectively, by the expressions ( 1) and ( 2): where p i is the probability of the client i be good, k is the number of independent variables in the model, x ji is the value of the independent variable j for client i and b j are parameters, i = 1, . .., n, j = 1, . .., k: Credit and profit scoring where IRR i is the value of IRR for client i, the others terms are similar to those previously described.

Results and discussion
We used the R programming language (R Core Team, 2000) to fit the models, and the randomForest package (Svetnik, Liaw, & Tong, 2003).

Credit scoring models
We developed three logistic regression models.Model 1 (complete model) contains all independent variables selected for the study.Model 2 excludes the RATING variable.Model 3 contains only the RATING variable.Model 3 allows us to evaluate the performance of the current criterion used by the cooperative to decide if a credit should be provided; Model 2 uses all the information available, except the information provided by RATING, and Model 1 uses all the information available.
Table VI shows the analyses results.We observed that the complete Model 1 performs better than Model 3 and slightly better than, albeit very similar to, Model 2. Model 2 performs better than Model 3, which suggests that including other variables is more beneficial to the risk classification than using the RATING variable alone (subjective credit risk classification).
The variables COLLATERAL, lnDEBT, INCOME (greater than R$6,000) and TYPE (other loans) are not statistically significant.The analysis of the signs of the TYPE variable shows that the product "Vehicle Purchase" has a higher credit risk than personal loan and unsecured loan products.
The model with the random forests technique created 500 classification trees, of depth 3. Table VII outlines the quality indicators (AUROC, KS and Gini) associated with the logistic regression and random forest algorithm models (1).We observed that the random forest modeling performed better than the logistic regression according to the three indicators, in agreement with Jones et al. (2015), Lessmann et al. (2015), Malekipirbazari and Aksakalli (2015) and Namvar et al. (2018).
The size of the sample provided by the cooperative was not big enough to enable the partition of the data between a developing and a validation sample, so, to evaluate the performance of the techniques with out-of-sample data, we used the k-fold (Kohavi, 1995) cross-validation method to assess the generalizability of the models.In this method, we randomly partitioned the data into 15 mutually exclusive groups (k = 15).By excluding a partition, we generated a model from the data collected for the remained partitions, which was subsequently assessed with the partition removed.The cross-validation findings were very similar to those assessed when using the model developed with all database observations, indicating that the random forests technique is better than the logistic regression technique.The minimum value of the KS statistic for the logistic model was 0.66, and the maximum was 0.70; conversely, for the random forests, the minimum value was 0.90, and the maximum value was 0.92.The minimum value of the Gini coefficient for the logistic regression model was 0.86, and the maximum was 0.88; the minimum value of the random forests was 0.90, and the maximum was 0.93.

Profit score model
We compared the profit score model developed herein by using the random forest technique to the model estimated by the ordinary least squares method with robust heteroscedasticityconsistent standard errors.
To attenuate problems due to the lack of linearity of the dependent variable and to improve the performance of the ordinary least squares regression model, we added quadratic variables to the set of continuous variables.Table VIII summarizes these variables.
For the profit scoring models, we used the same categorized variables used in the creditscoring model.
Table IX contains the results of the regression analyses.

Credit and profit scoring
The results show that the variables RELTIME, INTEREST, DEBT and their respective quadratic terms RELTIME_QUAD, INTEREST_QUAD and DEBT_QUAD are statistically significant, which suggests a nonlinear relationship between these variables and the IRR.Considering the amplitude observed in the sample (see minimum and maximum values in Table IV) of RELTIME and DEBT, the tests confirm a positive association between these variables and IRR.There is an inverted U-shaped relationship between IRR and INTEREST, with maximum IRR value of 60 per cent.
The variables RATING, COLLATERAL and INCOME (greater than R$6,000) are not statistically significant, as expected by the descriptive analysis results.Few observations (229 cases) have INCOME variable (greater than R$6,000) equal to 1, which may have contributed to the nonsignificant p value.The results for the RATING variable suggest that subjective credit scoring does not affect significantly the IRR of transactions.The variable TERM has a different sign in the profit-scoring model than it does in the credit-scoring model.This suggests that the transaction profitability increases with maturity.
The random forest technique used the same variables of the regression models.Table X shows the performance indicators for both methodologies.The mean squared error (MSE) of the random forests model (MSE = 183.29) is considerably lower than the MSE of the linear regression model (MSE = 427.61).This is consistent with the comparison using the Similar to the credit-scoring models, we used the k-fold cross-validation method (Kohavi, 1995) to assess the generalizability of the profit-scoring models.We observed MSE values ranging from 409.68 to 439.32 for the linear regression and values ranging from 175.70 to 202.41 for the random forest.Conversely, the R 2 statistic ranged from 0.27 to 0.30 for the linear regression model and from 0.67 to 0.71 for the random forest model.

Joint analysis of credit scoring and profit scoring
Joint analysis of credit scoring and profit scoring may improve decision-making by identifying the worst credit score that can be accepted without losing profitability of the loan portfolio and keeping the risk of losses due to default at an acceptable level.
Figure 3 shows the mean, median and 1, 5, 95 and 99 per cent percentiles of the expected IRR as a function of the predicted DP.This is an exploratory analysis due to limitations of the database.We observed that only 5 per cent of the clients classified with a 0.2 DP have an IRR less than 0 per cent and that 1 per cent of these clients have a predicted IRR less than À20 per cent.The mean and median portfolio varies little, with a slight increase in mean return: 6.7 per cent IRR for a DP of 0, and 7 per cent IRR for a DP of approximately 0.13, subsequently falling slightly and then more sharply for DPs greater than 0.7.

Conclusions
The objective of credit unions is to lend money at accessible rates to people that face credit restriction.Those people join the unions to borrow and to invest financial resources in a mutual financial help system.To have a sustained operation, credit unions must be efficient in the credit granting decision, reducing the exposition to severe losses caused by default of credit obligations.Therefore, credit risk analysis is crucial to the survival of credit unions.
Subjective credit analysis is still very common in Brazilian credit unions.It brings both operational and moral risks to these cooperatives because loans may be granted to clients

PD
Credit and profit scoring with high credit risks.Agency problems may arise, as the loan applicant is also a member of the cooperative; the agent that will make the credit granting decision is appointed by the board of directors, and the board of directors is elected by the members of the credit union.
Replacing subjective analysis with objective credit analysis using deterministic models will benefit credit unions.The credit decision will be based on the input variables and on clear criteria, turning the decision-making process impartial.The definition of a cut-off credit quality will allow defining the maximum acceptable level of risk for new transactions, and therefore risk management.The choice of independent variables related to credit risk is a basic condition for the objective analysis to present satisfactory results.
According to the logistic regression model, the following variables are significant (p < 0.05) in discriminating "bad" and "good" clients: type of transaction, subjective credit risk rating, income, length of banking relationship, interest rates, history of defaults and transaction term.However, our analysis has limitations due to data constraints.The sample loan transactions are hired and finalized in a two-year period (from 2015 to 2016).Because vehicle-financing products have longer maturities, the analysis of the impact of this variable in the default prediction would demand a longer period of data collection.
The joint use of credit scoring and profit scoring allows granting credit for the clients with the highest potential to pay off debt commitments and, at the same time, to certify that the transaction profitability meets the goals of the organization: to be sustainable and to provide loans and investment opportunities at attractive rates to members.
We used logistic and random forest analysis (a machine-learning technique) for accessing credit scoring for clients.We used OLS and random forest analysis to assess the profit scoring of clients.
The values of the Gini and AUROC indicators suggest that the credit-scoring models have a much higher predictive quality than subjective analyses.This evidence supports the literature, according to which subjective credit analysis performs worse than objective credit analysis.
The random forest technique performs better than the logistic regression method for credit scoring.We expect that machine-learning techniques will improve the cooperative financial performance, as the losses avoided will not translate into additional costs for the organization.
A combination of profit scoring and credit scoring analysis enables us to assess the effect on the mean profitability of the loan portfolio and the risk of losses caused by accepting clients with worse credit quality.This is a useful tool for loan decision-making.The savings generated help to reduce funding rates by improving the supply of credit, thereby contributing to the objectives of the credit union.
For future research studies, we suggest using other popular machine-learning techniques, such as artificial neural networks and support vector machines, for modeling of credit scoring in credit unions.For profit scoring, we suggest profitability analysis using the risk-adjusted return on capital as the dependent variable.Regularization techniques, such as lasso and elastic net, can be used to eliminate coefficients that are mostly irrelevant to the explanation for the observed effects.We can use the survival analysis technique to assess whether profits or losses can be predicted in late payment events and in early settlement of loan agreements.
Figure 1.Representation of classification trees

Figure 2 .
Figure 2. Histogram of the internal rate of return Figure 3. Descriptive analysis of the expected IRR as a function of the predicted default probability

Table I .
Growth of credit unions in Brazil (R$bn) Source: Banco Central do Brasil

Table III .
The random forests model has a higher explanatory power for the values observed (0.70) than does the linear regression model (0.29).