Credit scoring with a feature selection approach based deep learning

. In financial risk, credit risk management is one of the most important issues in financial decision-making. Reliable credit scoring models are crucial for financial agencies to evaluate credit applications and have been widely studied in the field of machine learning and statistics. Deep learning is a powerful classification tool which is currently an active research area and successfully solves classification problems in many domains. Deep Learning provides training stability, generalization, and scalability with big data. Deep Learning is quickly becoming the algorithm of choice for the highest predictive accuracy. Feature selection is a process of selecting a subset of relevant features, which can decrease the dimensionality, reduce the running time, and improve the accuracy of classifiers. In this study, we constructed a credit scoring model based on deep learning and feature selection to evaluate the applicant’s credit score from the applicant’s input features. Two public datasets, Australia and German credit ones, have been used to test our method. The experimental results of the real world data showed that the proposed method results in a higher prediction rate than a baseline method for some certain datasets and also shows comparable and sometimes better performance than the feature selection methods widely used in credit scoring.


Introduction
The main purpose of credit risk analysis is to classify customers into two sets, good and bad ones [1].Over the last decades, there have been lots of classification models and algorithms applied to analyse credit risk, for example decision tree [2], nearest neighbour K-NN, support vector machine (SVM) and neural network [3][4][5][6][7].One important goal in credit risk prediction is to build the best classification model for a specific dataset.
Financial data in general and credit data in particular usually contain irrelevant and redundant features.The redundancy and the deficiency in data can reduce the classification accuracy and lead to incorrect decision [8][9].In that case, a feature selection strategy is deeply needed in order to filter the redundant features.Indeed, feature selection is a process of selecting a subset of relevant features.The subset is sufficient to describe the problem with high precision.Feature selection thus allows decreasing the dimensionality of the problem and shortening the running time.
Credit scoring and internal customer rating is a process of accessing the ability to perform financial obligations of a customer against a bank such as paying interest or an original loan on due date, or other credit conditions for evaluating and identifying risks in the credit activities of the bank.The degree of credit risk changes over individual customers and is identified through the evaluation process.It is based on existing financial and non-financial data of customers at the time of credit scoring and customer rating.
Credit scoring is a technique using statistical analysis data and activities to evaluate the credit risk against customers.Credit scoring is shown in a figure determined by the bank based on the statistical analysis of credit experts, credit teams or credit bureaus.In Vietnam, some commercial banks start to perform credit scoring against customers but it is not widely applied during the testing phase and still needs to improve gradually.For completeness, all information presented in this paper comes from credit scoring experience in Australia, Germany and other countries.
Many methods have been investigated in the last decade to pursue even small improvement in credit scoring accuracy.Artificial Neural Networks (ANNs) [10][11][12][13] and Support Vector Machine (SVM) [14][15][16][17][18][19] are two commonly soft computing methods used in credit scoring modelling.Recently, other methods like evolutionary algorithms [20], stochastic optimization technique and support vector machine [21] have shown promising results in terms of prediction accuracy.
In this study, a new method for feature selection based on various criteria are proposed and integrated with a deep learning classifier in credit scoring tasks.
The rest of the paper is organized as follows: Section 2 presents the background of credit scoring, deep learning and feature selection.Section 3 is the most important section that describes the details of the proposed model.Experimental results are discussed in Section 4 while concluding remarks and future works are presented in Section 5.

Feature selection
Feature selection is the most basic step in data preprocessing as it reduces the dimensionality of the data.Feature selection can be a part of the criticism which needs to focus on only related features, such as the PCA method or an algorithm modeling.However, the feature selection is usually a separate step in the whole process of data mining.
There are two different categories of feature selection methods, i.e. filter approach and wrapper approach.The filter approach considers the feature selection process as a precursor stage of learning algorithms.The filter model uses evaluation functions to evaluate the classification performances of subsets of features.There are many evaluation functions such as feature importance, Gini, information gain, the ratio of information gain, etc.A disadvantage of this approach is that there is no relationship between the feature selection process and the performance of learning algorithms.
The wrapper approach uses a machine-learning algorithm to measure the good-ness of the set of selected features.The measurement relies on the performance of the learning algorithm such as its accuracy, recall and precision values.The wrapper model uses a learning accuracy for evaluation.In the methods using the wrapper model, all samples should be divided into two sets, i.e. training set and testing set.The algorithm runs on the training set, and then applies the learning result on the testing set to measure the prediction accuracy.The disadvantage of this approach is highly computational cost.Some researchers proposed methods that can speed up the evaluating process to decrease this cost.Common wrapper strategies are Sequential Forward Selection (SFS) and Sequential Backward Elimination (SBE).The optimal feature set is found by searching on the feature space.In this space, each state represents a feature subset, and the size of the searching space for n features is O(2n), so it is impractical to search the whole space exhaustively, unless n is small.

Deep Learning
Deep learning (deep machine learning, or deep structured learning) attempt to model high-level abstractions in data by using multiple processing layers with complex structures or otherwise, composed of multiple non-linear transformations.There are several theoretical frameworks for Deep Learning, but this research focuses primarily on the feed-forward architecture used by H2O.The basic unit in the model is the neuron, a biologically inspired model of the human neuron.In humans, the varying strengths of the neurons' output signals travel along the synaptic junctions and are then aggregated as input for a connected neuron's activation.In the model, the weighted combination α = of input signals is aggregated, and then an output signal f(α) transmitted by the connected neuron.The function f represents the nonlinear activation function used throughout the network and the bias b represents the neuron's activation thresh-old.Multi-layer, feed-forward neural networks consist of many layers of inter-connected neuron units, starting with an input layer to match the feature space, followed by multiple layers of nonlinearity, and ending with a linear regression or classification layer to match the output space.Multi

The Proposed Method
Our method uses Deep Learning to estimate the performance consisting of the cross validation accuracy and the importance of each feature in the training data set.In a multi-node system this parallelization scheme works on top of H2O's distributed setup, where the training data is distributed across the cluster.Each node operates in parallel on its local data.After that, we determine best feature set by choosing the best of Average score + Median Score and the lowest SD.To deal with overfitting problem, we apply n-fold cross validation technique to minimize the generalization error.

Step 1: Train data by Random Forest via 20 trails, calculate and sort median of variables important Step 2: Add each feature with best variables important and train data by Deep Learning with the cross validation
Step 3: Calculate score for each feature F i score where i=1.

.n (n is the number of features in current loop). Step 4: Select best feature using selection rules
Step 5: Back to step 1 until reach the desired criteria In step 2, we use deep learning with n-fold cross validation to train the classifier.In the j th cross validation, we will obtain a set of (F j , A j learn , A j validation ) that are the feature importance, the learning accuracy and the validation accuracy respectively.
We will use those values to compute the score criterion in step 3.
In step 3 we use the results from step 1 and step 2 to build the score criterion which will be used in step 4. The score of feature i th is calculated by: (1) The main of our algorithm is presented in step 4. In this step, we will select best of features using rules: the best of Average + Median Score and the lowest standard deviation (SD).

Rule 1: select features with the best of median score Rule 2: select features with the best of average score Rule 3: select features with the lowest SD
These rules guarantee us to get the highest accuracy and the lowest Standard deviation.This proposed method tends to find the smallest optimal set of features in order to reduce the number of output features as much as possible.Then, machine-learning algorithms are used to calculate the relevance of the feature.Based on the calculated value of conformity level, we find the subset of features having less number of features while maintaining the objective of the problem.

Experiment and results
Our proposed algorithm was coded using R language (http://www.r-project.org),using H2O Deep Learning package.This package is optimized for doing "in memory" processing of distributed, parallel machine learning algorithms on clusters.A "cluster" is a software construct that can be can be fired up on your lap-top, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster.We tested the proposed algorithm with several datasets including two public datasets, German and Australian credit approval, to validate our approach.In this paper, we used Random Forest with the original dataset as the base-line method.The proposed method and the base-line method were executed on the same training and testing datasets to compare their efficiency.Those implementations were repeatedly done 20 times to test the consistency of obtained results.
The German credit approval dataset consists of 1000 loan applications, with 700 accepted and 300 rejected.Each applicant is described by 20 attributes.Our final results were averaged over these 20 independent trials (Fig. 1).
In our experiments, we use the default value for the hidden parameter and the number of epoch parameter was tried with value of 10.The averages of classification results are depicted in Fig. 1.
The best subset contains 19 features and its accuracy is 74.68 %.Table 1 shows the performances of different classifiers over the German credit datasets.Baseline is the classifier without feature selection.Classifiers used in [22] include: Linear SVM, CART, k-NN, Naïve Bayes, MLP.Filter methods include: t-test, Linear Discriminant analysis (LDA), Logistic regression (LR).The wrapper methods include: Genetic algorithms (GA) and Particle swarm optimization (PSO).Comparing the performances of various methods in Table 1, we saw that the ac-curacy of deep learning on the subset of newly selected features obviously in-creases, and the number of features has been reduced by 21%.The average accuracy is 73.4% on the original data.After applying the feature selection, the aver-age accuracy increases to 74.68%.
Moreover, relying on a parallel processing strategy, time to run 20 trails with 5-fold cross validate taken by our method is only 5286 seconds (~88 minutes) while other methods must run several hours.This result highlights the efficiency in terms of running time of our method when filtering the redundant features.

t approval dataset
The Australian credit dataset is composed of 690 applicants, with 383 credit worthy and 307 default examples.Each instance contains eight numerical features, six categorical features, and one discriminant feature, with sensitive information being transferred to symbolic data for confidentiality reasons.The averages of classification results are depicted in Fig. 2. Table 2 shows that the accuracy of Deep learning on a subset of 7 selected features obviously increases.The average accuracy is 85.82% on the original data.After applying the feature selection, the average accuracy increases to 86.24%.Relying on parallel processing, time to run 20 trails with 5-fold cross validate taken by our method is only 2769 seconds (~46 minutes).

Conclusion
In this paper, we focused on studying feature selection and Deep Learning meth-od.Features selection involves in determining the highest classifier accuracy of a subset or seeking the acceptable accuracy of the smallest subset of features.We have introduced a new feature selection approach based on feature scoring.The accuracy of classifier using the selected features is better than other methods.Fewer features allow a credit department to concentrate on collecting relevant and essential variables.The parallel processing procedure leads to a significant decrement in runtime.As a result, the workload of credit evaluation personnel can be reduced, as they do not have to take into account a large number of features during the evaluation procedure, which will be somewhat less computation-ally intensive.The experimental results show that our method is effective in credit risk analysis.It makes the evaluation more quickly and increases the accuracy of the classification.

Figure 1 .
Figure 1.Accuracy in case of German dataset.

,Figure 2 .
Figure 2. Accuracy in case of German dataset -layer neural networks can be used to accomplish Deep Learning tasks.Deep Learning architectures are models of hierarchical feature extraction, typically involving multiple levels of nonlinearity.Deep Learning models are able to learn useful representations of raw data and have exhibited high performance on complex data such as images, speech, and text.The procedure to minimize the loss function L(W,B | j) is a parallelized version of stochastic gradient descent (SGD).Standard SGD can be summarized as follows, with the gradient L(W,B | j) computed via back propagation.

Table 1 .
Performances of different classifiers over the german credit dataset

Table 2 .
Performances of different classifiers over the Australian credit dataset