Research on Credit Card Default Prediction Based on k-Means SMOTE and BP Neural Network

. Aiming at the problem that the credit card default data of a ﬁnancial institution is unbalanced, which leads to unsatisfactory prediction results, this paper proposes a prediction model based on k-means SMOTE and BP neural network. In this model, k-means SMOTE algorithm is used to change the data distribution, and then the importance of data features is calculated by using random forest, and then it is substituted into the initial weights of BP neural network for prediction. The model eﬀectively solves the problem of sample data imbalance. At the same time, this paper constructs ﬁve common machine learning models, KNN, logistics, SVM, random forest, and tree, and compares the classiﬁcation performance of these six prediction models. The experimental results show that the proposed algorithm can greatly improve the prediction performance of the model, making its AUC value from 0.765 to 0.929. Moreover, when the importance of features is taken as the initial weight of BP neural network, the accuracy of model prediction is also slightly improved. In addition, compared with the other ﬁve prediction models, the comprehensive prediction eﬀect of BP neural network is better.


Introduction
Recently, the state vigorously promotes the economic construction of large-and medium-sized cities, which not only improves people's living standards but also changes people's consumption concept and consumption mode. People are more and more inclined to spend ahead of time and mortgage their "credit" to the bank to enjoy certain things in advance. However, when consuming, people often lack rational thinking and overestimate their ability to repay loans to banks in time. On the one hand, it increases the loan risk of banks; on the other hand, it increases the credit crisis of consumers themselves [1]. With a large number of banks selling credit cards, the phenomenon of credit card default emerges one after another. It is very important for banks to effectively identify high-risk credit card default users. Generally speaking, compared with the credit card customers who have not paid their loans overdue, there are fewer overdue repayments [2,3]. is variable feature of overdue and overdue loan repayment is called "two classifications" in machine learning prediction. In the prediction of "two classifications," a few categories are called positive examples (default), and most categories are called counterexamples (nondefault). However, most of the credit card loan data are unbalanced. In view of this situation, domestic and overseas scholars have taken up on a large scale a lot of researches. Khoshgoftaar et al. [4] proposed an evolutionary sampling method for unbalanced data, which uses genetic algorithms to selectively delete most types of samples and retain samples with a lot of feature information. Compared with other existing data sampling technologies, evolutionary sampling technology has better performance and is more conducive to empirical replication. e FN undersampling method used by Zhao et al. [5] regarded the minority class as a cluster, which was divided into multiple regions. And they calculated the distance from the negative class samples to the sample mean point in each region, reserving only one sample point in each region. Finally, the remaining negative class samples were used as new negative class samples and the original positive class samples for training and analysis. Zan et al. [6] used the generative countermeasure network (GAN) to synthesize a few samples to balance the data, then used AdaBoost to change the weight of the input samples, and established a prediction model based on the decision tree classifier. To a certain extent, the recognition rate of unbalanced data was improved. Hu et al. [7] used an improved version of oversampling and undersampling techniques to solve the problem of data imbalance and synthesized the new samples by assigning higher weights to adjacent minority samples through a weight vector. Based on the Euclidean distance standard undersampling most types of samples and keeping the number constant during the resampling process, they found that this method was superior to using a single data sampling technique. Han et al. [8] used an improved version of the smooth algorithm: borderline-smote, which essentially synthesizes new samples from minority samples. However, the original smooth algorithm selects a small number of samples around k nearest neighbors, while scholars use an improved version of the algorithm to find the minority class at the boundary line and use this method to synthesize new samples. Wang et al. [9] constructed a deep learning prediction model for imbalanced data. e model proposed a new loss function on the basis of the original neural network. is method does not need to balance the data in advance. Predictive analysis can be performed directly, and it can effectively reduce the classification error of positive and negative examples. Jiao et al. [10] proposed a reinforcement learning cumulative reward mechanism to improve the attribute selection of the classification regression tree, so as to improve the model's prediction probability for a small number of samples.
We can see that the problem of category imbalance is mainly solved from the following two perspectives: the first perspective is to balance the data by changing the number of samples. is method can also be divided into three aspects. On the one hand, it is to improve the oversampling method. On the other hand, it is based on the principle of undersampling to change the data distribution. On the third hand, it is the method of combining oversampling and undersampling. e second perspective is to improve the classifier algorithm to improve the prediction performance of the model and at the same time use relevant evaluation indicators to evaluate the prediction results. Under normal circumstances, since undersampling will lose information, oversampling is the most widely used technique, and smote is the more common method. However, we have found that most scholars cannot reduce the imbalance between and within the sample categories at the same time when using the improved version of the smooth method, and the applicability of the improved version of the classifier is also limited. erefore, this paper proposes an improved version of the smooth algorithm with better applicability, which combines the k-means algorithm.
is method clusters all samples using the k-means unsupervised learning algorithm, finds clusters with more samples in the minority class, and then uses the smote method that synthesizes new samples in the cluster to change the data distribution. It can not only reduce the imbalance between the categories but also reduce the imbalance within the categories. At the same time, it combines the BP neural network method to predict the credit card default situation to help the bank to identify credit card risks effectively.

Basic Theory
e main idea of the principal component analysis (PCA) method is to transform the n-dimensional feature variable through the coordinate axis and the origin to form a new m-dimensional feature (usually, m is less than n) [11].
is m-dimensional feature is also called principal component. Its essence is to replace a series of related sample features with newly generated comprehensive features that are irrelevant to each other. When analyzing the data, you can set the cumulative variance ratio determination factor in advance. e working steps of PCA are as follows: e first step is to standardize the original sample. is step is automatically executed by the software that analyzes the data. e second step is to determine the correlation between the sample features and calculate the correlation coefficient matrix. e third step is to determine the number m of principal components after dimensionality reduction, calculate the eigenvalues and their corresponding eigenvectors, and then synthesize these eigenvectors to obtain each principal component. e fourth step is to determine the comprehensive evaluation index, calculate the information contribution rate of each feature value and principal component, and then weight these values to obtain the final evaluation value.

Feature Importance Calculation of Random Forest.
Random forest is a relatively basic machine learning algorithm, which is widely used in predictive analysis [12], data labeling [13], tag ranking [14], feature importance calculation [15], and other fields. e principle of the algorithm is as follows: using bootstrap method to randomly construct n decision trees, each decision tree is split and pruned and finally combined to form a random forest. In this paper, random forest is used to calculate feature importance, which is used as the initial weight of BP neural network. e basic algorithm steps are as follows: e first step is to calculate the out-of-bag data error (error1) by using the sample data that has not been selected (out-of-bag data) when drawing samples to construct a decision tree. e second step is to randomly add noise interference to all the sample features of the data outside the bag and then calculate the error again and record it as error2. e third step is to calculate the importance of a feature � n i (error2 − error1)/n (n is the number of decision trees constructed).

BP Neural Network.
e prediction model used in this paper is the BP neural network algorithm, which is a feedforward neural network for error backward update. It is often used for bank risk analysis [16], geological disaster monitoring [17], image and handwritten digit recognition [18,19], and other fields. BP neural network consists of three parts: input layer, middle layer, and output layer. In the model, data samples enter the input layer through a weighted combination of different weights, then pass through the middle layer, and finally get the result from the output layer. Different weights and activation functions make the output of the model very different. In this experiment, the following steps were taken: e first step is to assign some parameters and initialize some parameters. In the experiment, this paper takes the feature importance calculated by the random forest as the weight of the input layer X i and sets the same value for the weight of one input variable corresponding to multiple hidden layers. In addition, the number of nodes in the input layer, hidden layer, and output layer is determined. e second step is to calculate the output of the hidden layer Z i : e third step is to calculate the output layer Y i : Among them, both aj and bk in the second and third steps are offset. e fourth step is to calculate the error E: Among them, yk is the expected output value, and Yk is the actual output value. e fifth step is to update the weights and biases in reverse.

k-Means SMOTE Algorithm
We know that smote is a method for synthesizing new samples and solving data imbalance proposed by Chawla et al. [20] and is widely used in various fields. Smote is an improved method of random oversampling technology. It is not a simple random sampling, repeating the original sample, but a new artificial sample generated by a formula. But the smote algorithm will also increase the imbalance between the positive and negative classes of the sample to a certain extent. erefore, according to the problem of imbalance of credit card sample categories, this paper uses an improved smote algorithm called k-means SMOTE algorithm.
is algorithm can reduce the imbalance between categories on the one hand and reduce the imbalance within categories on the other hand. In this experiment, we first cluster all samples (30,000), then use k-means method to filter clusters with more minority categories, select clusters with more minority categories after filtering, and finally perform smote oversampling in the filtered clusters. e detailed steps of the k-means SMOTE algorithm are as follows: e first step is to randomly select k points among all samples D � x 1 , x 2 , x 3 , . . . , x 30000 and use them as the sample cluster centers C 1 , C 2 , C 3 , . . . , C k . e second step is to calculate the distance from each sample to the cluster center: Among them, e third step is to allocate the sample into the closest clusters: e fourth step is to recalculate the cluster center: e fifth step is to repeat the above second, third, and fourth steps until the cluster center no longer changes. e sixth step is to filter clusters with fewer minority classes and select clusters with more minority classes to synthesize new minority samples. e seventh step is to perform smote oversampling of CK in each filtered cluster: Among them, rand(0, 1) represents a random number between 0 and 1, X new represents a new synthesized negative class sample, and xc represents a negative class randomly selected from m nearest neighbors in the filtered clusters. x represents the negative samples in the filtered clusters except m neighbors. e k-means SMOTE algorithm flow is shown in Figure 1

Preliminary Analysis of Data.
is paper uses data on credit card usage, which comes from the kaggle website (https://www.kaggle.com/uciml/default-of-credit-card-clientsdataset). e sample size of this data is 30,000, of which 6,636 are in the positive category (default) and 23,364 in the negative category (no default). e sample has a total of 25 variables. In this experiment, considering that the variable ID has no relationship with the target variable, the deletion process was Complexity 3 performed. 23 characteristic variables and 1 target variable were selected. e variables are shown in Table 1: Among these 23 features, each feature has been processed accordingly. For the feature limit_bal, we draw a density map according to the default type, and the result is shown in Figure 2.
It can be found from Figure 2 that when the given credit amount is approximately below 150,000, the probability of default is greater than that of nondefault. is shows that when the credit amount is low, there may be more defaulters. For the feature age, we also performed a visual analysis, as shown in Figure 3. Figure 3 shows that the probability of nondefault of age between approximately 25 and 40 is higher, which indicates that consumers in this age group are more capable of repaying credit card loans. is may be because their work and family tend to be stable without too much pressure. For the feature sex, we draw a stacked histogram according to the target variable, as shown in Figure 4.
As shown in Figure 4, whether it is male or female, the proportion of default consumers is still relatively low, which is in line with the general situation. Conventionally, most of the default data such as credit card fraud are uneven, and we need to make some adjustments to the model based on the actual situation. For the feature education, we find that the feature has six attribute values, and the meanings of the numbers 5 and 6 are unknown, in order to avoid causing a "dimensional disaster" when processing data. We merge them into one meaning (unknown) and draw a stacked histogram to visualize this feature, as shown in Figure 5.
For the feature marriage, we draw the same graph as the feature sex and education. e default and nondefault conditions of this feature are shown in Figure 6.  It can be seen from the above three figures that the sample set is unbalanced in the corresponding attribute values of the three characteristics of gender, education, and marriage. For the feature series payment status, we draw different stacked histograms according to different months, and the results are shown in Figure 7.
It can be seen from Figure 7 that consumers who delay payment by one month or less have fewer credit card defaults and almost never happen. In the three months of May, August, and September, for consumers who delayed payment for more than 2 months, the greater the probability of their credit card default is, the more likely it is to increase the loan risk of financial institutions. For the feature series BillAMT and PayAMT, we also perform the corresponding analysis and draw a line graph to visualize the two features, as shown in Figures 8 and 9.
As shown in Figures 8 and 9, due to the imbalance of the data, the line of default only occupies the front part of the figure. Figure 8 shows the amount of the bill, and Figure 9 shows the amount previously paid. Comparing these two images, we find that the six subimages in Figure 9 have greater fluctuations and greater range than the six subimages in Figure 8. Moreover, the uncertainty of the previous Complexity payment amount has also increased the difficulty for banks to adjust the credit card loan limit.

Data Processing and Feature Importance.
In this experiment, there are a total of 23 features and 1 target variable. After coding and data cleaning, 23 features become 89 input variables. is is a heavy load for model operation and is not conducive to the prediction results of this paper. For comparative analysis with other models, this paper uses PCA for dimensionality reduction, finally obtains 27 input variables, then uses random forest to calculate the importance of these 27 variables, and uses them as the initial weight of the BP neural network. e calculation results of the feature importance are shown in Table 2.

Model Evaluation Method.
According to the actual situation, for unbalanced data, we should use the evaluation index of unbalanced data [21], but because at the beginning of the experiment, we have balanced the number of positive and negative classes in the sample. And we are still using the two-class evaluation indicators commonly used in the past: hybrid matrix, recall, precision, f1-score, AUC value, and so on.

BP Neural Network Prediction
Model. is paper constructs a BP neural network prediction model based on credit card default data. Since this paper has 27 input variables, 55 neurons in the hidden layer, and 2 output layers, the BP neural network model used is shown in Figure 10. en, we use the 27 features after principal component dimensionality reduction as input variables X 1 , X 2 , . . . , X 27 and use the feature importance calculated by the random forest as the initial weight of BP neural network. For example, the calculation formula for the weight W of the hidden layer is as follows:   (8), there are 27 rows and 55 columns. 27 rows are the number of input variables, and 55 columns are the number of hidden layer neurons. In this experiment, we set each row in the matrix to be the corresponding feature importance (as in the above formula matrix 2) and substitute the result into the model for prediction. We find that when the weights are initialized, the accuracy of the model prediction is 0.8796, and when the feature importance is assigned to the weights, the accuracy of the model prediction is 0.8811. In terms of amount, the accuracy of the second case is slightly higher. When building the model, we used a three-layer BP neural network to build a credit card default prediction model. e input layer has 27 neurons, the hidden layer has 55 neurons, and the output layer has 2 neurons. e hidden layer is calculated using the following empirical formula: n � 2 × n1 + 2, (n1 is the number of input layers).
In addition to the initial weight of the hidden layer and the number of neurons in the hidden layer, we have performed a simple process, and the other parameters are default values.
Due to the uneven distribution of the experimental data, we use the k-means SMOTE algorithm to solve this problem. For the parameter k in the k-means SMOTE algorithm, we use the following empirical formula to calculate: , (N is the total number of samples).
en we substitute the sample size of 30000 (N) into the above formula, can calculate the value of k to be about 122, substitute it into the k-means SMOTE algorithm, and draw the ROC curve graph to intuitively compare the prediction performance of the model before and after k-means SMOTE. And we find that k-means SMOTE greatly improves the prediction performance of the model. e result is shown in Figure 11.
In Figure 11, we find that after the sample is processed by the k-means SMOTE algorithm, the prediction of the model has been greatly improved. e AUC value has been increased from 0.765 to 0.930, the ROC curve of the model is closer to the straight line 1 above the coordinate axis, and the accuracy rate has changed from 0.8252 to 0.8796.
Normally, the BP neural network model with more parameters is prone to overfitting. Because of the high fitting degree of the model, it is possible to learn the noise. We compare the performance of the prediction model in the training set and the testing set, and the results are as follows.
It can be seen from the above table that the values of performance indexes of the prediction model in these two groups of data set have little difference, so we judge that the possibility of overfitting the model in this experiment is relatively low. And the performance of the model can achieve the desired results.

Comparative Analysis with Other Models.
In order to verify the effectiveness of the method used in this experiment, we also establish five other common machine learning models for predictive analysis under the same conditions. We have compared and analyzed the prediction results of these five models in the same situation and used several common performance indicators to evaluate the model. Since the confusion matrix is used to show the prediction results according to different situations, it is not easy to compare the performance of these five models. We adjust it slightly (e.g., the accuracy rate is approximately equal to the average of the accuracy of model positive and negative examples) as shown in Table 3.
It can be seen from Table 4 that the F1 values of these six models have reached above 0.8, indicating that these six models can effectively predict the credit imbalance data in this paper, but the comprehensive prediction performance of the BP neural network is slightly better. e AUC value is the highest among the six models, and the accuracy rate is higher for SVM. But the running time of the SVM model is too long, close to 6 minutes; compared to other models, the running efficiency of SVM is very low. If the amount of data is very large, it is not a wise choice for us to use SVM for prediction. In addition, we can find that except the lower AUC value of the decision tree, the difference in the AUC value of other models is not particularly large. is situation can also be intuitively seen through the ROC curve. e result is shown in Figure 12.
In Figure 12, we can find that if we do not look at the numbers in Table 3, we cannot see the obvious difference in the ROC curves of the first five models from Figure 12. In the above figure, the sixth image is the ROC curve of the decision tree, which is obviously different from the previous five images.
is also shows that the tree has the worst performance among the six prediction models.

Summary
is paper proposes a comprehensive way by using k-means SMOTE and BP neural network algorithms for data imbalance. We find that the improved version of the smote algorithm (k-means SMOTE) not only effectively solves the problem of data imbalance but also improves the prediction performance of the model. In addition, we also find that using the feature importance calculated by the random forest as the initial weight of the hidden layer of the BP neural network can slightly improve the prediction performance of the model to a certain extent. However, this change is not obvious. On the one hand, it may be because the credit card default data has many influencing factors and is more complicated. We cannot take all such influencing factors into account, which may indirectly affect the calculation results of feature importance. On the other hand, we think that the amount of sample data may not be enough, the model of BP neural network is relatively simple, and there is no better interpretation of these data for predictive analysis.
In addition, with the gradual increase in the penetration rate of credit cards in our country, the research on its default risk has the following suggestions. On the one hand, we should further improve the construction of the credit indicator system. A good credit index system is conducive to better assessment of personal credit, and a risk prediction model with better classification performance can be established. Specifically, methods such as Delphi expert method, analytic hierarchy process, and regression analysis can be used to find the most representative individual credit indicators, then determine the weight of each indicator, and finally dynamically manage the evaluation system. On the other hand, we should strengthen risk management and control. Since credit card loan default involves personal moral issues, it is highly subjective and uncontrollable. Although major financial institutions are committed to developing the best methods for credit card loan risk avoidance, they have not been able to completely resolve the problem of credit defaults. erefore, financial institutions should focus on controlling and avoiding risks and try their best to reduce risk losses. Based on the idea of machine learning integration methods, they can comprehensively use each superior classifier to develop a more versatile risk control model.

Conflicts of Interest
e authors declare that they have no conflict of interest.