A Deep Learning Approach with Feature Derivation and Selection for Overdue Repayment Forecasting

: Risk control has always been a major challenge in ﬁnance. Overdue repayment is a frequently encountered discreditable behavior in online lending. Motivated by the powerful capabilities of deep neural networks, we propose a fusion deep learning approach, namely AD-MBLSTM, based on the deep neural network (DNN), multi-layer bi-directional long short-term memory (LSTM) (BiLSTM) and the attention mechanism for overdue repayment behavior forecasting according to historical repayment records. Furthermore, we present a novel feature derivation and selection method for the procedure of data preprocessing. Visualization and interpretability improvement work is also implemented to explore the critical time points and causes of overdue repayment behavior. In addition, we present a new dataset originating from a practical application scenario in online lending. We evaluate our proposed framework on the dataset and compare the performance with various general machine learning models and neural network models. Comparison results and the ablation study demonstrate that our proposed model outperforms many effective general machine learning models by a large margin, and each indispensable sub-component takes an active role. short-term memory (BiLSTM) encodes the sequential input, while the deep neural network (DNN) encodes selected features. The attention mechanism balances the importance of each hidden LSTM layer, and two encoded


Introduction
With the development of the economy and the rising level of consumption in national standards of living, the majority of people and companies have encountered capital turnover problems and have therefore attempted to obtain a loan for consumption, capital turnover, investment, etc. Therefore, the demand for financial credit services is continuously increasing. Online lending is a convenient and fast micro-loan innovation finance mode. Platforms streamline the intermediate, tedious loan procedure, thus attracting increasing numbers of clients who intend to solve their financial difficulties.
While online lending provides customers and platforms with an effective path towards a loan transaction, various fraudulent and insecurity factors emerge as well. Almost all credit businesses face problems such as long-term liabilities, loan delinquencies and overdue repayment, which pose great challenges for risk control in online lending. When overdue behavior occurs, financial institutions can make up for loss according to some collateral in traditional credit business. In contrast, manually handling small monetary loans with high frequency is difficult in online lending. Once frauds such as overdue repayment occur, tracking accountability and recuperating loss become difficult problems.
Repayment behaviors in online lending always accumulate in time to form a repayment behavior sequence. Analyzing the historical record of repayment behaviors can allow potential repayment behavior patterns to be discerned, thus predicting the occurrence of overdue repayment. In [1,2], the authors developed machine learning models to analyze event records, but they ignored the sequential format in these records. Deep learning models, especially sequential neural networks, have achieved remarkable performance in event sequence-related tasks [3][4][5]. These kinds of models are well suited to handle the massive amount of online repayment behavior data. However, as only limited features can be collected in online lending, deep neural models may easily sink into an over-fitting problem. Feature derivation by manually designing new features is commonly utilized to enlarge the feature size, but relies on human expertise.
In this paper, we publish a new dataset collected from a practical application scenario in online lending and propose a novel feature derivation and selection approach. Based on the dataset, we propose an overdue repayment forecasting method based on a fusion of deep learning models. Experiments demonstrate that our method outperforms various general machine learning models and neural network models. Furthermore, visualization and interpretability improvement work in our approach show the critical time points and causes of overdue repayment behavior. The contributions of our research can be summarized as follows: 1. We present a new dataset that originates from a practical application scenario in online lending, which can be downloaded at https://github.com/zjersey/payment_overdue_dataset. Over one million repayment records of 85,000 anonymous borrowers are contained, and all the sensitive information is encrypted for confidentiality. 2. An improved feature derivation and selection method is proposed that can generate extensive, fully-combined new features and select an arbitrary number of the most significant features based on a scorecard model. 3. We introduce deep learning models into the domain of risk control in online lending; specifically, overdue repayment forecasting based on historical repayment behaviors. Our proposed architecture, namely AD-BLSTM, combines a deep neural network (DNN), bidirectional long short-term memory (LSTM) [6] (BiLSTM) and the attention mechanism [7]. DNN and BiLSTM are used to learn from the static background information and dynamic event sequence, respectively, to maximize the superiority of the two networks. The attention mechanism is introduced to weight the importance of hidden layers in LSTM and integrate them to obtain a more informative representation. 4. Experimental results demonstrate that our approach outperforms various general machine learning models and neural network models. Interpretability improvement work is implemented based on the attention mechanism and derived features. We visualize differentiated attention weights to explore the key event time steps and analyze the feature importance of derived features to determine the causes of overdue repayment.

Related Work
Overdue repayment forecasting in this paper can be regarded as an event or behavior prediction problem. Previous works on this topic have aimed at applying predictive techniques to event sequences.

Event Prediction
The main purpose of event prediction is to predict the occurrence and condition of future events based on a sequence of past events. Prior research works have focused on probabilistic graphical model. Becker et al. [8] propose a framework of probabilistic models and additional methods such as the EM algorithm [9] to predict the future behavior of business process instances based on historical event data. This framework is composed of several probabilistic modules, such as the model transformation module and prediction module, which play different roles. Breuker et al. [10] develop predictive modeling techniques to describe business process behavior. Most similar probabilistic frameworks, such as those in [11][12][13][14], require the design of a complex module structure and are not end-to-end models but have good robustness and interpretability.
Machine learning is utilized after probabilistic graphical models [1]. The three prediction models of machine learning, constraint satisfaction and quality-of-service (QoS) are combined and compared in [2]. Machine learning methods such as the decision tree [15], support vector machine (SVM) [16], Bayes network [17] and cluster analysis [18] are comprehensively used and compared. The procedure of the machine learning-based predictive technique mainly contains the two steps of data preprocessing and model learning, reducing the intricacies of te probabilistic graphical model and achieving better results while maintaining interpretability.
With the development of neural networks and deep learning techniques, the recurrent neural network (RNN) [19] and long short-term memory (LSTM) [6] have exhibited powerful abilities in sequence-related tasks, especially in the realm of natural language processing (NLP) [20][21][22]. Event data commonly exist in the form of sequences, and so many works have focused on deep learning-based event prediction [3][4][5]23,24]. Evermann et al. [25] propose a process prediction method based on LSTM to predict the behavior of the running of a process. A range of similar techniques have introduced an LSTM-based deep learning approach to predict the timestamp of future behavior [26], the continuation trajectory of running cases [27], the remaining service execution times [28] and the completion properties [29].
The point process [30][31][32][33] is a solid framework for dealing with multi-dimensional event data in the continuous time domain that treats each event as a point associated with a time stamp, location and other attributes. Previous works [34][35][36][37][38][39] have associated the point process with neural networks to process event data.

Deep Learning in Online Lending
Deep learning has the characteristics of end-to-end learning, thus eliminating the complicated manual design process. On one hand, the feature representation ability of neural networks is extremely powerful, and so the deep leaning approaches mentioned in Section 2.1 can achieve better results than general machine learning methods and probabilistic models. On the other hand, the parameters of neurons in the network can hardly represent meaningful mathematical information of the input features, and so neural networks are poorly interpretable.
Recently, some works have introduced deep learning into the online lending domain. The authors in [40] transfer the learning algorithms of LSTM, the attention mechanism and word2vec [41], which are effective in the NLP domain, into online lending and propose a credit scoring method. The online operation record data of borrowers in online lending are regarded as a sentence with multiple words. Word2vec is applied to produce latent representation embedding for the behavior record. Instead, we divide the original behavior data into static attributes and dynamic sequences. Our proposed feature derivation and selection method is applied on the static features afterwards.
The authors in [42] develop deep learning models to predict the trading volume of the online market based on the trend of change in investor sentiment. TextCNN [43] is introduced to classify the sentiment of investor comments, and LSTM is utilized to analyze the trend of the trading volume. The prediction of the daily trading volume can be regarded as a time-series problem, while in our research, overdue repayment forecasting is an event-series problem.

Proposed System
We introduce multiple deep learning models to predict future overdue repayment behavior in online lending based on previous repayment behaviors. The structure of our proposed AD-BLSTM approach is illustrated in Figure 1. AD-BLSTM integrates DNN, LSTM and the attention mechanism for the purpose of appropriately representing different parts of the input repayment behavior record and improving prediction performance and interpretability. At the same time, we propose an improved feature derivation method that can generate extensive fully-combined new features and select an arbitrary number of the most significant features based on a scorecard model. The purpose of the task is to classify future repayment behavior into two types, overdue repayment behavior (positive) and normal repayment behavior (negative), according to the past repayment logs. Therefore, the problem is simplified into a binary classification task.
As illustrated in Figure 1, the structure of our system can be divided into five modules: feature derivation and selection, a multi-BiLSTM layer, a DNN layer the attention mechanism and an output layer. We first divide the input data into dynamic and static features. The last repayment label serves as the target and all the previous time-dependent features are set as dynamic features that are fed into the multi-BiLSTM layer. Produced by feature derivation and the selection module, static features are fed into the DNN layer. We simplify the task to a binary classification task, and so the objective is to minimize the cross-entropy loss: whereŷ (i) is the ground-truth and y

Feature Derivation and Selection
Generally, the process of feature derivation is essential when the features of the input are not abundant. Feature derivation requires manual design and professional prior knowledge, thus making the derivation procedure stochastic and insufficient. Generally, derivation methods are based on statistical information and expert diagnosis. Statistics-based methods calculate some common mathematical statistical values such as the maximum value, mean value and variance value of part of the existing features. Expert diagnosis-based methods introduce new features by professional prior knowledge and manual inference.
Motivated by the statistics-based feature derivation methods and the problem of insufficiency, our approach improves upon the original method by categorizing features into various major categories and expanding the mathematical statistical values in each major class. Our proposed feature engineering framework is illustrated in Figure 2.

Major Category 1
Major Table   feature derivation feature selection Figure 2. Overview pipeline of our proposed feature engineering method. A feature derivation table is manually designed by filling with N c major categories, each with n i sub-indexes. The feature size is enlarged into N d based on the table. Weakly influential features are filtered out during feature selection, and the feature size is reduced into N s .
The purpose of feature derivation is to extend the number of input features from N 0 to N d , where N 0 is the number of input features and N d N 0 . First of all, we manually design N c major categories according to the content of the input features. Besides this, mathematical statistics are set as one major category. Specifically, in this research, we set financial indicators, products, mathematical statistics, periods and time conditions as our major categories. Secondly, we add a number of subindexes as adequately as possible in each major category. For instance, the category of mathematical statistics includes subindexes of the cumulative value, cycle proportions, variance value, etc. Finally, the connection of one subindex from each major category is extracted as a new derived feature. Each customer's repayment behavior sequence is mapped into the derived features, and the mapping value is the corresponding feature value. A negative number is filled as a missing value signal when there is no accurate mapping value between the input data and derived feature.
The total amount of newly derived features can be calculated by Equation (2): where n i is the amount of subindexes in the i-th major category. To maximize the size of derived features, n i is supposed to be relatively large when designing the major categories and subindexes. One specific instance is illustrated in Figure 3. By combining the "overdue days" subindex in the financial indicator major category, "non-payday loan" in product, "maximum" in mathematical statistics, "weekday" in time condition and "one week" in period, we can obtain a newly derived feature (Feature 1): the maximal overdue days for a non-payday loan on weekdays of the recent one week.
Our proposed derivation approach involves the standardization and extension of common measures, and the problem of aimlessly selecting combining variables is solved by our method. Although feature combinatorial representation can be accomplished by the neural network automatically due to deep learning's characteristic of end-to-end learning and its powerful representation ability, the experimental result in our research shows that the derived features can improve the model performance despite requiring additional work. More importantly, each new feature is produced through the connection of several indicators, which is meaningful for improving the interpretability.
After the process of feature derivation, the size of the input features is expanded from N 0 to N d . To eliminate the existence of meaningless padding values and weakly influential features, feature selection is essential. The feature selection approach is introduced in Algorithm 1. The selecting pipeline can be divided into four parts: chiMerge, weight of evidence (WOE), Pearson correlation coefficient and LR. After the multi-step procedure of selecting influential features, the total volume of features decreases from N d to N s . Selected features will be fed into the follow-up networks.  Figure 3. An example to illustrate our proposed feature derivation method. We iteratively traverse all the feature collections in the feature derivation table to enlarge the feature size. At each iteration, one sub-index is selected from each major category to form a new feature.

DNN Layer
The deep neural network (DNN), also known as the multi-layer perceptron neural network (MLP), is the most fundamental network. A large DNN is stacked by precursor perceptron neurons with weights and an activation function. A single perceptron cannot represent a linearly non-separable situation, even the basic logic operation "xor". Expanding the number of perceptrons and connecting layers can represent any mathematical function.
The relationship between the input x (i) and output y (i) of the i-th layer can be calculated by Equation (3): where F is the activation function, usually tanh, ReLU or the sigmoid function. y (i) can be the output possibility or the input of the next layer as x (i+1) . We train a three-layer DNN as an encoder of input features X s ∈ R N s after derivation and selection into vector M ∈ R p , where p is the number of neurons in the last layer.

Multi-BiLSTM Layer
In neural networks such as DNN and CNN, the inputs are independent of each other without temporal dependence, while the recurrent neural network (RNN) considers sequential information in which the data are not only related to the input at this time but also related to the previous input. In other words, the RNN has the ability to memorize. The RNN models the dependency within sequence data extensively, but the accumulation of the gradient product of each time step in backward propagation causes the gradient to disappear when the sequence length is long.
On the basis of the RNN, LSTM is a type of neural network with the additional ability of forgetting, which is thus suitable for sequence data and has achieved outstanding results in natural language processing. The shortcoming of the RNN is that there is only one hidden layer state updating inside the network, so the model is relatively simple. All the historical input data are memorized by the RNN without filtering, thus frequently resulting in the long-distance dependence problem. LSTM alleviates the gradient disappearance and explosion problem by introducing several gate units with different functions. LSTM selectively operates on input information which is memorized, forgotten or output to the next layer with a certain weight according to the content importance of the information. All the operations are implemented by multiple computing components called "cell gates", including forget gates, input gates and output gates.
Firstly, the forget gate calculates the degree of forgetting the historical information, denoted as C, based on the input data and the hidden state of the previous moment.
f t will be multiplied by C t−1 afterwards to forget partial information in C t−1 . The input gate updates the candidate hidden layer value and the hidden layer state. Referring to Equation (5), i t determines the proportion of candidate hidden values updating to the hidden value C t , and the candidate hidden value C t is calculated by Equation (6). The hidden layer state C t is updated with the forget gate, based on Equation (7).
The function of the output gate is to output the hidden state in a certain proportion.
Historical information does not flow into the future state entirely, while essential information is preserved and useless information is forgotten after the three cell gates.
LSTM and the RNN both process sequence data in one direction, and h t is determined only by x t and h t−1 , ignoring the correlation between future events and current events. BiLSTM complements LSTM to address this problem of using a single direction. By contrast, the input sequence is processed in the reverse order simultaneously to obtain another hidden state, h tb , thus representing a sequence in two directions, and two hidden states are concatenated afterwards to get the final hidden state Consider a sequence of overdue repayment flags of length L: in the sequence represents whether the t-th repayment behavior is overdue (when x (t) = 1) or not (when x (t) = 0). Motivated by the structure of the pretrained language representation model ELMO [45], we train a two-layer BiLSTM network as the encoder to reconstruct input sequential behavior data into the vector H ∈ R 2m×L : where h i f ∈ R m and h ib ∈ R m are the forward and backward hidden states, respectively, in the i-th time step of the second LSTM layer.

Attention Layer
We propose an attention mechanism operating on the hidden state at each moment. We calculate the importance weight of each hidden state and combine the states based on weights to obtain a combined final hidden state. The formulas of the attention mechanism are listed below: where H is the output of the BiLSTM layer referring to Equation (10) and ω ∈ R 2m is the variable learned by the training process. α ∈ R L reflects the weights of hidden states at different moments and is operated on H to obtain the final output state h * ∈ R 2m .

Output Layer
We concatenate the output of the multi-BiLSTM layer h * and the DNN layer M to form a vector X o ∈ R 2m+p and apply it to a fully-connected layer and softmax function to obtain the predicted probability y o .

Dataset
Customers normally submit their fundamental information, such as identity information, wealth information and credit records, when they borrow money from online lending platforms. Besides, online lending platforms may record transaction details when loan behaviors are continued.
We present a new dataset in this paper, which is available to the public (https://github.com/ zjersey/payment_overdue_dataset). The real-world dataset was provided by a company in Shanghai and collected from the two situations above. The practical situation is that of an online lending platform that is used by a large number of borrowers and lenders. Borrowers and lenders can engage in loan transactions with each other under the control of the platform. Our dataset is sampled from the transaction records of the platform.
Each piece of data represents a record of a repayment transaction from a borrower to a lender. Our dataset contains 1,048,575 transaction records and 85,236 borrowers are involved, with an average of 12.3 repayment behaviors for each borrower. The maximal number of repayment records for a borrower is 20. The number of borrowers with a different number of records (ranging from 1 to 20) was counted, and the proportion is illustrated in Figure 4. Since the length of the behavior sequence in our dataset is not particularly long and there are borrowers with few records, some long-sequence modeling based methods [6,19] would not perform well for our dataset.
As a borrower might have transactions with multiple lenders at the same time, the repayment behavior does not have a regular frequency in our dataset. Besides, repayment behaviors between precise borrower-to-lender matches may be small in number. Therefore, our proposed approaches focus on modeling the historical records from the borrower level instead of the borrower-to-lender level.
Each transaction record contains 64 engineering features, which can be divided into three categories based on content: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18   • Customer information (after data masking): borrowers' unique identification, industry, etc. Note that data masking has been applied to protect the privacy of customers. • Dynamic repayment records: obliged repayment amount, obliged repayment time, overdue days, etc. • Background information: expense ratio, loan type, order scenario, etc.
The most meaningful information lies in the dynamic features. Each repayment transaction record has a due date, due amount of money, actual repayment date and actual repayment amount. The transaction record is identified as an overdue repayment if the actual repayment date is later than the due date or the actual repayment amount is less than the due amount. The background and identity features provide some auxiliary information. Although there are 64 features in total, some of them are meaningless or have a strong correlation. Therefore, the workable features after feature screening are limited in number, which makes feature engineering essential and challenging.

Data Preprocessing and Experimental Settings
We define a record as overdue repayment behavior if the overdue day or overdue amount in a repayment record is not equal to zero. The label for overdue repayment is set as positive; otherwise, it is set as negative. To define the target of the training process, we first sort the data by unique identification and a repayment timestamp. The target of the individual with the last repayment behavior being overdue is set to positive; otherwise, it is set as negative. As a result, the number of positive individuals is 15,608, accounting for 18.3% of the total, while negative samples account for 81.7%.
The sequence of all the behavior labels except the last label is regarded as the dynamic features and fed into the multi-BiLSTM layer. The window length of the label sequence is the maximum of each individual's record length, and padding is added at the end of the sequence to make up the required length. We operate a feature derivation and selection approach on the records, except the last record, for each individual to obtain static features as the input of the DNN layer. In feature derivation, we design five major classes as described in Section 3.1 with 14, 7, 20, 3 and 10 subindexes, respectively; thus, the total amount of newly derived features is N d = 14 × 7 × 20 × 3 × 10 = 58, 800. After feature selection, we retain 42 of the most influential features in the DNN layer.
We choose TensorFlow (https://www.tensorflow.org/)-a deep learning tool based on Python-as the deep learning framework for our experiment. We randomly select 80% of the data as the training set, 10% as the validation set and 10% as the test set. The parameters of our AD-MBLSTM model are listed in Table 1.

Evaluation Indicators
The statistics of the positive and negative sample proportions described in Section 4.2 indicate that the data distribution is unbalanced with few positive samples. Supposing that every input individual is classified as negative, then we can still obtain an accuracy of 81.7%. Obviously, this accuracy value is quite high but meaningless, because no overdue behavior can be recognized by the model. As the performance of the model cannot be measured appropriately only in terms of accuracy, we supply other evaluation indicators including recall, the area under the curve (AUC) value and KS value.
The confusion matrix is a fundamental evaluation indicator in binary classification tasks and is essential for the calculation of multiple other indicators as well. The confusion matrix is a 2 × 2 table with four combinations of actual values and prediction values: On the basis of the four combination values above, TPR, TNR, FPR and FNR are the four ratio values that represent the proportion of corresponding combination values. The formulas for TPR and FPR are as follows: All of the evaluation indicators can be calculated by the combination values and ratio values mentioned above. The recall value represents the proportion of the positive samples that are predicted as the correct class, according to the following formula: where a higher recall value indicates that more positive samples are distinguished by the model, which satisfies the actual demand of our research; i.e., to distinguish overdue repayment behaviors. Gradually changing the threshold of classification from 0 to 1 and calculating the corresponding TPR and FPR, we can form a line graph regarding the (TPR, FPR) pairs as points, named the receiver operating characteristic (ROC) curve. The AUC is the area under ROC curve. A large AUC value indicates good performance of the classification model, and a perfect model has an AUC of 1.
Similarly, taking the threshold as the x-axis and the TPR and FPR as the y-axis, we can obtain a graph with two lines, named the KS curve. The KS value is the maximum distance between the FPR curve and FPR curve in the vertical direction.
where a higher KS value indicates that positive and negative samples are distinguished more obviously.

Baselines
In order to comprehensively measure the performance of our proposed AD-MBLSTM model, we use baseline models, including multiple general machine learning models and basic neural networks, that have been utilized frequently in previous work for comparison. We divide the input data into dynamic and static features in AD-MBLSTM; however, this operation does not suit the baseline models, so we simply implement feature derivation and selection on the inputs and afterwards train the processed data using baselines with the same parameters.

•
Logistic regression (LR) is a fundamental and commonly used classification approach based on sigmoid function and the maximum likelihood method. The LR model does not need to assume the prior distribution of input data. Not only can the classification label be determined, but the predicted probability can also be obtained. and so the threshold can be adjusted according to demand and label distribution.

•
XGBoost [46] achieved state-of-the-art performance in large numbers of machine learning tasks as soon as it was proposed. The overall idea of XGBoost is to constantly add new decision trees to improve the performance of a system. Newly supplied trees can make up for the shortcomings of the previous weak classifiers to compensate for prediction residuals.

•
The factorization machine (FM) [47] interactively combines input features in pairs, which allows training to be performed on each pair of potential features. There is a similarity between FM and SVM in the formula. However, in contrast to SVM, all interactions between features are considered via factorized parameters in FM, and so FM works well even in problems with huge sparsity. • DNN extracts derived and selected features directly and connects them to the output layer.
• In multi-BiRNN, we consider each variable of the input record as a dynamic variable and feed input data into a two-layer bidirectional RNN. We choose this structure due to its similarity to the LSTM layer in AD-BLSTM, thus facilitating comparison.

•
Multi-BiLSTM has the same structure as multi-BiRNN except that the RNN is replaced with LSTM.

Result Analysis
The comparison results of AD-BLSTM with baseline methods are listed in Table 2. Our proposed AD-BLSTM model can be seen to outperform other methods in all evaluation indicators, especially in recall, AUC and KS values. AD-BLSTM achieves 0.59 recall value on the basis of an accuracy of over 0.85, indicating that 59% of overdue repayment behaviors can be predicted accurately. The AUC and KS values of AD-BLSTM suggest that it is capable of distinguishing positive and negative samples and thus represents a stronger classifier compared with other methods.
Traditional methods regard repayment records as a single input instead of a sequence, processing data into a fixed vector via feature engineering and training by general machine learning methods or neural networks, which has a similar procedure to the LR, XGBoost, FM and DNN baselines. By observing these baseline results, we can conclude that even the procedure of feature engineering has been improved by our proposed feature derivation approach, and although strong classifier algorithms such as XGBoost and FM have been utilized, the recall, AUC and KS still fluctuate to an unsatisfying level.
The training pipeline for the two memory network baselines, Mul-BiRNN and Mul-BiLSTM, shows large difference with the above four baseline methods. Repayment records are regarded as a time sequence instead of a non-sequential vector. As the recurrent network can extract sequential information effectively, the two baseline methods perform much better than traditional methods. Therefore, sequential modeling is essential in the context of our research.

Ablation Study
In this subsection, we explore the effect of the attention mechanism, feature derivation method, sequential input features and static input features; the results are listed in Table 3. In the table, "w/o attention" refers to the concatenation of the hidden vector of BiLSTM at the last moment with the DNN-extracted feature vector connected to the output layer; "w/o derivation" refers to the operation of basic feature engineering approaches such as normalization and one-hot encoding on the first record of each individual piece of raw data and directly connecting to the DNN layer, instead of feature derivation and selection. We can conclude from Table 3 that the attention mechanism and feature derivation improve the prediction performance of AD-BLSTM, but modestly. However, in the next subsection, we will show that the two components play an essential role in interpretability. Furthermore, in the table, "w/o sequence" refers to the isolation of BiLSTM and the attention layer, degenerating to the DNN baseline in Table 2, while "w/o statistics" isolates the DNN layer for the purpose of observing the effect of sequential behavior and static background information. The results indicate that the previous repayment label sequence has an influential impact on the next repayment target even without identification and background information, while the impact of static features is far less influential. Another observed phenomenon is that the accuracy has the opposite variation trend to the other indicators, perhaps for the reason that the model pays more attention to distinguishing overdue repayments, thus misclassifying some negative instances.

Interpretability
In this subsection, we introduce our work exploring the cause for overdue repayment, which can enhance the credibility of the prediction results and meanwhile provide possible prevention approaches. Our interpretability work is mainly based on the attention mechanism and feature derivation.

Locating Critical Time Point via the Attention Mechanism
In the attention layer, the vector α reflects the importance weights of LSTM hidden layers at different moments, calculated by Equation (12). Therefore, we explore the pivotal time points in a sequence of repayment actions by visualizing and analyzing the vector α, as illustrated in Figure 5. We calculate the mean α value of all individuals and visualize it as Figure 6. The importance weight has a positive correlation trend with the elapsing of time points except for the first time point. The reason for this is that the hidden layer representation in LSTM is calculated by the current input and previous hidden representation, thus containing more information than the previous input. However, the first hidden state is still distinctly larger than the next. We calculate the difference value of adjoining hidden states in positive samples and visualize it as Figure 5. The result demonstrates that the first and last three hidden states have more prominent weights than the others. In the process of dividing dynamic sequences, padding is performed at the end of sequences whose behavior number is less than the time_step, and the hidden states remain unchanged in these padded moments. Therefore, the last three hidden states may all refer to the last repayment behavior. We conclude that the first and last behavior have the most significant influence on the final result.

Analysis of Derived Features
The derived features can be fed into machine learning models with interpretability to further explore the latent logic of overdue repayment behavior. We choose XGBoost as the classifier model and set the features and corresponding targets as inputs. After convergence, we analyze the importance of features and explore the most meaningful subindexes.
To begin with, we select some of the static background features as the input for the XGBoost model. After convergence, the importance weights of the features are illustrated in Figure 7. In the figure, product_id_0-3 stands for the four types of loan products. sub_industry_name_0-6 stands for the seven types of sub-industry and industry_id_0-2 stands for the three types of industry: consumer finance, Internet finance and their integration. The meaning of all these features has been included in the description file of our dataset. From Figure 7, we can conclude that the type of product and industry has a significant impact on the overdue repayment behavior. Surprisingly, most people assume that the amount of owed money may greatly influence the overdue repayment, but as illustrated by our results, the owed money (start_money in the chart) has a similarly low impact to birthday, region and gender. Furthermore, we analyze the importance weights of all the 42 features obtained by our proposed feature derivation and selection method. The results are illustrated in Figure 8. The longitudinal axis denotes the order number of features, and the meaning of features has been provided in our dataset. By summarizing the key words of the features with five largest importance weights, we conclude that the last behavior has more of an impact on whether a future repayment will be overdue or not than the others, which agrees with the result after analyzing the attention weights. Additionally, whether the type of product is a very short-term cash loan matters a great deal, corresponding to the above conclusions from analyzing the background features.

Conclusions
In this paper, we proposed a fusion deep learning model and a novel feature derivation and selection approach for overdue repayment forecasting. Our methods were evaluated on a real-world dataset that we collected and made publicly available. Multiple neural networks were combined to simultaneously encode the static background information and dynamic sequential information. Experimental results demonstrated that our model outperforms various machine learning models and neural networks. The proposed feature derivation method can generate a large number of combination features from the original low-quality features. Furthermore, we visualized attention weights and found that the first and last behaviors are critical time points in a repayment behavior sequence. By analyzing the derived features, multiple interesting conclusions regarding the importance weights of features were provided.

Conflicts of Interest:
The authors declare no conflict of interest.