Harsanyi-transformation Oriented Default Risk Prediction Based on FA-XGBoost in P2P Network Loan

In the traditional Harsanyi transformation, the virtual player “Nature” assigns the type of a real player according to the historical data with a certain probability distribution. However, sometimes it is difficult to obtain historical data. Therefore, this paper proposes a FA-XGBoost model to predict the probability distribution of a real player’s type based on a large number of relevant data. In order to eliminate the increase of the computational complexity caused by the large data dimension, factor analysis is firstly utilized to decrease the data dimension, and then the XGBoost algorithm is used to learn the probability distribution of real player’s type. “Nature” finally allocates the player’s type according to the predicted probability distribution. This paper makes an empirical analysis based on the real loan data of Lending Club. The results show that the method can well guide the decision-making of P2P loan enterprises.


Introduction
Harsanyi transformation is an important method to analyze an incomplete game [1], and the key stage is the virtual player "Nature" selects the type for each real player. Scholars have used it to solve many problems. Xiong Fei et al. analyzed the incomplete information payment matrix based on the degree of individual confidence using the Harsanyi transformation to find its Bayesian Nash equilibrium [2]. Ma Lulu analyzed the problem of adverse selection before venture capital investment [3]. Zhang Pu et al. analyzed the distribution characteristics of stock volatility value and its influencing factors [4]. Li Jun and Li Wei obtained the Bayesian Nash equilibrium solution to determine the type of attacker [5].
Some scholars try to study how "Nature" assigns player's types. Niu Xiaomeng determined the possibility of football player's kicking direction based on historical data [6]. Gong Yicheng used historical data to guide the randomly decision for group logistics companies [7]. However, Harsanyi transformation is still difficult to practice because of the difficulty in obtaining player's historical data. In 2013, Liu Tieyan and his team first proposed the concept of "game machine learning" [8]. His team eliminate the uncertainty of the game through online data and Markov chain [9][10][11].
To solve the problem of how "Nature" assigns the players' type in Harsanyi transformation, this paper try to give a XGBoost based on factor analysis to predict the probability distribution.

Harsanyi transformation
An incomplete information static game G has N players can be shown as equation (1)  For the i-th player, the type belongs to the type space . He will take the behavior from the behavior space with the probability of , and gain the benefit . The Harsanyi transformation transforms the incomplete information static game into a dynamic game with complete but imperfect information, and then it can be further analyzed by using perfect Bayesian equilibrium.

Factor analysis
Factor analysis is to analyze the factors of a number of comprehensive indicators and extract the common factors, the mathematical representation of the factor analysis is shown as Formula (2) [13].

Introduction to XGBoost Algorithm
XGBoost achieves the generation of weak learners by optimizing the loss function, and uses the first derivative and the second derivative of the loss function to improve the performance of the XGBoost algorithm through pre-sorting and weighted quantile [14]. Let the data set {( , )}(| | = , ∈ , ∈ ), CART space = { ( ) = ( ) }( : → , ∈ ), q is the leaf node coefficient of the tree structure, T is the number of leaves, and each corresponds to an tree structure q and leaf weight w, XGBoost algorithm is implemented by minimizing the regularization objective function, as shown in equation (3): (3) and (4), L is the loss function, Ω is the penalty term for the complexity of the model, γ is the L1 regularization coefficient, and λ is the L2 regularization coefficient.

Evaluation indicators
For the binary classification problem, four cases of correct prediction are recorded as Table 1,where TP-True Positives; FN-False Negatives; FP-False Positives; TN-True Negatives [15]. Table 1. Four cases of binary classification problem. Forecast is positive Forecast is negative Actually positive TP FN Actually negative FP TN The accuracy (A), precision (P), recall (R) and F1 score are defined as shown in equation (5)-(8), a larger A means that the model is better, since P and R can hardly run the maximum value at the same time, a harmonic mean value F1 score is proposed.

Harsanyi-transformation Oriented Default Risk Prediction Based on FA-XGBoost
In Harsanyi-transformation, the "Nature" is supposed to know the probability distribution of players' type. However, a new player's type is usually private, so it's difficult to perform the Harsanyi transformation. This paper therefore proposes a FA-XGBoost model to learn the probability distribution pt. Suppose the new player has two types, t1 and t2, then "Nature" assigns the new player's type t1 according to the probability pt, and assigns the type t2 according to the probability 1-pt.
Suppose the perfect Bayesian equilibrium corresponds to the probability distribution pe, comparing pt and pe, players can know which strategy can lead to higher returns and how to make decisions.

3.2.A P2P Network Loan Default Risk Game Model Converted by Harsanyi
In the P2P network loan, a company has two strategies: "Approve" and "Refuse". In order to obtain higher returns, applicants may hide their real information, so the game is an incomplete information game. This paper constructs a game of incomplete information under certain assumptions.  Figure 1, the top node is "Nature". It will distribute non-default and default loan applicants with the probability of pg and 1-pg. The following ellipse represents the selection information set of P2P loan enterprises, including two nodes corresponding to natural selection, and the number of terminals represents the revenue of the P2P loan enterprises, when they reach the terminal along the corresponding path. Based on the expected return of the strategies, the perfect Bayesian equilibrium probability pe be obtained,as shown in equation (9). pe = (r+2)/(3r+2) (9) "Nature" will assign the loan applicant's type based on the probability distribution (pg, 1-pg). If pg = pe, P2P loan enterprises chooses "Approve" or "Refuse" with the same benefits; if pg>pe, "Approve" will have greater benefits; if pg <pe, "Refuse" will have greater benefits.

Data overview and pre-processing
This article uses the Lending Club's transaction record in 2018, containing 495242 pieces of data, each containing 144 variables. Since this paper studies whether the loan applicant can finally repay on time, the loan status is selected as the tag value, and the remaining 143 variables are attribute values.
In order to facilitate the subsequent analysis and training of the model, we have done the following four steps to pre-process the original data: • After data preprocessing, each sample has 74 variables, which will be used to proceed the research.

Factor Analysis Dimensionality Reduction
Based on the data preprocessing in 4.1, factor analysis was used to reduce the dimensionality of the data by SPSS software. After dimensionality reduction, the total variance was shown in Table 2.  Table 2, the cumulative contribution rate of the first 31 principal components reaches 85.205%, so the 74 original variables are reduced to 31 new features. We named these 31 new features as F1, F2… F31.

FA-XGBoost Driven Harsanyi Transformation
In order to test the effect of building the model, the dimensionally reduced data needs to be divided into a training set and a test set, wherein the training set is used to learn the model, and the test set is used to test the model effect. Simple random sampling is used here, 7:3 is used for training and testing.
Through the training set learning, the XGBoost model gives the absolute value of the regression to each loan applicant of the test set, that is, the probability distribution pt of the new player type, and then obtains the probability pg of non-default. Based on the actual interest rate of the data, we use the average value of 13.42% as the loan interest rate r. Therefore, the equilibrium probability pe can be obtained according to the equation (9), as shown in the equation (10). pe = (r+2)/(3r+2)=(0.1342+2)/(3×0.1342+2)=0.8883 (10) So the game matrix based on the FA-XGBoost driven Harsanyi transformation is shown in Table 3.  Table 3, we can get the prediction strategy of the fifth column, comparing the fifth and sixth columns, we can get the confusion matrix as shown in Table 4.  (14) show that the accuracy is 0.9994 and the F1 score is 0.9997, therefore, the FA-XGBoost driven Harsanyi transformation P2P loan applicant default model established in this paper can help loan enterprises to make a good decision on whether to approve user loans or not.

Conclusion
This paper proposes a XGBoost based on factor analysis to solve the problem encountered in the practical application of Harsanyi transformation to help "Nature" assign the type of a player. Firstly, the factor analysis is used to reduce the large amount of relevant data, and then the XGBoost algorithm is used to perform regression prediction on the dimensionality reduction data. Finally, "Nature" allocates the player's type according to the predicted probability distribution. The test results show that the accuracy of the forecast reaches 0.9994. Therefore, this method can effectively help the loan enterprise to make decisions and avoid the default risk of users.