A Novel DBN-EFA-CFA-Based Dimensional Reduation for Credit Risk Measurement

: Affected by the Federal Reserve 􀆳 s interest rate hike and the downward pressure on the domestic economy, the phenomenon of default is still prominent. The credit risk of the listed companies has become a growing concern of the community. In this paper we present a novel credit risk measurement method based on a dimensional reduation technique. The method first extracts the risk measure indexes from the basal financial data via dimensional reduation by using deep belief network (DBN), exploratory factor analysis (EFA) and confir‐ matory factor analysis (CFA) in turn. And then the credit risk is measured by a systemic structural equation model (SEM) and logistic dis‐ tribution. To validate the proposed method, we employ the financial data of the listed companies from Q1 2019 to Q2 2022. The empirical results show its effectiveness on statistical evaluation, assessment on testing samples and credit risk forecasting.


Introduction
In March 2014, the "11 Chaori Bond" was declared in default due to the inability of ST Chaori to repay its debts and interest, becoming the first bond in China s history to default on its debts [1] . Since then, defaults in our credit markets have occurred from time to time. In 2020, due to the impact of external environment, the downward pressure of the domestic economy brought huge uncertainty to highly indebted companies, and the scale of credit defaults hit another record high. In 2020, annual bond defaults exceed 50 billion, becoming the highest in nearly five years [2] ; in the first half of 2021 alone, the size of the bond is already close to 98.4 billion, substantially exceeding the level of the same period in 2020. On 17 August 2021, the Central Financial and Economic Commission emphasized the need to strengthen the financial rule of law and infrastructure, deepen the construction of credit system and play the fundamental role of credit in the identification, monitoring, management and disposal of financial risks [3] . In such an institutional environment, the study of corporate credit risk is undoubtedly of great relevance and policy implications.
standardization of their conduct and their financial situation will have a direct impact on the development of the Chinese securities market and the interests of investors [4] . As for the research methods of credit risk of listed companies, there are mainly traditional credit risk analysis, multivariate statistical models, and artificial intelligence-based credit risk assessment models [5] . Traditional credit risk analysis research mainly includes 5C element analysis method, LAPP (Liquidity, Activity, Profitability, Potentialities) method, and five-level classification method. This type of research method is primarily based on the experience and subjective analysis of bank experts to assess credit risk. In practice, due to the strong subjectivity, the determination of debtor default rates, default factors and their weights relies heavily on the creditors own judgment, thus making the assessment results of credit risk lack objectivity and scientificity [6,7] . Following the traditional methods of credit risk analysis, multivariate statistical models have been widely used abroad in the assessment of credit risk, summarized as linear probability models, logistic models, probit models and discriminant analysis models [8] . With the advent of the Big Data era, neural network technology took a place by virtue of its non-normal distribution and nonlinear characteristics, and was introduced into corporate credit evaluation in the 1990s, breaking the inherent limitations of traditional credit risk assessment models, further promoting the development of credit risk assessment models [9] .
The greatest advantage of multivariate statistical models is their obvious explanatory nature, while the strength of neural network techniques lies in the complex combination of a large number on non-linear network layers which allow the extraction of features at various levels of abstraction from the raw data [10] . Therefore, this paper combines deep belief networks, exploratory factor analysis, confirmatory factor analysis, structural equation model with logistic distribution to measure credit risk. With the help of deep belief networks for first-level extraction, the indicators that have a significant impact on credit risk are extracted from the financial indicators, then the indicators are extracted twice by exploratory factor analysis, and thereafter the final extraction of risk measure indexes is achieved by confirmatory factor analysis; after that the risk measure indexes, structural equation model and logistic distribution are used to quantify risk measurement. From there, the potential variables affecting credit risk and the paths of influence behind them (the quantitative structural relationship between each latent variable and credit risk) are identified and a credit risk measurement model is obtained.

Establishment of Credit Risk Assessment Method
In the study, the data we obtained may have information redundancy and multicollinearity, so we need to extract the indexes that have a significant impact on the credit risk from the original financial indicators. We obtain the deep nonlinear structure of the input data by deep belief network (DBN) to extract the essential features implied by the input data and perform first-level extraction of the original financial indicators [11] . Then, we analyze how many common factors affect the metrics and the nature of each factor by exploratory factor analysis (EFA), extracting the metrics in the second level after the first level extraction [12] . Finally, the data were analyzed by confirmatory factor analysis (CFA) for structural validity, convergent validity, and discriminant validity to finally extract the metrics we need, i. e., risk measure indexes. Systemic structural equation model (SEM) and logictic distribution are further used to find the specific paths that risk measure indexes affect credit risk and the variables hidden behind these indexes to quantificat credit risk measurement, as shown in the flow chart in Fig. 1 [13] .

Risk Measure Indexes Extraction
Let the financial indicators be X 1 X 2 X N . The original risk level is divided into two categories of ST (marked with "0") and non-ST (marked with "1").
In the first step, the risk measure indexes are extracted at the first level by DBN. According to the size of the weight parameters of the neurons in each layer of the network, the n input neurons X i1 X i2  X in with the largest weights are found in reverse from the output layer to obtain the first level extraction of the risk measure indexes.
In the second step, the risk measure indexes are extracted at the second level by EFA. For the risk measure indexes extracted at the first level, the metrics with factor loadings greater than 0.7 and commonality greater than 0.4 are extracted according to the factor loadings and commonality of each metric in EFA to obtain the risk measure indexes X t1 X t2 X tn extracted at the second level.
In the third step, the final extraction of risk measure indexes is performed through CFA. For the risk measure indexes extracted at the second level, it is necessary to pass the p-value and standard loading coefficient tests in CFA. At the same time, it is necessary to meet the conditions of average variance extracted (AVE) > 0.5, composite reliability (CR) > 0.7 and the AVE square root value is greater than the maximum value of the absolute value of the correlation coefficient between the factors. We will use the indexes that meet all the above conditions as the final obtained risk measure indexes X s1 X s2 X sn .

Credit Risk Quantification
The common factors obtained from the CFA in the process of risk measure indexes extraction are set as ξ 1 ξ 2 ξ m , as latent variables to measure the credit risk η of listed companies, their corresponding risk measure indexes are set as X s1 X s2 X sn , as observed variables to measure the credit risk η of listed companies, and the credit risk measurement X ′ as the observed variable corresponding to the credit risk η.
Based on the structural equation model, the structural model and measurement model of credit risk are: where l 1 l 2 l m b are the regression coefficients and ε η is the error term. And the measurement model for each obvious and latent variable is: ì where ( ) are the regression coefficients and ε 1 ε 2 ε n are the error terms.
where h 1 h 2 h n are the regression coefficients and ε is the error term. Finally, the credit risk level X is obtained by logistic distribution. ì í î ï ï ï ï ï ï where X is divided into two classification results ST (marked with "0") and non-ST (marked with "1").

Data and Data Pre-Processing
Considering data availability and sample representativeness, this paper selects all non-ST companies in the CSI 300 (300 companies) and all ST companies in listed companies (190 companies), totaling 490 companies, as research subjects. The data set is derived from all financial information of these 490 companies in the Wind database for the period Q1 2019 to Q1 2021, totaling 200 financial metrics. Since each company had missing data, including 21 financial indicators with all missing data in individual quarters and 207 companies with some financial indicators completely missing, we interpolated by the third spline interpolation method and the K-nearest neighbor interpolation method based on generalized cosine.

First-level extraction
In our DBN, we select 55% of the data as the training samples and 45% of the data as the testing samples, with 200 financial indicators as the visual layer neurons, and the original risk level is divided into two categories of ST and non-ST. We connect the last hidden layer to the BP neural network and use the two classification results ST (marked with "0") and non-ST (marked with "1") as the output layer neurons. Then we have to determine the quantity of the network s hidden layers and the nodes of each hidden layer by trial-and-error, together with the accuracy of the testing sample to judge the structural generalization performance, and the results are shown in Fig. 2.
In setting the nodes of the latest hidden layer, we fix the quantity of neurons in the latest hidden layer to 12, taking into account that the risk assessment level I index is 12. From 200 neurons to 12 neurons, we set 150, 100, 50 as the number of nodes in the previous hidden layer for a wide range of search in order to reduce the loss of information transmitted in the middle. The experiment found a high accuracy rate around 100 and a low accuracy rate below 50 or above 150. So we set the top hidden layer s node number to 120, 80, 60, 40, and find the best one for the foremost hidden layer from them. And so on, find the best count of nodes of the topmost hidden layer among the three hidden layers and four hidden layers. The results in Fig. 2 show that the recognition performance of the network containing the three-layer hidden layer structure is higher than that of the two-layer hidden layer network structure and the four-layer hidden layer network structure. Because the comparison in the network structure containing two hid-den layers found that the best network recognition performance under the structure of the first hidden layer with the nodes number of 60 and 80, therefore, in the network structure containing three hidden layers, a hidden layer was added as the first layer based on the nodes of the second and third hidden layers at 60 and 12, and the first hidden layer nodes were set at 140, 120, 100 and 80 respectively. The experimental results show that the performance of the network is optimal for a structure with 120 nodes in the first hidden layer. Therefore, we finally chose a DBN containing three hidden layers with 120, 60 and 12 nodes for the first-level extraction of the risk measure indexes.
In the first-level extraction of risk measure indexes, we use a top-down approach, starting from the output layer to find the key neurons in the third hidden layer, seeking the six neurons in the third hidden layer that have a significant impact (the highest weight) on each neuron in the output layer, and locating the eight neu- Fig. 2 The accuracy of the testing sample rons with the highest frequency from them, finally looking in turn to the key neurons in the input layer, i.e. the significant metrics that affect credit risk. Finally we find its corresponding indicators in the input layer, take the 20 indicators with the highest weights, and the results are shown in Table 1.
As shown in Table 1, the weight of debt-to-long capital ratio (X 1 ) is 60.68, which is much higher than the weights of other indicators. This is because the debt-tolong capital ratio reflects the long-term capital structure of an enterprise. The smaller the indicator, the lower the degree of debt capitalization of an enterprise, the lower the pressure of long-term debt repayment, the lower the risk of default, and the relatively smaller its credit risk. We can also see that these 20 indicators are more comprehensive in terms of profitability, solvency, operating capacity and development capacity of the company.

Second-level extraction
Deep belief networks narrow the scope for extract-ing risk measure indexes. Before formally constructing a risk measurement model, we need to find the true risk measure indexes using EFA and CFA. In the EFA, by first extracting indicators with factor loadings greater than 0.7, removing indicators which has the similar loading on different factors or falls on a specific factor but have factor loadings less than 0.7, and deleting indicators with communality less than 0.4, we obtained 11 indicators, as shown in Table 2.
Then the 11 obtained indicators are again subjected to EFA for finding the latent variables corresponding to each indicator, and the secondary extraction of risk measure indexes is performed.  Table 3, we can see that the degree of freedom (df) is 55, Kaiser-Meyer-Olkin measure of sampling adequacy KMO = 0.667 > 0.6 and the observed value of Bartlett s sphericity test statistic is 33 974.601 with a significance (Sig.) of 0.000<0.001, which indicates that these 11 indicators are highly correlated and also indicates that there is some overlap in the credit risk information of listed companies reflected among the 11 indicators, so we need to use EFA to find the potential variables behind these indicators.
To make the extracted latent variables, i.e., factors, more realistic, according to the extraction principle of feature value > 1, the 11 indicators were dimensionalized, meanwhile, the feature value and cumulative contribution rate were calculated. The results are shown in Table 4: there are 5 factors conforming to the extraction principle, and the cumulative contribution of these 5 factors is 80.28%, that is, the variance explained by these 5 factors accounts for 80.28% of the total variance, there is not much information lost by using these 5 factors to reflect the credit risk of listed companies, so these 5 factors can reflect the credit risk of listed companies comprehensively.
Classifying the 11 indicators according to higher loadings, it can be seen from Table 5 that the factor score formula for the first principal factor is: F 1 = 0.984X 20 + 0.979X 7 + 0.953X 6 + 0.006X 4 +0.020X 17 + 0.051X 9 + 0.021X 3 -0.012X 12 -0.043X 16 + 0.006X 15 + 0.003X 8 F 1 has large loadings of 0.953, 0.979 and 0.984 on three indicators X 6 , X 7 and X 20 , respectively. As all three indicators reflect the profitability of listed companies, F 1 is defined as profitability in this paper.
The factor score formula for the third principal factor is: +0.205X 17 + 0.931X 9 + 0.860X 3 + 0.028X 12 +0.024X 16 + 0.049X 15 + 0.052X 8 The loadings of F 3 in X 9 and X 3 are larger, one exceeds 0.9 and one is above 0.85. These 2 indicators can reflect the solvency of listed companies, so F 3 is defined as solvency.
The factor score formula for the fourth principal factor is: F 4 = 0.019X 20 + 0.017X 7 + 0.010X 6 + 0.012X 17 -0.005X 9 + 0.012X 3 + 0.793X 12 -0.789X 16 + 0.007X 15 + 0.005X 8 F 4 loads more on X 12 and X 16 , the former is 0.793, reflecting the profitability and operation management level of listed companies, the latter is −0.789, reflecting the short-term solvency of listed companies and reflecting the goal of listed companies operation pursuit, so here F 4 is defined as the operation level.
The factor score formula for the fifth principal factor is: F 5 = 0.002X 20 + 0.002X 7 + 0.001X 6 + 0.002X 17 +0.017X 9 -0.023X 3 + 0.006X 12 +0.004X 16 + 0.737X 15 -0.732X 8 The loadings of F 5 on X 15 and X 8 are large, and their absolute values are above 0.7. These two indicators can reflect the development ability of listed companies, so F 5 is defined as development ability. According to the results of the aforementioned factor analysis, the potential variables corresponding to these 11 indicators are these 5 main factors, and the final extraction of the risk measure indexes can be discussed through these 5 main fac-tors.

Final extraction
According to Occam s Razor, the "simple and effective principle", our model should be as simple as possible. In order to find indicators that can portray the degree of risk more quickly and concisely, we performed validity analysis on the data through CFA on the basis of EFA, and then further removed the inappropriate variables to achieve the final extraction of risk measure indexes.
It can be seen from Table 6 that when X 12 measures the operation level, the absolute value of its standardized load coefficient is 0.254<0.6, which means that the measurement relationship is weak. When X 16 measures the operation level, the standardized load coefficient does not show significant (p=0.378>0.05), indicating that there is no significant measurement relationship. The absolute values of the standardized load coefficients of X 8 and X 15 are 0.52 and 0.151 respectively, both less than 0.6, indicating that the measurement relationship is weak.
AVE and CR are used for convergent validity analysis. As can be seen from Table 7, the CR corresponding to operation level is less than 0.7. The AVE corresponding to development ability is less than 0.5, and the CR is less than 0.7. These imply that the aggregated validity of the data for this analysis is poor, and that the operation level and development capacity needs to be removed for further analysis. Combining Table 6 and Table 7, it can be seen that the observed variables corresponding to the operation level and development capacity need to be deleted.
After deleting X 8 , X 12 , X 15 , and X 16 , we performed CFA again on the remaining indicators, and the results were as follows. For the measurement relationship, the absolute values of the standardized load coefficients were all greater than 0.6 and appeared significant, implying a good measurement relationship. In terms of AVE and CR, all three factors corresponded to AVE values greater than 0.5, and all CR values were higher than 0.7, implying that the data of this analysis had good convergent validity. From the analysis of Table 8 for discriminant validity, for profitability, the AVE square root value is 0.96, which is greater than the maximum value of the absolute value of the inter-factor correlation coefficient of 0.07, implying a good discriminant validity. Similarly, it can be seen that solvency and cash flow also have good discriminant validity.  H3: Solvency has a significant impact on the credit risk of listed companies, and each indicator of solvency indirectly affects the credit risk of listed companies through its impact on solvency.

Credit risk path analysis and risk measurement
In order to further verify the rationality of the model and the impact of each indicator, the SEM of credit risk for listed companies is constructed based on the above assumptions, and its impact path is shown in Fig. 3.
In Fig. 3, we can see that the exogenous variables include profitability, cash flow, and solvency, which are set as ξ 1 ξ 2 ξ 3 . The endogenous variable is credit risk, set as η. Based on the SEM principle, the mathematical expression of the theoretical model can be obtained.
The structural and measurement models for credit risk are: As can be seen from the model, profitability, solvency and cash flow all have different degrees of influence on the credit risk of listed companies. Among them, profitability has the most significant impact with a regression coefficient of 0.39, while solvency and cash flow have the second most significant impact with regression coefficients of 0.094 and 0.061, respectively.  Profitability is measured by the following model: The metric model for profitability is: = 0.3263X 7 + 0.3526X 6 + 0.3211X 20 Among the profitability, the most significant impact is the return on total assets with a path coefficient of 0.994. This indicates that the return on total assets is the key to corporate profitability. Combined with the structural model of credit risk it is clear that hypothesis H1 holds, i. e. profitability has a significant impact on the credit risk of listed companies and that the indicators of profitability indirectly affect the credit risk of listed companies through their impact on profitability.
Cash flow is measured by the following model: The metric model for cash flow is: = 0.5090X 4 + 0.4910X 17 From the perspective of cash flow, the path coefficient of cash to meet invest needs is larger, which is 0.987. The larger the indicator is, the higher the selfsufficiency rate of the enterprise s capital, and the stronger the enterprise s ability to maintain production and operation. Combined with the structural model of credit risk it is easy to know that hypothesis H2 holds, that is, cash flow has a significant impact on the credit risk of listed companies, and each indicator of cash flow indirectly affects the credit risk of listed companies through the impact on cash flow. Solvency is measured by the following model: The metric model for solvency is: In terms of solvency, the path coefficients for corporate free cash flow per share and net cash from operating activities/current liability are 0.697 and 1, respectively. In summary, the risk measurement is: The risk level can be determined: The standardised coefficient estimates, standard errors, critical ratios and p-values for the main SEM pathways can be seen in Table 9. In the measurement model, it can be seen that the regression coefficients of the observed variables corresponding to each latent variable are all above 0.5 and all are significant at the 0.05 confidence level. In the structural model, the regression coefficients between the latent variables also all passed the significance test at the 0.05 confidence level. This indicates that the model meets the basic fit criteria and is identifiable.

Statistical evaluation
In this section, to further validate the soundness of the model, we evaluate the structural equation model based on the model fit index and the following table shows the results.
The SEM fit index test results are shown in Table  10, and the fit criteria are basically met, except for χ² and χ²/df, which are not ideal. However, these two indicators are susceptible to sample size and tend to reject all well-fitting models when the sample size is large, so the model is not revised here.

Assessment on testing samples
In this section we assess the credit risk of the data in the testing samples with a risk metric model. The testing samples accounts for 45% of the original sample, with a total of 1 984 samples and 13 892 data, with 2 039 missing data, and a missing rate of 14.68%. Among them, the data of cash to meet invest needs indicator is missing for 5 quarters, and the data of net cash from operating activities/current liability indicator is missing for 27 companies. We interpolate these missing data by data preprocessing. It can be seen that incomplete data information is destined to lead to a lack in our evaluation, and the results are shown in Table 11.
As can be seen from Table 11, 80 of the nondefaulted samples were correctly assessed, 641 of the defaulted samples were correctly assessed, and the overall prediction accuracy of the risk metric model was (802+ 641)/1984= 72.73%. In the evaluation, the type Ⅰ error rate, i.e., the error rate of determining the non-defaulted sample as the defaulted sample, can be found to be 408/ (802+408)=33.72%. The type Ⅱ error rate, i.e., the error rate of determining the defaulted sample as the non- Finally, we forecast credit risk with data for a total of 4 quarters from Q2 2021 to Q2 2022 with 454 companies, including 300 non-ST companies and 154 ST companies, for a total of 2 270 samples, of which 47 companies have all missing data for individual indicators and 2 quarters have all missing data for the cash to meet invest needs indicator, which we interpolate by the same interpolation and use the interpolated data for forecasting, and the results are shown in Table 12.
The results in Table 12 show that the overall prediction accuracy of the risk metric model was (1 075+576)/ 2 270=72.73%. The Type Ⅰ error rate was 432/(1 075+ 432) =28.67% and the Type Ⅱ error rate was 4.16% lower than the Type Ⅰ error rate, at 187/(187+576) = 24.51%. There were a total of 1 651 samples of prediction data, of which 619 were predicted correctly. Of the non-defaulted samples, a total of 432 samples were judged incorrectly and judged to be defaulted. Of the defaulted samples, a total of 187 samples were determined to be incorrect and judged to be non-defaulted.

Conclusion
This paper uses financial data from Q1 2019 to Q2 2022 as the research data set for construction and evaluation of a novel credit risk measurement model. The measure indexes for credit risk of listed companies are obtained through DBN, EFA and CFA in turn. Then the risk measurement is obtained with SEM and logistic distribution. Finally, the performance evaluation of the risk measurement is made with statistical evaluation, assessment on testing samples and credit risk forecasting, resulting in the following conclusions: 1) The credit risk of listed companies is mainly influenced by profitability, cash flow and solvency. Profitability has the greatest influence with a regression coefficient of 0.39, cash flow has the least influence with a regression coefficient of 0.061, and solvency is in the middle with a regression coefficient of 0.094.
2) Among profitability, the return on assets has the highest impact weight, annualized return on assets is the second highest, and the return on total assets is the lowest, with impact weights of 0.352 6, 0.326 3, and 0.321 1, respectively.
3) Cash flow is mainly reflected by the net profit cash cover and the cash to meet invest needs, which has an impact weighting of 0.509 0 and 0.491 0, respectively. 4) Within solvency, the impact of corporate free   cash flow per share is significant with an impact weighting of 0.589 3, followed by net cash from operating activities/current liability with an impact weighting of 0.410 7.