A deep learning approach of financial distress recognition combining text

: The financial distress of listed companies not only harms the interests of internal managers and employees but also brings considerable risks to external investors and other stakeholders. Therefore, it is crucial to construct an e ffi cient financial distress prediction model. However, most existing studies use financial indicators or text features without contextual information to predict financial distress and fail to extract critical details disclosed in Chinese long texts for research. This research introduces an attention mechanism into the deep learning text classification model to deal with the classification of Chinese long text sequences. We combine the financial data and management discussion and analysis Chinese text data in the annual reports of 1642 listed companies in China from 2017 to 2020 in the model and compare the e ff ects of the data on di ff erent models. The empirical results show that the performance of deep learning models in financial distress prediction overcomes traditional machine learning models. The addition of the attention mechanism improved the e ff ectiveness of the deep learning model in financial distress prediction. Among the models constructed in this study, the Bi-LSTM + Attention model achieves the best performance in financial distress prediction.


Introduction
Corporate financial distress is a financial risk, indicating a high probability of corporate bankruptcy, affecting the stability of financial markets and social and economic systems and bringing adverse effects to the global economy.Therefore, financial distress prediction (FDP) has attracted significant attention from stakeholders such as corporate managers, governments and investors.FDP can provide early warning information for corporate risks, help corporate managers take risk control measures to avoid the deterioration of the situation and help investors grasp the profitability of listed companies and adjust investment strategies to reduce expected investment losses [1][2][3].
FDP is a binary classification problem.Most early studies used corporate financial indicators combined with traditional machine learning models to detect corporate financial distress.When analyzing financial indicators for financial distress classification, statistical methods are used for feature engineering and combined with machine learning models.The commonly used models are Bayesian network, support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBoost), and artificial neural networks (ANN).Deep learning technology has developed rapidly in recent years with the improvement of hardware computing power.Gradually, researchers began to introduce deep learning models into FDP research, such as convolutional neural networks (CNN) and recurrent neural networks (RNN).
The financial indicators used in FDP are usually extracted from the data of the company's annual financial statements.Companies in financial distress often manipulate information disclosure, thereby preventing the negative impact of financial distress on the enterprise [4,5].In addition, using financial indicator data to detect financial distress, the information source is single and limited, and it is challenging to identify enterprise managers' attitudes and manipulation traces.With the application of deep learning technology in natural language processing (NLP) research, deep learning text classification models have been introduced into FDP research.Experts in finance and accounting have begun to use powerful computing hardware and artificial intelligence technology to extract information from text data to identify financial distress [6,7].An enterprise's disclosure text comprises corporate financial reports, annual reports and prospectuses, of which the management discussion and analysis (MD&A) section of the company's annual report is the most studied MD&A text and is the primary channel for the disclosure of business information and financial information.It expresses the company management's views on potential opportunities and challenges in development to the outside world.So, the disclosed text features can improve the effectiveness of financial analysis [8][9][10][11].However, the MD&A text content is long and unstructured.Determining how to deal with the long text and effectively extract the semantic features of the text is the main problem.
This study combines a deep learning text classification model with an attention mechanism to adapt to MD&A Chinese long text classification in FDP.Usually, the text sequence of the MD&A part of the annual reports of listed companies in China is very long, and its length is usually more than several thousand Chinese characters.In the MD&A Chinese text, the second half is mostly a numerical description, and the textual content that contains managers' views on enterprise development is in the first half.In our experiments, first, the text is preprocessed, and then the first 2000 words are selected as the MD&A Chinese text input of fixed length through the experiment, which covers all the keywords of the original Chinese text content.Second, we combine the attention mechanism with the gated recurrent unit (GRU) and bidirectional long-short term memory (Bi-LSTM) models to construct the GRU+Attention and Bi-LSTM+Attention models, capture time series information by using multiple hidden layer structures of GRU and LSTM and then use the attention mechanism further to summarize the time series key information in the message.Finally, the entire MD&A Chinese long text is extracted as a vector containing the critical information of the text, which is combined with the financial indicator vector to predict whether the company is facing financial distress.This paper constructed a deep learning text classification model adding an attention mechanism.Financial indicators were added for FDP research, and the effects of traditional machine learning models were compared.The experimental results show that the performance of the proposed Bi-LSTM+Attention model overcomes other deep learning and traditional machine models in area under receiver operating characteristic curve (AUC-ROC) and area under precision-recall curve (PR-AUC).
The main contributions of this study to the research field of FDP are the following: For textual data, we validate the effectiveness of MD&A textual data in FDP.For model construction, the model with attention mechanism has improved the ability to extract critical information from Chinese long text sequences, and the prediction effect of financial distress is enhanced.For the prediction model, the Bi-LSTM+Attention model has a better binary classification effect than other models in FDP.It extracts Chinese text sequence information through the hidden layer in the bidirectional LSTM layer and uses the attention mechanism to increase the weight of important information in the hidden layer.As the model recognizes the critical information of the MD&A Chinese text, the prediction effect of financial distress is improved.
The rest of the paper is as follows.We review related research on FDP in Section 2. We present the flow of the study and the main model structure in Section 3. We describe the selection of samples, data preprocessing process, model settings and evaluation metrics in Section 4. We present the experimental results of each model in Section 5. We conclude and summarize the main findings in Section 6.

Literature review
Financial distress occurs when a company faces financial difficulties and faces financial risks.In the research on financial distress, different scholars have given different views.Beaver et al. defined the default of preferred dividends and debt default as financial distress [12].Deakin et al. believed that the sign of financial distress is that the enterprise is insolvent or liquidated for the benefit of creditors [13].Carmichael et al. believed that financial distress is a form of corporate debt default or insufficient funds [14].Zmijewski et al. defined financial distress as filing for bankruptcy [15].Altman et al. described financial distress in their research and believed that corporate bankruptcy was the most suitable scenario for financial distress [16].Dimitras et al. defined financial distress as a situation in which a company cannot pay suppliers and preferred shareholders, overdraft bills and the company is legally bankrupt [17].Ross et al. pointed out that financial distress is when a company's operations are forced to take corrective measures due to insufficient cash flow [18].
Most scholars choose to study special treatment (ST) companies as samples of financial distress.Ding et al. used statistical methods to compare and verify the degree of correlation between ST companies and the samples of financial distress.They found that companies classified as ST had an increased probability of financial deterioration and financial distress in the next year.[19].Geng et al. found that ST companies are a good representative of financial distress, as they are more likely to go bankrupt in the future [20].Ruan et al. used the ST labels of listed companies to indicate whether a firm is in financial distress [21].In general, companies are marked as ST for two reasons: The listed company has incurred losses for two consecutive years after being audited by the accounting company.The other is that the net earnings per share of public companies are less than the par value of their shares.Typically, public companies marked as ST face severe financial deterioration, consistent with financial distress.
In traditional FDP research, most researchers first extract features from financial indicators and then combine statistical models or machine learning models to predict the financial distress of enterprises.The financial data of enterprises are often very related to their actual financial situation [22].In the early research, Beaver et al. pioneered the use of financial ratios to predict the financial distress of enterprises and combined them with a univariate discriminant model for analysis.The results showed that financial characteristics are effective in FDP research [12].Altman et al. used the multivariate discriminant statistical method to study FDP and converted the accounting information of manufacturing enterprises into financial ratios.The results showed that using accounting information to calculate financial ratios can perform well in FDP [23].
Later, scholars paid more attention to sensitive indicators and conducted combined analyses of the company's open-access quantitative data.Some scholars focus on designing models to capture better features.For example, Wang et al. injected feature selection strategies into traditional models.They constructed the FS Boosting ensemble learner method, which can automatically capture the feature diversity of samples to obtain better performance [24].Huang et al. studied the effect of feature preprocessing of financial indicators on the prediction effect.They constructed a least absolute shrinkage and selection operator (LASSO) selection technique to screen critical financial indicators and found that fewer essential variables can achieve better prediction performance [25].There are also studies using financial features extracted by financial experts.For example, Zhou et al. combined classifiers such as LDA to test the features of FDP proposed by financial experts.The results showed that these features positively impact the FDP of listed companies in China [26].For financial risk forecasts in other regions, Alaminos et al. summarized ten financial variables through previous research and used logistic regression to construct a general model that could explain global bankruptcy forecasts [27].Huang et al. calculated 16 financial ratios based on four basic financial statements of listed companies in Taiwan and compared the performances of six machine learning methods in FDP [28].
With the rapid development of artificial intelligence today, more and more scholars are combining deep learning models in traditional research.Deep learning has many hidden layers for feature transformation.The layer-by-layer feature conversion converts the feature of the sample in the original space to the new feature space, making classification or prediction easier.Deep learning techniques make fewer assumptions about the data [10], and they have solid learning ability, comprehensive coverage, strong adaptability and good portability.Therefore, scholars have gradually introduced deep learning algorithms into various financial market studies, such as stock market prediction [29], bank bankruptcy prediction [30] and customer credit scoring [31].These studies not only improve the research effectiveness of the problem but also can be used to extract features of supplementary data in addition to financial indicators.FDP is a classic research direction in financial market research and many recent studies in FDP have used deep learning models compared to statistical studies.For example, Halim et al. investigated the performance of deep learning models such as RNN, LSTM and GRU in the FDP of listed companies in Malaysia and found that deep learning methods can achieve better results than traditional machine learning models [3].Li et al. constructed a sentiment dictionary based on a deep learning framework in the financial domain.The empirical results showed that the sentiment features of annual reports extracted through the dictionary significantly impact FDP [32].Table 1 summarizes the relevant research on FDP according to six dimensions: author, year, sample size, main models, evaluation metrics and variable types.Text mining and big data analysis have become hot topics in academia, and the development of text analysis research has promoted the study of traditional accounting and financial issues.Du Jardin developed an improved financial fraud detection system by combining textual features from company annual reports [4].MD&A text is one of the commonly used text types in research.Qian et al. found text features unique to MD&A, such as vocabulary size, specialized vocabulary, readability and emotional tendencies.Thus, MD&A text promotes interdisciplinary research on the link between accounting information and corporate textual disclosure [11].In addition, many studies analyze the textual information in MD&A as a research object and use it as a supplement to corporate financial data [7,9,10,38,40].
There are methods for text information extraction in text analysis research.Word2Vec is a commonly used model based on ANN.It encodes each word by building a bag-of-words model that maps the text to a vector containing the contextual lexical order [41].Mai et al. used the word vector method to extract the text features of the MD&A part in their study, and the study showed that the deep learning model has superior prediction performance after adding text disclosure indicators.Then, combining accounting ratio and text features can further improve the prediction performance of deep learning models [10].Xiuguo W et al. extracted the word vectors of the text features of the MD&A part through the Word2Vec method and developed an enhanced financial fraud detection system by using deep learning models such as LSTM and GRU.Their results showed that deep learning methods outperform traditional machine learning methods [38].In other studies on FDP, Wang et al. used an ensemble learning approach by combining accounting-based features and textual disclosure information and found that a model containing real disclosed textual features is more efficacious [38].Ruan et al. found that few scholars in FDP research use pre-trained end-to-end models to process text, so they introduced the Bidirectional Encoder Representations from Transformer (BERT) Chinese pre-training model for word embedding processing in the research.They have added the hierarchical attention neural network (HAN) to alleviate the characteristics of long text feature extraction issues [21].
The structure of RNN gives it advantages in processing sequence information.RNN stores past information and current input by introducing state variables and simultaneously generates the output passed to the next node according to the past information.However, the information update and preservation of long-term and short-term dependencies of latent variables in the RNN model are unstable, and some studies have constructed the LSTM model to solve this problem [42].Furthermore, due to the gradient vanishing and exploding problems, the effect of semantic features extracted by ordinary RNN models will gradually deteriorate with the increase of text sequence length.Studies have shown that adding an attention mechanism to the problem of super-long text classification can improve the model's ability to identify critical information [43,44].The origin of the attention mechanism is to simulate the human brain's attention to things, and it was first used in image research [45].The attention mechanism calculates the importance of each part by weighted summation of different parts of the output of the time series model and then captures more critical information.Later studies applied the attention mechanism to the field of NLP.Bahdanau et al. constructed a machine translation model based on the attention mechanism by adding the attention mechanism to the language codec to calculate the importance of the input and output of the translation model [46].

Methodology
The main purpose of our research is to construct a model suitable for long-text classification for FDP based on the combination of financial indicators.Typically, the long MD&A text of corporate disclosure contains several thousand characters, and the narrative structure is not uniform, which greatly complicates text feature extraction and risk information identification.

Deep learning architectures for long text
In the commonly used word embedding methods for converting unstructured text data into structured text features, there are two main approaches: One is the word embedding method that does not consider the context, such as TF-IDF and bag-of-words model, by taking the words in the text as a set and representing the text features by the word frequency or the number of times the words appear.This approach is simple to extract the subject of the text, ignoring the timing and context information of the vocabulary, and the representation effect of long text features is poor.The other is the word embedding method that considers the context, such as the Word2Vec word vector and deep learning word embedding models.The Word2Vec word vector model is calculated by a shallow neural network and maps the text to a vector containing the order of context words.The length of the context words involved in the calculation depends on the length of the set sliding window.The deep learning word embedding model usually refers to the vocabulary encoding layer in deep learning models, such as the word encoding layer based on the RNN model and the word encoding layer based on the Transformer.This research introduces and compares the Word2Vec word vector model and two deep learning word embedding models.
Because the MD&A text content is complex and lengthy, the dimension of the text vector after word embedding is very high, which brings difficulty to text feature extraction.For using Word2Vec word vectors to represent text, this method ignores the time sequence features in sentences and is less compatible with long texts.For RNN models based on time series data, such as LSTM and GRU, the gradient vanishing and exploding problems will also occur when the text is too long.For the Transformer-based BERT pre-training model to embed text as a vector, the BERT model limits the length of the input text sequence to no more than 512 words, which may cause some semantic information not to be calculated.In order to improve these problems, studies have shown that adding an attention mechanism can enhance the critical information extraction ability of time series models when processing long sequences [47].The attention mechanism updates the weight by calculating the information distribution in the text vector, assigning a higher weight to the vital information in the sentence, and reducing the weight of the irrelevant information so that the critical information in the sentence is summarized in the output.Therefore, this paper proposes introducing an attention mechanism layer into the deep learning text classification model by referring to the attention mechanism principle and constructing a deep learning text classification model based on the attention mechanism.

Model construction
We use the Bi-LSTM model's output as the attention mechanism's input to extract MD&A long text features.We concatenate the forward and reverse hidden layers of Bi-LSTM and use the output layer as the input of the attention layer.The attention layer calculates the importance of different time series of Bi-LSTM output sequence and finally calculates the weight and summary results.
The flowchart of our model is shown in Figure 2. The model structure is divided into five layers: input layer, word vector layer, Bi-LSTM layer, attention mechanism layer and output layer.

Input layer
The input layer is used to preprocess the input text data, which helps the model better focus on key information in MD&A text.We choose MD&A data that can represent the management concept of enterprise managers as text data and perform data cleaning and word segmentation preprocessing on MD&A data.Preprocessing makes the MD&A text shorter and more compact.Finally, we take the preprocessed MD&A Chinese text word segmentation sequence as the input model.X t represents a single Chinese word.To display the text content more intuitively, we have chosen the MD&A Chinese text of a company with a stock code of '000007' in 2017 as an example.The MD&A Chinese text data before and after preprocessing is shown in the Appendix for reference.

Embedding layer
The word embedding layer converts preprocessed text into text vectors by encoding words into numbers.We constructed a Chinese encoding dictionary based on all preprocessed MD&A Chinese texts and set the text length to 2000 through truncation and zero padding.Finally, we encode the MD&A Chinese text passed to the input layer based on the constructed dictionary and convert the preprocessed MD&A Chinese long text into a numerical vector with a sequence length of 2000.e t represents the encoded value of a single word.

Bi-LSTM layer
The Bi LSTM layer generates text semantic vectors related to financial distress by calculating MD&A Chinese long text vectors.The advantage of LSTM is that it can alleviate gradient vanishing and exploding problems so that LSTM can perform better in long-dependent sequences.The LSTM cell is shown in Figure 3. the results of the forward and reverse hidden layers and the last memory cell. 2)

Attention mechanism layer
The Attention mechanism layer is used to weight the text semantic vectors calculated by the Bi-LSTM model, highlighting key semantic information and thus better predicting financial distress.The purpose of the attention mechanism is to focus on the information that is more important to the target task in the input sequence information while reducing the weight of other irrelevant information and gradually filtering out non-critical information.It updates the weights by calculating the information distribution of the sequence so that the output can select some critical input information for summarization.In RNN-based encoder-decoder, Bahdanau attention [44] treats the decoder hidden state at the previous time step as a query and the encoder hidden state at all time steps as both keys and values.The calculation method of the attention mechanism in Bi-LSTM is shown in formulas (3.7)- (3.10).
Among them, h i represents the hidden layer vector of the i-th time step of Bi-LSTM, which is used as the query vector.The o i represents the output vector of the i-th time step of Bi-LSTM as the key and value vector.The t represents the length of the text feature sequence, and u i and m i represent the hidden layer and output layer vectors after activation by ReLU and tanh.The attention weight a i represents the attention weight calculated at the ith time step.The c t is the attention-weighted sum of the outputs of all time steps of Bi-LSTM.
In the attention layer, the input information is the output of the forward and reverse hidden layers of Bi-LSTM, and the importance of the Bi-LSTM output at each moment is calculated in the attention mechanism.Then, the output results at each moment are weighted and summed to obtain Ht.Finally, the summarized text features are output.

Output layer
The fully connected neural network can learn the relationship between different features through the fully connected layer and use the activation function to increase the nonlinear ability of the model.In the output layer, we constructed a fully connected neural network to fuse the extracted text semantic features and financial indicators and calculate the binary output of financial distress.We concatenate the text feature vector extracted by the attention layer with the financial indicator vector to construct a merged vector and receive the merged vector through a fully connected neural network, in which several ReLU activation functions increase nonlinearity.When the research needs to check the prediction effect of the text feature vector, the text feature vector can be directly passed to the fully connected neural network.Finally, the softmax function calculates the binary classification output of the enterprise's financial distress.

Dataset
Our experimental data comprises corporate financial data, annual report text data and financial distress labels.The main data comes from the China Stock Market and Accounting Research Database (CSMAR).For some missing data, we manually supplemented the missing data using the data from the CNIFO repository * and the official website of the listed company.For China's A-share stock market, the Shenzhen stock exchanges will announce the listed companies that have received ST every year.In the process of processing data, this study combines the label of year T with the enterprise data of year T-2 [10,48,49].For example, Jinhua Group was warned of investment risks and became ST after the release of its 2019 annual report in 2020, and we use its ST label for 2020 and its financial and annual report data for 2018 to conduct financial distress prediction research.The dummy variable of the listed company's financial distress is set.If the listed company is marked as ST, the dummy variable of the year is 1.Otherwise, it is 0. The experimental data set consists of financial indicators and MD&A texts.In sample selection, we selected the research object of 1642 listed companies in the Shenzhen A-share market from January 2017 to December 2020, and we selected financial data and Chinese annual report texts of companies from January 2015 to December 2018 as the dataset.From the 1642 listed companies, we screened out 74 companies marked as ST in the middle of the four years.By removing banking, financial and utility enterprises such as banks from the sample, we obtained a dataset consisting of 296 ST enterprise annual report sample instances and 6272 regular enterprise annual report sample instances, a total of 6568 annual enterprise reports.

Text data
As a crucial part of an enterprise's annual report, MD&A data is the main output channel for the disclosure of business and financial information, expressing the management's views on potential opportunities and challenges in development to the outside world.Structurally, MD&A information is mainly text information, a channel for voluntary disclosure by listed companies.In order to facilitate the vectorization of unstructured text and use it as the input of the model, it is usually necessary to further preprocess the text before building the model.Unlike English, there are no space marks between Chinese terms.Therefore, in Chinese document preprocessing, word segmentation is difficult to perform with ambiguous words.Referring to the Chinese text processing methods commonly used in NLP research, we use the Jieba word segmentation tool to segment MD&A Chinese text data, remove low-frequency words through word frequency statistics and then delete stop words according to the artificially supplemented stop word dictionary of the Harbin Institute of Technology of China.
The MD&A text vector must be converted into a digital format.We use Word2Vec text vectorization for machine learning and a text embedding method for deep learning models.
Regarding Chinese text length, the MD&A part of the corporate annual report exceeds several thousand words, which is super-long text content, which increases the difficulty of text feature extraction.The length distribution of MD&A Chinese text before and after preprocessing is shown in Figure 4.After we reviewed the text structure of MD&A, we found that critical information mainly exists in the first half of the text.We facilitate the word embedding processing of the model by controlling the text sequence to be the same length, so truncating longer texts, and zero-padding short texts are necessary.Here, we select 2000 words as the fixed length of the text sequence according to the experimental results and compare their effects on different models.After that, for the machine learning model, we refer to the practice of W. Xiuguo et al., adopt Word2Vec processing based on word vector embedding for the text and convert the word segmentation result into a 256-dimensional vector representation to facilitate the calculation of the algorithm [38].The word embedding model we use is the word vector model designed by the China Institute of Space Science † , and its training tool is Word2Vec in Gensim, using the content of the Baidu Encyclopedia as a corpus.The deep learning model does not require much feature processing on the data.After removing stop words and low-frequency words, we input the Chinese vocabulary sequence into the word embedding layer of the model.The Embedding layer converts the text into a digital vector according to the encoded sequence by labeling the Chinese vocabulary.The parameters of the Word2Vec model are shown in Table 2.
Table 2. Parameter setting of Word2Vec model.
The financial indicators used in this research were downloaded from the CSMAR database.The choice of financial indicators has a significant impact on the outcome of the forecasting algorithm.In previous studies, the scenarios of financial risk predicted using financial indicators include financial fraud prediction, financial restatement prediction and FDP.These types of studies are all centered on the financial risks of enterprises.The selection of financial indicators has been repeatedly screened and verified, and the structure of financial indicators in financial risk prediction has been gradually summarized.This paper summarizes the financial indicators commonly used in previous financial distress prediction studies and integrates the conclusions of financial risk related studies [10,35,36,49].Finally, we selected 46 typical financial indicators from eight dimensions: ratio structure, solvency, development ability, risk level, operating ability, relative value indicators, per share indicators and cash flow analysis.The financial indicators used in this study are shown in Table 3.

Model implementation
We chose PyTorch as the deep learning framework, an open-source Python machine-learning library that can utilize powerful GPUs to accelerate the computation of tensors.PyTorch was designed and open-sourced by the Facebook Artificial Intelligence Research Institute (FAIR) based on Torch.
We use the Adam (Adaptive Moment Estimation) optimizer as the gradient descent optimization algorithm in the deep learning training process.We add a decreasing learning rate setting to the model to help the model learn detailed information.In order to prevent the model from overfitting, we added a Dropout layer and a LayerNorm normalization method to the output layer of the model.The former makes some neurons stop working according to a certain probability, making the model more generalizable.The latter mitigates overfitting by normalizing each batch.Finally, we adopt a strategy of early stopping in model training.
We randomly selected 60% of the dataset as the training set, 20% as the validation set and the remaining 20% as the test set.The parameters of Bi-LSTM and Bi-LSTM+Attention models are shown in Table 4.

Metrics
FDP research is a typical binary classification problem, and we aim to build a model with good generalization performance.This research selected five evaluation indicators (Precision, Recall, F1score, AUC, PR-AUC) to measure the classification and prediction performance of the FDP model.We use modules in the Scikit-learn library in Python to calculate Precision, Recall, F1-score, AUC and PR-AUC metrics to evaluate and select training models.
The precision rate refers to the proportion of all predicted distressed enterprises that experience distress, and the recall rate refers to the proportion of the distressed enterprises that are predicted correctly.The formulas for Precision and Recall are shown in Eqs (4.1) and (4.2): Precision = T P T P + FP (4.1) Among them, true positives (TP) indicate that the predictions of those distressed companies are correct.False Negatives (FN) indicate that these distressed companies have incorrect predictions, classifying them as non-distressed companies.A true negative (TN) indicates that the predictions of non-distressed companies are correct.False positives (FP) indicate that those non-distressed companies have incorrect predictions, classifying them as distressed companies.
The F1 is often used to measure the model's classification performance.It can be represented by the weighted average of the Precision and Recall of the model as shown in Eq (4.3): In studies using imbalanced datasets, the AUC and PR-AUC metrics indicate the extent to which the model is independent of sample proportions.The AUC value is the area under the ROC curve.The ROC curve is also called the susceptibility curve.AUC value refers to the area under the ROC curve.The ROC curve, also known as the sensitivity curve, is a curve with recall as the ordinate and (1-specificity) as the abscissa.AUC stands for prediction effect, and the larger the AUC is, the better the model prediction performance.Since this research focuses more on positive samples, and both Precision and Recall measure the ability of the model to find positive samples, here we add the PR-AUC indicator to test the model's performance in unbalanced samples.PR-AUC is the average of Precision calculated based on each Recall threshold.Usually, larger PR-AUC values indicate better model performance [33].In conclusion, the values of the above five indicators are between 0 and 1.The closer to 1 the value is, the better the prediction performance of the model.

Comparison model
To demonstrate the effectiveness of the Bi-LSTM+Attention model, we selected TextCNN, GRU, Transformer and traditional machine learning models for comparative analysis.

TextCNN
CNN has the advantage of having fewer parameters, sampling efficiently and being suitable for efficient parallel computation on GPUs.The TextCNN model is designed for text classification.TextCNN captures the local features of the text and extracts semantic segmentation information at different levels by automatically combining and filtering local features, similar to a sliding window containing multiple words [50].The TextCNN model we constructed uses a one-dimensional convolutional layer and a temporal maximum pooling layer containing four components: word embedding, convolution, pooling and softmax.The parameters of the TextCNN model are shown in Table 5.

Electronic Research Archive
Volume 31, Issue 8, 4683-4707 Vaswani et al. proposed Transformer, which achieved the best results in translation tasks by building a model based on the attention mechanism, and demonstrated that Transformer could be generalized to learning tasks in other domains [51].Compared to implementing the attention mechanism constructed by Bahdanau above, Transformer consists of multiple modules with attention combined to form an encoder and decoder.Since the FDP study is not a language-generating task, here we use the encoding part of the Transformer (Encoder) for text feature extraction.The parameters of the Transformer model are shown in Table 7.

Other benchmark models
We also selected several traditional machine learning classifiers as comparison models for deep learning models.Past studies have shown that SVM performs well on complex financial data.SVM projects the original data to higher dimensions by nonlinear mapping and then builds the optimal decision hyperplane to maximize the distance between the two closest samples on either side of the plane.SVM works better when the sample size is small compared to other nonlinear methods.Inspired by the central nervous system of animals, the ANN model applies neuronal information processing and excels in self-learning and self-organization.The decision tree model is a tree structure describing instance classification, consisting of nodes and directed edges.RF is an ensemble algorithm that adds a voting mechanism to the ensemble of multiple decision trees, which can alleviate the balance error of unbalanced classification data sets.XGBoost is an algorithm implementation based on gradient boosting decision tree (GBDT).It has the advantages of high efficiency, flexibility and portability.It is often used to solve large-scale data problems in industrial fields.

Empirical results
This research compares the effects of machine learning and deep learning models combined with financial and text indicators.Our empirical results can be divided into two parts.One part is the experimental results of financial indicators and text vectors on traditional machine learning models (RF, SVM, XGBoost, and ANN).The other part is the experimental results of financial metrics and text data on deep learning models (TextCNN, Bi-LSTM, GRU, Transformer, GRU+Attention, and Bi-LATM+Attention).In order to verify the improvement effect of the attention mechanism on semantic extraction of MD&A Chinese long texts, this study added ablation experiments and constructed Bi LSTM and GRU models without attention mechanism, respectively.We will focus on these two parts to start the analysis.

Traditional machine learning models
In the traditional machine model, we tested the single financial indicator and the financial indicator combined with the text word vector indicator.The classification results of a single financial index are shown in Table 8.The results show that the SVM model performs better in Precision metrics, while the AUC and PR-AUC values of the XGBoost, RF and ANN models perform better.The results of the machine learning model after adding text vectorization indicators are shown in Table 9.
The empirical results show that the AUC metrics of XGBoost, RF and ANN models are still excellent after adding text vectorization indicators, and the AUC metrics have increased by 1.27%, 1.86% and 4.31%.SVM and ANN also perform well on the PR-AUC indicator, and adding the text vectorization indicator improves by 5.83% and 13.95%.The recall effect of the XGBoost model performs better when the text vectorization index is added, increasing by 6.67% after adding it.Overall, we found that after adding text vectorization features, all traditional machine learning models improved slightly, proving that text data provided certain incremental information for machine learning models.From the above results, it can be seen that due to the high requirements for feature engineering and the limited processing ability of the Word2Vec method for long text vectors, the performance of machine learning models in FDP research combining long text is limited.

Deep learning models
In the deep learning model, we tested the performance of two types of models based on text indicators before and after adding financial indicators.The classification results of the single-text indicators are shown in Table 10 and Figure  In the ablation experiment, comparing the GRU+Attention and Bi-LSTM+Attention models with GRU and Bi-LSTM without attention mechanism, it is evident that the effectiveness of each indicator has been improved to a certain extent.Their AUC metrics are improved by 17.87% and 25.72%, which means that the attention mechanism enhances the deep learning model's ability to capture long text information.In terms of model effect, the performance of the Bi-LSTM+Attention model is better than that of the GRU+Attention model, and the AUC index is 8.02% higher than the latter, indicating that the Bi-LSTM+Attention model has a more significant classification effect than other models in the single-text index.Usually, when the AUC metric result of the test set is more significant than 0.75, it indicates that the model's discriminative ability is good [52].Therefore, our experimental results indicate that deep learning models based on attention mechanisms can achieve better discrimination ability, and the overall performance of the Bi LSTM+attention model is the best.
The results of adding both textual and financial indicators to the deep learning model are shown in Table 11 and Figure 6.The empirical results show that, on the whole, the effect of each deep learning model is improved after adding financial indicators, indicating that the addition of financial indicators also provides incremental information for the text classification model.Based on the attention mechanism in the commonly used text classification models, the AUC and PR-AUC effects of the Transformer model are slightly better than those of the Text-CNN, GRU and Bi-LSTM models.
Similarly, in the ablation experiment, the AUC index values of GRU and LSTM increased by 6.88% and 13.66%, respectively.It can be seen that the introduction of the attention mechanism has improved the effectiveness of the model.
Regarding model performance, the Bi-LSTM+Attention model outperformed the GRU+Attention model, and the AUC index was 8.99% higher than the latter.It shows that the classification effect of the Bi-LSTM+Attention model is more significant than other models when text indicators are combined with financial indicators.

Conclusions
Prediction of financial distress of listed companies can provide enterprise managers with early warning of enterprise risks and help investors make decisions to avoid financial losses.In traditional research, researchers usually predict financial distress by using corporate financial data for feature extraction or combining text word vector features.Due to the application of deep learning in natural language processing, various deep learning based text classification models are utilized by FDP researchers.
This study took 1642 listed companies in China's Shenzhen A-share market from 2017 to 2020 as the research object, constructed a deep learning text classification model and used MD&A Chinese text data and financial indicators to predict the financial distress of listed companies.However, the text in the MD&A section of listed companies in China was several thousand words long, making the text analysis work extremely difficult.To cope with the long text sequence classification problem, we combined the attention mechanism with GRU and LSTM models to construct GRU+Attention and Bi-LSTM+Attention models.This study captured the time series information by using multiple hidden layer structures of GRU and LSTM and then used the attention mechanism to summarize the time series critical information in the MD&A long text.
By comparing the experimental data with traditional machine learning models and deep learning models, the research results are as follows: 1).We verify that the MD&A Chinese text of listed companies can provide incremental information for the FDP model.The deep learning models we constructed are more effective in identifying corporate financial distress by combining text and financial data than traditional machine learning models.
2).We found that the attention mechanism can improve the long text classification performance of the deep learning model in FDP research.
3).Compared with the ordinary deep learning model, the Bi-LSTM+Attention model we constructed has different degrees of improvement in the measurement indicators of FDP research.
Our research has several limitations waiting for future research.First, the Transformer model in this research experiment may not show the best effect due to hardware limitations, and future research can try to debug it under a better memory configuration.Second, the dataset of this study uses the MD&A Chinese texts in the annual reports of listed companies as text data.Future research can test this model when the hardware is improved.Third, the deep learning model based on the attention mechanism constructed relies on the Chinese text training of the annual reports of listed companies in China, and the applicability of FDP of listed companies in other countries needs further research.Lastly, we built two deep learning models combined with an attention mechanism for long text classification: GRU+Attention and Bi-LSTM+Attention.In future research, we can try to combine the attention mechanism in other text classification algorithms to explore the effect of different models based on the attention mechanism in FDP research.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 4 .
Figure 4. Length distribution of MD&A Chinese text before and after preprocessing.

5 .
The experimental results indicate that various deep learning models have certain effects on FDP research when only using MD&A text data, which verifies the reliability and effectiveness of MD&A Chinese long text data for FDP research.The empirical results show that the AUC and PR-AUC of the attention-based Transformer model are slightly better than Text-CNN, GRU and Bi-LSTM models in frequently-used text classification models.

Figure 5 .
Figure 5. Results of single-text data in deep learning.

Figure 6 .
Figure 6.Results of text indicators combined with financial indicators in the deep learning model.

Table 1 .
Analysis of comparisons with FDP models.

Table 5 .
Parameter setting of TextCNN model.regardedas a simplified variant of LSTM, but GRU is relatively faster.GRU can capture the dependencies on sequences with time step distance.It is designed to reduce the gradient disappearance problem and thus retain information on long sequences.The parameters of the GRU and GRU+Attention models are shown in Table6.

Table 6 .
Parameter setting of GRU and GRU+Attention models.

Table 7 .
Parameter setting of Transformer model.

Table 8 .
Results of financial indicators in machine learning models.

Table 9 .
Results of financial indicators combined with text indicators in machine learning models.

Table 10 .
Results of single-text data in deep learning.

Table 11 .
Results of text indicators combined with financial indicators in the deep learning model.