A Feature Extraction Method Based on Feature Fusion and its Application in the Text-Driven Failure Diagnosis Field

As a basic task in NLP (Natural Language Processing), feature extraction directly determines the quality of text clustering and text classification. However, the commonly used TF-IDF (Term Frequency & Inverse Document Frequency) and LDA (Latent Dirichlet Allocation) text feature extraction methods have shortcomings in not considering the text’s context and blindness to the topic of the corpus. This study builds a feature extraction algorithm and application scenarios in the field of failure diagnosis. A text-driven failure diagnosis model is designed to classify and automatically judge which failure mode the failure described in the text belongs to once a failure-description text is entered. To verify the effectiveness of the proposed feature extraction algorithm and failure diagnosis model, a long-term accumulated failure description text of an aircraft maintenance and support system was used as a subject to conduct an empirical study. The final experimental results also show that the proposed feature extraction method can effectively improve the effect of clustering, and the proposed failure diagnosis model achieves high accuracies and low false alarm rates.


I. Introduction
I N the information age, data is generated all the time, especially text data.Because of its convenience, flexibility and universality, the quantity of text data is growing exponentially in, for example, applications such as twitter, online articles and shopping reviews.These words usually contain a lot of useful information, but also a lot of interference information.So, the text data needs to be processed to mine the useful information.
For structured data, scholars usually write corresponding programs directly to mine information with the help of computers.However, for unstructured data such as text, the computer cannot recognize it directly.Therefore, it is necessary for the computer to first understand the text before mining the text for information.The recognition process involves converting text symbols into numeric symbols that can be recognized by computers.Scholars have tried many methods to resolve issues around recognition, but the original rule-based methods obviously do not solve this problem very well [1].However, inspired by neural network, many scholars put forward some effective new technologies [2], and text feature extraction is one of them.
Since computer cannot directly recognize unstructured data, the text needs to be transformed into a structured format for processing.At present, the VSM (Vector Space Model) [3] is widely used to make the transformation; but a document may contain thousands of words, which easily leads to dimension explosion and expensive calculations.
Therefore, feature extraction is usually used for further dimension reduction.Feature extraction can find the most representative text features that use low-dimensional feature vectors to represent text data based on the original text [4].TF-IDF [5] and LAD [6], are two classical models for text feature extraction, and are easy to operate.However, TF-IDF only extracts keywords without considering the context.Although LDA considers the context, it tends to lose features and be ambiguous.Inspired by Zhao [7] et al. 's work of fusing χ 2 statistics-based feature and LDA semantic-based feature to improve the performance of a feature extraction, this paper proposes a text feature extraction method based on feature fusion, which combines the TF-IDF and LDA methods.
As NPL technology advanced, it was used in the field of text-driven failure diagnosis.Dnyanesh [8] et al. proposed a novel ontology-based text mining methodology to construct the D-matrices by automatically mining the unstructured repair verbatim data collected during failure diagnosis and used it to do failure pattern recognition.Rodrigues [9] et al. used text mining and neural networks to identify and classify aircraft failure patterns.However, most of the current text-based failure diagnosis models are supervised, which means these models apply to only text data with labels.However, in fact, the failure-description text is often unlabeled because of high labor costs.Therefore, based on the proposed feature extraction method, this paper developed a textdriven unsupervised failure diagnosis model.The extracted feature vectors are clustered to obtain the pseudolabel data, and then the pseudolabel data is used to train the classifier for feature diagnosis.This failure diagnosis model can classify and automatically judge which failure mode the failure described in the text belongs to once a failure-description text is entered.

A. Text Feature Extraction
As the basic work of text processing, text feature extraction has always been a hot research topic in NLP.So far, most of the existing feature extraction methods are based on the bag-of-words model [10], the topic model [11] and the word embedding model [12] - [13].
The bag-of-words model [10] adopts one-hot encoding to generate word vectors.Each word vector's dimension is equal to the size of the word vocabulary.In this vector, only one dimension's value is 1 and the rest are 0. Obviously, this kind of vector composed of 0 and 1, cannot represent a word accurately, because different words have a different importance to the text.Scholars usually use the TF-IDF method to assign weights to one-hot vectors.TF-IDF [5] is a widely used weighting technique, and plays an important role in the field of information retrieval.A TF-IDF is easy to carry out and usually performs well in short-text dataset, but performs badly in a long-text dataset or a class imbalance dataset.
A topic model [11] is a kind of topic generation models and a threelayer Bayesian probability model.The topic model's core idea is that a document selects a topic according to a certain probability, and a topic also selects a word according to a certain probability.LDA [6], as a classical topic model, is widely used in the task of text classification and text clustering.Compared with TF-IDF, LDA considers the context and performs much better in long-text dataset.However, LDA, as an unsupervised model, tends to lose features, is ambiguous in the process of feature extraction and performs badly in short-text datasets.
A word embedding model is designed to solve the dimension explosion problem of the bag-of-words model when processing long-text dataset.Through neural networks [13], word cooccurrence matrices, probabilistic models and other methods, a word embedding model maps the bag-of-words model's one-hot vector to a continuous vector space with a much lower dimension to enable a dimension reduction [12] - [13].Word2vec is the most widely used word embedding model framework; it includes two word-vector-generation models, Skip-Gram and CBOW (Continuous Bag-of-Words Model).Skip-Gram and CBOW are three-layer neural networks with different inputs.Skip-Gram inputs the current word to predict the surrounding words, while CBOW inputs the surrounding words to predict the current word.Obviously, the two models take the context into account.However, the two models usually perform badly in short-text datasets [14].

B. Feature Fusion
Feature fusion originates from data fusions that were originally conducted in the military.In recent years, with the rapid development of AI, data fusion has been widely applied in intelligent medical [15], intelligent industry [16], intelligent transportation [17] and so on.Data fusion is a framework, which contains fusion modes and tools.Data fusion mainly uses different fusion modes and tools to combine different data sources, which may generate improved new data for certain application scenarios [18].Whether the data after a fusion is effective or not mainly depends on the application scenario.In most cases, data fusion can effectively enhance the authenticity and availability of data [19], which is why data fusion is needed.
Feature fusion, as a technology of data fusion; uses given feature sets to generate new fusion features [19], and is very suitable for classification tasks.Liu [20] et al. fused two groups of feature vectors into a unit vector, and extracted features from high-dimensional vector space.They proposed a serial feature fusion algorithm and applied this algorithm to do face recognition.Their experiments showed that this algorithm could reach an accuracy rate of 98.5% with only 25 features.Yang [19] et al. proposed a new serial feature fusion algorithm for unstructured data, and tested it on the CENPARMI handwritten digital library, the NUST603 handwritten Chinese character library and the ORL face image library.The experimental results showed that their algorithm effectively improved the classification accuracy.Sun [21] et al. also proposed a new feature fusion algorithm based on CCA (Canonical Correlation Analysis), which performed well in small sample dataset with high dimensions.They first extracted two groups of feature vectors with the same pattern, then established correlation criterion functions between them, and finally extracted their representative features to form effective recognition vectors.

III. Text Feature Extraction Method Based on Feature Fusion
Text feature extraction is a process of text vectorization.As the first step of text processing, it directly determines the effects of the followup processes.However, text feature extraction is a very complex problem, because it involves conversion of abstract character symbols into concrete number symbols under the premise of maintaining the meaning of the original text.TF-IDF and LDA are two commonly used feature extraction methods proposed by scholars.The TF-IDF method is easy to execute and usually performs well in short-text datasets, but ignores the context.LDA considers the context and performs well in long-text dataset, but easily leads to blindness.Therefore, this paper proposes a text feature extraction method based on feature fusion, which combines the TF-IDF and LDA methods and is named TI-LDA.

A. TF-IDF Feature Extraction Method
The TF-IDF feature extraction method is actually the TF-IDF weighting of one-hot vector generated by a bag-of-words model.A One-hot vector whose dimension is equal to the size of the word vocabulary is discrete, and only one dimension's value is 1 and the rest are 0. The one-hot representation of a sentence can be obtained by adding the one-hot vector of all the words in the sentence.However, obviously, not every word is equally important to a sentence, so using TF-IDF to weight each word is necessary.The core idea of TF-IDF is the importance of a word is positively correlated with the frequency of its occurrence in a given text, and negatively correlated with the frequency of its occurrence in all texts of the corpus.The TF-IDF is composed of TF (Term Frequency) and IDF (Inverse Document Frequency).TF refers to the frequency of a word's occurrence in a given text, and its calculation formula is: (1) where n i,j is the number of times that word t i appears in text d j and the denominator is the total number of times that all word appears in text d j .
IDF is used to measure the frequency of a word's occurrence in all texts of the corpus, and its calculation formula is: (2) where |D| is the total number of texts in the corpus and is the number of texts containing t i in the corpus.
The calculation formula of TF-IDF is: (3) The normalized formula of TF-IDF is: The TF-IDF algorithm is simple in principle, easy to operate and efficient to calculate, which make it suitable for short-text mining.However, it ignores the context to carry out vectorization and easily causes dimension explosion when dealing with long text.

B. LDA Feature Extraction Method
LDA is a statistical topic model which represents the topics of each document in the form of a probability distribution.An LDA believes that a document consists of several topics, and each topic consists of several words.When generating a document d with K topics, the probability of a word w being selected is: (5) where K and k are the total number of topics and indexes of the group of topics and t k stands for the topic k.For example, there are three topics: animals, actions and names.Different words are distributed under each topic."cat", "dog" and "pig" belong to the topic of animals."sitting", "running" and "standing" belong to the topic of actions."Tony", "Jack" and "Lucy" belong to the topic of names.Suppose a sentence that says the dog is sitting need to be generated.The first step is to select the topics under the condition of the target semantics.The second step is to choose the words under the condition of the selected topics.In the end, the sentence "the dog is sitting" can be generated.
Therefore, the distribution of words in a given document can be obtained as Fig. 1.The process of generating a document by the LDA model can be summarized as detailed below: • Step 1: Sample from the Dirichlet distribution α to generate the topic distribution θ i of document i.; • Step 2: Sample from the Multinomial distribution θ i to obtain the topic z i, j of word j in document i.; • Step 3: Sample from the Dirichlet distribution β to generate the topic distribution of topic z i, j ; • Step 4: Sample from the Multinomial distribution to obtain word ω i, j .
Based on the above process, the following joint distribution can be obtained: (6) By integrating θ i and Φ, and summing z i , the maximum likelihood estimation of the word distribution can be obtained as follows: (7) Finally, parameters in an LDA model can be obtained by a Gibbs sampling [22].
The LDA model can effectively extract the semantic information of the text, in consideration of the context.However, in the selection of topic words, LDA has a certain blindness, that easily causes ambiguity and feature loss.

C. TI-LDA Feature Extraction Method
The feature fusion algorithm has been widely used in the field of AI for applications such as target tracking, pattern recognition and image understanding.In the field of pattern recognition, Jian Yang [19] et al. proposed two fusion strategies, parallel feature fusion and serial feature fusion.They also verified the robustness and practicability of the two fusion strategies through experimentation.
Suppose A and B are two different feature spaces of sample space Ω, meanwhile suppose α ∈ R n and β ∈ R n are a feature vector of A and B, respectively.Then, the parallel feature fusion can be expressed as: (8) where i is the imaginary component and γ ∈ R max(n,m) is a feature vector of the new feature spaces.In the parallel feature fusion process, feature vectors α and β may have different dimensions; the feature vector with the lower dimension needs to be supplemented with 0 before fusion.Take α = (a 1 , a 2 , a 3 ) T and β = (b 1 , b 2 ) T for example.First, add 0 to supplement vector β to create a three-dimensional feature vector (b 1 , b 2 , 0) T , then carry out the feature fusion according to equation (8), and the final result is γ = (a 1 + ib 1 , a 2 + ib 2 , a 3 + i 0) T .
Serial feature fusion can be expressed as: (9) Compared with parallel feature fusion, serial feature fusion does not need to consider the vectors with different dimension.Also take α = (a 1 , a 2 , a 3 ) T and β = (b 1 , b 2 ) T for example; fuse directly and the final For a given text sample space Ω, suppose A and B is the feature vector space based on TF-IDF and LDA, meanwhile α i ∈ R n is a feature vector in A and β i ∈ R m is a feature vector in B. Adopting the parallel feature fusion strategy, according to equations (8), the sample space Ω can be represented as: (10) While adopting the serial feature fusion strategy, according to equations (9), the sample space Ω can be represented as: (11) IV.Text-Driven Failure Diagnosis Model Based on TI-LDA To make full use of failure-description text and understand the role the TI-LDA text feature extraction method plays in the field of fault diagnosis, this paper researched TI-LDA in the failure diagnosis field and designed a text-driven failure diagnosis model, that is suitable for small data samples and the main framework is as shown in Fig. 2. Before any further processing, the text data needs to be preprocessed.Specifically, for English text data, stop words need to be removed, while for Chinese text, word segmentation is also needed because there is no distinct identifier for separation.In addition, there are differences in word granularity, part of speech, polyphonic characters and so on between Chinese NLP and English NLP.Although the model is mainly for Chinese text, there is usually a mixed use of Chinese and English for failure-description text.For this situation, the model treats English words in the text as special Chinese characters.After the preprocessing, feature extraction is done to obtain text vectors, the obtained feature vectors are processed by CFSFDP (Clustering by Fast Search and Find of Density Peaks) clustering to mark the pseudolabels for failure text.The obtained pseudolabel data cannot be directly put into the classifier for training, because the class imbalance problem often exists in the failure text, which will affect the performance of the classifier.Therefore, this paper adopts the SMOTE (Synthetic Minority Oversampling Technique) oversampling method to balance the pseudolabel data.Finally, the balanced data is put into the SVM classifier to train the failure diagnosis model.

A. Text Preprocessing
As mentioned above, different preprocessing strategies should be adopted for English text and Chinese text.The English text should be processed by removing stop words directly, while the Chinese text should be segmented first and then the stop words should be removed.At present, the common Chinese word segmentation methods are mainly based on dictionaries, statistics and rules, and a dictionary-based method is the most effective and widely used.Common dictionary-based word segmentation systems include Jieba, the CAS (Chinese Academy of Sciences) segmentation system, Smallseg and Snailseg.Their functions are compared in Table I.It can be seen from Table I that Jieba is more powerful and suitable for the text data used in this article, so this paper adopted Jieba for word segmentation.
There are often a large number of stop words in the text, such as emotional particles and punctuation marks, that have no contribution to the semantic expression.If text is directly used for subsequent processing, these stop words will inevitably cause too high a dimension for the text vector, increase the calculation cost, and interfere with text clustering.Therefore, these stop words need to be removed with the help of a stop words list.A simpler approach is to directly use the stop word list established by professional organizations, such as the NLTK (Natural Language Toolkit) English stop word list and the Baidu Chinese stop word list.Although this method is simple, it does not achieve the best effect.If you want to achieve the best effect, you need to establish a special stop table in accordance with the specific situation of the text data.

B. Feature Extraction
The text feature extraction function adopts the TI-LDA method proposed in this paper.In terms of the selection of fusion strategy, this paper selects the serial pattern feature fusion.By comparing equations (10) and (11), it can be found that the parallel feature fusion will continue to reduce the dimension.However, before the fusion of text features, the preprocessing and feature extraction will have already reduced the dimension of the text data.Obviously, more features will be lost if we use the parallel feature fusion.Therefore, this paper uses the serial feature fusion to do feature fusion.

C. Text Clustering Based on CFSFDP
Text clustering is a key element in the failure diagnosis model, and its main function is to mark the pseudolabels for the failure text.Therefore, choosing a clustering method suitable for the text data is very important.The current clustering algorithms can be broadly divided into partition-based methods [23], hierarchical-based methods [24] - [25], density-based methods [26], grid-based methods [27] and model-based methods.Because the failure-description text studied in this paper is typical of small data sampls, this paper uses CFSFDP to do the clustering.CFSFDP is a clustering method for small data sampls published in Science by Rodriguez [28]  CFSFDP assumes that the center of the cluster is surrounded by some points with low local density, and these points are far away from other points with high local density.Therefore, the clustering centers can be obtained by calculating the nearest distance, and the remaining points can be divided into their categories according to their order of density.Suppose p i and p j are two different points of discrete data point set D = {p 1 , p 2 , ..., p n }, and define p i 's local density ρ i as the number of points in the circle with p i as the center and d c as the radius.Then, ρ i can be calculated by the following formulas: (12) where function: (13) Here, d ij is the distance between p i and p j , and d c is the cutoff distance, that needs to be determined manually.
Define the set of points with a higher density than p i as , define the distance δ i to be: (14) When the data point p i has the largest local density, and δ i represents the maximum distance between p i and p j in the data set , otherwise δ i represents the minimum distance between p i and p j in the data set .
To comprehensively measure the local density p i and distance δ i , another variable γ i needs to be introduced.In addition, the calculation criteria of γ i is: (15) The clustering centers can be selected according to the value of γ i , because the clustering centers usually have a larger value of γ i .If all values of γ i are arranged in descending order and plotted on a two-dimensional plane, it can be found that the values of a in the nonclustering central interval are relatively smooth.If we arrange all γ i in descending order and plot them in coordinates, you can see the value of γ i is generally small and changes stably in the interval of nonnonclustering centers, while the value of γ i is generally large in the interval of clustering centers and there is an obvious jump of γ i 's value near the critical point.Therefore, the number of clustering centers and classes can be determined based on the above characteristics.

D. Oversampling Based on SMOTE
In most of the failure monitoring data, including the failuredescription text, there exists a class imbalance problem.The major class usually has more samples than the minor class.In fact, the major class occurs frequently but usually does less harm, while the minor class occurs occasionally but does great harm.This kind of unbalanced class data is a great challenge to classification.If the unbalanced data is directly used for classification, the minor-class samples will be submerged in the major class samples, which often results in high false alarm rates for the major-class and high missing alarm rates for the minor-class [30].
At present, there are mainly two ways of data balancing processing of oversampling.One is directly copying the samples of the minor class, the other is artificially generating the samples of the minor class according to the minor class's characteristics.The former is easy to do but easily causes overfitting, while the latter is more complex but difficult to overfit.This paper adopts the SMOTE [31] oversampling method, which is based on the latter, because the sample size of the data used in this paper is small.Based on the above considerations, the SMOTE [31] algorithm, a widely used and relatively mature oversampling method, was adopted in this paper.The basic function of SMOTE is to manually add the minor-class samples to the new sample set by analyzing the characteristics of the minor-class.The SMOTE processes to solve a class imbalanced problem are as follows: Define a sample set of a minor class.For any point in X, calculate the Euclidean distance between this point and all remaining points to obtain the k nearest points.Here, this paper assumes the multiplier of oversampling as n, that is randomly selecting n points in the k nearest points to generate set .By a random linear interpolation, add the new sample to X, which is shown in the following formula: (16) where rand(0,1) is a random number between 0 and 1.The above formula can generate m samples of the minor class to achieve the purpose of balancing the data set.

E. Classification Based on SVM
The pseudo-label data after balanced processing, needs to be put into the classifier for training.To do the selection of the classifier, the SVM classifier is selected in this paper.SVM, as a supervised learning method suitable for data with small sample sizes, was first proposed by Vapnik et al. [32] Due to its characteristics of easy operation and high robustness, SVM has been widely used in the field of feature diagnosis, and this paper also uses SVM to do classification.SVM is mainly based on statistics.SVM first maps the input data from the low-dimensional space to the high-dimensional space to make the problem linearly separable, then finds an optimal hyperplane in the high-dimensional space to divide the data.Therefore, the selection of the optimal hyperplane directly determines the classification effect.Because the linear binary classification of SVM is the basis and prototype of SVM.This process first consider the linear binary classification problem, as shown in Fig. 3. Define the sample points (x i , y i ), i = 1, 2, ..., s, x i ∈ R m , y i ∈ {1, −1}.Based on the above conditions, a classification hyperplane H is constructed: (17) H divide the above sample points into two classes, and the formula is: (18) H can separate two different classes of samples, but the goal of SVM is to find the optimal hyperplane, that is, the maximum distance between the two classes of samples.Therefore, the objective function is:

Artificial Intelligence and Sensor Informatics: Exploring Smart City and Construction Business Implications
The constraint is: (20) To find the optimal solution of equation ( 19), the Lagrange multiplier is introduced, and equation ( 19) can be linearized into (21) where a i ≥ 0 and i = 1, 2, ..., l.
The constraint is updated to: (22) When a i ≥ 0, these sample points are referred to as support vectors.The optimal classification discriminant function is: (23) So far, the linear binary classification problem is solved.In addition, the nonlinear binary classification problem can be converted to a linear binary classification problem by kernel functions.The objective function of the nonlinear binary classification problem is: (24) The optimal classification discriminant function of the nonlinear binary classification problem is: (25) Here, is the kernel.The kernel function needs to be selected artificially, and the Gauss kernel function is used in this paper: (26) where x c is the center of the kernel function, and σ is the width parameter of the kernel function.
So far, the binary classification problem has been solved.For the multiclassification problem, this paper builds an SVM multiclassification framework based on a binary tree, which is shown in Fig. 4.

V. Experiments and Result Analysis
To verify the effectiveness of the proposed TI-LDA feature extraction algorithm and the text-driven failure diagnosis model, this paper used the failure-description Chinese text accumulated and recorded by an aircraft maintenance and support system to design verification experiments and analyzes the experimental results in detail.After eliminating the repeated and missing data, 1683 effective failure-description texts are obtained.In addition, some failuredescription texts are shown in Table II.To obtain better effectiveness for removing the stop words, this paper designed a special stop words list according to the characteristics of the corpus and the existing stop word list.Through the analysis of the text used in this paper, it is easy to find that two-word stop words are the most common, followed by professional characters, letters, and three-word stop words.Therefore, based on the above characteristics and the commonly used stop words lists, this paper designed a special stop word list to removing stop words.Some of the stop words are shown in Table III.

A. Effectiveness Verification Experiment of TI-LDA
To verify the effectiveness of TI-LDA, this paper first used TF-IDF, LDA and TI-LDA to extract the feature vectors of the failuredescription texts respectively, then clustered the three sets of feature vectors using CFSFDP, and finally evaluated the effectiveness of the feature extraction by comparing the effects of clustering.This paper used TF-IDF, LDA and TI-LDA to extract the features of the preprocessed text, and the normalized feature vectors obtained are shown in Table IV  For the three sets of vectors, this paper used the CFSFDP method for text clustering.According to the principle of CFSFDP, the relative distance values between the vectors was first calculated, and the results are shown in Table VII, Table VIII and Table IX.Then, we calculated the values of γ i and drew them in descending order on a two-dimensional plane, as shown in Fig. 5, Fig. 6, and Fig. 7. Finally, we determined the number of clustering centers and classes based on the numerical variation diagram of γ i .It's easy to see from Fig. 5, Fig. 6 and Fig. 7 that although the methods of feature extraction are different, the number of classes are the same.To evaluate the clustering effect, this paper looks at two aspects, the intraclass compactness and the interclass separability, and uses the average intraclass compactness ( ) and the average interclass separability ( ) indicators to do the evaluation.The smaller the 's value, the higher the compactness of the entire data set; the larger the 's value, the higher the separability of the entire dataset is.To reflect the clustering effect on the entire dataset, this paper defines a comprehensive evaluation indicator = / .It's easy to observe that the larger 's value, the better the clustering comprehensive effect.The specific results are shown in Table X.In terms of intraclass based on X, because TI-LDA's value is the smallest, TI-LDA has the highest intraclass compactness and makes obvious improvements compared with TF-IDF.In terms of interclass separability, TF-IDF performs best because of the higher dimension of the feature vectors; it is followed in performance by TI-LDA and LDA.For overall performances, TI-LDA gains the highest marks and is far ahead of TF-IDF and LDA.Altogether, the TI-LDA method proposed in this paper, effectively improves the clustering effect.

B. Effectiveness Verification Experiment of Text-Driven Failure Diagnosis Model
Because the TI-LDA method proposed in this paper is better than TF-IDF and LDA based on Table X, the subsequent processing of the confirmatory experiments is based on the clustering results of TI-LDA.Through CFSFDP text clustering, the text data in this paper was divided into six completely different failure types.The first failure type is a transmitter failure; the second is a signal failure, which mainly is a signal problem of different monitors; the third is the failure of the aircraft's flight parameter indicators; the fourth is a generator failure, which is mainly caused by a generator overload and a signal failure; the fifth is engine failure; and the last is the failure caused by mechanical fatigue.The details are shown in Table XI.In this paper, a number of samples of each failure type were calculated, as shown in Fig. 8.It can be seen from Fig. 8 that the text data presents an obvious class imbalance, with a large quantity gap between the major class and the minor class.In addition, the second type of failure and the sixth type of failure accounts for more than 90% of the failures, while the remaining four types accounts for a relatively small percent.In this paper, SMOTE is mainly used for data equalization to solve the class-unbalanced problem and to sample a number of each failure type after oversampling, as shown in Fig. 9.  Comparing Fig. 8 and Fig. 9, the class imbalanced problem has been significantly improved and there is no significant difference in the number of samples between different classes after oversampling.
This paper used the original data and the oversampling data to train the SVM classifiers.By comparing the classification effect of the two classifiers, the effectiveness of the proposed text-driven failure diagnosis model was verified.Usually, the classification effect of classifier is mainly judged by the classification accuracy.However, for the text data with a class imbalance problem in this paper, the classification accuracy is not comprehensive.Therefore, this paper decided to use the confusion matrix from FNR (False Negative Rate), FPR (False Positive Rate), Acc (Accuracy), Recall and F 1 to evaluate the classification effect.The specific results are shown in Table XII.From Table XII, it can be shown that the classifier trained by the original data is very close to the classifier trained by the oversampling data in Acc, and both have high accuracy, which further reflects the validity and feasibility of the proposed failure diagnosis model from a data perspective.However, for other classification indicators, the failure diagnosis model trained by the oversampling data has obvious improvements in Recall and F 1 , which reflects that there exist false high accuracies of the classifier trained by the original data.According to the characteristics of the data in this paper, the major class usually has many more samples than the minor class.Therefore, classifiers tend to group minor-class samples into the major class, which often results in high false alarm rates of the major class and high missing alarm rates of the minor class.This can also be seen from the FNR and FPR.The FNR and FPR of the classifier trained by the oversampling data are much smaller, so there are fewer mistakes in the classification task, which effectively avoids the high false alarm rates.

VI. Conclusion and Discussion
In a variation from the traditional methods of failure diagnoses based on structured data, this paper proposes a text-driven failure diagnosis model by using NLP technology, which fills in a gap for research on failure diagnoses based on unstructured data, especially text data.
To resolve the shortcomings of traditional TF-IDF and LDA text feature extraction methods, this paper proposes TI-LDA, a new text feature extraction method, based on serial feature fusion, and uses the CFSFDP clustering method to verify the effectiveness of TI-LDA.The final experimental results show that the feature vectors extracted by TI-LDA can effectively improve intraclass compactness and interclass separability, compared with the methods using TF-IDF and LDA alone.
In this paper, the TI-LDA method is applied to the field of failure diagnosis, and a text-driven failure diagnosis model based on TI-LDA is created by combining the machine learning methods such as CFSFDP clustering, SMOTE oversampling and SVM classification.This failure diagnosis model can classify and automatically judge which failure mode the failure described in the text belongs to once a failuredescription text is entered.Through an effectiveness verification experiment, it was found that the failure diagnosis model proposed in this paper has a high accuracy, and effectively solves the problem of high false accuracies and high false alarm rate caused by the class imbalance problem.
It's worth mentioning that this text-driven failure diagnosis model is unsupervised, which means it does not need any label data, so it has better portability and lower labor costs.This model has very broad application prospects, especially in the failure diagnosis field for large and complex equipment such as for aircraft, high-speed rail, and even for the medical field.
However, the model can still be improved.For example, the TI-LDA text feature extraction method fuses according to a ratio of 1:1, but this ratio may be not the best fusion ratio.So, our next work will study how to find the best fusion ratio to obtain the best fusion effect.In addition, in order to simplify the study, this paper does not consider the interference of abnormal or unreal text data.Future work will focus on ways to identify abnormal text data so as improve the accuracy of the model.
To verify the effectiveness of the proposed feature extraction algorithm and failure diagnosis model, a long-term accumulated failure-description text of an aircraft maintenance and support system was used as a subject to conduct an empirical study.The main contribution of this paper is to propose a more effective feature extraction method, by fusing TF-IDF and LDA, two typical feature extraction methods, and apply it to the field of failure diagnosis, by establishing an intelligent text-driven failure diagnosis model with the help of machine learning methods such as clustering and classification.The remaining of this paper is organized as follows: Section II presents an overview of text feature extraction and feature fusion.Section III is the introduction of the basic principle of the proposed feature extraction algorithm.Section IV presents the proposed text-driven failure diagnosis model.Section V is the experiment and the discussion of the experimental results.Section VI covers conclusions and discussion.
et al.Compared with the typical partition-based K-Means [29] method, CFSFDP not only handles clusters with aspherical shapes but also automatically determines the number of clusters.Compared with the typical density-based DBSCN (Density-Based Spatial Clustering of Applications with Noise) method, CFSFDP doesn't need to iterate repeatedly to determine the density threshold.
，射频的K14a、b开关片地线断，23MZ的本振的L3电感器未并 联上，16HZ频率调偏The radio cannot transmit and the signal cannot be received.The ground wire of the switch pieces K14a and K14b of rf was broken.The L3 inductor of the 23MZ local vibration was not connected, and the frequency was deviated Signal 3 773 校"Ⅰ"状态记忆灯亮，地速指示极小(0021)且不动，低频分机故障 Status indicator light of "I" was on.The reading of the ground speed indicator was minimal (0021) and fixed.The low frequency extension broken down Indicator 4 1416 1发起动发电机起动超负荷信号灯亮，减速器轴承漏光 The starting generator of the no. 1 engine was overloaded and the signal light was on.The reducer bearing was lightly leaking Generator 5 1509 发动机停车后余油管大量漏油 Lots of oil leaked from the residual oil pipe after the engine stopped Engine 6 1679 在"陆"位置加不上高压 High pressure cannot be applied at position "Land" Mechanical fatigue

TABLE I .
Comparison of Dictionary-Based Word Segmentation Systems

TABLE II .
Examples of Aircraft Failure-Description Text

TABLE III .
Stop Words List , Table V and Table VI.

TABLE IV .
Normalized Feature Vectors of TF-IDF

TABLE V .
Normalized Feature Vectors of LDA

TABLE VII .
Relative Distance Values of TF-LDA n γ Fig. 5. γ i Value Variation Diagram of TF-IDF.

TABLE X .
Comparison TABLE of Clustering Indicators

TABLE XI .
Clustering Results of TI-LDA

TABLE XII .
Comparison TABLE of Evaluation Indicators of SVM Classifier