Entity recognition of railway signal equipment fault information based on RoBERTa-wwm and deep learning integration

: The operation and maintenance of railway signal systems create a significant and complex quantity of text data about faults. Aiming at the problems of fuzzy entity boundaries and low accuracy of entity recognition in the field of railway signal equipment faults, this paper provides a method for entity recognition of railway signal equipment fault information based on RoBERTa-wwm and deep learning integration. First, the model utilizes the RoBERTa-wwm pretrained language model to get the word vector of text sequences. Second, a parallel network consisting of a BiLSTM and a CNN is constructed to obtain the context feature information and the local attention information, respectively. Third, the feature vectors output from BiLSTM and CNN are combined and fed into MHA, focusing on extracting key feature information and mining the connection between different features. Finally, the label sequences with constraint relationships are outputted in CRF to complete the entity recognition task. The experimental analysis is carried out with fault text of railway signal equipment in the past ten years, and the experimental results show that the model has a higher evaluation index compared with the traditional model on this dataset, in which the precision, recall and F 1 value are 93.25%, 92.45%, and 92.85%, respectively.


Introduction
Railway signal equipment is a general term for signal display equipment, station interlocking equipment, and section blocking equipment.It is a crucial guarantee to ensure the safety of train and operation shunting work, as well as to improve the traffic capacity of the railway [1].With the rapid and efficient development of information technology, a large amount of unstructured text data about faults is generated by the railway signal system during operation and maintenance.To handle faults, maintenance staff mainly relies on manual experience and expert knowledge.Due to less experience, poor communication and delayed fault processing time, this kind of maintenance may lead to major safety hazards and an inability to meet the demands of the high-speed operation of modern railways in China.Therefore, it is a major challenge to determine how to make reasonable use of the fault information generated at the railway site, mine the potential relationship between fault text, and assist the field personnel to quickly solve the various fault phenomenon occurring at the scene.
The knowledge graph (KG) is a technical approach that utilizes graphical models to describe knowledge and represent the associative relationships between entities [2].Knowledge graphs are used to make information resources easier to compute, understand, and evaluate, enabling rapid responses and reasoning with knowledge.The knowledge graph is primarily categorized into two types: the open domain knowledge graph and the vertical domain knowledge graph.The knowledge graph for railway signal equipment faults belongs to the vertical domain knowledge graph with strong domain characteristics and strict requirements for knowledge content, and is closely related to safety, which can provide auxiliary decision-making for intelligent fault diagnosis and prediction.
Named Entity Recognition (NER), as one of the significant parts of constructing knowledge graphs, uses related models to locate and classify named entities in text into certain labeled categories [3].Given that the majority of fault information in railway signal equipment is in the form of unstructured text data, it is crucial to initially identify specific categories of entities through NER.This is done to facilitate the creation of a knowledge graph and other associated activities.Compared with other fields, entity recognition in the field of railway signal equipment fault is characterized by many proper names, fuzzy entity boundaries and rich entity expressions [4].To recognize different types of entity information, this paper proposes a model based on RoBERTa-wwm and deep learning integration (referred to as the RBCMC multilayer model).The core idea is presented in the following four points: (1) After sorting the fault texts of railway signal equipment, the features of the text data are summarized to define five kinds of entity labels, such as fault phenomenon, fault reason, repair measure, and repair outcome.The BMEO method then uses the YEDDA [5] labeled tool to annotate each character in the fault text.
(2) To obtain a vector representation of the text's rich semantic information, the RoBERTa-wwm pretrained language model processes the labeled fault text.The RoBERTa-wwm model not only enhances the semantic representation by obtaining many prior knowledge unlabeled, but also obtains word-level semantic representations during the training process.A neural network consisting of a Bidirectional Long Short-Term Memory (BiLSTM) and a Convolution Neural Network (CNN) working in parallel is constructed to extract the contextual feature information and local feature information of the text, respectively.
(3) The Multi-Head Self-Attention mechanism (MHA) is used to mine the association between different features and extract feature vectors that contain other words.By defining the number of heads of the MHA, features are extracted from different dimensions, and these features are processed by splicing to improve the model recognition ability.(4) The experiments conducted on fault data from railway signal equipment have shown that the model proposed in this paper, which combines RoBERTa-wwm and deep learning, is highly suitable for entity recognition in the field of railway signal equipment faults.The precision, recall, and F1 values achieved were 93.25%, 92.45% and 92.85%, respectively.

Related work
Analyzed from state-of-the art of research algorithms, the current entity recognition methods are divided into the following three main types:

Rule and dictionary approach
This kind of approach first needs to construct many entity extraction rules, which are generally constructed manually by experts with specific domain knowledge, and then the rules are matched with text strings to recognize named entities [6].Although the accuracy and recall of the method are generally high, it becomes more difficult to adapt to emerging entity types as the rule set construction cycle lengthens with increasing dataset size.

Traditional machine learning based approach
This approach based on commonly used machine learning models includes Hidden Markov Models (HMM) [7], Maximum Entropy Models (ME) [8] and Conditional Random Field (CRF) [9].Although this method is more effective than the first, it has the disadvantages of high requirements for text extraction features and strong interdependence between predicted labels.

Deep learning based approach
With the significant progress of deep learning in the field of natural language processing in recent years, deep neural networks have been successfully applied to NER tasks.At present, the neural networks used for NER mainly include CNNs, Recurrent Neural Networks (RNN) and neural networks that contain an attention mechanism.Neural networks can automatically learn sentence features and achieve end-to-end entity recognition without complex feature engineering [10].Huang [11] proposed various sequence labeling models based on LSTM networks, among which the BiLSTM-CRF model achieved state-of-the-art accuracy on the NER dataset.Yang [12] generated word vectors corresponding to the labeled sequences through Word2vec [13] and used the BiLSTM-CRF model to complete the task of railway accident fault NER by loading the knowledge from an external sources, namely Wikipedia.Kong [14] constructed a multilayer CNN model that can capture short-term and long-term contextual information and make full use of CPU parallelism to improve model efficiency compared with LSTMs.Li [15] used a parallel structure of MHA and BiLSTM neural networks to get feature representation.They combined a medical dictionary and a language model that had already been trained to combine character and word vectors.
In recent years, the emergence of pretrained language models (PLMs) has created more possibilities for the enhancement of text feature representation.Devlin [16] first proposed Bidirectional Encoder Representations from Transformers (BERT) to pretrain models, generate word vectors containing positional information and incorporate contextual features into word vectors through the bidirectional transformer model.In addition to this, PLMs such as Generative Pre-Training (GPT) [17], Enhanced Language Representation with Informative Entities (ERNIE) [18] and A Lite BERT (ALBERT) [19] perform well in terms of feature representation.The BERT model has been widely used in the field of NER tasks.Guo [20] proposed the BERT-BiLSTM-CRF legal case entity recognition method for the characteristics of domestic Chinese legal texts, but only the location and the name of the person in the legal text are labeled accordingly, lacking other elements.To enrich character vectors, Li [21] mixed multi-source participle information with global vocabulary embedding information based on BERT-BiLSTM-CRF.The model works better when it comes to crop diseases and insect pests.Lin [22] introduced MHA to focus on key feature information, and the proposed BMBC model was able to accurately identify various types of entities in high-speed rail turnout information.Ma [23] proposed a LSTM-CRF and CNN serial strategy for sequence labeling model applied to an NER task, which obtained high evaluation indexes on the Conll2003 English dataset, but the LSTM network was unable to capture textual information in both directions.However, there is no separator between words in Chinese, and BERT can only mask characters but not words when using the Chinese corpus for pretraining, so word-level semantic representations cannot be obtained through pretraining.To address the shortcomings of the BERT model, the RoBERTa model [24] was proposed, which uses more training data, a longer training time, more powerful training batches and combines the benefits of the Chinese whole world mask (wwm) and the RoBERTa model.To take full advantage of the pretrained layers of the encoder, Zhang [25] designed a method of representing the dynamic weight fusion of the vectors generated by the 12 layers of the transformer of the RoBERTa-wwm, which is used as an input to underlay the BiLSTM network.
Most of the deep learning based on NER models proposed by the above scholars use a single neural network.This paper not only proposes to use a parallel combination of BiLSTM and CNN feature extraction networks to get the contextual features of the fault text, but also introduces an MHA after the network to tap into the association between different features and extract the feature vector containing other words.
Compared to traditional word vector representation, the BERT series of models can do bidirectional modeling by using a deep transformer architecture.This lets the context of the word be taken into account at the same time to get more complete contextual information.RoBERTa-wwm is specifically designed for Chinese data, where words lack separators, and BERT cannot mask words during pretraining.It employs the whole word mask and dynamic mask strategy to learn distinct linguistic representations, making it more appropriate for identifying fault information in Chinese railway signal equipment.

Corpus construction
Fault data of railway signal equipment is stored in text form by recording and summarizing the fault phenomena, cause analysis, processing and processing results.This gives a more complete record of the signal equipment faults that happen in detailed information [26].
The goal of entity recognition for fault information of railway signal equipment is to extract all kinds of entity information from fault text and classify different types of entities, such as fault phenomenon, fault reason, and repair measure.In this paper, entity recognition is regarded as a sequence annotation task by labeling each Chinese character in the text and identifying the beginning and ending items in the sentence to extract named entities.This process effectively avoids the accumulation of errors caused by word separation and realizes the extraction and classification of entity information [27].
A given signal equipment fault text   of length n is denoted as   = { 1 ,  2 , ⋯ ,   }, where   represents the i-th character.After the RBCMC multilayer model, the label sequence   corresponding to each text character is obtained, where () represents the nonlinear mapping in the entity model, the length of   = { 1 ,  2 , ⋯ ,   } is the same as   and   represents the label corresponding to the i-th character.
Analyzing the characteristics of text data about faults, this paper defines the five entity types shown in Table 1, which are fault phenomenon, fault position, fault reason, repair measure, and repair outcome.In addition to covering the whole process of fault diagnosis, these five entity categories also lay the foundation for subsequent relationship extraction tasks.
The fault text is labeled by the BMEO method through the YEDDA labeled tool, where B denotes the beginning of the entity position, M the middle of entity position, E the end of entity position, and O the nonentity character, and it connects to the defined entity type with "-".Each entity tag represents the entity type and the position of the character in the entity.

Entity recognition of railway signal equipment fault information
In this paper, a model based on RoBERTa-wwm and deep learning integration is proposed for the entity recognition on fault information of railway signal equipment, and the overall structure of the model is shown in Figure 1, which mainly contains four layers: the RoBERTa-wwm layer, the BiLSTM-CNN layer, the MHA layer, and the CRF layer.
First, under the condition of fully considering the relational features between characters, words and sentences, the text data is fed into the RoBERTa-wwm embedding layer so that the original fault text is converted into a vector representation to facilitate the learning of the subsequent CNN and BiLSTM neural networks.Then, to fully extract the local feature vectors   and contextual feature vectors   of the text, the vectors generated by the RoBERTa-wwm layer are used as inputs to CNN and BiLSTM.After fusing the features of the two, the MHA layer mines the internal relationship between different features to obtain text features with different granularities, and finally the optimal labeled sequence with constraints is outputted in the CRF layer.
The contact point of machine J is not in good contact with the open circuit, which leads to 31 # turnout inversion with no indication.Afterwards, the maintenance personnel replaced the contact point, and the fault disappeared.

RoBERTa-wwm layer
To address the issue of multiple meanings of a word, the BERT model adds word position information, improving entity recognition accuracy [28].The two core tasks performed by the BERT model, which is based on the bidirectional transformer encoder, are the Masked Language Model (MLM) and Next Sentence Prediction (NSP).The principle of MLM is that the word to be predicted is first randomly replaced with the label [MASK] in a certain proportion (15%), and then the original value of the word is predicted according to other non-masked words provided in the context.The NSP model is primarily trained to understand the relationship between sentences.
RoBERTa-wwm is an improvement to the BERT model; its framework is consistent with BERT, and it improves accuracy by 5% to 20% over BERT [29].RoBERTa-wwm makes three improvements on the BERT model: (1) The pretraining process uses a dynamic masking strategy, which creates a unique mask for each input sequence.The input data is more randomly generated, allowing it to learn more semantic information.(2) The NSP task is removed, which enhances the model's efficiency to some extent.(3) Byte-Pair Encoding (BPE) is used to process text data.
The pretraining process of RoBERTa-wwm is shown in Figure 2, where the input data is first processed in a specific format, and where the labels [CLS] and [SEP] represent the start and end positions of the text, respectively, and some characters in the text are randomly masked using the label [MASK] [16].The text sequence corresponds to an input that consists of a superposition of three different embedding features, namely token embedding, segment embedding, and position embedding.
Word vectors are trained using the encoder portion of the bidirectional transformer by RoBERTawwm, which more comprehensively retains the semantic information of the fault text, enhances the model's contextual bidirectional feature capture capability, solves the problem of multiple meanings of a word, and theoretically improves the accuracy of the entity recognition model [30].
[CLS] The The contact point of machine J is not in good contact , which leads to 31 # turnout inversion with no indication.

Figure 2.
The procedure of RoBERTa-wwm generating input vectors.

BiLSTM-CNN layer
This paper proposes a parallel network consisting of BiLSTM and CNN for feature extraction.In the previous layer, the RoBERTa-wwm language model pretrained feature vectors were sent to the BiLSTM and CNN networks to pull out contextual and local features of the fault text.Subsequently, the two features are combined and input into the MHA layer for further processing.

BiLSTM
Recurrent neural networks (RNNs), which address text sequences as directed graphs and can capture historical dependencies through internal feedback connections [31][32], are well suited for capturing contextual information about fault text.However, traditional RNNs have the problems gradient vanishing and gradient explosion during the training process.Long Short-Term Memory is proposed to solve the above problems.
LSTM is a kind of RNN network model with a gating mechanism that can learn the long-term dependency relationship between sequences and present a better effect in text processing.LSTM consists of a forgetting gate, an input gate and an output gate, and its structure is shown in Figure 3.
First, the content to be discarded in the previous cell is decided by the forgetting gate, which receives the output of the previous moment ℎ −1 and the input of the current moment   , and the result   of the forgetting gate at the moment t is shown in formula (2), where   denotes the weight matrix of the forgetting gate, which is divided into two parts:   denotes the weight matrix corresponding to the transmission of input  to   ,  ℎ −1  denotes the weight matrix corresponding to the transmission of the previous state ℎ −1 to   and   denotes the matrix of bias terms.The result   is bounded to (0,1) by the activation function .The input gate controls the information that needs to be added to this cell and is calculated as shown in formulas ( 3) and ( 4), where   denotes the weight matrix of the input gate,   is the cellular state of the LSTM at moment t and the forgetting gate   is multiplied with the state  −1 of the previous moment to achieve the effect of selective forgetting.
The output gate is used to decide which information can be used as the output of the current stage and is calculated as shown in formulas ( 5) and ( 6), where   denotes the weight matrix of the output gate and   denotes the bias term matrix.Multiplying the output gate   with tanh(  ) yields the new output content ℎ  at the current moment, which is used as one of the input contents at the next moment.The BiLSTM layer is used to extract the contextual features of text, combining the text's forward and backward hidden state results, which can better access the long-distance bidirectional semantic dependencies and effectively solve the dependency problem of the entities in the fault text that are far away from each other.

CNN
CNNs, as the most popular algorithm in deep learning, are widely used in image and time series data processing, and it has non-fully connected and weight-sharing network structure characteristics, which reduces the complexity of the network model and the number of weights.CNNs include two operations: convolution and pooling, whose principles are shown in Figure 4, and the specific process is as follows: First, different sizes of convolution kernels on the input feature vector matrix for feature computation are used to obtain the local feature of the text, and the computation is shown in formula (7).
Where   is the weight parameter of the convolution kernel,  is the activation function,   is the bias term of the convolution kernel, and the final output of the convolution layer is shown in formula (8).
To simplify the expression of features, after obtaining text features by convolution, the maxpooling operation is used to get the strongest features.After the convolution result is calculated, as shown in formula (9), to get the maximum value of c.After the pooling operation, the feature vector not only has a reduced dimension but also preserves the most core semantic information of the text.

MHA layer
By incorporating the attention mechanism, the neural network can prioritize and concentrate on more important information relevant to the current task.This improves efficiency and accuracy in task processing.Considering the small size of the railway signal equipment fault corpus and the abundance of non-standardized text in the corpus, contextual feature vectors   from BiLSTM and local feature vectors   from CNN are combined to extract more textual features, which are then input into the MHA layer to calculate the attention mechanism.
The process of the self-attention mechanism is to first multiply the input matrix  with three hidden weight matrices, converting the input vector into a query vector  and a set of key vectors  and value vectors .The attention weights are then computed from  and  and applied to  to obtain the output of the entire weights [33].For inputs ,  and , the output vectors are computed as shown in formula (10), where ,  and  are three matrices with dimensions   ,   and   respectively.

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( 𝑄𝐾 𝑇 √𝑑 𝑘
)  (10) The self-attention mechanism has the flaw that the model may excessively focus its attention on its own position when encoding information from the current position, so MHA is proposed to address this issue.
MHA consists of multiple attention mechanism units that work together to let the model focus on data from different locations within different representation subspaces.Figure 5 shows the working principle of MHA: all the obtained feature vectors ℎ  are spliced and then   is linearly transformed to obtain the final feature vector , which is calculated as shown in formulas (11) and (12), where    ,    and    are the weight matrices of different attention units, and "Concat" stands for a splice vector. = (, , ) = (ℎ 1 , ℎ 2 , ⋯ , ℎ  )  (11) Working schematic of MHA.

CRF layer
To ensure the legitimacy of the final predicted label, CRF is introduced to add constraints to the final labels.For example, the first character in a word starts with a label "B-" or "O", and the output character label after "B-Phe" must be "I-Phe".With these constraints, the probability of illegal sequences in the label sequence prediction will be reduced, thus improving the correct rate of entity recognition.The CRF structure is shown in Figure 6.The observation sequence  = { 1 ,  2 , ⋯ ,   } and the state prediction sequence  = { 1 ,  2 , ⋯ ,   } are known, and the correspondence (, ) is calculated according to formula (13), where    ,  denotes the condition where the label is   and the observed character is   of   , which comes from the hidden state of BiLSTM.  −1 ,  denotes the transfer from the  −1 label to the   of the label   , which is learned as part of the model parameters and obtained during training.
The probability of the predicted sequence is obtained from formula ( 14) by normalizing all possible sequence paths with softmax function, where   denotes the set of all possible tag sequences against the input sequence,  ̃ denotes the current predicted tag sequence, (,  ̃) denotes the total score of the sentence under the current predicted tag sequence and exp() denotes the exponential function.
In the training phase, the likelihood function of the predicted sequence is obtained by taking logarithms at both ends, as shown in formula (15).
In the decoding stage, the maximum likelihood function () is used for decoding to obtain a set of sequences with the highest overall output probability, which is the final predicted labeled sequence, as shown in formula (16).
The specific process is to take the output sequence of the previous layer of MHA as input, and the CRF predicts the label sequence   = { 1 ,  2 , ⋯ ,   } with the constraint relationship and highest probability based on the character labels before and after the context.

Experimental environment
The experimental environment is a Windows 10 operating system.The CPU is an Intel(R) Core (TM) i9-13900KF.The compilation language is Python version 3.9, and Spyder is used as the integrated development environment.Pytorch, a deep learning framework developed by the Facebook Artificial Intelligence Institute, was used to build the NER model.

Evaluation index
In this study, the model is evaluated on its precision (P), recall (R) and F1 value in the task of recognizing entities in fault information for railway signal equipment.The formulas for these three indexes are shown in ( 17) ~ (19).
Where TP represents the number of entities correctly recognized, FP represents the number of entities incorrectly recognized and FN denotes the number of entity labels not recognized.

Hyperparameter tuning
In the MHA layer, the number of heads is a highly essential parameter, and its selection will directly affect the MHA's ability to extract the key features.Table 3 shows the effect of different attention heads in the MHA layer on the model indexes.According to Table 3, the model achieves optimal performance when the attention head is set to 8. Specifically, compared to the attention heads 2, 4 and 5, the F1 score improves by 0.8%, 0.29% and 0.2%.

Model verification
Table 4 shows the recognition effect of five entity labels under the NER model based on RoBERTa-wwm and deep learning integration proposed in this paper.From the results, the evaluation indexes of fault phenomenon, repair measure, and repair outcome are high, and their precisions reach 93.08%, 92.03% and 96.45%, respectively.The expression of these three entities is relatively single, with prominent grammatical features and obvious entity boundaries.The precisions of fault reason and fault position are only 89.42% and 83.23%, respectively.This result is due to the diversity of fault reason and fault position language descriptions, and the blurring of the boundaries between the entities.A fault phenomenon corresponding to the fault reason of the situation is very complex, resulting in the inability to learn the correct expression of the cause of the fault.

Model comparison
To further validate the effectiveness of the model suggested in this paper on this dataset, the model and other common NER models are compared and tested, and the results are shown in Table 5~11.The common entity recognition models used are as follows: (1) The HMM model is a directed graph probabilistic and generative model.The model generates entity labels as unobservable sequences of hidden states and readable raw corpus text as an observable result. ( The CRF is a model of conditional probability distribution given a set of input random variables conditional on another set of output random variables.The linear chain CRF is one of the most commonly used models for sequence labeling problems. (3) The BiLSTM model, as the most basic model of a neural network, first takes a sentence as input, then moves two LSTMs in opposite directions of the sentence to construct a context-sensitive representation of each word, and finally predicts each entity label using the softmax function.
(4) The BiLSTM-CRF model is the most mainstream NER model.The BiLSTM layer produces the predicted value for each label, which serves as the input for the CRF.By transferring the probabilities in the CRF loss function, the model can learn various constraining rules to enhance the accuracy of the result.
(5) The BiLSTM-CNN-CRF model splices the feature vectors from the CNN and BiLSTM into the CRF layer.It has been proven that this kind of parallel structure can extract more features from longer text sequences.
(6) The BiLSTM-CNN-MHA-CRF (BCMC) model adds an attention mechanism based on the BiLSTM-CNN-CRF model to obtain the global features of the text sequence and how strength the characters are linked to each other.
The recognition effect of the fault phenomenon "Phe" under each model is shown in Table 5.The RBCMC multilayer model, as described in this study, has exceptional efficacy in accurately identifying the labels "B-Phe", "M-Phe" and "E-Phe", achieving F1 of 88.73%, 92.41% and 90.28%, respectively.Compared with the BiLSTM-CRF model, the precision is improved by 0.16%, 9.13% and 8.91%, respectively, and the recognition effect of the "B-Phe" label is relatively general.The reason is that the first character of the entity label is mostly uncertain.The recognition effect of the fault position "Pos" under each model is shown in Table 6.The recognition effect of this label in each model is relatively general, and the F1 values of the RBCMC multilayer model proposed in this paper for recognizing the labels "B-Pos", "M-Pos" and "E-Pos" reach only 83.58%, 85.71% and 81.71%, respectively.There are many uncertainties in general fault positions, leading to complex and diverse linguistic expressions.
The recognition effect of the fault reason "Rea" under each model is shown in Table 7, and the performance in terms of recognition is unsatisfactory.The RBCMC multilayer model suggested in this study achieves the highest performance, with a F1 of only 80.88%, 84.44% and 82.5% for "B-Rea", "M-Rea" and "E-Rea", respectively.The recognition effect of the repair measure "Mea" under each model is shown in Table 8.The performance of the various types of labels of repair measures is relatively good, mainly because the label expression is relatively single and the entity boundary is relatively clear.The F1 values of "B-Mea", "M-Mea" and "E-Mea" are improved by 9.56%, 3.91% and 6.71% based on the BiLSTM-CRF model, which indicates that the CNN and MHA play a great role in extracting text features.The recognition effect of the repair outcome "Out" under each model is shown in Table 9.As the best-performing category among the five entity types, the label has a relatively single expression, which is mainly expressed as "Fault disappears, equipment back to normal", "Equipment normal, write-offs restored".The recognition effect of the other nonentity label O under each model is shown in Table 10.Since the other nonentity label O is the largest number label among the 16 labels, this label has a good index on each model, and the RBCMC multilayer model proposed in this paper has the best effect of recognizing it, with its precision, recall, and F1 reaching 97.23%, 87.95% and 92.37%, respectively.Table 11 shows the comparison of the entity recognition effects of different downstream models.From Table 11, the deep learning model demonstrates superior performance in the NER test when compared to the traditional machine model.This is achieved by automatically extracting the relevant characteristics from the text.The BiLSTM-CRF model outperforms the BiLSTM model because of the CRF's ability to effectively capture label dependencies and generate entity labels with constrained relationships.The comparison between BiLSTM-CRF and BiLSTM-CNN-CRF demonstrates that the performance of entity recognition is enhanced by the extraction of additional text features through the parallel operation of BiLSTM and the CNN.Compared with BiLSTM-CNN-MHA-CRF and BiLSTM-CNN-CRF, the three indexes are increased by 3.49%, 2.09% and 2.81%, respectively, indicating that the MHA has obvious advantages for text feature extraction and combining features from different angles to enhance the model representation.The RBCMC model based on BiLSTM-CNN-MHA-CRF improves precision, recall and F1 by 2.4%, 0.6% and 1.5%, respectively.Taken together, the RBCMC multilevel model proposed in this paper has the highest evaluation indexes in the task of identifying entities with fault information.
To verify the effectiveness of the RoBERTa-wwm pretrained model for the task of recognizing fault information, standard pretrained models are selected for comparison testing with RoBERTa-wwm used in this paper.The test results are shown in Table 12.From the table, the three evaluation indexes of precision, recall and F1 of ERNIE, BERT, Chinese-BERT-wwm and RoBERTa-wwm are all above 80%, which shows that the pretrained language models of the BERT series have better performance for entity recognition in this paper's dataset, with the RoBERTa-wwm model having the highest of the three indexes.The difference between the evaluation indexes of Chinese-BERT-wwm and BERT is only about 1%, and RoBERTa-wwm improves about 1% in all three evaluation indexes compared with Chinese-BERT-wwm.

Case study
To make the initial application of the model proposed in this paper, a railway signal equipment fault information entity recognition system is constructed as the basis of the future railway signal equipment fault knowledge graph.The system can recognize fault texts other than that of the test set of this paper, and the system recognition test is carried out with a railway fault text as an example.The system recognition results are shown in Figure 7.

Figure 1 .
Figure 1.Overall structure of the model in this paper.

Figure 4 .
Figure 4.The principle of CNN.

Figure 6 .
Figure 6.The structure of Conditional Random Field.

Table 1 .
Definitions of entity type.

Table 3 .
Effect of different attention heads on model metrics in the MHA layer.

Table 4 .
Recognition effect of five entity labels under the RBCMC multilevel model.

Table 5 .
Effectiveness of different NER models in recognizing the fault phenomenon.

Table 7 .
Effectiveness of different NER models in recognizing the fault reason.

Table 8 .
Effectiveness of different NER models in recognizing the repair measure.

Table 9 .
Effectiveness of different NER models in recognizing the repair outcome.

Table 10 .
Effectiveness of different NER models in recognizing the other nonentity labels.

Table 11 .
Performance comparison of different downstream models.

Table 12 .
Performance comparison of different pretrained language models.