Research on Entity Recognition in Aerospace Engine Fields Based on Conditional Random Fields

With Entity recognition is an important basic tool for many natural language processing tasks such as information extraction, question answering systems, and syntactic analysis. Entity recognition is divided into general field and specific field. There were different recognition methods for different fields. At present, one of the more common entity recognition technologies is a rule-based method, and the other is a statistical method. Due to data sets scarcity in the aerospace engine field, there is a lack of corresponding entity recognition research in the aerospace engine field. This paper proposes a statistical-based entity recognition method to identify entities in a specific field based on conditional random fields. This method constructs a training set in the aerospace engine field through manual annotation, extracts entities in the aerospace engine field, uses words, word frequency, and part-of-speech features respectively, and conducts experiments. This method can more accurately identify entities in the aerospace engine field, and finally make the accuracy rate, Recall rate and F value reach 92.7%, 87.1%, 89.8%.

However, entity recognition itself is more difficult than English recognition. Chinese entities have no obvious signs and are interfered by word segmentation, while English word segmentation is not required. But, due to the various types and terms of entities in specific fields, boundary recognition is more difficult, which in turn increases the difficulty of entity recognition. At the same time, the recognition of named entities based on machine learning relies heavily on data annotations, and the lack of training data of annotation in specific domain also increases the difficulty of field entity recognition.
The complexity of domain entity recognition is mainly manifested in the following aspects: (1) Domain entity recognition lacks annotated data sets.
(2) Due to the complexity and diversity of entity types, the combination of domain words does not have certain rules, which leads to the difficulty of domain words recognition.
(3) The composition structure of the entity is relatively complex, and the length of the entity is not limited. There will also be multiple entity nesting, entity abbreviations, mixed Chinese and English, and other situations, which will bring difficulties to entity recognition. (4) The boundary between domain entity recognition and general entity recognition is not obvious. Domain entities include general entities, and it is difficult to label.
The main technical methods of domain entity recognition and general entity recognition are consistent. It is mainly divided into rules and dictionary-based methods and statistics-based methods.
Rule-based entity recognition methods mostly use linguistic knowledge and the lexical and syntactic characteristics of entities in the professional domain entities to construct a rule templates, and then match the corpus in the text with the rule templates. The rules and dictionary-based method is the earliest of the entity recognition method, as long as there are related templates, it is relatively simple to implement. However, depending on the quality of the rule templates, the transplant ability is poor and the cost is high.
Statistics-based methods mainly include Hidden Markov Model (HMM), Maximum Entropy (ME) [2] , Support Vector Machine (SVM), Conditional Random Fields (CRFs) [3] and other methods based on statistics and Machine learning. The statistics-based method does not require the construction of rule templates and the participation of experts, and the method is more flexible. And compared with HMM, CRFs can achieve global optimization. Compared with ME, CRFs solves the problem of label bias. Although CRFs has these advantages, it still has the disadvantages of high training cost and high complexity. Therefore, it is necessary to select appropriate features to improve recognition performance and reduce or avoid the use of inefficient features [4] .
Due to the lack of data sets in the engine field, there is also the problem of domain entity recognition. Therefore, this paper uses a conditional random fields model based on statistical methods, uses the CRF++0.58 toolkit to recognize entities in the aerospace engine field, and uses features such as words and word frequency to train and learn to realize the entity recognition of data in the aerospace engine field and obtain a high accuracy rate.

Conditional Random Fields
Conditional random fields (CRFs) was proposed by Lafferty [5] et al. in 2001. It is an undirected probability graph model, which combines the characteristics of hidden Markov model and maximum entropy model. CRFs is a special case of Markov random field, which solves the problem of tag bias caused by hidden Markov model. In addition, features of context can be taken into account to make better feature selection. CRFs is to calculate the conditional probability distribution density of another set of output random variables given a set of input random variables. A general CRFs model is shown in Figure 1.  (1) Probability problem: Conditional probability is calculated on the premise of given model parameters, observed values and state values.
(2) Prediction problem: Given model parameters and observations, calculate the state value of the maximum conditional probability.
(3) Training problem: Given training data, the model parameters are trained with the goal of maximizing conditional probability.
CRFs can be applied in many fields. Yang Yan et al. [6] proposed a image classification and recognition method based on CRFs in the field of image, which realized accurate classification and recognition of images. In the field of natural language processing, Che J et al. [7] proposed an Chinese word segmentation method based on a CRFs, which realized automatic word segmentation of Chinese. Kaixin Liu et al. [8] proposed the named entity recognition method of Traditional Chinese Medicine (TCM) clinical medical case symptoms based on CRFs, which realized the entity recognition of TCM clinical complex symptoms. C. Janarish Saju et al. [9] proposed an entity recognition based on CRFs in bank big data, which realized entity recognition in a new field.
Entity recognition in engine field based on CRFs is proposed in this paper, which realizes the recognition of specialized words in engine field. Entity recognition in engine field is mainly to realize the prediction problem of CRFs. In other words, set the random field p x|y of the linear chain component, Under the condition that the input observation sequence is x, calculate the conditional probability y of the value of the output sequence, and the annotation of the observation sequence is finally obtained, which has the form of formulas (1) and (2).  (3), the prediction problem of CRFs is the problem of maximizing the probability of demoralization to obtain the optimal sequence marker. The maximum unnormalized probability decomposition calculation can be obtained by formulas (4) (5). max * ( , ) y w F y x (4) Among them: ..,f (y,x)) (5) Due to the wide application of CRFs, many CRFs tools have appeared, The experiment of this article also selected the CRF++ toolkit. The current common CRFs versions are CRF++0.53, CRF++0.58, this article selectes the CRF++0.58 to identify the engine field entities, CRF++ is a well-known open source tool for CRFs, written in C++ language. Its most important function is the use of feature templates. So we can automatically generate a set of feature functions, instead of generating our own feature functions, what we have to do is to find features, such as part of speech.
Choose appropriate parameters and design a feature template according to your needs to train and test the data. There are four main parameters that can be adjusted: -a CRFs-L2 or CRFs-L1 Normalized algorithm selection. The default is CRFs-L2. Generally speaking, the effect of the L2 algorithm is slightly better than that of the L1 algorithm, although the value of the non-zero features in the L1 algorithm is substantially smaller than that in the L2 algorithm.
-c float This parameter sets the hyper-parameter of CRFs. The higher the value of C, the higher the degree of CRFs fitting training data. This parameter can adjust the balance between overfitting and unfitting. It also can be found by cross-validation and other methods to better parameters.
-f NUM This parameter sets the cut-off threshold of the feature. CRF++ uses features that appear at least NUM times in the training data. The default value is 1. This option is useful when using CRF++ to large scale data, where there may be millions of features that occur only once.
-p NUM If the computer has multiple CPUs, then the training speed can be increased through multithreading. NUM is the number of threads.

Entity in the aerospace engine field
Using the CRFs of machine learning to identify entities in the engine field is supervised learning. It requires a certain scale of labeled training data. However, the engine field lacks domain data corpus we need to build a certain scale of engine training corpus.This article is based on Jieba word segmentation tools segmented the initial text and annotated part of speech, and then manually annotated the training data set and corrected the results of Jieba word segmentation. This article uses the book "Encyclopedia of World Missiles and aerospace Engines" to study and analyze the engine-related entities in it. The provided data set is 1.74 MB, 70% is used for training corpus, 30% is used for test corpus.
We formulate certain labeling standards for the obtained training data of aerospace engines, and then obtain training data through manual labeling based on Jieba word segmentation. The marking standards are as follows: (1) Words that directly describe or indicate an engine, such as "rocket engine", "aspirated engine".
(2) Words representing the structure of aerospace engine systems and related aerospace terms, such as: "space vehicle", "propulsion system", "check valve". Although these words do not denote aerospace engines, they also describe aerospace-related vocabulary and are marked as entities in the training set of this article.
(3) The identification of entities should be as accurate and complete as possible, such as: "third-level engine".
(4) The aerospace engine vocabulary composed of a mixture of Chinese and English is labeled as a vocabulary. Such as: "YF-40 Liquid Rocket Engine".
(5) With national aerospace engine vocabulary, we should mark the country and engine words as a vocabulary. Such as: "China Long March 4A (LM-4A) carrier rocket".
Through the analysis of the aerospace engine data, it is found that the aerospace engine entity has the following characteristics: (1) The entity type is highly specialized, and entities in the aerospace engine field rarely appear in other fields.
(2) There is no clear definition of the boundary of related entities, and it is easy to blur the scope of the boundary when marking entities.
(3) The length of the entity is uncertain, ranging from two characters to more than a dozen characters. The part of speech composed of words does not have a certain rule, and there will be a mixture of Chinese and English, and unified characteristics cannot be found according to the context.
(4) The term nesting phenomenon exists in the aerospace engine entities. (5) Data sparseness exists in entities of the aerospace engine field, and many aerospace entities have only appeared once or twice in the text.

aerospace engine entity recognition based on CRF++
The entity recognition of aerospace engines based on CRFs, using the CRF++ toolkit for experiments, the main process of using CRF++ for entity recognition in the aerospace engine field is: (1) Collect experimental corpus and perform preprocessing operations such as word segmentation and part-of-speech tagging on the corpus.
(2) Select appropriate parameters and feature training model.  Table 1, each text contains multiple tokens, and each line represents a token. A token contains multiple columns of data, which are displayed in each column. The first column is the word itself, the last column is the word label, and the middle is the selected feature.
The specific process of aerospace engine entity recognition is shown in Figure 2.

Pre-treatment
Preprocessing the data after collecting the corpus used in the experiment is the foundation of the model building. The pure text in the book "Encyclopedia of World Missiles and aerospace Engines" is extracted, and operations of removing spaces and empty lines are performed as preliminary preprocessing. Then the text obtains the initial word segmentation through the Jieba word segmentation tool, and obtains the word itself, part of speech, and word length after the word segmentation. After obtaining the basic features of the words, the text is labeled with lexical positions by manual methods. The lemma tagging is to obtain the boundary information of each word. The current general lemma tagging methods mainly include 2-tag tagging method, 4-tag tagging method and 6-tag tagging method.
The tag set of 2-tag is {B,E}, B represents the first word of a entity, and E represents the middle or end of the entity. B and E form an entity. 4-tag is more complex than 2-tag, but the recognition of the result will be more accurate. 4-tag tag set is {B,M,E,S}, B represents the first word of the entity words, and M represents the middle of the entity Part, E represents the tail of the entity, S represents a separate entity. 6-tag implements a more detailed entity tagging than 4-tag. The label set of 6-tag is {B,M,E,W,O}, B represents the first word of the entity words, M represents the middle part of the entity, E represents the tail of the entity, and W represents a separate entity in the field. In the aerospace engine field expressed as a separate engine word into an entity, O represents other components of the sentence except the engine entity. The more complex the annotation set is, the training speed of CRF model will be affected of the CRFs model, but it can also improve the result of entity recognition. This paper uses the 6-tag label set to label entities. The definition of the 6-tag label set is shown in Table 2.   Table 1, the first column is the characteristics of the word itself after jieba word segmentation, the second column is the part-of-speech characteristics of the words, the third column is the word length characteristics of the current words, and the fourth column is the lexeme characteristics of the word. The three features selected in this article are as follows: (1) The word itself. Our task is to obtain the entity in the field of aerospace engines. According to the characteristics of the entity in the field of engine, most words only circulate in this field, so the word itself covers the largest information of the entity, and the word itself is the most important feature.
(2) Part of speech Certain rules can be found according to the compositional characteristics of words and phrases. Chinese word formation has certain characteristics. By observing entities in the field of aerospace engines, most of the entities' part-of-speech combinations are "n+n", "v+n" and other formats, so the part of speech Feature is also one of the very important features.
(3) Word length Engine words generally have more than one word. This feature can be used to determine whether the word is part of the entity by the word length.

The Model of establishment and prediction
The training of CRFs model needs a feature template, which is used to provide a unified template for the feature functions in the CRFs model, so as to propose the relevant feature functions from the training text, and then train to obtain the weight of each feature function. The definition of the template directly determines the recognition effect of the final model.
Commonly used feature templates are divided into unigram template and bigram template. The unigram template starts with "# Unigram" and the binary feature template starts with "big gram". Compared with the unigram feature template, the binary feature template can improve the recognition efficiency of the model, but when the number of features is relatively large, the binary feature template will greatly increase the training time of the CRFs model. Therefore, this article uses unigram features template in the aerospace engine entity recognition.
The feature template format is: %x[row, col]. x can be U or B, corresponding to two types. The number in square brackets is used to calibrate the feature source, row represents the row relative to the current position, 0 is the current row; col corresponds to the column in the training file. Only the first column (number 0) is used here, which is the text. For the aerospace engine entity recognition, when the selected features are the word itself, part of speech and word length, the established feature template is shown in Figure 3. U00 to U09 correspond to the feature template of "word itself", U10 to U14 correspond to the feature template of "part of speech", and U15 to U17 are feature templates of "word length".
After the feature template is specified, the CRF++ toolkit can be used to train the model on the training set while testing the model.
Model training mainly includes two aspects: input and output respectively. The input is a preprocessed data set, and the output is a model composed of feature functions and parameters.
The model test uses the trained template, the data is processed into the result that can be recognized by the model as input, and the output is the recognized aerospace engine entity.

Experimental data
In the field of Chinese word segmentation, SIGHAN-2nd International Chinese Word Segmentation Bakeoff is the most authoritative evaluation. This article uses the data set provided by Microsoft Research Asia for SIGHAN bakeoff 2005 as a general field data set, all of which are "People's Daily". The content in January 1998, with a total of 1.7 million words, is a news language with a wide range of sentences and standardized data formats. The number of training data labeled entities is shown in Table  3. As shown in Table 3, the number of entities and non-entities of People's Daily are 1,255,055 and 1,113,336 respectively. This article proposes an entity recognition algorithm for the aerospace engine field. The bake off 2005 data set lacks aerospace engine field data. Therefore, this experiment uses the aerospace engine field data from the "World Missiles and aerospace Engines". The provided data set is 1.74 MB. 70% is used for training corpus and 30% is used for test corpus. The number of training data labeled entities is shown in Table 4.  Table 4, the aerospace engine field entities include 42017 words in combination and individual words, and a total of 161,380 words in the general field.

Classification
Parameter

Evaluation Criteria
The performance of a method is mainly measured by the accuracy of the method's recognition. The commonly used indicators for evaluating the accuracy are accuracy (P), recall (R) and F1 value. The calculation method is as follows:

Result analysis
This article first compares the entity recognition of CRF++ tool in general field and aerospace engine field. The results are shown in Table 5. It can be seen from Table 5 that after using CRF++ for entity recognition, the recognition results in the general field will be significantly better than the recognition results in the aerospace engine field. This is mainly due to the relatively complex entities in the aerospace engine field, fewer training data sets, and manual annotation errors.
The analysis of the results generated by using different entity recognition algorithms on the aerospace engine data is shown in Table 6. The general domain model generated after training on the People's Daily data set using CRFs tool, and the test set of the aerospace engine using the general domain template. It can be seen that the entity recognition result is not good, because the entity types in the general domain are inconsistent with those in the aerospace engine domain. The inconsistent types result in poor entity recognition. The effect of using jieba word segmentation is better than the general domain model, but the recognition result is still not ideal. The CRFs tool was used to train the space engine field data and generate the template with the three features, and the test was conducted with the template. The accuracy rate of 92.7% was obtained, which was higher than the other two methods.
Analysis of the results generated by entity recognition using different features on the aerospace engine data is shown in Table 7. It can be seen from Table 7 that the recognition of the model obtained by selecting the three features for training is relatively accurate. At the same time, adding part-of-speech features on the basis of the word itself has a greater impact on the results, while the word length has a small impact on the recognition results, but it is also improved accuracy to a small extent.

Experimental analysis
Analyzing the experimental results, it was found that the errors were mainly concentrated in the following aspects: (1) The recognition words are not complete, such as "CIAOTRJ verification engine" is identified as "CIAOTRJ"/" verification "/" engine ".
(2) The recognized word is partly more than the correct entity. Except for the case of incorrect word segmentation, there are cases where, for example, "fuel supply pipeline" is recognized as "regulated fuel supply pipeline".
(3) Because there is no uniform standard, there are some ambiguities in the labeling. For example, according to the labeling rules, "J33-A-18A turbojet engine" is judged as an entity, but the recognition result is "turbojet engine".
(4) Some words are not considered as entities in the aerospace engine field, but they may also be recognized due to their own characteristics or similar context and terminology, such as "regulation system".
(5) The recognition result is incorrect due to inevitable errors in manual labeling. From the results, although there are still errors in the results, the recognition effect is relatively good.

Conclusion
This paper proposes an entity recognition method in the field of aerospace engines based on CRFs, using the famous open source tools CRF++ over the airport, using artificially labeled data sets, and constructing a model based on the word itself, part of speech, and word length. By comparing experiments, the effectiveness of CRF++ and the necessity of three characteristics are obtained. From the experimental point of view, this paper still has certain defects. Due to the sparse aerospace engine data, the comparison of experimental results using different methods is relatively high; there are certain errors in the process of manually labeling the training set, which will also have a great impact on the results. Next, we will consider training with fewer data sets to obtain better results and reduce errors in manual data annotation.