Named Entity Recognition of Traditional Chinese Medicine Patents Based on BiLSTM-CRF

With the growing popularity of traditional Chinese medicine (TCM) in the world and the increasing awareness of intellectual property protection, the number of TCM patent application is growing year by year. TCM patents contain rich medical, legal, and economic information. Effective text mining of TCM patents is of great theoretical and practical significance (e.g., the R&D of new medicines, patent infringement litigation, and patent acquisition). Named entity recognition (NER) is a fundamental task in natural language processing and a crucial step before indepth analysis of TCM patent. In this paper, a method combining Bidirectional Long Short-Term Memory neural network with Conditional Random Field (BiLSTM-CRF) is proposed to automatically recognize entities of interest (i.e., herb names, disease names, symptoms, and therapeutic effects) from the abstract texts of TCM patents. By virtue of the capabilities of deep learning methods, the semantic information in the context can be learned without feature engineering. Experiments show that the BiLSTM-CRF-based method provides superior performance in comparison with various baseline methods.


Introduction
TCM has a long history and has been inherited for thousands of years. It is becoming increasingly popular all over the world for its mild medicinal properties and impressive therapeutic effects, especially for certain chronic and intractable diseases. On October 1, 2018, the World Health Organization brought TCM into the medical compendium with global influence, and remarkably, it has made significant contribution to the world's combat with the COVID-19 pandemic so far.
Patent is the carrier of the most advanced technology and method that contains rich technical, economic, and legal information. With the increasing awareness of intellectual property protection, the number of patent application is growing year by year. At present, patent texts, in parallel with scientific literature, have become an important source of knowledge in the era of the Internet, during which emerging technologies such as Web services, cloud computing, the Internet of Things, and wireless sensor networks have contributed to the protection, development, and promotion of TCM [1][2][3][4][5]. It is expected to analyze patents as much as possible, so that the hidden valuable information can be extracted and utilized in the R&D of new medicines, patent infringement litigation, and patent acquisition. Since the R&D of TCM is a time-consuming and laborious process, if we can fully analyze the information in TCM patents, repeated medicine research will be largely avoided, R&D cycle will be shortened, and R&D costs will be saved. In this sense, the analysis of TCM patents is becoming a hot research topic. Before analyzing TCM patents, an essential step is to extract important named entities (e.g., herb names, disease names, symptoms, and therapeutic effects) from the TCM patent texts. These extracted entities can not only serve as the object of semantic extension in patent intelligent retrieval but also be the input in calculating patent text similarity.
Nowadays, the identification of these important named entities in TCM patent texts is faced with the following challenges: (1) Due to the long history and wide geographical distribution of herbs, factors such as dialect and erroneous records have led to a rather common phenomenon of aliases. One herb may have multiple names, and different herbs may share the same name. For example, "ganoderma lucidum" has synonyms such as "ChiZhi," "HongZhi," "MuLingZhi," "JunLingZhi," "WanNianXun," and "LingZhiCao," and both "black sesame" and "linseed" are known as "flax." Currently, there is not a complete Chinese herb thesaurus yet (2) Due to different writing habits, TCM doctors and researchers tend to express a certain disease name in languages similar with classical Chinese, such as using "ShouHan" or "ShouLiang" to refer to "catch a cold" (3) Due to the lack of labeled training sets of TCM patent texts, a proper training set has to be established manually while applying machine learning method for NER At present, there are few studies on TCM patents mining, particularly in the NER of TCM patents. High-quality labeling of TCM patents still relies on manual work in a large extent. Therefore, we intend to design a model to fill the technical gap in NER of TCM patents. The main contributions of this paper are summarized as follows: (1) We apply BiLSTM and CRF to the NER of TCM patents, which can comprehensively utilize the characteristics of the contexts (2) The possible entity boundary problem in the task of NER is discussed, and practical solutions to the problem is proposed The rest of the paper is organized as follows: Firstly, related work in the field of NER is presented, and then the models of CRF and BiLSTM are introduced, respectively. After that, the method proposed in this paper is introduced in detail, including the structure of the neural network and all key steps. Finally, the experiment and the analysis of experimental results are demonstrated.

Related Work
Named entity recognition (NER) is an important task in natural language processing that aims to identify key entities in the text. It paves way for applications such as information retrieval, relation extraction, machine translation, and question-answering system. Usually, the performance of these applications will depend directly on the accuracy of NER. To date, the research methods of NER basically fall into three categories, namely, dictionary-based, rule-based, and machine learning-based (deep learning included). Dictionary-based methods, widely used in the early stage, work by matching text against a dictionary (a list of entity names). This kind of method is simple, but the accuracy of recognition depends heavily on the accuracy of word segmentation, and it often fails when it comes to custom words. With regard to the rule-based method, extraction rules are summed up by experts after observing the characteristics of the text, and entities are extracted using regular expressions. The downside is that the accuracy of this method depends on the quality of rules, and its portability is poor. At present, the most common method is based on machine learning or deep learning, which supports automatic characteristics learning of entities from the training set and has exhibited high generalization capability.
Lei et al. [6] investigated the effects of different types of features based on the CRF classifier and machine learning algorithms for NER in Chinese clinical text. The result showed that word segmentation based on a Chinese medical dictionary played a positive role.
Rahman et al. [7] designed a feature-oriented method that based on CRF to identify disease in biomedical literature. Despite its results needed to be improved further, sentence and token level features were given, thereby contributing to the overall performance and providing inspiration for our research.
In our previous research [8], the cotraining algorithm in weakly supervised machine learning was improved to recognize herb names in TCM patent texts. The method proposed does not require a large number of training samples and makes full use of the characteristics of the texts, including distribution and word-formation features. The deficiency of this method is that each iteration require manual participation, and high-quality candidate words have to be selected as the seeds for the next iteration. The recall rate also needs to be improved.
In recent years, deep learning has been developing rapidly since its emergence in the field of image recognition. Different from traditional machine learning methods, deep learning is characterized by automatic and effective feature extraction. Moreover, with the emergence of the Word2Vec method, deep learning-based methods have been applied to a variety of natural language processing tasks, such as NER, entity relationship extraction, part of speech analysis, and sentiment analysis.
Wei et al. [9] proposed an approach that combines CRF and Bi-RNN with SVM and outperformed the baseline by a large margin. The major factor that hindered the performance lies at redundant or missing modifier of the entity to be recognized. Likewise, the neural network architecture presented in Chen L's work [10] achieved good results on some corpora and outperformed the previously top performing systems; yet, boundary errors of word segmentation caused some performance loss. Ye et al. [11] examined the challenge of the lack of explicit labels in Chinese text to define the boundaries of words and proposed an NER model based on character-word vector fusion. The presented model was proved to reduce the dependence on the accuracy of the word segmentation algorithm and make the best use of the words' semantic characteristics. To address the problem of failing to correctly identify the boundary of an entity, we ensure the integrity and accuracy of the entity by bidirectional storage of text information and the introduction in of the BIO labeling method.
Based on experimental analysis, Wen et al. [12] pointed out that compared with traditional models, the deep learning-based methods can better capture the context 2 Wireless Communications and Mobile Computing information and thus save the laborious work of feature selection. Furthermore, adding CRF layer over the neural network can bring better results. In a sentiment analysis experiment, Chen et al. [13] found that sentence classification can improve the performance of sentence-level sentiment analysis. These findings greatly benefit for our research, especially in the design of the proposed model structure.
As a special recurrent neural network (RNN) in deep learning, LSTM avoids the gradient disappearance problem of RNN [14] and is especially suitable for dealing with character sequences with dependency as it can efficiently utilize the characteristics of the context. Combining LSTM with CRF, which ensures the validity of the predicted label by applying rules it learns from the training dataset, we argue that promising recognition result is anticipated.

Conditional Random Field
Conditional random field, proposed by Lafferty et al. in 2001, is an undirected graph model and a statistical model that works well in labeling and segmenting structured data [15]. It is good at global normalization of sequence probability and can freely label for sequences by feature function, avoiding the conditional independence and solving the problem of labeling bias [16]. The conditional random field model randomly outputs variable Y under the condition of a given random variable X and constructs a conditional probability model PðY | XÞ that satisfies Markov property as shown in Equation (1): In Equation (1), w represents all nodes except node v, w v represents all nodes connected with node v in an undirected graph, and Y v and Y w are the random variables corresponding to node v and w. The linear chain CRF [17] is frequently used in NER tasks. Given an observation sequence X = fX 1 , X 2 , ⋯ , X T g and the corresponding label sequence Y = fY 1 , Y 2 , ⋯, Y T g, the conditional probability distribution PðY | XÞ of Y constitutes the conditional random field: For a linear chain conditional random field, the loss function is shown in Equation (3) [18]. where X represents the input sequence, Y t represents the label of the current position, Y t−1 represents the label of the previ-ous position, K represents the number of feature functions, T represents the size of X, f k ðt, Y t , Y t−1 , XÞ represents the k th feature function, w k represents the weight of the k th feature function, and ZðXÞ represents the normalization factor. Finally, the possibilities of all candidate labeling sequences are summed up. The conditional random field model uses the forward and backward algorithm to carry out the conditional probability and characteristic expectation of different sequence positions, uses the quasi-Newton method and other maximized likelihood estimations to solve model parameters, and uses the Viterbi algorithm to find out the optimal label sequence by dynamic programming.

Long Short-Term Memory Neural Network
Hochreiter and Schmidhuber [19] proposed long short-term memory (LSTM), which consists of memory units c t , forget gates f t , input gates i t , and output gates o t .
The structure detail of LSTM is shown in Figure 1. In LSTM, there are operations such as addition, multiplication, tanh function, and sigmoid function. A line from the input ðx t , h t−1 Þ to the output h t represents the state of the cell. The output of this cell is a number between 0 and 1, where 0 means no passing at all and 1 means permitting to pass completely.
Firstly, what information to discard from the cell state will be decided. The forget gate f t judges the importance of the past memories, that is, the extent to which the past memories participate in the generation of new memories, as shown in Equation (5).
Next, what new information to store in the cell state will be determined. The input gate i t determines how important the current word is, that is, to what extent it is helpful in generating new memories, as shown in Equation (6).
At the same time, a tanh function is used to generate a new candidate memory unit e c t , as shown in Equation (7).
Next, the cell state c t is updated to obtain the memory unit of the current moment, as shown in Equation (8).
The final step is to determine the output of the model. The output gate o t is first used to determine which part of the cell's state will be output, as shown in Equation (9).
Then, the cell state at the current moment is processed through tan h, and these two signals are considered 3 Wireless Communications and Mobile Computing comprehensively, so that only a certain part is output and the final output of the hidden layer is obtained.
However, this kind of network only considers the influence of the past sequence at present, while ignoring the information later on and failing to achieve an ideal performance. Therefore, a bidirectional long short-term memory neural network (BiLSTM) model is introduced in, which can connect the output of the LSTM unit bidirectionally and capture bidirectional semantic dependencies, thereby improving the performance of the overall model.

The Proposed Model
In TCM patent texts, the description of a disease name usually has hints. For example, in the text sequence "DuiZhiLiaoLinBaJieYanJuYouXianZhuLiaoXiao (It has a significant effect on the treatment of lymphadenitis)", the word "ZhiLiao (treatment)" is often followed by a certain disease name. Therefore, when deciding whether a text contains a disease name or not, starting words play a key role in capturing the strong dependence in the context. On one hand, CNN tends to obtain static information of the text sequence "ZhiLiaoXiaoGuoMingXian (the therapeutic effect is notable)", and so when it receives the word "ZhiLiao (therapeutic)", it will mistakenly assume that the subsequent sequence "XiaoGuoMingXian (the effect is obvious)" is a disease name; on the other hand, although LSTM is good at obtaining longterm and long-range information in a sequence, its major weakness is that it cannot handle the noisy words in the rest of the sequence. For instance, given a text sequence "It can safely and effectively treat cough without toxic side effects and addiction", this sequence contains a disease name, but there are noisy words both in the front part and the back part of the sequence. To sum up, CNN tends to capture keywords and thus cause misjudgment, and one-way LSTM is not sensitive enough to subsequent interference, leading to misjudgment as well. So, if only CNN or one-direction LSTM is used, interferences aforementioned may not be eliminated. However, BiLSTM can not only capture the information of the timing sequence dynamically but also make use of both the preceding and the subsequent information of the current word to ultimately obtain a strong dependency.
BiLSTM is well aware of the context information in the character sequence, and CRF helps improve the labeling accuracy at the sentence level. Combining the advantages of BiLSTM and CRF, the overall recognition accuracy will be improved. An illustrative structure of the BiLSTM-CRF model is shown in Figure 2.
The procedure of TCM entity recognition based on BiLSTM-CRF will be described in details later in this paper. Here, the core steps are listed as follows: (1) Each character in TCM patent text will be mapped into a low-dimension dense vector by using a pretrained embedding matrix (2) Embedding vector of each character will be taken as the input of each time step of BiLSTM Layer, and the hidden states of the forward and backward output will be spliced to obtain a complete hidden state sequence; that is, the context features of the text are extracted. The probability distribution matrix of the sequence and label is then calculated (3) CRF Layer learns the potential relationship between sequences and excludes the label sequences that do not conform to the sentence-level grammatical rules. Finally, the optimal label sequence is ready for output In fact, studies have revealed that character-based NER generally performs better than word-based methods [21]. For this reason, this model takes characters as the initial input so that the accumulation of errors caused by poor segmentation can be mostly avoided. A global dictionary with size v is constructed, that is, a collection of v distinctive characters in the training set, in which each character corresponds to an identifier. A sentence containing n characters is one-hot encoded as an n × v-dimension matrix, denoted as W = ðw 1 , w 2 , ⋯, w n Þ, where w i represents the vector of the i th character of the sentence. A schematic diagram is shown in Figure 3.

Embedding
Layer. TCM texts from the input layer are converted into vectors so that computers can calculate. The traditional one-hot representation cannot capture the semantic relationship between words, and it will also bring in dimensional disasters and data-sparse problem. Distributed representation methods, however, can map words into fixed-length, low-dimensional, dense vectors. The semantic similarity between words can be measured based on the distance of the words in the vector space, which well overcomes the shortcomings of one-hot representation [22]. Especially after Mikolov proposed the Word2Vec model in 2013, the distributed representation entered the practical stage, and the application of deep learning methods in the field of NLP reached a new height. Furthermore, in order to cope with the mismatch between the scale of the training dataset of the deep neural network model and the parameters to be trained, high-quality pretraining results were used to initialize the parameters to get better results. We fine-tune the initial word embeddings, modifying them during backpropagation of the neural network model [23]. Each incoming identifier of a character w i from the input layer is mapped into a d -dimension dense vector X i = ðx 1 , x 2 , ⋯ , x d Þ, where d is the embedding size that defines the number of features used to represent character. The vector is obtained according to Equation (11).
where H denotes a d × v -dimension pretrained weight matrix. The vector is then transmitted to BiLSTM Layer as the input. At this point, a sentence containing n characters is mapped into a dense matrix X (with the shape of n × d) from the initial sparse one-hot matrix W (with the shape of n × v).

BiLSTM
Layer. This layer is used to extract sentence features. As shown in Figure 1, every character vector and h R i are then spliced to obtain the output hidden state c i = ½ðh i L , h i L Þ of each position i, and finally, the sentencelevel hidden state sequence C = ðc 1 , c 2 , ⋯, c n Þ is obtained. A full connection layer is used to map the hidden state vector to k dimensions, where k is the number of labels in the label set, so as to extract features and provide a probability distribution matrix P = ðp 1 , p 2 , ⋯, p n Þ. Element p i,j represents the probability of classifying the character w i as the j th label. The pseudocode of the BiLSTM Layer is shown in Algorithm 1.

CRF Layer.
This layer carries out sentence-level sequence labeling to ensure the generation of the globally optimal labeling sequence. The output of the BiLSTM Layer p i is independent of each other, ignoring the strong dependence between its preceding label p i−1 and its subsequent label p i+1 . The CRF layer can automatically obtain some restrictive rules from the training data and conduct sentence-level adjustment, which will reduce the probability of illegal sequences and improve the accuracy of label sequence prediction. An intuitive explanation will be given later in this section in conjunction with the mathematical notation. Considering the need to add a start state at the beginning of a sentence and a stop state at the end, the data structure of the CRF layer is a ðk + 2Þ × ðk + 2Þ state transfer matrix A. A i,j represents the transfer score from the i th label to the j th label. For a sentence W = ðw 1 , w 2 , ⋯, w n Þ as the input, assuming that a predicted label sequence y = ðy 1 , y 2 , ⋯, y n Þ is obtained, the score of the prediction is defined as where P i,y i is the probability of the i th position of BiLSTM outputs being y i , and A y i ,y i+1 is the transfer probability from (1, 0, 0, 0) (0, 1, 0, 0) (0, 0, 1, 0) (0, 0, 0, 1) Figure 2: BiLSTM-CRF model structure based on Chinese characters.

Wireless Communications and Mobile
Computing y i to y i+1 . The score of a candidate sequence is jointly determined by the features P extracted from the BiLSTM layer and the transfer matrix A aforementioned. Suppose, for instance, a sequence that the BiLSTM layer outputs is "B-M, B-M, I-M, O," but the probability of "B-M, B-M" in the transfer matrix A is quite small or even negative (in a practical sense, the chances of it happening are also slim to none); so, the score s will be reduced, and consequently, this unreasonable prediction sequence is likely to be ruled out. For each training sample W, the Viterbi algorithm [24] is used to calculate the score sðW, yÞ of all possible labeled sequence y, and then a softmax function is added to normalize all the scores. Equation (13) During model training, for the sentence input sequence X, the loss function is set to take the logarithm of the probability of the target true label sequence Y. To maximize the probability corresponding to the true label sequence, the method of taking the negative value and then minimizing Input: Pre-trained character embedding X: Output: Probability distribution matrix P of the input sequence (1) Step 1: The character vectors from X are sent into the forward LSTM layer (2) for i ∈length (X) do (3) send X i to BiLSTM Layer (4) end for (5) Step 2: The state of the cell in the current LSTM network is updated (6) f Step 3: The character vectors from X are sent into the backward LSTM layer and the above 2 steps are repeated (13) Step 4: The forward and backward sequences of hidden layers are spliced to obtain a sentence-level hidden state sequence C rich in contextual information (14) Step 5: C is sent into a full connection layer and the prediction matrix P is obtained (15) Return P; Algorithm 1 6 Wireless Communications and Mobile Computing it is adopted, and the gradient descent algorithm is introduced in to solve the parameters and maximize the loglikelihood function: In the process of prediction, scores s corresponding to every candidate sequence y are calculated according to the trained parameters, and the Viterbi algorithm using dynamic programming at its core is used to calculate the optimal path. The predicted result is denoted as Y * : In summary, a BiLSTM-CRF model was established in this paper to identify the entities of herb names, disease names, symptoms, and therapeutic effects in TCM patent texts. Given a text sequence as input, the model can output relevant entities in it. Combined with word vector technology, LSTM has obvious superiority when dealing with text sequence as it can take full advantage of long-term and long-distance information dependencies [25]. BiLSTM can provide even more comprehensive contextual information and makes it easier to learn about contextual dependencies. The additional CRF layer makes up for the deficiency of the BiLSTM model by optimizing the recognition results comprehensively from the sentence level.
6. Experiment 6.1. Experimental Procedure. Based on the BiLSTM-CRF algorithm, this paper has completed the NER experiment on TCM patent texts. The main experimental process is shown in Figure 4.    The data collected by crawlers contained a large number of nontext structure data such as website tags, links, and special characters, which was not conducive to sequence labeling [26]. Through technologies such as regular expressions and character format standardization, nontext data was removed, and 1600 abstract texts were finally preserved as experimental corpus.
The dataset was then partitioned into three parts, i.e., training set, verification set, and test set. The proportion of these three parts was 6 : 2 : 2 and specifically, 147,788 Chinese  6.3. Sequence Labeling. In simple terms, sequence labeling is to mark each element in the given sequence with a corresponding label so that it is possible for the neural network to learn. According to the selection criteria, 1600 highquality abstract texts were selected for this experiment. The model designed in this paper is mainly aimed at identifying four categories of entities, namely, herb names, disease names, symptoms, and therapeutic effects in TCM patent texts. Examples of each entity class are shown in Table 1.
Firstly, each entity of the four categories should be identified. Secondly, the BIO (Begin, Intermediate, Other) labeling method [27] is adopted for text labeling, where "B" means the first character of a word, "I" means the nonfirst character of the word, and "O" means a nonfocus character or punctuation. In the specific labeling process, the labeling methods shown in Table 2 are used to distinguish between entities, where the initial character of a herb name is expressed in the form of B-M (Begin-herb), and the rest of the word is expressed in the form of I-M (Intermediate or end-herb). Table 3 shows the detailed process of TCM patent sequence labeling. Take sample No.1 for example, the original text is denoted as state 1 in the table. With the help of a corpus labeling tool YDEEA [28], we need to find the entities in the text and then label them with corresponding marks. For example, as shown in state 2, "JinYinHua/M" represents that "JinYinHua" is a herb name entity. The tool will automatically generate a labeled version of the text for us in the form shown in state 3.
Finally, the labeled characters are directly sent to the input layer for processing; that is, Word2Vec embedding method is used to generate a 128-dimension word vector matrix for model training.

Model Training.
PyTorch is an open-source Python machine learning library developed by Facebook Artificial Intelligence Institute (FAIR). It is widely used to implement various machine learning algorithms. The BiLSTM-CRF model proposed in this paper was implemented in the PyTorch framework. Based on the training experience of relevant papers [29] and through multiple adjustments, the main parameters considered in the experiment are finally set as shown in Table 4.
6.5. Model Test. We evaluate experimental results by comparing the recognition results of different models with manual labeling as bench mark. Precision rate P, recall rate R, and F1 value are selected as evaluation indicators [30], as shown in Equation (16) to (18).
actual num , true num , and correct num represent the actual number of entities identified in the sample, the number of true entities, and the number of entities correctly identified, respectively. An example is given in Figure 5 to illustrate the calculation of the three indicators in discussion.
In this simple case, true num is 2 (a herb name and a therapeutic effect), actual num is 1 (a herb name), and correct num is also 1 (a herb name). Substitute these values into Equation (16) to (18) and we can conclude that the values of P, R, and F1 are equal to 100.00%, 50.00%, and 66.67%, respectively. According to the definition, The precision rate represents how many entities among the prediction result are true entities, while the recall rate represents how many entities among the original samples are correctly predicted. Since these two indicators reflect the recognition performance from different perspectives, their weighted harmonic mean needs to be considered for a comprehensive evaluation. The most commonly used is the F1 value, which indicates that the method is effective when the F1 value is high.

Results and Discussion
The experimental results are the average scores of the optimal model, as shown in Table 5: precision rate is 94.63%, recall rate is 94.47%, and F1 value is 94.48%. Based on the analysis of experimental results and labeled corpus, it can be seen that the performance of entity recognition, in a large part, is determined by the number of labels and to what extent different categories of entities are distinguishable from one another.
The category of herb names is labeled in the largest amount, and it is also unlikely to be recognized as any other category; so, it makes sense that its recognition result outperforms other categories. Although there are fewer training samples for therapeutic effects and disease names, their degrees of distinction are relatively high; so, their recognition result turns out favorable as well. Symptom entities are the least labeled and mostly likely to be confused with entities in other categories. For example, "ZhuiJianPanTuChu (intervertebral disc herniation)" is possibly regarded as a symptom or as a disease name, which seriously reduces the accuracy of identification. Furthermore, the boundary of symptoms is not clearly defined, and its common to have a symptom nested within a therapeutic effect. For example, if "JingMaiX-ueShuan (venous thrombosis)" in "KangJingMaiXueShuan-XingCheng (to alleviate venous thrombosis)" is labeled as a symptom entity while the model recognizes the entire phrase as a therapeutic effect, a false recognition will occur. Thus,   Table 6. It can be seen that the recognition results of BiLSTM and BiLSTM-CRF models exceeded HMM in accuracy, recall rate, and F1 value, indicating that the performance of the BiLSTM-CRF model in this experiment is superior to that of the traditional machine learning algorithm HMM. This result can be explained from at least two perspectives. On one hand, in terms of word' representation, HMM involves one-hot representation while RNN uses a kind of distributed representation; so, the latter is more efficient in the face of high dimensions; On the other hand, from the perspective of the evolution mode of the hidden state, BiLSTM, as an RNN based model, replaces the linearity of HMM with a highly nonlinear one; so, its expression ability is stronger.
When the sample size goes up, the recognition effect of the LSTM method shows an impressive boost, thanks to its full consideration of timing characteristics [31]. Compared with the HMM model, the LSTM model uses a deeper and more complex neural network. On the basis of the ordinary neural network structure, the recurrent neural network structure is integrated, and timing characteristics are further considered. Therefore, this model is suitable for dealing with the contextual relationship in texts where the output at each time step is affected by the states of the previous time step.
In addition, from LSTM to BiLSTM, the overall F1 value increased from 93.09% to 93.66%, which is an apparent improvement. This is because LSTM only extracts one-way features of sequences, resulting in the lack of many useful features that make sense for sequence labels, whereas BiLSTM can extracts the features from both forward and backward directions of the sequence, so as to obtain the knowledge more comprehensively and thereby achieve better performance [32].
Moreover, by adding the CRF layer, the BiLSTM-CRF model obtained the optimal probability of label transfer. Compared with BiLSTM, precision, recall, and F1 value of the proposed model were improved by 0.98%, 0.61%, and 0.82%, respectively.
From the comprehensive recognition results, there is still room for further improvement of the method at present. The performance of the proposed model is mainly restricted by two factors: (1) The scale of training data. Current data scale cannot well support the parameters required by the model, resulting in a negative impact on the learning of the model (2) Entity labeling granularity. As discussed in the previous section, the entity categories used in this paper lead to noticeable nesting between symptoms and therapeutic effects. For example, the symptom "Ke (cough)" is nested in the therapeutic effect entity "ZhiKe (to relieve a cough)". In fact, in this case, we can simply label "cough" as a symptom and words such as "stop", "eliminate," and "relieve" only serve as clue words indicative of a symptom entity so that   In the future, with the development of transfer learning, unsupervised learning technology, and radical vector feature representation methods that are more granular than word vectors, better recognition results are expected to be achieved with smaller data sizes and fewer labels [33]. Furthermore, the combination of coarse and fine-grained methods, together with entity alignment and semantic disambiguation technology, may address the problem of polysemy in TCM texts and improve the recognition performance.

Conclusions
In this paper, a BiLSTM-CRF model is constructed to obtain the bidirectional semantic features of the context and identify four types of entities, namely, herb names, disease names, symptoms, and therapeutic effects. Favorable results have been achieved, with the overall F1 value reaching 94.48%. The significant advance in comparison with HMM, LSTM, and BiLSTM indicates that the model constructed in this paper on NER is able to provide strong support for subsequent natural language processing applications and provide theoretical and technical reference for researchers in relevant fields. Besides, in view of the strong portability of the deep learning model, it can be applied to the NER tasks in various fields other than TCM patents as in our research. For any available standard dataset in a certain field, we can define a set of entity types properly, train the model, and extract the target entities for further analysis and utilization. Yet, there is still room for improvement in the task of NER in TCM texts. In future research, the following aspects shall be taken into account. On one hand, a corpus with higher purity, richer content, and larger scale is recommended to build higher quality character vectors and closer attention to the standardization of dataset labeling that is needed; On the other hand, for the rather ubiquitous phenomenon of nesting among entity categories in TCM texts, the granularity of entity labeling can be further optimized by referring to the solution presented in the previous section.

Data Availability
The dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.