Stacking-BERT model for Chinese medical procedure entity normalization

Medical procedure entity normalization is an important task to realize medical information sharing at the semantic level; it faces main challenges such as variety and similarity in real-world practice. Although deep learning-based methods have been successfully applied to biomedical entity normalization, they often depend on traditional context-independent word embeddings, and there is minimal research on medical entity recognition in Chinese Regarding the entity normalization task as a sentence pair classification task, we applied a three-step framework to normalize Chinese medical procedure terms, and it consists of dataset construction, candidate concept generation and candidate concept ranking. For dataset construction, external knowledge base and easy data augmentation skills were used to increase the diversity of training samples. For candidate concept generation, we implemented the BM25 retrieval method based on integrating synonym knowledge of SNOMED CT and train data. For candidate concept ranking, we designed a stacking-BERT model, including the original BERT-based and Siamese-BERT ranking models, to capture the semantic information and choose the optimal mapping pairs by the stacking mechanism. In the training process, we also added the tricks of adversarial training to improve the learning ability of the model on small-scale training data. Based on the clinical entity normalization task dataset of the 5th China Health Information Processing Conference, our stacking-BERT model achieved an accuracy of 93.1%, which outperformed the single BERT models and other traditional deep learning models. In conclusion, this paper presents an effective method for Chinese medical procedure entity normalization and validation of different BERT-based models. In addition, we found that the tricks of adversarial training and data augmentation can effectively improve the effect of the deep learning model for small samples, which might provide


Introduction
Mining medical text data from electronic health records (EHRs) to generate clinical evidence has been widely applied in clinical decision-making. One fundamental problem in medical text mining is entity normalization, which aims to map entity mentions to standard concepts in a given knowledge base (KB) or controlled vocabulary. Accurate entity normalization can solve the problem of consistency in the expression of entity mentions and realize information sharing at the semantic level. In China, with increasing implementation of a healthcare payment policy by diagnostics-related groups in hospitals, a large amount of irregular writing in clinical notes need to be manually mapped to the standard concepts of the International Classification of Diseases (ICD); additionally, the entity normalization task of diagnoses and procedure has become very important, as it requires sufficiently trained staff with a good knowledge of both medicine and coding rules. In the real world, medical entity normalization tasks are time-consuming and labor-intensive; thus, this paper mainly focuses on the Chinese medical procedure entity normalization task and describes an automated and efficient method to map clinical terms into ICD codes in Chinese.
There are three major challenges to optimizing the Chinese medical procedure entity normalization task: 1) Variety. Due to diverse writing habits, the experience of physicians and the requirements of medical institutions, there are many different non-standard expressions in Chinese; the same concept may be linked by different entity mentions; for example, entity mentions that "Mile's", "直肠癌根治术 (Dixon)" are all linked to the normalized concept "腹会阴直肠切除术 (abdominoperineal resection of the rectum)" in Chinese control vocabulary ICD-9-CM-3. 2) Similarity. Chinese words have similar glyphs but different semantics, such as the two-procedure concepts "硬脊膜外病损切除术 (excision of epidural lesion)" and "硬脊膜下病损切除术 (excision of subdural lesion)" in Chinese control vocabulary ICD-9-CM-3; their similarity interferes with the exact matching of terms. 3) Limited context information. Mention-level entity is short text whose critical context information is limited, and the concept in ICD has no semantic relationship information available. To solve these problems, we regarded the mention-level entity normalization as a sentencepair classification task in this study and designed a stacking-bidirectional encoder representations from transformers (BERT) fusion model to capture the semantic information of clinical entity mentions. External KB and easy data augmentation (EDA) skills were used to increase the diversity of training samples, which provided rich term variation features to model. In addition, we generated difficult negative samples to train the model to learn the subtle differences between concepts and added adversarial learning in the training process to improve the discrimination ability of the model to deal with similar samples.
The normalization task here could be referred to as entity linking in the computer science community. In the biomedical domain, many previous studies focus on the development of rule-based methods [1][2][3]. Their work relied on large, expert-curated vocabularies of standardized medical terminology for string matching-based approaches, with great success [4]. In recent years, deep learning-based systems have addressed the limitations of string matching and achieved good performance of entity normalization. In general, deep learning-based systems could consist of two steps [5]: (i) Candidate Concept Generation -to retrieve candidate concepts related to a given entity mention; (ii) Candidate Concept Ranking -to rank the candidate concepts and decide on the one most relevant to the given entity mention. To improve the efficiency of candidate concept generation, Vashishth et al. [6] introduced a semantic-type prediction module to alleviate the problem of the overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention.
Candidate concept ranking is the key step for entity linking systems. Similarity-based methods have been proven to be effective for concept ranking. They commonly used sentence embedding as an upstream task before text classification, which adopts a vector space model to represent entity mentions and candidate concepts into a fixed length vector for semantic similarity calculations [7][8][9]. In recent years, deep representation learning models such as BERT [10] have been widely used to improve many natural language processing (NLP) tasks. In the medical domain, BioBERT [11] and ClinicalBERT [12] language representation models, which were pre-trained on biomedical texts and clinical notes based on BERT architecture, were introduced to advance the state-of-the-art performance on many domainspecific NLP tasks. Li et al. [13] introduced a BERT-based model named EhrBERT that was trained using 1.5 million EHRs; they proved the effectiveness of their BERT-based model on entity normalization tasks, but they treated entity normalization as a multi-classification task of a single sentence, where the size of classes depends on the vocabularies used in a corpus; the performance of this model relies on having a large amount of training data for each class, so it is not suitable for small samples. Kalyan and Sangeetha [14] proposed a medical concept normalization system based on BERT and highway layers; our experimental results show that our model outperformed all existing methods on two standard datasets. Sung et al. [15] introduced a BIOSYN system for biomedical entity representation learning that uses synonym marginalization dispensing with the explicit needs of negative training pairs; our results show that the iterative candidate selection based on our model's representations is crucial for improving the performance, together with synonym marginalization. The above studies' preliminaries proved the effectiveness of BERT on clinical entity classification tasks. In this study, we developed different sentence-pair similarity calculation models with different structures based on BERT, and stacking was performed to make full use of the advantages of BERTbased models.
Most previous studies have focused on the standardization of English entities. Up to now, there have been few studies specifically designed for Chinese-based clinical entity normalization. The realworld public datasets in Chinese related to health informatics are almost nonexistent, and this has been a bottleneck for the development of text mining in the Chinese medical entity normalization domain. Some researchers have developed algorithms based on manually annotated datasets. Xia et al. [16] proposed a multi-field indexing approach, which accomplishes the term normalization task by using an information retrieval algorithm with four level indices: word, character, pinyin and its initial. Luo et al. [8] introduced a multiview convolutional neural network to address the normalization of diagnostic and procedure names simultaneously. Likewise, Zhang et al. [17] presented an unsupervised framework to normalize the Chinese medical concept by combining disease text with comorbidity. Wang et al. [18] developed and compared several entity-linking approaches to normalize disease and procedure terms in Chinese; their results showed that the BERT-based ranking method achieved the best performance on encoding both Chinese diagnosis and procedure terms.
Based on the previous studies, the entity normalization was regarded as a sentence-pair classification task in this study; we designed different sentence-pair similarity calculation models with different structures based on BERT and propose a stacking-BERT fusion model to capture the semantic information of clinical entity mentions. There are three major contributions of this paper: • We used an external KB and EDA skills to increase the diversity of training samples; the results show that EDA skills can provide more features of term variation for the model.
• We proposed a concept ranking model with different structures based on BERT; it is fused by a stacking mechanism to further improve the performance of the model. Our detailed experimental analysis on Chinese medical procedure entity normalization tasks realized remarkable improvements over existing methods.
• We added adversarial learning to the training process; the results show that adversarial learning can significantly enhance the robustness and generalization of the model.

Study design
Given the medical procedure entity set , , … , , … , ∈ , which recognizes Chinese clinical text, and a controlled vocabulary , , … , , … , ∈ , which consists of a set of standard concepts, the entity normalization task of our study is to find the best corresponding concept for each input entity , as shown in Eq (1), where the score is calculated by the text matching algorithm in our model: (1) Figure 1 shows the system architecture for entity normalization used in this study, which consists of three modules: 1) dataset construction: to increase the diversity of training samples by using an external KB and EDA skills; 2) candidate concept generation: to generate a list of candidate ICD concepts for a given entity, using a simple BM25 algorithm and an extended BM25 by integrating synonym knowledge of SNOMED CT and train data; and 3) candidate concept ranking: to rank candidate ICD concepts, we propose a stacking-BERT model with different structures based on BERT, which was fused by a stacking mechanism. Detailed descriptions of these methods are given in the following sections.

Dataset
We evaluated our approach on the clinical entity normalization task dataset of the 5th China Health Information Processing Conference (CHIP2019) [19]. The dataset provides procedure entities recognized from Chinese electronic medical records, and the controlled vocabulary is "ICD-9-CM-3 Peking union medical college hospital edition 2017", which contains 9467 different procedure concepts in Chinese, where each entity in the dataset is manually linked to one or more than one standardized concept in the controlled vocabulary. The distribution of entities in the dataset is shown in Table 1 and the examples are shown in Figure 2. The dataset has the following problems: 1) the dataset does not give negative samples with entities that do not match with concepts; 2) due to the small training set, there were 23% concepts in the test set that were not in the training set; and 3) one entity may link to more than one concept, and approximately 5% of entities in the dataset map to multiple concepts.

Easy data augmentation of training data
In order to make the model learn more semantic information, the construction of the training set is very important. We adopted EDA skills to generated new pairwise training data based on the CHIP2019 dataset ( Figure 3).

Data cleaning.
We cleaned the useless punctuation and content in procedure entities to match regular expressions, such as "(腹腔镜)胆囊切除术 (51.2201)" to "腹腔镜胆囊切除术 (laparoscopic cholecystectomy)". Then, English abbreviations that appear in the training set entities were extracted separately.
Positive sample extension. Three methods were used to extend positive samples in our study: (i) data transmission expansion based on the pairs of training data; (ii) data symmetric extension based on the pairs of Step (i); and (iii) positive sample supplementation based on external clinical terminology. The Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) is a comprehensive multilingual clinical terminology guide used in EHRs and interoperability, and its components are concepts (codes), descriptions (terms) and relationships. Each concept has a unique concept ID, a fully specified name and multiple descriptions (including a preferred term and one or more synonyms); they all expressed the same semantics of one concept. We matched all descriptions in the same concept pairwise in SNOMED CT and added all synonym pairs to the training set as positive examples.
Negative sample generation. Previous studies suggested that the construction of difficult negative samples can enhance the feature-learning ability of the model and thus improve its effectiveness. We generated negative samples for each entity with the commonly used information retrieval method BM25 introduced in Section 2.3. With the exception of the manually linked concept, other top 20 concepts were retrieved for each entity in training set.

Candidate concept generation
Due to the large size of the ICD-9-CM-3 data, if the whole vocabulary was used as a candidate concept set, most of the concepts are irrelevant to the entity, it will bring a great burden to the model operation. The purpose of candidate concept generation is to ensure that all possible correct concepts are added to the candidate concept set as much as possible. Common recall methods include string similarity calculations based on text features and search engine retrieval. In order to improve the efficiency of model operation and ensure the recall rate of the best corresponding concept, the candidate concept generation component consists of two steps: (1) indexing all ICD codes and their preferred concepts in Chinese by invoking the Lucene application programming interface, and (2) retrieving the top n candidate concepts from the index for a clinical entity , by employing the BM25 model provided by Lucene [20].
To achieve higher recall for candidate generation, we used the Chinese characters as the basic building blocks of both indexing and retrieval without considering Chinese word segmentation. In addition to the baseline index described above, another two indexes were proposed in this section by using annotated training data and synonym terms of SNOMED CT. We established the index of SNOMED CT terms and ICD concepts by aligning the fully specified name and preferred terms in SNOMED CT with the concepts in ICD-9-CM-3 by regularization. Figure 4 shows an example of the complete candidate concept generation process.

Candidate concept ranking
This section mainly introduces the candidate concept ranking model, stacking-BERT, developed via our study. The stacking-BERT model consists of two layers, where the first layer includes four base ranking models with the different structures introduced in Sections 2.4.1 and 2.4.2, and the final layer is a simple logistic regression model. The Stacking mechanism and algorithm are introduced in Section 2.4.3.

BERT-based ranking model
As a sentence-pair classification task, using the BERT-based model shown in Figure 5(a), we treated the word representation from the top layer of transformers as the features for the normalization task. Similar to Ji et al. [21], in our BERT-based classification model, for each input entity and a candidate concept , we constructed a sequence < [CLS] [SEP] > as the input of the fine-tuning procedure, where [CLS] is the special word used as the representation of the whole sequence, and [SEP] is the special word used for separating and . After encoding 12 or 24 layers of multi-head attention transformers, the final hidden state output of the special [CLS] token ∈ was passed to the softmax layer to compute the probability distribution of all classes, which is described as , where ∈ is the parameter added during the fine-tuning procedure. Here, 2 means only two classifier labels in our task, the classifier 1 means that is the mapping concept for and 0 means that is not the mapping concept. We employed the probability of 1 as the final score of each input pair; after ranking all scores, the top-ranking candidate concept was found as the best mapping concept for .

Siamese-BERT ranking model
The Siamese neural network architecture [22] of two towers with shared weights and a distance function at the last layer has been effective in learning similarities in domains such as text [23] and images [24] by modeling the similarity directly based on pairs of inputs. Siamese networks lend themselves well to the semantic invariance phenomena present in entity normalization. Recently, Fakhraei et al. [25] have developed a solution based on a deep Siamese neural network model (Siamese Bi-LSTM) to embed the semantic information about the entities and empirically show the effectiveness of these embeddings in bio-entity normalization datasets. Using BERT, researchers have started to input individual sentences into BERT and derive fixed-size sentence embeddings. The most commonly used approach is to average the BERT output layer (known as BERT embeddings) or use the output of the first token [CLS] [26][27][28]; but, Reimers and Gurevych's [29] work show that these common practices yield rather bad sentence embedding. They proposed a modification of the pretrained BERT network that uses Siamese and triple network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity.
As shown in Figure 5(b), a Siamese-BERT network was built in this study based on the work of Reimers and Gurevych [29] to generate sentence embeddings independently for the entity mention and candidate concepts; then, they were concatenated as the input of the classification function. In the training process, candidate mapping pairs and a class label expressed as < , , > were fed to the Siamese-BERT network, which was composed of mapping pairs ( 1) and other non-mapping pairs ( 0 ). The aim of training is to minimize the distance in an embedding space between positive examples and maximize the distance between negative examples. We fine-tuned BERT to update the weights and produced sentence embeddings and , and as in Nils' work, a pooling operation was added to the output of BERT to derive the fixed-sized sentence embedding. For each candidate mapping pair , , we concatenated the sentence embeddings and with the element-wise difference ∨ and multiplied it with the trainable weight ∈ , where is the dimension of the sentence embedding and is the number of classifier labels: where is a vector of the * 1 dimension. Then, we computed the probability of each classifier label using the softmax function. Finally, the same with the BERT model, we computed the probability of 1 and found the top-ranking candidate concept ; the loss function of the network was set as categorical softmax loss: where is the prediction of the probability that this sample belongs to the ℎ classifier label, and is the target probability the network should produce. This function makes the loss less when the prediction probability is close to the target probability, and larger when it is far away from the target probability. Figure 5. Structure of base ranking models.

Stacking-BERT model
Stacking is an effective ensemble learning method for classification problems, it generally use several basic classifier models to produce outputs, which are later used as features for the next stacking layer [30]. This paper presents a stacking-BERT model including two layers. Stacking models usually use several complex models for the base classifiers and a simpler combined model for the final model. Because we adopted a different language model, feature representation, network structure, corpus and adjustment strategy, the pretrained models learned different prior knowledge and performed differently in downstream tasks. In order to combine the characteristics of different pretrained models, we trained the two models introduced in the last section with the pretrained model BERTbase-chinese [31] and RoBERTalarge-pair [32] to generate four ranking models, i.e., the BERT-based model, RoBERTabased model, Siamese-BERT model and Siamese-RoBERTa model. They produced the probability of 1 for each input sentence pair; then, these probability values were used as input in the logistic regression model that was a final layer. The algorithm of the stacking-BERT model is shown in Table  2. In particular, we used 5-fold cross-validation in the training process of each base ranking model.

Experimental settings
In this study, we built the experimental environment using a PyTorch 1.6 framework, using the library of Transfomers to load the pretrained models. The training set described in Section 2.2 was used to fine-tune the stacking-BERT model, wherein most model hyperparameters were the same as those saved in the pretrained model; we tuned the batch_size with 32 and fixed the max_sequence length with 128. In order to get the best result, we set learning rates of 1e-5, 2e-5 and 5e-5, respectively, for each model in the training process and tuned the number of training epochs from 1-10; finally, we saved the best performance for each model. The final hyperparameters of the four base ranking models are shown in Table 3. For the logistic regression model, we used the default parameters of sklearn [33].  [34] extended these techniques to text classification tasks and sequence models by applying perturbations to the word embeddings in a recurrent neural network; the proposed method achieved state-of-the-art results on multiple benchmark semi-supervised and purely supervised text classification tasks. Furthermore, Madry et al. [35] proposed the projected gradient descent (PGD) method to improve the perturbations to the word embedding, their MNIST and CIFAR10 networks based on the PGD achieved good performance in response to a broad set of attacks.
To improve robustness and the generalization ability of concept ranking models, we added the adversarial training to the process of model training. Instead of interfering with the original input sample itself, adversarial training feeds the adversarial samples to the model by adding some small perturbations to the word vector of the embedded layer. Generally, the optimization function of adversarial training can be represented as follows [35]: The part of max () means that we need to find a set of adversarial samples that maximize loss in the sample space; the part of min () means that, when faced with the adversarial sample set of such a data distribution, we should minimize the expected loss of the model on the adversarial sample set by updating model parameter, where means the perturbations on input .

PGD obtains adversarial examples by multi-step variant fast gradient sign attack (FGSM). With the initialization word embedding
, the perturbed data in the t-th step can be expressed as follows: , , where ∈ : ‖ ‖ denotes the projection of perturbations into the set , is the step size, is the loss function, the meaning of is to take the partial derivatives. The algorithm of PGD in the training process is described as shown in Table 4. Table 4. Adversarial training process for PGD. Algorithm 1: Adversarial training process for PGD Input: Initialization word embedding x of input data, perturbation accumulation steps 1. Compute the forward loss of , then compute the gard of backward , backing up the initial embedding; 2. for t in range( ): ( starts at 1) 3.
Compute adversarial perturbation by the grad of the embedding, add to the current embedding, which is represented as ; 4.
Zero the grad, then compute the forward loss and of in Step 3, then compute the of backward; 6. else: 7.
Restore the of Step1, then compute the last forward loss of in Step 3, then compute the of backward and add it to ; 8. Restore the embedding to the value of Step 1; 9. Update the parameters according to the grad of Step 7.

Evaluation metrics
We evaluated the performance of different entity normalization algorithms in terms of the evaluation metrics provided by the CHIP2019 organizer [19]. For each original entity , ∈ 1, which has been manually annotated to concepts in the test dataset, assuming the model outputs concepts for , and are a set of concepts and the score of the model is calculated as ∩ , (9) ∑ (10)

Comparisons with other different models
Several unsupervised and deep learning models were selected as baseline methods in this paper: • Metric_LCS [36] method. Longest common subsequence (LCS) finds the subsequences of two given sequences, which appear in the same order in the two sequences but need not be continuous; it is often used as the unsupervised method for text matching and to measure the literal similarity of strings. We used the Metric_LCS method to measure the literal similarity of entities and concepts, and then found the most similar concept as the standardized result.
• BM25 [20]. This is the most popular algorithm to calculate the query and document similarity score in the field of information indexing; we used the same method introduced in Section 2.2.3 and chose the top 1 candidate concept as the final result of this method.
• Bert-as-service [37]. The bert-as-service system uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, mapping a variable-length sentence to a fixed-length vector using the BERT model. We used the bert-as-service system to calculate the sentence vectors of all entities and concepts, and then used cosine similarity to find the best matching concept for each entity.
• CNN-ranking model [7]. It was the best deep learning-based system to date on both the ShARe/CLEF and NCBI datasets. Since the language of data were different, we could not completely reconstruct the KBs as used but not released in Li et al.'s work; we just reimplemented the system in our data and used the same settings as described in their paper.
• Siamese Bi-LSTM model [24]. This model significantly has outperformed other models on web document retrieval tasks. Because the tasks and datasets are different, we just reimplemented the system in our data and used the same settings as described in their paper.
• BIOSYN model [15]. The BIOSYN model outperformed previous state-of-the-art models on four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction). We used the same method of sparse representation and the same settings described in their paper. However, the BioBERT model was replaced with the BERTbase-chinese model because the BioBERT model was pretrained by an English corpus. Table 5 shows the performance comparisons for different models. Compared with other methods, our stacking-BERT fusion clinical entity normalization system achieved the highest accuracy of 93.1% on the CHIP2019 test set. Respectively, all deep learning methods achieved better results than the unsupervised methods. The BIOSYN model performed better than other deep learning models. For three unsupervised models, the bert-as-service system performed better, as the accuracy was improved by at least 10% as compared to Metric_LCS. It can be seen that pretrained models based on large-scale corpora can play an important role in both supervised and unsupervised methods. 62.57% bert-as-servicebase-chinese 71.33% CNN-ranking model [9] 86.7% Siamese Bi-LSTM [20] 85

Comparisons of ensemble models
In order to verify the effectiveness of the stacking model proposed in this work, we compared it with different ensemble models. Two ensemble models named Voting-BERThard and Voting-BERTsoft were obtained by fusing four BERT-based classifiers with a hard voting mechanism and soft voting mechanism, respectively [38]. From the performance shown in Table 6, we can find that 1) the ensemble model based on the stacking method performed better than voting methods; 2) compared with the single BERT-based ranking model, multi-model fusion can achieve a better result; 3) each BERT-based ranking model achieved a good result, i.e., the accuracy of each model was above 90%, and the result showed that the supervised learning model which fine-tuned with domain data was significantly better than that of unsupervised learning; 4) compared to the Siamese-BERT model with a structure of twin towers, the result of the BERT-based model was better; and 5) in the models with different structures, the pretrained model BERTbase-chinese had achieved a better result than RoBERTalarge-pair, but the difference between the results was smaller.  Table 6 shows that, regardless of whether it was our stacking-BERT model or other ranking models, adversarial training based on the PGD algorithm could effectively improve the effect of the model. When PGD adversarial training was added to the training process, the accuracy of the BERTbased model was even higher than that of the stacking model without PGD (92.05% vs. 91.73%). Table 7 shows the results of our stacking-BERT model using different training data. D0 refers to the 8000 positive samples in the training data, D1, D2 and D3 respectively refer to the positive examples generated by the three methods introduced in Section 2.2.2. The results show that the negative examples generated by BM25 were much better than that randomly selected; in the case of identical positive samples, the accuracy improved by 22.8%. In addition, we validated the effects of different data augmentation methods through ablation experiments. By comparing the results of four experiments, it can be seen that three data augmentation methods all played a certain role in improving the effect of the model. Particularly, the positive samples supplemented by SNOMED CT were most effective, as the accuracy stabilized at more than 92% when we used the supplementary positive samples. 3.3.5. Effect of candidate concept generation As described in Section 2.3, we adopted a BM25 algorithm to generate candidate concepts. Figure 6 reports the number of candidates per entity and the rate of standard entity recall for the candidate sets that were conducted using two types of strategies. The line "(total)" means the recall of candidate concepts in all test sets, while the line "(1 to 1)" only calculates the recall rate of samples which one entity linked to one concept. For the traditional information retrieval model BM25, to which we applied three indexes, the top 20 candidates were retrieved for each entity and a recall of 99.6% was obtained for one-to-one samples. When the number of candidate concepts was the same, the recall rate of the BM25 algorithm was higher than the bert-as-service system, which proves that our method is more efficient.

Discussion
As shown in the results, the performance of the stacking-BERT model was better than that of other deep learning models. Stacking models can make full use of the learning ability of base classifiers and further improve the classification effect without increasing the complexity of a single model, or the amount of training data. The combination of classification models with different structures and pretrained models can produce better results. BERT can learn deeper semantic features through the mechanism of multi-head attention based on the transformer. At the same time, it used the task of next sentence prediction as the training goal and trained the language representation together with the mask language model. This design was used to capture the relationship between sentences, which was conducive to the application of pretrained general representation in text matching and other tasks. Second, the BERT pretraining models were all based on large-scale Chinese text corpus like wiki; they fully learned the grammatical features of Chinese words and phrases. Therefore, BERT-based models proved to be effective for Chinese clinical entity normalization tasks.
However, the differences between four ranking models were not quite as large, and the BERTbased models performed better than the Siamese-BERT models. Note that the Siamese-BERT framework was not optimal for sentence pairwise classification. It used a bi-encoder that mapped sentences independently to sentence embeddings. For classification, the classifier would take these two embeddings and derive a label. On the other side, BERT used a cross-encoder, which meant that both sentences were present at input time, and BERT compared the two inputs to derive the labels which gave much better classification results.
The quality of the training dataset had a close relationship with the results of the model. Particularly, the generation of negative examples was very important. Negative examples generated by BM25 were hard samples for the model, and more detailed differences could be learned through hard similar samples to improve the discrimination ability of the model. For the three data augmentation methods for the positive samples, the external clinical terminology supplement in the same domain was the most effective method. Using a transitive extension can make the model learn more similar information; using a symmetric extension to exchange the position of text pairs will change the position encoding so that the model can observe the similarity of the two texts from different angles.
Candidate concept generation needs to consider both the recall rate and data scale. The BM25 retrieval method based on a triple index proposed in this paper has been proved to be simple and effective. But, there is also a drawback, because our dataset had cases that one entity linked to multiple standard concepts in the CHIP2019 dataset; the candidate concept generation recall rate for the total test data did not reach 100%; thus, the concept ranking model could not find the correct concept. Deep generative models will be considered in the future to improve the recall rate of candidate concept generation.
There were a lot of concepts with high similarity and redundant components in the original words of procedure in the data, and these will cause interference in the model in the process of training and prediction. The adoption of PGD confrontation training can improve the robustness of the model response to confrontation samples. However, a PGD algorithm will increase our training time, and it was not suitable for large-scale datasets.
In our experiment, an entity in the test set may be linked to one or more concepts; the statistics show that our multi-model fusion system had a normalization accuracy of 96.48% for single mapping and 25.86% for multiple mapping. For the clinical entity standardization task of CHIP2019, the average score of all participating teams was 79.75%; the first ranked team constructed a ranking system of implication scores based on BERT and applied the best fine-tuning to the quantity prediction module, finally achieving an effective result of 94.83%; the final performance of our model was second only to the Top 1 team [19]. The analysis of the experimental results shows that our model needs to be improved in two aspects. On the one hand, our model had poor ability to predict the number of concepts; using a manual rule or deep learning model to predict the number of concepts will be the way to improve our methods in the future. On the other hand, although we have dealt with common abbreviations in the data preprocessing stage, the normalization performance of new entities with professional abbreviations was still not ideal; for example, the entity mention "VVI 心脏起搏器植入 术(Cardiac pacemaker implantation)" was predicted as the normalized concept "心脏起搏器置入术 (Cardiac pacemaker implantation)" by our model, but the correct concept is "单腔永久起搏器置入 术". The key to solving this problem is relying on a large number of medical professional KB.

Conclusions
A system that can automatically encode clinician-entered terms into ICD codes with high accuracy is of great importance to hospitals in China. It will not only save cost and time for clinical coding processes, but also improve the standardization of clinical data in China. In this paper, we proposed a stacking-BERT model for Chinese clinical entity normalization tasks which investigated the effectiveness of different BERT models. Our experiment proved that BERT-based normalization models outperformed some similarity-based methods; using the sentence-pair classification task of the original BERT architecture and the pre-trained model of Chinese can lead to satisfactory performance. In addition, we found that the tricks of adversarial training and EDA can effectively improve the effect of the deep learning model for small samples. However, our study lacks in-depth mining of Chinese clinical entity characteristics, so we are exploring the use of HowNet Sense and Lattice Graph to calculate the similarity of clinical entities.