Semantic-enhanced graph neural network for named entity recognition in ancient Chinese books

Named entity recognition (NER) plays a crucial role in the extraction and utilization of knowledge of ancient Chinese books. However, the challenges of ancient Chinese NER not only originate from linguistic features such as the use of single characters and short sentences but are also exacerbated by the scarcity of training data. These factors together limit the capability of deep learning models, like BERT-CRF, in capturing the semantic representation of ancient Chinese characters. In this paper, we explore the semantic enhancement of NER in ancient Chinese books through the utilization of external knowledge. We propose a novel model based on Graph Neural Networks that integrates two different forms of external knowledge: dictionary-level and chapter-level information. Through the Graph Attention Mechanism (GAT), these external knowledge are effectively incorporated into the model’s input context. Our model is evaluated on the C_CLUE dataset, showing an improvement of 3.82% over the baseline BAC-CRF model. It also achieves the best score compared to several state-of-the-art dictionary-augmented models.

and its corresponding chapter title (P) concatenated with SEP.The encoder layer generates contextualized embeddings ( h ci , h p ) for each character and the chapter title using the "Bert-Ancient-Chinese" model.In the GNN layer, the vertex set of the graph consists of the Chinese characters ( c i ), matching words ( w i ), and the chapter (P, serving as the global node).The words are extracted from a dictionary.The global node links all character nodes, while the word nodes link their corresponding character nodes.The CRF layer produces the final output tags for each character in the input sentence.Each tag consists of a prefix and its entity type (e.g.B OFI denotes the beginning of an Office entity).The prefixes B-I- E-O stand for Begin, Inside, End, and Outside, respectively.The right part shows the process of incorporating matching words.

Low-resource NER
The huge training data required by neural networks has led to a decline in performance for Low-resource NER, such as biomedical, social media, etc.A number of recent technologies have been proposed to tackle the lowresource domain NER, such as pre-trained embedding 13 , multi-task 14 , multi-models 15 , transfer learning 16 .
On the other hand, Numerous studies have shown that incorporating additional useful knowledge, such as external dictionaries, can still achieve significant gains in NER tasks.Nie et al. 17 proposed a semantic expansion module that encodes dictionary information, assigning different weights to each word matched in the dictionary and effectively alleviating the data sparsity issue in social media texts.Diao et al. 18 enhanced the pretrained language model BERT by externally adding an encoder to handle N-Gram dictionary, thereby explicitly integrating lexical-level information.Liu et al. 19 integrated dictionary information into the underlying part of BERT, enabling BERT to learn deeper knowledge from the dictionary.This integration resulted in excellent performance in NER and various sequence labeling tasks.Nevertheless, adding dictionary information to the base layer of pre-trained language models raises their complexity, posing practical application challenges.In contrast, our approach avoids embedding knowledge directly into the core of the pre-trained language model.We opt for adding a graph neural network layer, which streamlines the training process of the NER model.

Graph neural networks on NER
Numerous recent methods leveraging Graph Neural Networks (GNNs) have emerged for Named Entity Recognition (NER) [20][21][22] .Chen et al. 20 presented a Randomly Wired Graph Neural Network specifically designed for Chinese NER, pioneering a novel context encoder that automatically generates the wiring pattern of the graph network.Sui et al. 21introduced collaborative Graph Attention Networks 23 to integrate self-matched lexical knowledge in Chinese NER, aiming to resolve ambiguities inherent in Chinese text.Wang et al. 22 proposed a Polymorphic Graph Attention Network, which uses softLexicon 24 to construct graphs between characters and matched words and capture correlation from them.Our work is closely related to the graph attention networks to ensure the efficient integration of these knowledge into the model's input context.

NER in ancient Chinese
Existing research on NER in ancient Chinese has progressively shifted towards leveraging deep neural network models.Xu et al. 25 utilized basic neural network models such as BiLSTM-CRF and BERT to explore the effects of entity recognition within the "Fang Zhi Wu Chan" Yunnan volume corpus.Liu et al. 26 proposed a semi-supervised learning method combined with feature words for medicine NER in traditional Chinese medicine ancient text.Zhang et al. 27 explored NER in ancient Chinese wine texts using deep learning models, particularly focusing on BERT pre-training models.
To enhance the representation of ancient texts, researchers have proposed several pre-trained language models based on ancient text corpora, such as SikuBERT 28 , GuwenBERT (GuwenBERT https:// github.com/ ethan-yt/ guwen bert), Bert-Ancient-Chinese (BAC) 9 .Feng et al. 29 proposed a classical Chinese named entity recognition model, which is based on Bert-ancient-Chinese+Recurrent Long Short-Term Memory+Conditional Random Field (BAC+RLSTM+CRF) for Named Entity Recognition and tested on the C-CLUE dataset.Ge Sijia 30 adopt the SikuBERT model as the pre-trained model to integrate both named entity recognition and sentence segmentation tasks.In this paper, we utilize Bert-Ancient-Chinese as the encoder and integrate external dictionary and chapter information to further enhance the representation of ancient texts.

Method
Figure 1 illustrates the main architecture of our model, which is partitioned into four layers: the input, the encoder, the GNN and the output.In this section, we meticulously detail each component that comprises our model, subsequently unfolding the methods involved in the training and inference phases of the model.

Input
For the NER task, the input is a given ancient sentence S = c 1 , c 2 , . . ., c n , where c i is the i th character and n is the length of the input sentence, and the output is the predicted entity labels Y = y i , . . ., y n .In our model, besides the sentence S itself, we also introduce two types of knowledge as inputs: the lexical sets L s matching the input S and the global chapter information P.

Acquisition of lexical sets
Integrating rich lexical information into models is a promising approach for enhancing named entity recognition in ancient Chinese.Given a input sentence S and the dictionary knowledge base, we use a full matching method to identify all potential lexical items within the sentence over the dictionary, formulated as: where w i is the matching lexical item.An example of this is demonstrated in Fig. 1, we extract five lexical items.The semantic information of these lexical items will be incorporated into a graph neural network, thereby infusing the model with a deeper understanding of lexical knowledge.

Acquisition of chapter information
Ancient texts frequently feature short sentences and single-character words, which significantly increases the ambiguity in entities.It's commonly observed that an entity, when mentioned multiple times throughout a chapter, retains the same meaning.Chapter titles in ancient books refer to the themes and structures of the chapter (1) (w 1 , w 2 , . . ., w m ) = Matching(S).
Vol:.( 1234567890 To link input sentences within the same chapter, this paper seeks the origins of each sentence in the Chinese Ancient Books Library (https:// publi sh.ancie ntboo ks.cn/ docSh uju/ platf orm.jspx).In Fig. 1, take the sentence S "授宣武将军、管军千户 (Appointed as General of Martial Valor and in charge of a military unit of a thousand households)" as an example: by querying the website, we identify that it belongs to the chapter P: "元史> 卷166>第53>罗璧 (Yuan History>Volume 166>Section 53>Luo Bi)".We then add this chapter information as an extension to the input sentence:S#P .Once encoded, the chapter embedding h p serves as a global node within the graph neural network, incorporating comprehensive global context.

Encoding
The model's input is composed of three elements: the input sentence S for prediction, the associated prior chapter information P, and the matching lexical set L s .As shown in Fig. 1, we concatenate the input sentence S with the prior chapter information P. Then the concatenated sequence S#P is fed into a pre-training language model (Bert-Ancient-Chinese, BAC) for encoding, treating P as the next sentence to S. Through the encoding process, the pre-training language model is able to incorporate the chapter information into the input sentence.The coding is formulated as follows: Additionally, to incorporate an abundance of external information, the matching dictionary is encoded using embeddings trained externally, the encoding is as follows: where m is the number of words matching lexical sets.In the experiment, we utilize vocabulary embeddings developed by Tencent (https:// ai.tence nt.com/ ailab/ nlp/ en/ downl oad.html).Although these embeddings are primarily trained using modern texts, their semantics have evolved from classical literature.Integrating this evolved semantic information should also enrich the semantic representation of ancient texts, which has been validated in the experimental section.

Graph construction
We need to construct a graph for integrating both the matching lexical items and chapter information into the input sentence.The vertex set of a graph is made up of the Chinese characters in the input sentence, lexical items extracted from it, and the chapter.For example, as shown in Fig. 1, the vertex set is To represent the edge set, the adjacency matrix needs to be introduced.The elements of the adjacency matrix indicate whether pairs of vertices are adjacent or not in the graph.if a lexical item i contains a Chinese character j, the edge(i, j) will be assigned a value of 1. Intuitively, character-lexical item pairs can capture the semantic and boundaries information of words in the sentence, which is help for named entity recognition.For vertex chapter P, it is connected to each character vertex, which is regarded as a global node and can more effectively infuse chapter information.

Graph attention
The matching lexical set may include some noise words, which mainly refer to words that are incorrectly segmented by the full-match word segmentation method.Such words do not contribute to the main information and may even have a negative impact.For example, in Fig. 1, the lexical item "武将(military general)" is a noise, and its integration will affect the semantic expression of the sentence.In order to alleviate the impact of noise words, we employ the Graph Attention Network (GAT) for modeling on the graph.The initial input to the graph is a set of node features from encoding layer NF 0 = h 1 , h 2 , . . ., h n , h p , e 1 , e 2 , . . ., e m where n is the node number of input sentence, m is the node number of the matching lexical items and the total nodes is N = m + n + 1 .In an M-layer GAT, the input of l-th layer is a set of node features, NF l = h l 1 , h l 2 , . . ., h l N , together with an adjacency matrix A, h i ∈ R F l , A ∈ R N×N , where F l is the dimension of features at l-th layer.The output of l-th layer is a new set of node features, NF l+1 = {h l+1 1 , h l+1 2 , . . ., h l+1 N } , which are regarded as the input of (l + 1)-th layer.We employ multi-head attention to compute the node features.Specifically, a l-th layer GAT operation with K independent attention heads can be formulated as : where ⊕ denotes concatenation operation, σ is the ReLU (Rectified Linear Unit) non-linear activation function, N i is the neighborhood of node i in the graph, α k ij are the attention coefficients, Vol.:(0123456789) where σ is also the ReLU function The output of all node features is denoted as G, where G ∈ R F m ×(n+m+1) .We keep the first n columns of these matrices and discard the last m + 1 columns, because only character representations are used to decode labels.
Finally, the input of CRF layer is denoted as: where W 1 and W 2 are trainable matrices.The new represent R for sentence integrate the contextual information from encoding layer and the semantic information of the lexical set and chapter from GNN layer.

Decoding and training
A standard CRF layer is used to capture the dependencies between successive labels.Formally, we took the above R = {r 1 , r 2 , . . ., r n } as our input to the CRF layer, and its output was the conditional probability of the golden tag sequence y = {l 1 , l 2 , . . ., l n } is here y ′ is an candidate label sequence, W l i CRF is used for modeling emission potential for the i-th word in the sentence, and T l i−1 ,l i CRF is the transition matrix storing the score of transferring from l i 1 to l i .The first-order Viterbi algorithm was used to find the highest scored label sequence during decoding.To train the model, the cross-entropy objective function was exploited.Given a manually annotated training data {(s 1 , y 1 ), (s 2 , y 2 ), . . ., (s n , y n )} , The loss function is defined as: where denotes the L 2 regularization parameter and θ is the all trainable parameters set.

Experiments
In this section, we conduct a series of experiments to assess the effectiveness of our proposed method.

Dataset
We use the ancient Chinese NER dataset C-CLUE 29 to evaluate our proposed model.The C-CLUE dataset originates from all twenty-four history books, which employes a crowdsourcing annotation system.There are 6 types of entities in the initial dataset (https:// github.com/ jizij ing/C-CLUE), but two of them are too few, so this paper excludes them.We use the official divided data set for training, validation and test, and the statistics are shown in Table 1.

Metric
Following the standard setting, we evaluate the methods using micro-averaged F1 score and also report the precision (P) and recall (R) in percentage.The measurement formula is as follows: Where T p represents the number of entities correctly identified by the model, F p represents the number of enti- ties incorrectly identified by the model, and F t represents the number of entities that cannot be identified by the model.All experiments were conducted 5 times with random seeds and the average performance outputs were used for the result reporting and analysis.We conducted all experiments using an Nvidia RTX3090 GPU, Linux operating system (Ubuntu 22.04), and 96GB of system memory for both the all baselines and our proposed method.

Hyper-parameters settings
In the training process, the hyper-parameters were tuned on the corresponding development sets.We optimized our model with a stochastic gradient descent (SGD) following Cui and Zhang 31 , the learning rate is set to 2 × 10 −5 , and the batch size is 32.The training procedure stopped when the results of the next five validations were not better than the previous best record.Table 2 shows the hyper-parameters used in our experiments.

Baselines
To comprehensively demonstrate our model's effectiveness, we conduct experiment on two groups of baselines based on whether the model utilizes a dictionary, with all models being state-of-the-art NER models.The first group involves evaluating benchmark models that do not utilize any external information, listed as follows.
• CNN-CRF 32 adopted a standard a convolutional neural network (CNN) encoder on the character sequence, which can obtain its multiple gram features, and used CRF for decoding.• BiLSTM-CRF 33 used bidirectional LSTMs for encoding and CRF for decoding.
• TENER 34 utilized the Transformer encoder to model the long-range and complicated interactions of sentence.
The second group concentrated on integrating dictionary data into the models, listed as follows.
• LR-CNN 35 incorporated lexicons into CNN-based NER, which leveraged high-level semantics to identify the correct words.• Lattice-LSTM 36 effectively encoded both individual input characters and possible words using gated recur- rent cells, which helps in reducing segmentation errors.• NFLAT 37 introduced the adapted transformer encoder with a non-flat-lattice structure, which separates lexical fusion from the encoding of contextual features.It can lower the computational demands involved in processing both character-word and word-word interactions during self-attention.
Note that in the experiment, we reproduced all baseline models.To maintain consistency and fairness in comparison, all baselines adopts the same experimental machine environment as the our model.Each experiment was also conducted 5 times with random seeds and the average performance outputs were used for the result reporting and analysis.Furthermore, the static word vectors in the baselines and our model used Tencent Chinese vocabulary embeddings, which offer 200-dimension representations for more than 12 million Chinese words and phrases.These embeddings are derived from extensive pre-training on large-scale corpora.

Results
Table 3 shows the performances of our method against the baselines.Firstly, it is evident that models which incorporate dictionary information outperform their counterparts that do not.For example, when contrasted with LSTM-CRF, which does not utilize dictionary data, Lattice-LSTM with dictionary integration shows a notable enhancement of nearly 4.25% in F1-score.This highlights the vital importance of dictionary data in bolstering the efficacy of NER models.Further, in the comparison of models that use pre-trained language model against those that do not, the pretrained models, namely BAC-CRF and BAC-GNN-CRF, outshine models that lack pre-training, like TENER and NFLAT.This demonstrates that pre-trained language models, which leverage large-scale text datasets, acquire more nuanced contextual information.This enrichment significantly enhances their representation capabilities.The influence of pre-trained language models using various ancient corpora on our model's performance will be further explored in the "Discussion" section.
Lastly, our proposed model, BAC-GNN-CRF, excels above all others in all metric.Compared to the BAC-CRF model, our model has surpassed it by 3.82% in F1 score.This shows that the two types of external knowledge introduced by our model through GNN, namely dictionary and chapter information, are effective.

Ablation
To investigate the contribution of each component of our model, we conducted ablation experiments on the dataset, and the results are shown in Table 4.The exclusion of the global node (-Global Node), representing chapter information, from the graph structure resulted in a marginal diminution across all evaluative metrics, with precision dropping to 70.57% , recall to 78.69% , and the F1-score to 74.41% .A further decrement in performance was observed upon the elimination of the global node and chapter information from both the encoder and the GNN layer, underscoring the efficacy of integrating chapter embeddings as global nodes within the GNN layer for enhanced utilization of chapter information.
By removing the Graph Neural Network (GNN) layer, there is a more pronounced decline in performance, where precision, recall, and F1-score deteriorated to 67.14% , 76.12% , and 71.34% , respectively.This significant degradation highlights the pivotal role of the GNN layer in the model's architecture.The -GNN &P excludes both the lexical and the chapter information, referred to as BAC-CRF, which performance drops across all metrics compared to the full model.
The ablation studies have shown that each component of the BAC-GNN-CRF model contributes to its performance.In particular, the GNN layer, emerges as a critical element for precise entity recognition.The global node also aids in maintaining high scores across all metrics, showing its importance in global chapter feature integration.Finally, The best F1-score achieved by the complete model configuration validates the synergistic effect of amalgamating dictionary and chapter information with graph attention mechanisms, delineating a potent strategy for this NER task.  3 has illustrated the significant role of pre-trained language models in enhancing the performance of NER in Ancient Chinese Books.Researchers have proposed multiple ancient Chinese pre-trained language models such as Roberta-Classical-Chinese (Roberta-Classical-Chinese: https:// huggi ngface.co/ Koich iYasu oka/ rober taclass ical-chine se-base-char), GuwenBert-Base (GuwenBERT https:// github.com/ ethan-yt/ guwen bert), SikuBert (SikuBERT https:// huggi ngface.co/ SIKU-BERT/ sikub ert) and Bert-Ancient-Chinese (Bert-Ancient-Chinese https:// github.com/ Jihuai-wpy/ bert-ancie nt-chine se).Since these pre-training models use different ancient text training corpora, they have different impacts on the performance of our proposed model.Table 5 presents the performances of different pre-trained language model over our framework.While Bert-Base-Chinese (Bert-Base-Chinese https:// huggi ngface.co/ bert-base-chine se) was developed with modern text, other models were honed on various classical texts.We can see that language models pre-trained on classical Chinese greatly enhance the model's performance compared to Bert-Base, checking that pre-trained language models on classical Chinese can enhance the representation of ancient texts.Furthermore, of the models pre-trained on classical texts, Bert-Ancient-Chinese(BAC), used by our our model, achieves the best performance.This could be attributed to the inclusion of the experimental dataset within the pre-training corpora.

Comparison on entity categories
The Fig. 2 shows a comparative analysis of the F1 score between our model and the baseline BAC-CRF, across four different entity categories: PER (Person), LOC (Location), OFI (Office title), and ORG (Organization).We can see that our model outperforms the BAC-CRF in every category.This suggests that the integration of GAT in the BAC-CRF framework helps to better capture the context and semantic of texts, which is beneficial for the task of named entity recognition.The biggest improvement can be seen in the ORG category, with an increase of 4.21% in the F1-score.This could indicate that the BAC-GNN-CRF model is particularly better at handling the complexities associated with organizational names, which may be due to more effective disambiguation of entities facilitated by the integration of an external dictionary.

Performance against entity length
We also analysis the NER performance relative to the length of entities.We categorize the entity lengths into four categories: 1, 2, 3 and 4+ (i.e., ≥ 4 ). Figure 3 shows the results, where the F1-scores of both the BAC-CRF and our model are offered.For entities composed of a single character, both models exhibit diminished performance in contrast to entities spanning two characters.This observation suggests that entities of a single character in ancient Chinese texts are likely to present a higher degree of ambiguity.A significant enhancement in F1-score is evident for both models when processing bi-character entities, underscoring the likelihood that entities of this length offer more distinct contextual cues, thereby facilitating more accurate identification by the models and mitigating ambiguity.However, as entity length extends beyond two characters, a decline in performance is observed for both models.This trend highlights the increasing challenges associated with delineating the boundaries of more extended entities.
Our model consistently outperforms the BAC-CRF across all evaluated conditions, illustrating its capability in enhancing the representation of ancient Chinese text.This advantage is attributed to the strategic incorporation of external dictionaries and the chapter-level context.

Case study
We conduct a case study.Table 6 shows four cases in the test sets.we can see that the "纬 (Wei)" is correctly recognised by the BAC-CRF and the BAC-GNN-CRF models in case 1, while the BAC-CRF makes a mistake and our model still does it correctly in case 2. The reason is that the two sentences originate from the same chapter:" 新五代史> 卷五十七-杂传第四十五> 贾纬.(New History of the Five Dynasties>Volume 57-Miscellaneous Biography 45> Jia Wei)" ,which is integrated into our model as a global node and can help alleviate the ambiguity of word "纬(Wei)".
In case 3, the BAC-CRF erroneously split "刑部郎中(Director in the Ministry of Punishment)" into two separate entities: "刑部(the Ministry of Punishment)" and "郎中(doctor)" .However, as "刑部郎中" is found in the external dictionary as a lexical item, our model was able to identify it correctly.This illustrates the efficacy of external dictionaries in enhancing the accurate delineation of entity boundaries.On the contrary, since the term "镇(Town)" frequently appears in geographic names, both models erroneously classified '方镇(Fang Town)' as a location entity.This misclassification requires more contextual information and external knowledge to be solved.Case 4 provides an example of a noise word and the associated analysis is shown in Fig. 4c.Furthermore, we present three visual analyses of the graph attention module in Fig. 4. Figure 4a displays the attention coefficients related to the global node in case 2, showing that characters within a specific entity are assigned higher weights, while other commonly occurring characters receive lower weights.Figure 4b shows the attention weights for the entity "刑部郎中" in case 3, demonstrating that the semantics of the matching words are effectively distributed across the individual characters "刑", "部", "郎", and "中".This enables our model to accurately identify the entity "刑部郎中".Figure 4c demonstrates the effect of a noisy word in case 4. The noisy word "命中", which could potentially disrupt entity recognition, is assigned a low weight, enabling our model to correctly identify the entity due to the attention module.

Conclusion
In this paper, we introduce a semantic enhanced graph neural network model, namely BAC-GNN-CRF, specifically tailored for NER in ancient Chinese books.Our approach employs a graph-based framework to integrate two distinct forms of external knowledge: dictionary-level and chapter-level information.This integration is aimed at augmenting the semantic representation of the texts and mitigating ambiguity.Additionally, we leverage Graph Attention Networks (GAT) to diminish the influence of matching noise and to more effectively incorporate the aforementioned external knowledge.The experimental evaluations conducted on the C_CLUE dataset substantiate the effectiveness of our model, evidencing substantial improvements and achieving the best score compared to several state-of-the-art dictionary-augmented models.In the future, we intend to broaden the scope of our model by integrating additional forms of external knowledge and by exploring more advanced machine learning architectures to further enhance NER capabilities within the of ancient Chinese literature.All codes and resources are released at the website: https:// github.com/ qtxcm/ BAC-GNN-CRF.Table 6.Four samples outputted by the BAC-CRF and the BAC-GNN-CRF.The case in Gold row is composed of annotated sentence with its corresponding chapter title.The outputs of both models are the predicted sentences without the chapter information.

Figure 1 .
Figure 1.The figure illustrates the architecture of our BAC-GNN-CRF NER model with an example.The input consists of an ancient Chinese sentence ( . . ., c i , .. .) and its corresponding chapter title (P) concatenated with SEP.The encoder layer generates contextualized embeddings ( h ci , h p ) for each character and the chapter title using the "Bert-Ancient-Chinese" model.In the GNN layer, the vertex set of the graph consists of the Chinese characters ( c i ), matching words ( w i ), and the chapter (P, serving as the global node).The words are extracted from a dictionary.The global node links all character nodes, while the word nodes link their corresponding character nodes.The CRF layer produces the final output tags for each character in the input sentence.Each tag consists of a prefix and its entity type (e.g.B OFI denotes the beginning of an Office entity).The prefixes B-I- E-O stand for Begin, Inside, End, and Outside, respectively.The right part shows the process of incorporating matching words.
Figure 1.The figure illustrates the architecture of our BAC-GNN-CRF NER model with an example.The input consists of an ancient Chinese sentence ( . . ., c i , .. .) and its corresponding chapter title (P) concatenated with SEP.The encoder layer generates contextualized embeddings ( h ci , h p ) for each character and the chapter title using the "Bert-Ancient-Chinese" model.In the GNN layer, the vertex set of the graph consists of the Chinese characters ( c i ), matching words ( w i ), and the chapter (P, serving as the global node).The words are extracted from a dictionary.The global node links all character nodes, while the word nodes link their corresponding character nodes.The CRF layer produces the final output tags for each character in the input sentence.Each tag consists of a prefix and its entity type (e.g.B OFI denotes the beginning of an Office entity).The prefixes B-I- E-O stand for Begin, Inside, End, and Outside, respectively.The right part shows the process of incorporating matching words.

Figure 2 .
Figure 2. Performance comparison across various entity categories between BAC-CRF and BAC-GNN-CRF.

Figure 4 .
Figure 4.The visual analyses of graph attention.(a) The attention coefficients related to the global node on case 2. The darker the color, the greater the weight value.(b) The attention coefficients for the entity "刑部郎中" on case 3. (c) The impact of noised word on case 4. * All values are rounded to two decimal places. ) F l+1 .At the last layer, the dimension of final output features is h Scientific Reports | (2024) 14:17488 | https://doi.org/10.1038/s41598-024-68561-xwww.nature.com/scientificreports/ a single-layer feed-forward neural network.Note that, the dimension of the output h j+1 i is K * ′ , which are computed using averaging operation.

Table 1 .
Statistics of the C-CLUE datatset.

Table 3 .
Main results.The results marked with the asterisks are based on our reproduction.A bold number denotes the highest value in that column.The Bold is the best results.

Table 5 .
The performances of different pre-trained language models.