SciDeBERTa: Learning DeBERTa for Science and Technology Documents and Fine-tuning Information Extraction Tasks

Deep learning-based language models (LMs) have transcended the gold standard (human baseline) of SQuAD 1.1 and GLUE benchmarks in April and July 2019, respectively. As of 2022, the top five LMs on the SuperGLUE benchmark leaderboard have exceeded the gold standard. Even people with good general knowledge will struggle to solve problems in specialized fields such as medicine and artificial intelligence. Just as humans learn specialized knowledge through bachelor’s, master’s, and doctoral courses, LMs also require a process to develop the ability to understand domain-specific knowledge. Thus, this study proposes SciDeBERTa and SciDeBERTa (CS) as pretrained LMs (PLMs) specialized in the science and technology domain. We further pretrained DeBERTa, which was trained with a general corpus, with the science and technology domain corpus. Experiments verified that SciDeBERTa (CS) continually pretrained in the computer science domain achieved 3.53% and 2.17% higher accuracies than SciBERT and S2ORC-SciBERT, respectively, which are science and technology domain specialized PLMs, in the task of recognizing entity names in the SciERC dataset. In the JRE task of the SciERC dataset, SciDeBERTa (CS) achieved a 6.7% higher performance than the baseline SCIIE. In the GENIA dataset, SciDeBERTa achieved the best performance compared to S2ORC-SciBERT, SciBERT, BERT, DeBERTa and SciDeBERTa (CS). Furthermore, re-initialization technology and optimizers after Adam were explored during fine-tuning to verify the language understanding of PLMs.


I. INTRODUCTION
One of the recent views in language models (LMs) is that increasing the model size and training dataset size has a positive effect on the accurate performance of downstream tasks [1]. Consequently, the size of LMs and amount of training data are increasing exponentially. The model sizes of Megatron-Turing NLG [2], GoPher [3] of DeepMind and ERNIE 3.0 [4] are as large as 530, 280, and 175 B, respectively. Recently, a model size of Google's PaLM [5] was 540 B.
Based on the standard benchmark Super-GLUE [6] for LMs, the five models of SS-MOE [7], Turing NLR v5 [8], ERNIE 3.0 [4], T5+UDG [9]], and DeBERTa [10] have exceeded the human baseline as of 2022. These five models have sizes of 269, 5.4, 10, 11, and 1.5 B, respectively. As shown in the benchmark results, such large LMs achieve successful results in most NLP tasks in the general domain. However, their performance is limited in specialized domains.
This study aims to develop a large-capacity LM that improves the natural language understanding (NLU) performance in the science and technology domain. The most intuitive method for domainspecific LMs is pretraining with domain-specific data. We initialize the pretrained LM (PLM) with parameters trained on the general domain, and then train continually on domain-specific data. SciB-ERT [11] and S2ORC-SciBERT [12] are representative PLMs specialized in science and technology based on BERT [13]. We base our work on De-BERTa to consider not only the global context of the input but also the local context. In addition, DeBERTa has the smallest model size among the top five models of the SuperGLUE leaderboard; at 78 GB, its data size is relatively small. However, it has excellent performance in relation to its model size and training data size. Further, we analyze finetuning optimization techniques of NLU tasks in the science and technology domain.
The contributions of this study are as follows: • We present SciDeBERTa and SciDeBERTa (CS), PLMs specialized in the science and technology domain. Performance has been improved by continually training DeBERTa [10] with the science and technology corpus; furthermore, the field-of-study bias is relatively low by training on S2ORC, a dataset that covers diverse science and technology fields. SciDeBERTa (CS), which was trained by continual learning only with computer science (CS) domain abstracts, confirmed good performance in the information extraction task of SciERC composed of the AI Society dataset. SciDeBERTa, which was trained by continual learning with S2ORC whole domain abstract data, confirmed good performance in the information extraction task of the GENIA dataset. • The following machine learning techniques were applied to the information extraction tasks in the science and technology domain based on PLMs to improve performance.
-Re-initialization suppresses the parameter overfitting problem that occurs with task data that are relatively smaller than the pretraining data. For the base model size consisting of 12 layers, the best performance in DeBERTa is achieved when reinitializing the 10th layer. This is a different from BERT, which shows the best performance in re-initialization in the 12th layer. We analyze and theoretically and experimentally present the cause of performance improvement through parameter re-initialization. -We analyzed the most suitable optimizer for information extraction tasks in the science and technology domain by theoretically and experimentally comparing AdamW [14], AdamP [15], and RAdam [16], which are the latest optimizers proposed after Adam [17]. In the information extraction task, we confirm that the effectiveness of RAdam without the learning scheduler is superior to that of AdamW and AdamP with slanted triangular scheduler. This paper is organized as follows. In Section II, we briefly review the latest trends in PLMs, domainspecific PLMs, and fine-tuning techniques for NLU tasks. Section III describes the features of De-BERTa, which is the basis of SciDeBERTa, and the training data for the scientific domain-specific PLM. Section IV analyzes the re-initialization and optimization techniques used in the fine-tuning stage to improve the NLU performance. In Section V, we report the experimental results of SciDeBERTa and fine-tuning techniques introduced in Sections III and IV; a conclusion section follows.

II. RELATED WORK
In this section, we review the previous works on domain-specific PLMs and efficient fine-tuning techniques for NLU.

A. RECENT TRENDS OF TRANSFORMER MODELS
We summarize the previous works on transformerbased [18] advanced LMs and pretraining corpus for PLMs.
There are two mainstreams of transformer-based LMs: BERT [13], which consists of encoder blocks, and GPT [19], which consists of decoder blocks. BERT improves performance specialized in NLU such as sentence and word classifications, whereas GPT improves performance specialized in natural language generation. BERT uses two methods in pretraining: masked language modeling (MLM), which predicts the randomly masked tokens in input sentences, and next sentence prediction (NSP), which matches the order of sentences.
BERT-based PLMs have been enhanced into various models with improved pretraining tasks. Span-BERT [20] improved MLM to predict spans instead of tokens that are relatively easy to solve. StructBERT [21] and ERNIE 2.0 [22] have changed the pretraining task to predict the sentence order of several sentences instead of two sentences. RoBERTa [19] separates each document and uses doc-sentence to sample inputs from only one document to improve context representation understanding by training consecutive sentences in the same document.

B. LANGUAGE MODELING ON SCIENTIFIC TECHNOLOGY DOCUMENTS
General LMs based on transformers solve downstream tasks using two types of datasets in each training process. First, a large amount of unlabeled text data is pretrained by self-supervised learning, and the model acquires a universal language representation. The trained knowledge is transferred by fine-tuning a PLM to the target data. The target data is labeled task data, and its size is relatively small compared to the pretraining data.
In the pretraining step, a general domain corpus is used for extracting knowledge that can be generally useful in NLP tasks. BERT has been pretrained with 13 GB of plain text consisting of 800 and 2,500 M words from BookCorpus [23] and English Wikipedia, respectively. XLNet [24] and RoBERTa [19] optimized BERT based on the observation that it was underfitted. They trained a model for a considerably longer period with a larger batch size on more data.
The PLM shows an improved task performance when the gap is small between the corpuses used in pretraining and fine-tuning. The BookCorpus and English Wikipedia data used in pretraining the BERT have few noises (for example, few spelling mistakes) and use formal writing style. Therefore, the PLM trained on these data shows good performance in most NLP task benchmarks and leaderboards that have similar characteristics. However, these models have difficulty achieving good performance in social media conversations, product reviews, and community posts, which have many noises and are informal. This is particularly true if the target domain includes technical terms that do not belong to the general language domain, such as financial, legal, biomedical, and scientific texts. Thus, TweetBERT, FinBERT, LegalBERT, BioBERT, PubMedBERT, and SciBERT have been researched as specialized LMs that pretrain BERT with a specific domain corpus instead of a general domain corpus [11], [25]- [29].
The pretraining data of SciBERT, the representative PLM in the science and technology field, is composed of 82% biomedical domain data and 18% CS domain data with 3.2 B tokens. The S2ORC [12] dataset, which was released later, collected data in a more balanced manner in more diverse fields of science and technology. Among the models specializing in the science domains, S2ORC-SciBERT, which has been pretrained with 16.4 B tokens, shows a better performance in processing tasks in the science and technology field than SciBERT.
BioBERT [28] and PubMedBERT [29], which are state-of-the-art models for biomedical NLP tasks, were both pretrained with biomedical domain text data collected from PubMed and PubMed Central (PMC). PubMed and PMC are databases developed and maintained by the National Library of Medicine. PubMed provides citations of biomedical journals together with abstracts, and PMC archives full-text articles. BioBERT uses a continual pretraining method that additionally pretrains the general BERT with biomedical articles from the general standpoint that the knowledge provided by a general domain LM would still be useful in biomedicine. In practice, continual pretraining works effectively with a small amount of domain-specific data. Pub-MedBERT considers that PuMed and PMC provide biomedical unlabeled text data sufficient to pretrain a general LM: 33 M abstracts in PubMed and 7.6 M articles in PMC. Hence, PubMedBERT randomly initializes all parameters of BERT and performs pretraining completely with biomedical in-domain texts only. PubMedBERT pretrained from scratch showed a better performance in some biomedical NLP tasks than BioBERT.

C. FINE-TUNING OF LANGUAGE MODELING
Various studies have been conducted to improve the effectiveness in the process of fine-tuning a PLM to downstream tasks of a specialized domain. The most intuitive and widely used solution is to use an optimization technique. BERT showed an improved performance than the conventional LM by using BERTAdam, a modified version of ADAM [13]. BERTAdam plays the role of a warm-up for the learning rate by rescaling it for each epoch in place of the bias correction of Adam [17]. It improved stable training output without a separate warm-up for fine-tuning in the early stages of training [30]. In a following study, Xiong et al. used the pre-layer normalization (pre-LN) neural network that places the normalization layer of the transformer block ahead of the multi-head attention layer. The pre-LN structure showed stable training output in the early steps without a separate warm-up [31], [32]. However, bias correction plays the same role as warm-up in fine-tuning; there was a study that showed stable results by doing warm-up at the beginning of training and training for a long time [33]. The adaptive model initialization (Admin) clearly indicated that the cause of the initial instability of the PLM was the high dependency on residual connection [34]. Thus, additional parameters that adapt to the variance of outputs were applied to reduce residual dependency; it was concluded that determining the optimal hyperparameter is more important than a warm-up.
One of the main issues in fine-tuning is the suppression of overfitting resulting from domain gap and difference in size between the datasets for the pretraining of PLM and the fine-tuning. T. Zhang et al. introduced layer re-initialization of neural networks as a method to solve overfitting of fine-tuning data learning for parameters over-fitted to pretraining data through PLM, specifically, BERT [30].
This study focuses on the following optimizers: AdamW [14], RAdam [16], and AdamP [15], which have improved the most widely used Adam optimizer based on L2 regularization, weight decay, and learning rate scheduler. The characteristics of these three optimizers are compared and the optimal algorithm for fine-tuning a domain-specialized PLM is verified. Additionally, in the proposed study, the re-initialization method to suppress overfitting considering the characteristics of the SciDeBERTa model is theoretically and experimentally explored.

III. SCIDEBERTA: DEBERTA AND SCIENCE AND TECHNOLOGY DATASET
We introduce SciDeBERTa, a PLM based on De-BERTa for the science domain. Figure 1 shows the training process of SciDeBERTa. We base our work on DeBERTa base . SciDeBERTa is first initialized by DeBERTa, which has been pretrained with a general domain corpus. Next, additional pretraining is performed with a dataset of the science domain. Finally, the PLM is fine-tuned with the domain task data to resolve downstream tasks.  [19] is the sum of the token and absolute position embedding. DeBERTa uses a vector obtained by concatenating relative position embedding to token embedding as input. The attention weight of DeBERTa is obtained by the disentangled attention mechanism; disentangled attention is calculated using the cross attention between the queries and keys of the content (token) embedding of tokens and the relative position embedding. This calculation of the cross-attention score reflects the relative distance between tokens and considers the dependency among them, which is a local context that is difficult to reflect in the standard self-attention mechanism.
Similar to BERT and RoBERTa, DeBERTa also uses absolute position to consider the global context. The conventional LM sums the absolute position embedding to the input representation, but De-BERTa incorporates absolute position embedding by adding it to the input of the last n th trans-former layers before the softmax layer, denoted as enhanced mask decoder.

B. DATASET FOR PRETRAINING SCIDEBERTA MODEL
SciBERT was pretrained using a corpus consisting of 82% biomedical domain and 18% CS domain. In contrast, S2ORC-SciBERT was pretrained using a relatively balanced distribution of data in more diverse fields of study. Even though it was trained for a more general domain, S2ORC-SciBERT showed a performance similar to or better than SciBERT in biomedical and CS domain tasks. Based on these observations, SciDeBERTa uses data from all fields of study provided by S2ORC for additional pretraining so that it can be used for text mining in the general science domain.
S2ORC provides both abstracts and full texts of scientific papers. Unlike abstracts, full texts are not standardized and contain considerable noise. Hence, if full texts are to be used for training, the model needs to be trained with a larger amount of data for a longer time to sufficiently understand the knowledge in the texts. PubMedBERT [29] showed an experiment result where the model pretrained with 12 GB of abstracts had little difference in performance with a model trained with 128 GB of abstracts and full texts if the training time was the same. For a longer training, PubMedBERT trained with full-text showed a performance improvement on several downstream tasks. Nevertheless, SciDe-BERTa uses only the abstracts of S2ORC in continual learning because performance degradation may occur without sufficient data and learning time, and field coverage should also be considered.

IV. FINE-TUNING WITH MULTIPLE INFORMATION EXTRACTION TASKS
GLUE and SuperGLUE are representative benchmarks of language understanding tasks. However, these benchmarks consist of the general language corpus; thus, they are unsuitable for use in the evaluation of LMs trained with science and technology documents. One method for evaluating a PLM trained with science and technology data is to extract the knowledge information conveyed by the sentences. In other words, named entity recognition (NER), relation extraction (RE), and co-reference (Coref) resolution are performed as tasks to extract knowledge information.
The NER can be classified as a token (span) clas-sification problem. There are two methods for relation extraction: sentence classification that extracts relations from all sentences (we call this method RE), and joint entity recognition and relation extraction that extracts entities and their relations (we call this method JRE). Coref resolution identifies the co-reference information among major entities or regarding entities through synonyms. In this study, we evaluated the fine-tuning performance through the NER, RE, JRE, and Coref resolution tasks of the SciERC dataset [35] and NER task of the GENIA dataset [36].
Three tasks, such as NER, JRE, and Coref, can be performed individually. However, the performance of the fine-tuning is improved by their simultaneous execution. DyGIE++ [37] generates a span after receiving the embedding of a PLM as input, performs a weighted sum of losses by performing three types of tasks, and updates the span information according to the loss. It has been experimentally proven that the performance of NER and JRE tasks is improved when the span information is updated by conveying the Coref resolution information. This study verified the improved performance by exploring re-initialization and optimizer technology in NER, JRE, and Coref tasks.

A. UPPER-LAYER RE-INITIALIZATION TO SOLVE OVERFITTING
In Section II-C, we mentioned the re-initialization of training parameters by the resolution to suppress the overfitting. In order to select which layer receives the re-initialization, we first reviewed previous studies on the representation characteristics by a layer of LMs.
Aken et al. revealed that the layer immediately before the last determines the performance of applying the probing method for each layer of BERT [38]. Clark et al. compared and analyzed the magnitude of the attention value of the layers for a specific token [39]. The results proved that in the low layer, the "CLS" token, which reveals the overall features of sentences, shows a large attention value; in the middle layer, tokens such as "SEP" that classify two sentences show a large attention value; and in the top layer, frequent words such as "." and "," show a large attention value. The higher the layer, the more the features of the detailed and frequent tokens of the sentence are revealed. These findings indirectly reveal that data changes can be an overfitting factor. The re-initialization of the top layer may resolve overfitting, and our study verifies this through experiments.
Re-initialization is also beneficial for eliminating the accumulated noise of residual connections. One block of a transformer-based model is generally composed of a residual connection and a normalization layer after a multi-head attention layer [19]. RadFord et al. successfully achieved performance improvement through a structural change from GPT [19] to GPT2 [40]. GPT2 [40] was composed of a pre-LN (normalization layer before each block) structure while increasing the number of layers compared to GPT and composing each block as a residual connection. Moreover, additional layer normalization was applied even after the last self-attention, and re-initialization was applied to the depth at which residuals accumulate [40]. Composing a block of the pre-LN structure and re-initializing high layers has the dual effects of resolving the cumulative errors of the model and resolving the difference in detailed tokens owing to the data difference between the pretraining and finetuning datasets.
Combining the results of previous studies, the overfitting layer is concluded as the upper layers in which the neural network is repeatedly stacked [30]. However, in the case of DeBERTa, the relative positional embedding is equally stacked from the bottom to the top three layers, and in the case of the last two layers, the absolute positional embedding is applied in the same way as the existing BERT. Consequently, BERT and DeBERT have different neural network characteristics. In the DeBERTa model, the overfitting layer is contextually the top three layers, and the last two layers can be interpreted as new layers. Experimental verification results also support this. Contrary to the fact that the last layer in BERT has a large overfitting effect upon re-initialization, DeBERTa confirms that the re-initialization layer helps in overfitting when the last third layer is included.

B. EXPLORER OPTIMIZER
The Adam optimizer was developed in 2015 by combining the advantages of the adaptive gradient optimizer (Adagrad) and the root mean square optimizer (RMSProp) [17]. Adam increases the step size of the less-visited places and reduces the step size of the frequently visited places for a balanced movement as Adagrad updates the parameter, and applies the exponential weighted moving average Algorithm 1 Adam [17] : Adaptive Moment Estimation Input: α(lr); β 1 , β 2 ; θ 0 (params); f (θ); ϵ Init: m 0 ←0 1st moment; v 0 ←0 2nd moment Output: optimized parameter θ t 1: for t = 1, ... do 2: if the variance is tractable, i.e., ρ t > 4 then end if 14: end for (EMA) of RMSProp to give a weight to the recent gradient value. The Adam algorithm is shown in Algorithm 1. Lines 3 to 6 represent the first moment estimate, the second moment estimate, the first moment bias correction, and the second moment bias correction, respectively. When the equation in line 7 is differentiated, (1) is obtained. This is a proportional update averagegradient √ (averagegradient) 2 , which is a characteristic of Adagrad, and the normalization value is bound by (2). The advantage of the Adam algorithm is the presence of a scale-invariant bounded norm; it is argued that even if the gradient increases, the step size is bound, and a stable optimization descent is possible.
(2) However, compared to SGD, Adam is weakly generalizable because, in SGD, L 2 regularization coincides with weight decay; however, in Adam, the weight decay rate becomes smaller than L 2 regularization. Additionally, optimizers usually add acceleration in the learning direction along with momentum, but if the size of momentum increases while less learning is done, learning will not reach the convergence point stably.
To improve the Adam optimizer and to achieve a precise regularization, AdamW separately applies weight decay and L 2 regularization [14]. Furthermore, when it is combined with the learning rate scheduler, an improved performance can be obtained. For stable convergence, AdamP adjusts the momentum according to the normalization state [15]. However, the learning rate is delayed by approximately 8% owing to the additional calculation of the normalization.
The RAdam algorithm is shown in Algorithm 2 [16]. RAdam replaces the EMA of Adam with the simple moving average (SMA), which is widely used in economics and adjusts the various sizes of the rectification term, which acts as the adaptive learning rate according to the degrees of freedom (DoF) ρ. Line 6 of Algorithm 2 is the DoF ρ, which is the length of the SMA. If the variance is tractable according to the condition of line 7, the rectification term in line 9 adjusts the variance of the learning rate according to the DoF ρ. Unlike variance, which is tractable, if ρ t ≥ 4, it is updated as shown in line 12. RAdam has the advantage of enabling stable training by replacing warm-up, which plays the role of variance reduction, with the variance adjustment of the rectification term.
In Section V, the experiments showed that all three models: AdamW [14], AdamP [15], and RAdam [16] showed improved training results that were more stable than Adam [17]. The three models AdamW, AdamP, and RAdam had a similar performance. The improved performance was confirmed because RAdam showed the advantage of stable   training without being affected by the learning rate and learning rate scheduler.

C. DATASET FOR FINE-TUNING
The tasks for the experiments are the NER, JRE, and Coref of SciERC and NER and Coref of GENIA.
SciERC is composed of annotated entity names, relations, and cross-references for the abstracts of 500 papers from 12 AI conferences according to the ACL RD-Tec 2.0 [41]. The entity names and related information of the SciERC dataset are not information about people and places that are used for general purposes, but elements to verify the knowledge information and system of scientific documents and comprise elements as shown in Table 1. SciERC is composed of 2,867 sentences; the total number of named entities is 8,094, and the total number of relations is 6,319. In other words, it is a relatively small dataset composed of less than 10,000 tagged entity names [35].  [13] from scratch general 3.3 B tokens/16 GB -DeBERTa [10] from scratch general 78 GB -SciBERT [11] from scratch science 3.2 B tokens full-text S2ORC-SciBERT [12] from scratch science 16 [13] 67.9 ± 0.5 44.7 ± 0.8 55.0 ± 0.2 76.6 ± 0.2 44.9 ± 0.3 DeBERTa [10] 68.9 ± 0.6 44.6 ± 0.7 59.3 ± 0.7 77.8 ± 0.1 46.5 ± 0.8 SCIIE [35] 64.2* 39.3* 48.2* --SciBERT [11] 67.57* --78.0 ± 0.2 44.9 ± 1.0 S2ORC-SciBERT [12] 68 GENIA contains 2,000 abstracts from Medline articles [36]. It consists of a total of 18,545 sentences and 436,967 words. There are 96,293 entities, which are defined in terms of hierarchical structure, and 35 fine-trained entity categories. For the ontology of GENIA, please refer to Kim's thesis [36]. The subcategories are collapsed into five single labels (DNA, RNA, protein, cell line, and cell type) as described in Table 1.

A. EXPERIMENTAL ENVIRONMENT AND IMPLEMENTATION DETAILS
We used one node that connects eight units of A100 GPU by the NVlink to pretrain the LM; the GPU RAM is 80 GB each, for a total of 640 GB. For finetuning, we configured the hardware of an Intel Core i7 machine with NVME SSD, 64 GB RAM, and two RTX 2080 GPUs; the GPU RAM is 24 GB each, for a total of 48 GB. Table 2 shows the hyperparameters for pretraining SciDeBERTa with S2ORC. The hyperparameters for fine-tuning are shown in Table 3. We performed multi-task using DyGIE++ in finetuning. In SciERC, the loss-weight ratios of NER, JRE, and Coref for each target task are (1.0, 0.0, 0.0), (0.2, 1.0, 1.0), and (0.5, 0.5, 1.0), respectively. In GENIA, only the target task is different, and the ratio of NER and Coref is the same at (1.0, 1.0).
Re-initialization, a technique for generalization, can be expected to improve performance in small datasets, which is the main cause of overfitting. This is the case with SciERC, which is a small task with less than 10,000 entities. However, there is almost no performance improvement for data of a certain size like GENIA. Therefore, we applied re-initialization only to experiments on the SciERC task. Table 4 and Table 5 show the pretraining step information and performance comparison results of PLMs specialized in the science and biomedical domain. The evaluation metric is F1-score. In the NER task of SciERC, comparing the results of the base general domain LMs, DeBERTa, which improved BERT, outperformed BERT even in the scientific domain task. Among the LMs specialized in science and technology, SciDeBERTa (CS) showed the best performance that was 3.53% higher than SciBERT, 2.17% higher than S2ORC-SciBERT, and 0.3% higher than SciDeBERTa. Since SciDeBERTa (CS) was trained on abstract data from academic papers in the computer science domain, it showed a better performance than SciDeBERTa trained with data from other science and technology fields in the Sci-ERC dataset. In the SciERC JRE task, SciDeBERTa (CS) showed a 6.7% higher performance than baseline SCIIE. However, in the SciERC Coref task, De-BERTa showed the best performance, 1.9% higher FIGURE 2: NER and RE task performance change by applying re-initialization to layer(s); the PLM is SciBERT [11].

B. PERFORMANCE COMPARISON ACCORDING TO THE PRETRAINED MODEL AND DATASET
than SciDeBERTa (CS). In the SciERC dataset, for all three tasks of NER, JRE, and Coref, SciDe-BERTa showed a slightly lower performance than SciDeBERTa (CS). However, for the NER and Coref tasks in the GENIA dataset, which is from medline articles, SciDeBERTa showed the best performance when compared to other models. In the NER task of the GENIA dataset, SciDeBERTa showed a 0.9% higher performance than SciDeBERTa (CS) and 2.1% higher performance than BERT. In the Coref task of the GENIA Dataset, SciDeBERTa showed a 0.7% higher performance than SciDeBERTa (CS) and 2.1% higher performance than BERT. As can be seen from Table 5, the above-mentioned terms are repeated in the Coref task as pronouns. For this task, DeBERTa showed a difference of 1.6% to 4.3% compared to BERT, indicating a good performance. In the Coref task, it can be seen that DeBERTa trained as a general language model generally performs better than SciBERT, S2ORC-SciBERT, or SciDeBERTa trained as a science and technology model.

C. PERFORMANCE COMPARISON BY RE-INITIALIZATION OF LAYERS
The models used in the experiments in Figure 2 and Figure 3 are all 12-layer base models; the experiments were performed five times, and the average value was taken in consideration of variance from random initialization. The four types of tasks: NER, RE, JRE, and Coref, are used for fine-tuning on the SciERC dataset. Figure 2 confirms that the last 12th layer re-initialization is most suitable for the overfitting prevention effect through the layer reinitialization method in SciBERT. Such a result in re-initialization performance according to layer is FIGURE 3: NER, JRE, and Coref task performance change by applying re-initialization to layer(s); the PLM is DeBERTa [10] and the target task is JRE. The loss-weight ratio of (NER, JRE, Coref) is (0.2, 1.0, 1.0) consistent with the explanation in Section IV-A that more frequent words are expressed in the higher layer than in the main flow of context. However, as Figure 3 shows, the 10th layer reveals the overfitting prevention effect through re-initialization in the DeBERTa model in the base PLM of 12 layers. As discussed earlier in Section IV-A, DeBERTa uses the same relative positional embedding from the lowest to the 10th layer and uses absolute positional embedding for the 11th and 12th layers. In the case of BERT, all 12 layers use absolute positional embedding. Figure 2 shows changes in the average F1 score for NER and RE tasks. As depicted in Figure 2, the average F1 performance improved by approximately 0.99% through the re-initialization of the last layer. Using the same method, an improvement in the average F1 performance by approximately 1.85% can be confirmed through the re-initialization of the last two layers for the RE task. Figure 3 shows changes in the average F1 score for the NER, JRE, and Coref tasks by using Dy-GIE++. When using the DyGIE++ model, the target task is set as a JRE task. The ratios of NER, JRE, and Coref tasks when calculating loss are 0.2, 1.0, and 1.0, respectively. In Figure 3, the F1 score comparison of the re-initialization layer is based on the JRE task. Although the initialization of other layers is slightly more effective for NER and Coref, the JRE task is the target task; therefore, it was considered as the standard. In the case where reinitialization was applied to the 10th layer (reinit10), compared to the case where re-initialization was not applied (base), NER, JRE, and Coref improved their FIGURE 4: IE multi-task performance comparison according to optimizer; the PLM is DeBERTa [10] F1 scores by 0.7%, 1.2%, and 2.4%, respectively.

D. COMPARISON OF INFORMATION EXTRACTION PERFORMANCE CHANGES ACCORDING TO OPTIMIZERS
We compared the performance of four types of optimizers in the state of applying re-initialization to the 10th layer of DeBERTa PLM by the DyGIE++ model, which performs multi-task learning for four types of optimizers, including Adam, during finetuning. All experiments were also performed five times and the average value was taken in consideration of variance from random initialization. Figure 4 shows the performance of NER, JRE, and Coref for the SciERC dataset. The weights for multitask learning of NER, RE, and Coref are 0.2, 1.0, and 1.0, respectively. For the target task, we used the 150-dimension dropout 0.4 for two layers. Table 3 shows the other hyperparameters. Figure 4 shows that RAdam without a scheduler achieves the best performance compared to other optimizers. In Figure 4, Adam, AdamW, and AdamP used the slanted triangular scheduler; and after RAdam, AdamP shows good performance. However, as previously mentioned in IV-C, AdamP has the disadvantage of slowing down the learning rate by approximately 8%. For AdamW, when the polynomial decay learning scheduler was applied to warm-up five epochs instead of slanted triangular scheduler, the same performance as AdamP was confirmed. As described in Figure 4, the performance of RAdam is higher than Adam, AdamW, and AdamP in all three kinds of tasks: NER, JRE, and Coref.

VI. CONCLUSION
This study proposed SciDeBERTa through training specialized in the science domain; the base model was DeBERTa, whose performance as a general LM benchmark has been proven. It was experimentally verified that SciDeBERTa and SciDeBERTa (CS) improved performance compared to SciBERT and S2ORC-SciBERT, suggested as conventional LMs specialized in the science and technology domain. In particular, in the case of SciDeBERTa (CS) trained in the CS domain among the science and technology domains, an improved performance was confirmed over SciDeBERTa trained in various science and technology domains on the SciERC dataset. Furthermore, Xavier normalization was used to reinitialize the PLM, and the RAdam optimizer was applied as a method for overfitting the resolution and optimization to improve LM performance by fine-tuning the PLM. For future related research, further tagging for the SciERC dataset or data augmentation is necessary considering that the tagging of the SciERC dataset in the present model is insufficient.

VII. ACKNOWLEDGEMENT
This work was helped by the National Supercomputing Center with their resources and technical supports.
YUNA JEONG received her B.S. degree in computer engineering at Korea Polytechnic University (2012) and Ph.D. degree in computer engineering at Sungkyunkwan University (2019). She is a senior researcher in the AI Technology Research Center at the Korea Institute of Science and Technology Information (KISTI). Her main research interests include machine learning, deep learning, and natural language processing.