Hypert: hypernymy-aware BERT with Hearst pattern exploitation for hypernym discovery

Hypernym discovery is challenging because it aims to find suitable instances for a given hyponym from a predefined hypernym vocabulary. Existing hypernym discovery methods used supervised learning with word embedding from word2vec. However, word2vec embedding suffers from low embedding quality regarding unseen or rare noun phrases because entire noun phrases are embedded into a single vector. Recently, prompting methods have attempted to find hypernyms using pretrained language models with masked prompts. Although language models alleviate the problem of w embeddings, general-purpose language models are ineffective for capturing hypernym relationships. Considering the hypernym relationship to be a linguistic domain, we introduce Hypert, which is further pretrained using masked language modeling with Hearst pattern sentences. To the best of our knowledge, this is the first attempt in the hypernym relationship discovery field. We also present a fine-tuning strategy for training Hypert with special input prompts for the hypernym discovery task. The proposed method outperformed the comparison methods and achieved statistically significant results in three subtasks of hypernym discovery. Additionally, we demonstrate the effectiveness of the several proposed components through an in-depth analysis. The code is available at: https://github.com/Gun1Yun/Hypert.


Introduction
Hypernymy denotes a semantic relationship characterized by a hierarchical connection between an abstract term and subordinate instances.To illustrate, when presented with a directive to enumerate various exemplars of vehicle, one may readily evoke representations such as an automobile, a watercraft, and an aircraft.In this context, these entities materialize as specific manifestations falling within the broader classification of vehicle, thereby designating vehicle as the hypernym and the entities mentioned above as their respective hyponyms.
The hypernym relation holds significant importance in natural language processing (NLP).This salient semantic connection assumes a crucial role in diverse NLP tasks, including question answering, ontology construction, textual entailment, and lexicon augmentation [1][2][3].To facilitate these tasks, a large lexical database, WordNet, was introduced, which delineates semantic relations among words.However, using manual human effort, constructing such a resource is labor-intensive and time-consuming.Consequently, numerous studies have endeavored to automatically extract hypernym relationships from corpora [4][5][6][7].
Hypernym discovery entails identifying all possible instances of hypernyms for a given query within the vocabulary [8].Recently, hypernym discovery studies have used word embeddings to capture the meaning and relationship between words.These studies can be further categorized into two classes: word2vec embedding methods [9][10][11][12] and prompting methods [13][14][15][16].The word2vec embeddings are based on the distribution hypothesis to vectorize the meaning of a word into a vector space [17].However, noun phrases are mapped to a single vector, and thus rare noun phrases in the corpus can be poorly embedded in this way.
The prompting method presents a potential solution to alleviate this issue, employing pretrained language models (PLMs) with subword tokenization algorithm [18].The prompt takes sentences as input, which consist of a query token x and a [MASK] token (e.g., "A/An x is a [MASK]") [13].Despite the promise that harnessing PLMs holds for addressing challenges associated with infrequent terms, certain limitations persist.Only one word can be predicted from the prompt using one [mask] token, even though the gold hypernyms can be multiple words.Additionally, it is essential to acknowledge that PLMs, in their original design, are not inherently geared towards discerning hypernym relationships.Hence, customary procedures involving supplementary rounds of pretraining and subsequent fine-tuning are often employed to tailor the PLM to the specific demands of domain-specific tasks [19,20].
Within the realm of hypernym discovery, the application of these processes (i.e., pretraining and fine-tuning) remains unexplored.An inherent shortcoming of existing studies concerning hypernym discovery is their limited efficacy in deciphering the semantics and hypernymic relationships inherent in noun phrases.Approaches grounded in word-2vec suffer vulnerability when confronted with noun phrases due to their amalgamation into singular vectors.Even though prompting methodologies offer potential relief, the all-encompassing nature of general-purpose PLMs renders them unsuited for comprehending hypernymic information, given their training on broad-spectrum sentences.
To address the limitations posed by existing approaches, in this work, we propose Hypert, a hypernymy-aware pretrained language model for hypernym discovery that harnesses Hearst pattern sentences.The method we present aligns with established domain adaptation practices for PLMs.To elucidate, the PLM is subjected to additional training using a corpus consisting of sentences based on the Hearst pattern, thereby heightening its sensitivity to hypernymic constructs.The Hearst pattern, though simple, possesses a solid foundation, thereby facilitating the construction of a corpus rich in hypernym relationships [21].More specifically, we employ the extended Hearst patterns, which embody the core concept of the original Hearst patterns.Then, the augmented pretrained model undergoes fine-tuning on a dedicated dataset tailored for the hypernym discovery task, employing a specialized input prompt.By amplifying the hypernym awareness of language models through supplementary pretraining, our approach envisages an elevation in the efficacy of hypernym discovery, ultimately yielding enhanced performance.The effectiveness of the proposed method is assessed through a comparison with conventional word2vec-based methods and the prompting approach, utilizing the SemEval-2018 task9 dataset [8].The experimental results show that the proposed method significantly outperforms the comparison methods.Furthermore, an in-depth analysis was also conducted to evaluate the effectiveness of individual components of the proposed method.We also presented the distribution of utilized Hearst patterns in the corpus and analyzed them.Additionally, our investigation reveals the efficacy of further pretraining, contrasting favorably with BERT [22].To discover the robustness of our method, we conduct a comparative evaluation of prediction outcomes for rare words.Lastly, the t-distributed stochastic neighbor embedding (tSNE) plots are presented to visually represent the classification token ([CLS]) embedding representation space in the context of hypernymy.
The contributions of this work are be summarized as follows: • We propose a further pretraining method to enhance the hypernymy awareness of language models by denoising Hearst pattern sentences from a corpus and using a special input prompt for fine-tuning to achieve the best performance among the comparison methods.• An in-depth analysis is conducted regarding the further pretraining.The results reveal that the proposed further pretraining method improves performance compared to BERT.• The proposed method solves the low embedding quality of noun phrases that rarely appear and demonstrates this through the case study.

Related work
In traditional hypernym discovery and detection studies, pattern-based methods have been used to identify hypernym-hyponym pairs from a corpus.In the work of [1], the author defined lexico-syntactic patterns, called Hearst patterns, which can automatically filter hypernym-hyponym pairs from large corpora.For instance, a pattern like "y such as x" indicates that x is a hypernym of y.Because such pattern-based syntactic relation extraction is a good starting point, this seminal research affects lots of subsequent studies that can be roughly divided into distributional similarity-based, ontological knowledge-based, and machine learning-based methods.
Several studies based on distributional similarity are inspired by the distributional inclusion hypothesis [23][24][25], which assumes that the hypernym can substitute for its hyponym.Based on this hypothesis, researchers have proposed unsupervised distributional measures with asymmetric scoring functions.In the work of [26], WeedsPrec, a precision-based similarity measure was proposed to quantify the weighted inclusion of a narrow term x to a broad term y (i.e., x → y ).Additionally, coWeeds [27], the geometric average of cosine similarity with WeedsPrec, was proposed.In the work of [28], ClarkeDE, a variant of WeedsPrec, was proposed to compute the degree of inclusion.Alternatively, invCL [27] considered the inclusion of a narrow term x to a broad term y and the exclusion of y to x using the ClarkeDE measure.However, distributional inclusion hypothesis methods require specific similarity measures to identify semantic relationships.In addition, there may also be a sparsity problem for noun phrases that contain more than one word, making it challenging to identify hypernym relationships.
Since the concept of word2vec embedding was introduced [17], researchers have attempted to use it as a word representation in supervised learning.In the work of [9], the authors demonstrated how to learn semantic hierarchies from word embeddings via projection learning.Projection learning achieved remarkable improvement in the Chinese hypernym detection task compared to other pattern-based and distributed methods using learnable piecewise uniform projection matrices that map queries to various hypernym representations.In addition, RMM [11] repeatedly uses a shared weight projection matrix for a given query with word2vec embeddings, assuming that hypernyms may come from various conceptual hierarchy levels.RMM exploits the attention mechanism [29] and residual connection [30] to capture corresponding candidate hypernyms.Furthermore, SPON [12] uses word2vec embeddings with a simple neural network to enforce hypernym relationship properties, asymmetry, and transitivity, as a soft constraint.However, these methods are vulnerable to rare or unseen words because they map noun phrases to a single vector.Moreover, traditional methods are highly inefficient when dealing with out-of-vocabulary terms because they either use random vectors or require retraining the entire word representation from scratch.
Ontological knowledge also aids hypernym discovery as many distant supervision approaches rely on existing ontologies.[31] utilized BabelNet [32], a multilingual lexicalized semantic network and ontology, to extract sentences containing terms linked by hypernym relations within BabelNet.The sentences are incorporated into the training data when these terms exhibit hypernymic connections.With this training data, they built classifiers to determine whether a given sentence contains expressions indicative of hypernymic relationships.Similarly, [33] use BabelNet [32] and embed pairs of its synsets (namely, term-hypernym) into the sense embedding spaces [34].For example, Apple and the concept company could form a term-hypernym pair.In that embedding spaces, the authors learn a hypernym transformation matrix of all term-hypernym pairs and then compute similarity values over the pairs.The machine learning-based methods try to identify the patterns of hypernym relationships from given data.For example, HyperNET [35] used an LSTM-based network to recognize hypernym syntactic patterns by representing dependency tree paths as sequential data.Despite the many efforts to find hypernyms with syntactic patterns, pattern-based methods have sparsity problems in which hypernym-hyponym pairs that match the pattern are rare in the corpus [10,36].Recently, matrix factorization techniques, such as Singular Value Decomposition, have been used to mitigate the sparsity problem of pattern-based methods and showed improved results [21].
With the emergence of the transformer [37] in NLP, many researchers have recently used transformer-based PLM, such as BERT [22], in various NLP tasks and applications.The transformer [37] is a novel encoder-decoder network architecture based solely on the attention mechanism, called self-attention, and does not rely on recurrence or other convolutions.In addition, BERT [22] is an effective PLM that can be fine-tuned for a wide range of NLP tasks using the encoder block of the transformer and has achieved state-of-the-art results on 11 benchmark datasets.Following the success of the transformer and BERT, many BERT variants have been proposed [19,20,[38][39][40].For example, BioBERT [20] and FinBERT [19] improved their performance by further pretraining BERT with a domain-specific corpus.
To use a PLM for identifying hypernym relationships, [16] evaluated the ability of BERT through human language experiments called prompting and demonstrated proficient results in hypernym retrieval.Moreover, several studies have used human language experiments to evaluate the linguistic knowledge of PLM [41][42][43].For example, singular and plural prompts were used to probe the hypernymy knowledge of BERT [15].In addition, various prompts, specifically Hearst patterns, natural sentences, and handwritten context, have been used to find hypernyms with BERT [14].In the work of [13], the authors investigated the performance of general-domain and domain-adapted language models on financial hypernymy pair datasets using prompting masked language models.However, the prompting approaches can incur unstable identification because the performance varies depending on the prompt type.In addition, due to the format of the prompt, only one token within the vocabulary of the language model used can be predicted.
Lastly, a group of studies adopts a strategy to hybridize multiple approaches to maximize identification performance.CRIM [10] can be a notable hybrid approach that combines pattern-based and projection learning methods.The supervised learning approach of CRIM uses projection learning with multiple parallel projection matrices.The pattern-based part of CRIM uses Hearst patterns to assign weight to word2vec embeddings.Then, the cosine similarity is used as a score between two words.Like CRIM, [44] proposed a hybrid approach to discover hypernym relations using a pattern-based and distributional model.Their model begins with finding seed hypernyms using extended Hearst patterns, then adds the hypernyms of the nearest neighbor.
Inspired by the effectiveness and generality of PLM, this study aims to find hypernyms using a language model further pretrained by masked language modeling (MLM) using Hearst pattern sentences.We employ a specially formatted input sentence consisting of noun phrases and special tokens.Moreover, projection learning is adopted to capture semantic relationships between noun-phrase embeddings.

Proposed method
This section introduces the proposed hypernym discovery system with Hypert.We illustrate the architecture of the proposed hypernym discovery system in Fig.

Data preparation
This study employs the SemEval-2018 Task9 (Hypernym Discovery) [8] dataset for the benchmark.The dataset contains five subtasks: three subtasks for general purposes for three languages (English, Spanish, and Italian) and two domain-specific subtasks in English (Medical and Music).We considered three of these subtasks for the experiments: 1A (English), 2A (Medical), and 2B (Music).Table 1 presents the examples of the query-hypernym pair dataset.The queries and candidate hypernyms in the hypernym discovery task can be one-word, two-word, or three-word noun phrases.Each query can have up to 15 gold hypernyms.In addition, the query is given with a noun phrase and type of query, which is either a concept or entity.
Table 2 lists the statistics of each subtask.Each subtask comprises a corpus, vocabulary, and the training, validation, and testing sets.The corpus was used to train word embeddings.The vocabulary includes noun phrases that can be target hypernyms.The datasets contain query-gold hypernym pairs, where the number of train, the number of valid, and the number of test are the number of queries for training, validation, and   testing, respectively, and the number of vocabulary represents the number of candidate hypernyms.More dataset details can be found in the work of [8].
We used extended Hearst patterns to build a hypernym-related corpus from the given corpus. 1 47 patterns were used to extract hypernym-related sentences.Specifically, the number of extracted sentences for 1A, 2A, and 2B are 5 M, 137K, and 153K, respectively.These are 4%, 4%, and 3% of the total sentences for each subtask.Although there may be more efficient patterns, the patterns exploited in this study are sufficient to build a training corpus and achieve the best performance among the comparison methods.
We used a special input prompt sentence S P as an input sentence of the PLM: where Q and C are the query and candidate hypernym terms, respectively.The [CLS] token is a special token that is always the first of every input sequence.This special token is a classification token used as the aggregate sequence representation.The [SEP] token is also a special token to separate sentences.We used the [CLS] token embedding to represent the hypernym relationship and the [SEP] token to separate query and candidate hypernym terms in this study.However, in the hypernym discovery task, the query is given with the type of query: concept or entity.For instance, the query "fuse" is a concept, and the query "Louis Armstrong" is an entity.To provide type information for the input query, we added [CON] and [ENT] special tokens, referring to concept and entity types.Thus, the [Type] token of S P is [CON] when the query type is a concept and is [ENT] when the query type is an entity.However, both input terms, the query and candidate hypernym, were split into multiple tokens because of subword tokenization.Thus, the number of each term token varies for each S P .The span of the subword tokens that correspond to each term must be identified to obtain word embeddings.To achieve this, we generated M Q , M C ∈ R l×1 , masking vectors as span information vectors.The length of S P is l. Figure 2 illustrates the generated M Q and M C vectors for the given S P .Both vectors are one-hot vectors consist- ing of 0 and 1.In addition, M Q is the query token span vector, and M C is the candidate hypernym token span vector.We set the masking vectors to 1 for each token span and (1) 0 for the others.For instance, M Q was set to 1 for the query token span and 0 for the others.

Hypert: hypernymy-aware BERT
We introduce Hypert, a PLM for the hypernym relationship.The overall process of further pretraining is illustrated in Fig. 3. Sentence extraction (left) indicates the pattern retrieval process to build a pretraining corpus.In addition, MLM (right) uses extracted sentences and generates Hypert, the hypernymy-aware BERT, for each subtask.
We hypothesize that the language models can learn hypernym relation knowledge from specific sentences in this study.As with BioBERT [20], FinBERT [19], and Dark-BERT [45], domain-specific tasks are significantly improved with a further pretraining using the domain corpus.By considering the hypernym relationship to be a specific domain, the language model can be further pretrained on the hypernymy-related corpus to improve hypernym relationship awareness.To achieve this, sentences representing a hypernym relation are required to construct a hypernymy-related domain corpus.The Hearst pattern is devised to detect hypernym-hyponym pairs from the corpus [1].For example, if the sentence "mammal such as dog" matches "y such as x, " one of the Hearst patterns, then (dog, mammal) can be extracted as a hypernym relationship.We exploited the Hearst pattern to identify sentences that contain hypernym relationships.Sentences matching the Hearst patterns were extracted to build the hypernym-related corpus.When extracting sentences, only the part matching the pattern in the sentence was extracted, not the entire sentence containing the pattern.
Similar to BioBERT and FinBERT, Hypert is initialized with BERT, a pretrained model consisting of transformer encoder layers before further pretraining [22].We also employed the BERT tokenizer and added special tokens [CON] and [ENT].Subword tokenization in BERT splits words into multiple subtokens defined in the vocabulary pool of BERT, allowing rare or unseen words to be represented with proper subtoken embeddings.However, this advantage leads to the need for M Q and M C , as mentioned previously.In contrast, further pretraining on a corpus related to hypernymy allows BERT to gain a multiple-perspective understanding of the critical information regarding the hypernym relationship between input query tokens and candidate tokens.While MLM effectively improves contextual hypernymy understanding, next sentence prediction is irrelevant to this task as we are unconcerned with the relationship between two consecutive sentences.Therefore, the pretrained model initialized with BERT base is further trained using the constructed corpus from above without the next sentence prediction objective.Additionally, Hypert is generated separately for each constructed subtask corpus.In other words, there are three Hypert models for 1A, 2 2A, 3 and 2B. 4

Fine-tuning and prediction
We present a fine-tuning and prediction process using Hypert.The output of Hypert H ∈ R l×d model is obtained by f (S P ) , where f is the proposed Hypert.The length of the input sentence tokens is l, and d model is 768, the dimension of the BERT base model: The embedding of each term is computed by averaging the token embedding, which can be obtained by multiplying H and each span information vector described above, divided by the sum of masking vectors as follows: and where êq , êc ∈ R d model ×1 are embeddings of query and candidate hypernyms.In addi- tion, the embedding of the [CLS] token ê[CLS] ∈ R d model ×1 is obtained by taking the first index of the final hidden state of Hypert H.An affine transformation is applied to reduce the dimensions of each embedding.Thus, e [CLS] can be defined as follows: The query embedding e q can be given as follows: and the candidate hypernym embedding is defined as follows: The dimensions of query embedding êq and candidate hypernym embedding êc are reduced by 2 https://huggingface.co/HeroGeonil/Hypert. 3https://huggingface.co/HeroGeonil/Hypert-medical. 4 https://huggingface.co/HeroGeonil/Hypert-music.
[CLS] token ê[CLS] is used in the last layer, so the dimensions are reduced by the num- ber of projection matrices k with W [CLS] ∈ R k×d model , and b [CLS] ∈ R k×1 .All W and b are learnable wnd biases.
Previous studies have used projection learning for the supervised approach [9][10][11].In this study, we adopted the projection learning method, using projection matrices to capture the relationship between the query and candidate hypernym embeddings produced by Hypert.The projection matrix was created by applying a normal distribution N (0, 1/d) as noise to the identity matrix as follows: where I denotes an identity matrix, and ǫ i ∈ R d×d represents the ith noise term sampled from a normal distribution.Each projection matrix i was generated by adding the individual noise ǫ i to I.
Then, the query embedding e q was multiplied by multiple k square projection matrices i ∈ R d×d to obtain projected matrices P i , where i = {1, ..., k} .P can be defined as Finally, the score matrix s ∈ R k×1 was computed using P ∈ R k×d and the candidate hypernym embedding e c as follows: The embedding of the [CLS] token was used for relation representation.To achieve this, e [CLS] and s were concatenated to F ∈ R 2k×1 as follows: The input prompt includes the query and candidate hypernyms; thus, the output of the proposed model is the probability of a hypernym relationship.Thus, the final layer is a feedforward network with a sigmoid activation function to output [0, 1] as follows: where o is the probability of a hypernym relationship, W o and b o represent learnable parameters, and σ denotes a sigmoid function. (8) However, hypernym discovery aims to retrieve suitable hypernyms from a given predefined vocabulary.Therefore, as many input prompts as the number of all words in the predefined vocabulary are generated and calculated for one query.The number of o for each query equals the number of words in the predefined vocabulary.Each query has a maximum of 15 gold hypernyms; thus, we sorted the output, taking the top 15 candidates.The inference of the proposed method is described in Algorithm 1.In Line 4, make_prompt is a function that generates S P for the query Q and candidate hypernym c (i) , as demonstrated in Eq. 1 whereas the function make_masking_vectors of Line 5 pro- duces masking vectors M Q and M C illustrated in Fig. 2 for a given S P .

Experimental results
This section presents the performance of the proposed and conventional methods.In addition, it describes the experimental settings for the hypernym discovery dataset, evaluation measures, and employed statistical tests.

Experimental settings
The HuggingFace transformers library 5 with PyTorch [46] was used for the implementations.The experiment was conducted using an Intel i9-10980XE, three NVIDIA GeForce RTX 3090 GPUs, and 128GB RAM.In addition, distributed training was employed by using Data Parallel functionality in PyTorch.In the further pretraining of Hypert, the BERT base model6 initialized the PLM.We set the batch size, learning rate, and cosine scheduler warm-up steps to 216, 5e-5, and 500, respectively.The maximum training step was also limited to 10k for each subtask dataset.
After further pretraining, the proposed pretrained model was fine-tuned on the training dataset of each subtask.In the fine-tuning model, k was set to 24, and d was set to 200.We set the batch size to 32 and the maximum epoch to 15 for training.
In addition, we used negative sampling because the dataset consists of only positive samples.For each positive sample, 50 negative samples were generated.The model with the best validation mean average precision (MAP) score epoch was used for testing.The loss function was set to binary cross-entropy loss and minimized using the AdamW optimizer [47].The binary cross-entropy loss function is defined as follows: where L, q, c, t, and y refer to the loss value, query, candidate hypernym, label, and prediction of the proposed model.The label is 0 for negative pairs and 1 for positive pairs.We conducted hold-out cross-validation for each experiment.The training, validation, and testing sets were combined and randomly selected in equal proportions to the given split.For each subtask dataset, the experiment was repeated 10 times.We obtained 10 performance values for each measure.
The proposed method was compared to three conventional hypernym discovery methods: RMM, SPON, and prompting BERT.Details of each method are provided below.
• RMM [11]: This method utilizes a projection matrix with word2vec embeddings.The shared projection matrix is applied to hyponym term embedding recurrently to obtain representations of higher concept-level of words.To obtain word2vec embeddings for RMM, we set embedding dimensions and window sizes to 200 and seven, respectively.Then it is trained based on ten negative samples with ten epochs training for each given corpus.Next, to train the RMM model, we set the batch size to 32, and the number of negative samples was set to 50.The maximum training epoch was set to 1,000 with 200 patience.Lastly, the best validation MAP model was selected for testing.RMM was chosen for comparison because it is a representative method based on the projection matrix.• SPON [12]: This method creates a distance-to-satisfaction vector for a given hyponym and candidate hypernym.The output representation is subtracted from the candidate hypernym term.All the parameter settings and procedures for obtaining word2vec embeddings and training SPON are the same as the experimental settings of RMM.Again, the best validation MAP model was selected for testing.In our comparison, SPON was chosen because it effectively reflects asymmetricity and transitivity properties which are essential for identifying hypernym relations.• Prompting BERT (is-a) [14]: Prompting BERT generates hypernym for a given prompt by predicting [MASK] token.Because the original BERT is used directly, an additional fine-tuning process is unnecessary.In our experiment, we considered the prompt "A/An x is a [MASK]." because of its simplicity in identifying hypernym relationships.The strategy to prompt BERT was chosen to validate the superiority between Hypert and the pre-trained language model for our task.Another reason for choosing the prompting strategy is that it does not rely on distributional similarity, for example, word2vec, in contrast to RMM and SPON.• Prompting BERT (such as) [14]: Similar to experimental settings of Prompting BERT (is-a), we considered a prompt "A [MASK] such as A/An x. " (such as) because of its superior performance in identifying hypernym relationships.

• Mean average precision (MAP):
The MAP is the mean of average precision, the average of each obtained hypernym from the search space, for a given query word.The MAP is defined as where Q and |Q| refer to the given set of query words and the size of the set, respectively.
• Mean reciprocal rank (MRR).The MRR is usually used to evaluate the effectiveness of an information retrieval system [48,49].The reciprocal rank is the reciprocal of the first relevant or correct outcomes.The MRR is the average of the reciprocal rank for each given query word and is defined as where the rank i refers to the rank position of the first correct hypernym of i-th query.• Precision at k (P@k).The P@k metric calculates the top-k hypernym outcome precision and is defined as where TP and FP refer to the true positive and false positive, respectively.Specifically, we set the cut-off threshold k to 1, 3, 5, or 15 in this study.We compared each method on different iterations using the Wilcoxon signed-rank test [50] because we are interested in the superiority of the proposed method over the comparison methods.We let d i be the difference between the performance of the two methods on the ith iteration.The differences were ranked according to their absolute values: the smallest d i was assigned to the first rank.In the case of ties, average ranks were assigned.We let R + be the sum of the ranks for the iterations on which the compared method outperforms the proposed method, defined as and R − is the opposite, as follows: Then, according to the critical values for the Wilcoxon's test, for a confidence level of α = 0.05 and with N = 10 , the difference between the compared methods is significant if min(R + , R − ) ≤ 8.In this case, the null hypothesis of equal performance is rejected.
rank(d i )

Comparison results
Table 3 presents the results of the experiments on three subtask datasets.This table contains the MRR, MAP, and precision at ranks k = {1, 3, 5, 15} (P@k) of the proposed and comparison methods.The average performance of the holdout cross-validation with the corresponding standard deviation is presented for each evaluation measure and method, and the best performance among the methods is represented in bold.As listed in Table 3, the proposed method outperforms all measures across subtasks.
The MRR indicates the ability of the related item to be ranked high, suggesting that the proposed method performs more accurately in identifying hypernyms than other methods.For example, in the results of the 1A dataset, the MRR value of the proposed method is 38.86.Compared to RMM, which uses word2vec embedding with projection learning, the average performance difference is 11.47.The MAP value of the proposed method is 24.17, the first rank, and the difference in average performance from RMM, which is the second rank, is 5.92.The MAP considers the precision of all related items.Therefore, the results indicate that the proposed method predicts more gold hypernyms regarding ranking problems than other methods.The results of P@k also support this.These results appear the same in all other subtask datasets.
Table 4 reveals the results of the Wilcoxon signed-rank test of the proposed method against the comparison methods for 10 iterations on the 1A dataset.The table confirms that the proposed method significantly outperforms other methods because all p-values are less than the significance level of α = 0.05 , rejecting the null hypothesis.For each evaluation measure, the winning method is remarked with bold, and p-values are presented in the parenthesis.For the 2A and 2B datasets, the Wilcoxon signed-rank test result is the same as that for 1A, which can be observed in the Appendix 1.

In-depth analysis
This study introduces a further pretraining and fine-tuning process for hypernym discovery.Specifically, the pretraining phase uses MLM with extended Hearst patterns extracted from the given corpus, and the fine-tuning phase adopts projection learning with Hypert.To assess the influence of the choices, we examined several components of the proposed method.We discuss the effects of the proposed pretraining method and provide the results of the outcomes from each pretraining step.We also defined and evaluated two subgroups in the 1A subtask dataset to validate the robustness of Hypert against conventional methods.This study provides the pattern distribution with statistics and analyzes which patterns appeared frequently.We speculated that the proposed method could handle rare noun phrases.To support this, we present the prediction list of the proposed method and comparison methods.Additionally, the tSNE plots of the e [CLS] representation space are presented to analyze the effectiveness of using the [CLS] token as a hypernym relationship information vector.The pattern analysis was conducted to determine which patterns effectively construct a hypernym-related corpus.The pattern distribution for the extracted sentences is displayed in Fig. 4. Most sentences were extracted using almost five patterns.Table 5 presents the counts for the top five patterns.The " NP y as NP x " pattern was more than 50% for all subtasks.The " NP y such as NP x " pattern was more than 20%.These five patterns comprised over 98% of sentences.
Table 6 represents the results of using the Hypert and BERT models.The results indicate that the proposed pretraining method improves the performance of all subtask datasets across all evaluation measures.In addition, we also employed the Wilcoxon signed-rank test to confirm the superiority of the pretraining method.The results of the Wilcoxon signed-rank test are provided in Table 7.Most results reject the null hypothesis of the Wilcoxon signed-rank test with a significance level of α = 0.05 , except for the 2A dataset.
Because we chose 1k steps for all subtasks for fairness, we varied the pretraining from step 0k (without pretraining) to 10k to observe the performance of increasing  the steps.Tables 8 and 9 detail the performance of the proposed further pretraining method for each 1k step.
In the 1A dataset, the MRR value of pretraining with the 1k steps model is the best through all steps.Compared to the 0k steps model, which does not use further  pretraining, the average performance difference is 2.24.Moreover, the results indicate that the proposed further pretraining method improves performance compared to the 0k steps model at every 1k step.The rational choice for each subtask will now be discussed.For the 1A dataset, the model with 9k pretraining steps seems reasonable considering average performance and standard deviation.For the 2A dataset, despite the 8k steps showing most of the best performance, we consider the 2k steps to be selected as the best choice because of the low standard deviation and the negligible difference in performance between them.However, the results for the 2B dataset clearly provide reasonable pretraining steps, 1k.The second is also evident, 6k.Note that we merely selected the model with the 1k steps through all subtask datasets for fairness.A comparison of the pretraining steps considered to be the best performance can be found in Appendix 2.
In the 1A dataset, the MRR value of pretraining with the 1k-step model is the best through all steps.Compared to the 0k-step model, which does not use further pretraining, the average performance difference is 2.24.Moreover, the results indicate that the proposed further pretraining method improves performance compared to the 0k-step model at every 1k step.For the 1A dataset, the model with 9k pretraining steps seems reasonable, considering the average performance and standard deviation.For the 2A dataset, despite the 8k steps showing most of the best performance, the 2k steps are the best choice because of the low standard deviation and negligible difference in performance between them.However, the 2B dataset results provide reasonable pretraining Table 9 Comparison of for P@3, P@5, and P@15 results by number of pretraining steps 0k indicates the BERT model.Bold represents the best performance across the comparison models, and underline represents second place

Subtask
Steps Evaluation measures P@3 P@5 P@15 steps, 1k.The second is also evident at 6k.We selected the model with 1k steps through all subtask datasets for fairness.A comparison of the pretraining steps considered to have the best performance is provided in Appendix 2. We defined and evaluated two subgroups to assess the robustness of performance for queries that can be grouped within the 1A dataset.One is the person group, and the other is the computer-software group.The person group consists of the query word if "person" exists in the gold hypernyms of the test set.The computer-software group consists of queries that correspond when "computer" or "software" exists in the test set gold hypernyms.On average, the person group had 320 queries, and the computer-software group had 61 queries.We evaluated the proposed method and comparison methods for the two subgroups.The results are shown in Table 10.
For the person group, the proposed method showed the best performance among the comparison methods for all evaluation measures.The MRR value of the proposed method is 84.78, which is significantly higher than other methods.For the computersoftware group, the proposed method also outperforms other methods.Thus, the proposed method consistently outperformed compared methods in the experiment of two subgroups which is a similar result observed from the experiment of the original 1A dataset.In detail, the performance of the person group is substantially higher than the computer-software group.The reason for this result may be found from the characteristics of the person group, that most of the gold hypernym "person" appears first in the gold hypernyms, and the number of gold hypernyms is small.In contrast, the gold hypernyms of the computer-software group are much more varying compared to that of the person group where most gold hypernyms are multi-words, such as ("Xpdf ", "code, computer software, software package,... "), indicating that the hypernym relation is much more difficult to predict.
To assess the robustness of the rare words, we compared the predictions of the proposed method with word2vec-based methods for the rare words.The test query "open proxy server, " which appears nine times in the given corpus, was used for analysis.The prediction lists of each method and the gold hypernyms are presented in Table 11.
The gold hypernyms are represented in bold with the ⋆ symbol.Because the dataset is handcrafted, there may be more hypernyms.Thus, using underlines, we annotated relevant words on hypernym relationships using our judgment.Each prediction list was produced by sorting the probabilities of modes and taking the top 15 candidates.Hence, the earlier a word appears in the list, the more likely it is to be a hypernym.The results reveal that the proposed method adequately predicts rare words.In addition, most prediction words, including relevant words, are predicted better than the others.Conversely, RMM and SPON, which are word2vec-based methods, perform poorly on low-frequency words.For example, SPON only corrects for one gold hypernym ranked low on the list.Except for "computer program ⋆ , " SPON predicted the wrong words.Although RMM did not correctly predict any gold hl relevant words were present in its prediction list, such as "pseudonymized, " "spoofing attack, " and "IP address spoofing, " but none of them were hypernymy.Table 12 lists the prediction list for "tempestuousness, " which appeared once in the corpus.The result also suggests that the proposed method contains more gold hypernyms than the others for rare words.In addition, we explored the quality of the [CLS] token embeddings.We randomly selected hypernym pairs for each subtask from the testing set.The positive S P were cre- ated from the selected pairs.The negative S P were also generated by replacing a gold hypernym with a random candidate hypernym that is not gold.Then, each S P was input into the proposed model to obtain e [CLS] for each S P .Figure 5

Conclusions
Hypernym discovery is challenging because it finds appropriate hypernyms from a large predefined pool of candidates for a given query.In addition, because the candidates contain noun phrases, conventional word2vec-based methods are challenging to handle.In addition, BERT can solve this problem using subword tokenization.However, there have been no attempts to use BERT in hypernym discovery with its widely used training steps of domain adaptation: pretraining and fine-tuning.Therefore, this study presents the following procedures for adapting BERT to the domain tasks by modifying the pretraining and fine-tuning stages.We proposed MLM with Hearst pattern sentences as a further pretraining procedure to adapt the hypernymy domain.The proposed method outperformed the comparison methods on all evaluation measures and subtask datasets.The Wilcoxon signed-rank test was employed to confirm the superiority of the proposed method.We also conducted an in-depth analysis to confirm the effectiveness of the proposed pretraining procedure, analyzed the distribution of utilized Hearst patterns, and presented effective patterns.The proposed pretraining performs better than BERT without the proposed pretraining stage.In addition, we demonstrated that the proposed method is robust against rare words compared to the comparison models in the case study and can produce stable performance in the viewpoint of subgroups.The results of the case study indicate the robustness of the proposed method for rare words compared to the existing methods.Furthermore, the tSNE plots were presented to demonstrate the representation space of the special prompt component.
Despite the effectiveness of Hypert, the computational cost of Hypert for inferencing hypernym relationships can be heavier than conventional methods such as Hearst pattern matching.Thus, when a large number of queries and candidates, for example, 200,000 candidates for one query in this study, is considered, Hypert can be slower than conventional methods.For example, the proposed Hypert expenses 15.52 queries per second (q/s), whereas its counterparts RMM, SPON, and prompting BERT consume 0.02, 0.02, and 0.46 q/s, respectively.In addition, the performance of Hypert may still be limited because we employed a general tokenizer instead of developing a domainspecific tokenizer for each general, medical, and music domain.Furthermore, a pretraining process may be required if Hypert is applied to a new domain, such as cyber security because the proposed method is based on a general language model.
In the future, we would like to construct additional benchmark datasets for hypernym discovery because most studies in hypernym discovery tasks reported that no additional benchmark datasets are available so far except SemEval2018 Task9-Hypernym Discovery dataset [11,44].Specifically, we would like to start our effort to create new datasets for Cyber Threat Intelligence (CTI) from the cyber security domain to evaluate the efficacy of Hypert from cybersecurity-oriented documents.In the field of CTI, extracting cyber threat insights from diverse data sources spanning multiple domains, including the Web, is an essential task.In addition, cyber threat information is predominantly communicated through written language in diverse CTI reports involving hypernymic relationships.For example, cyber security practitioners may seek to scrutinize CTI reports containing references to specific malware instances.In this context, hypernym discovery can be used to determine the category of a particular malware, and the proposed Hypert can be applied here.We would like to study this issue further.
Tables 13 and 14 are the results of the Wilcoxon signed-rank test for the 2A and 2B datasets against comparison models.All p-values in the result are 1.95e-3, rejecting the null hypothesis.
1. Data preparation (left) presents the dataset and generation of the input data for the model.Model training (middle) represents the model training and inference for Hypert.Prediction (right) sorts the model output, taking the top 15 candidates.

Fig. 1
Fig. 1 Architecture of the hypernym discovery system in this study.(Left) Data preparation used for SemEval-2018 task9 to generate input data.(Middle) Model training applied the Hypert of each subtask to initialize the decision model, adopting projection learning.(Right) Prediction sorts the top 15 output results

Fig. 2
Fig. 2 Generation of M Q and M C for S prompt .M Q is 1 for query tokens and 0 for all other tokens; M C is 1 for hypernym tokens and 0 for all other tokens

Fig. 3
Fig. 3 Overview of the further pretraining method.(Left) Sentence extraction used extended Hearst patterns for sentence retrieval.(Right) Masked language modeling exploits each extracted subtask corpus and creates the Hypert for each subtask

Fig. 4
Fig. 4 Distribution of patterns for each subtask depicts the tSNE plots of the e [CLS] representation space for each subtask.Blue indicates the e [CLS] of positive S P and red indicates the e [CLS] of negative S P .The hypernymy and nonhypernymy clusters are appropriately separated in all three plots, revealing that using the [CLS] token as a hypernym relationship information vector effectively identifies hypernym relationships.

Table 1
Examples from the SemEval2018 Task9-hypernym discovery dataset

Table 2
Dataset statistics for each subtask

Table 3
Model performance on SemEval2018-task9 datasetBold indicates the best performance across the comparison models

Table 4
Wilcoxon signed-rank test results of the proposed method against comparison methods for the 1A dataset with 10 iterations At the significance level of α = 0.05(p-values in parentheses)

Table 6
Comparison of evaluation measures on the Hypert and BERT models for each subtaskBold face indicates the best performance between two models

Table 7
Wilcoxon signed-rank test results for the Hypert and BERT models with 10 iterations At the significance level of α = 0.05 ( p-values in parentheses)

Table 8
Comparison of MRR, MAP, and P@1 results by number of pretraining steps 0k indicates the BERT model.Bold represents the best performance across the comparison models, and underline represents second place

Table 10
Model performance on two subgroups in the 1A datasetBold indicates the best performance across the comparison models

Table 11
Prediction results for each method for the rare noun phrase "open proxy server" (appearing nine times in the corpus) Bold with the ⋆ symbol indicates gold hypernyms, and underlines mark hypernymy-relevant words Software ⋆ , Computer program ⋆ , Software program ⋆ , Software application ⋆ , Proxy server ⋆ , Application ⋆ , Software package ⋆ , Application software ⋆

Table 14
Wilcoxon signed-rank test results of the proposed method against comparison models for the 2B dataset with 10 iterations At the significance level of α = 0.05(p-values parentheses)

Table 15
Comparison of evaluation measures on the best Hypert and BERT models for each subtaskBold face indicates the best performance between two models