A Hybrid Method of Coreference Resolution in Information Security

In the field of information security, a gap exists in the study of coreference resolution of entities. A hybrid method is proposed to solve the problem of coreference resolution in information security. The work consists of two parts: the first extracts all candidates (including noun phrases, pronouns, entities, and nested phrases) from a given document and classifies them; the second is coreference resolution of the selected candidates. In the first part, a method combining rules with a deep learning model (Dictionary BiLSTM-Attention-CRF, or DBAC) is proposed to extract all candidates in the text and classify them. In the DBAC model, the domain dictionary matching mechanism is introduced, and new features of words and their contexts are obtained according to the domain dictionary. In this way, full use can be made of the entities and entity-type information contained in the domain dictionary, which can help solve the recognition problem of both rare and long entities. In the second part, candidates are divided into pronoun candidates and noun phrase candidates according to the part of speech, and the coreference resolution of pronoun candidates is solved by making rules and coreference resolution of noun phrase candidates by machine learning. Finally, a dataset is created with which to evaluate our methods using information security data. The experimental results show that the proposed model exhibits better performance than the other baseline models.

virus" often appear in the text, where "the virus" is also the term that must be extracted; thus, the text in information security includes noun phrases and nested phrases in noun phrases that need to be extracted, in addition to simple nouns, pronouns, and proper nouns.
(2) Different types of extracted candidates require different extraction methods. For example, the candidates that were extracted in [Lee, Surdeanu, and Jurafsky (2017)] included all of the common nouns, proper nouns, pronouns, and syntactic patterns, such as appositive, predicate nominative, and role appositive, whereas in information security texts syntactic patterns cannot be used to extract candidates, and a different extraction method from [Lee, Surdeanu, and Jurafsky (2017)] is required.
(3) Different features are used for CR in general and information security fields. For example, when the entity type "name" is CR-resolved in the general field, gender can be considered an important feature; whereas the entity type in information security is usually expressed in a passive voice that is gender-neutral. (4) There is an abundance of terms, proper nouns, and abbreviations in information security text. Although abbreviations for nations or places appear in the general field, these types of abbreviations were not processed in [Lee, Surdeanu, and Jurafsky (2017)] according to the OntoNotes annotation Guidelines [BBN Technologies (2006)]. To meet the above-mentioned challenges, a hybrid method is proposed herein to solve the CR problem in the information security field. This study was divided into two parts: (1) extracting and classifying all candidates from the text (including noun phrases, pronouns, entities, and nested phrases), and (2) applying CR to the extracted candidates. In a previous study [Han, Yuanbo, and Tao (2019)], a BiLSTM+attention+CRF model was proposed to recognize named entities in documents and solved the problem of inconsistent labels of the same entity in the document, e.g., advanced persistent threat and APT. The attention mechanism was added to the BiLSTM-CRF model to focus on the relevance of a given word to all of the other words in the document: the feature representation of the word was obtained at the document level, and entity extraction and classification were then carried out. However, experiments showed that the model was slightly weak in identifying rare entities that did not appear in the training set and entities with long lengths. Inspired by the success of [Lin, Li, and Yang (2007) ;Li, Savova, and Kipper (2008) ;Wang, Zhou, Ruan et al. (2019)] in integrating domain dictionaries into the CRF and BiLSTM-CRF models to solve the problem of rare entity recognition, an improved model based on a domain dictionary (Dictionary+BiLSTM+Attention+CRF, or DBAC) was proposed. In this model, the domain dictionary matching mechanism is introduced, and new features of words and their contexts obtained according to the domain dictionary. In this way, full use can be made of the entities and entity-type information contained in the domain dictionary. The new feature is combined with the original word feature as the input of the BiLSTM+Attention+CRF model to solve the recognition problem of both rare and long entities. The candidates to be extracted include nominal phrases, pronouns, and nested phrases, in addition to entities. If only the DBAC model was used to extract nominal phrases and nested phrases, considerable manpower and material resources would be wasted to obtain annotations: noun phrases and nested phrases have certain grammatical rules that can be summarized; thus, the rules and methodology of the (DBAC) model are adopted to extract and classify candidates from text. The contributions of the paper are the following.
(1) A hybrid method is proposed to solve the CR problem in the information security field.
(2) A method combining rules with a deep learning model (DBAC) is proposed to solve the problem of extracting candidates in the information security field.
(3) Rules are combined with machine learning to resolve pronoun and noun phrases.
2 Related studies CR has been studied for a long time. In the early development of CR, methods based on rules were mainly used, including the syntax-based Hobbs theory [Hobbs (1978)], centering theory based on dialogue [Brennan, Friedman, and Pollard (1987)], and the syntax-based RAP algorithm [Lappin and Leass (1994)]. In the early 21st century, some scholars considered that rules-based methods perform better than machine-learning methods [Haghighi and Klein (2009)]; however, rules-based methods have clear shortcomings, namely, a high reliance on a user's ability to manually set rules because the quality of the rules directly affects the method performance, poor flexibility, and high human resources costs. Research on the use of machine learning for CR is mainly focused on training classifiers [Sukthanker, Poria, Cambria et al. (2018)], among which decision trees and random forests are the most frequently used [Soon, Ng, and Lim (2001); Aone and Bennett (1995); Lee, Surdeanu, and Jurafsky (2017)]. The use of deeplearning models in the NLP field has resulted in the gradual application of these methods to CR tasks [Wiseman, Rush, Shieber et al. (2015); Lee, He, Lewis et al. (2017); Zhang, Santos, Yasunaga et al. (2018);Wiseman, Rush, and Shieber (2016); Clark and Manning (2016)]. In [Wiseman, Rush, Shieber et al. (2015)], the first deep-learning model for CR was proposed, and pre-training was carried out on two separate subtasks (ana preference detection and antecedent sequencing) to learn different feature representations. The model also proved that obtaining global features from entity groups could help to improve CR performance. However, the premise of the study in [Wiseman, Rush, Shieber et al. (2015)] was that the entity groups have been classified in advance, whereas for our research in information security, relevant candidates must first be extracted from the text. Therefore, in our study, the global features in the entity group in [Wiseman, Rush, Shieber et al. (2015)] were converted into the global features in the document. In [Lee, He, Lewis et al. (2017)], candidate detection was combined with the CR task: First, CNN was used to study the features of the characters, and LSTM was used to obtain the word features; then, the feature representations of the candidates were studied using the Attention mechanism, and the antecedents corresponding to the candidates were sorted by a feed-forward neural network. The deep neural network used in this model was very large and difficult to maintain. In addition to these research applications in the general field, studies on using CR in the biological field have also been developed, mainly because the biological field also has large annotated corpuses, e.g., MEDSTRACT [Pustejovsky, Castano, Sauri et al. (2002)] ??
and MEDCo [Su, Yang, Hong et al. (2008)]. Typical applications include a proposed hybrid method based on learning and rules [D'Souza and Ng (2012)] with an F1 value of 60.9%, which is the most advanced in the biological field.

Method
A hybrid method is proposed in this paper in which rules are combined with learning to implement a CR task. The procedure consists of two parts: (a) extracting all the candidates (including nominal phrases, pronouns, entities, and nested phrases) from the documents and classifying these candidates, and (b) performing the CR task for the candidates. The model structure is shown in Fig. 1.   Fig. 1, the model is divided into two parts: candidate extraction and CR. Candidate extraction is a mixture of rules+DBAC, which is used to extract and classify the candidates in the text, and CR is then applied to the candidates.

Candidate extraction
In the present study, the candidate words to be extracted include nominal phrases, pronouns, entities, and nested phrases. The extraction process is divided into nominaland nested-phrase extraction and entity extraction. The extraction of noun phrases and nested phrases adopts rules, and the extraction of entities adopts the DBAC model. The concrete architecture is shown in the candidate extraction section in Fig. 1.

Extraction of noun phrases and nested phrases
Normally, a noun phrase consists of a noun and its modifier, with the noun as the central word. There are two types of positional relations between modifiers and nouns: the attributive relation that is placed before a modified noun and the post-positive attributive relation that is placed after the modified noun. An analysis of the corpus in information security shows that the noun phrases that require CR are usually prepositional attributive noun phrases. Therefore, we only consider the first positional relation. Generally, there are two types of prepositive attributives: determiners, which are used to limit the scope of nouns, such as "these," "three," "a," "the," and "my"; and adjectives, which express the features of a noun, such as "red," "close," "new," and "small." We can obtain nominal phrases using the following rules.
We consider that 1 U represents the set of articles, 2 U the set of possessive adjectives, 3 U the set of possessive nominal pronouns, 4 U the set of demonstrative determiners, 5 U the set of quantifiers, 6 U the set of cardinal words, N the set of nouns, NP the set of noun phrases, and AD the set of adjectives; then, the set from which we obtain the following rules: We summarize these three rules below as follows: The term BEL refers to the predicate to which verbs belong. In addition, we must extract the nested phrases that usually exist in the extracted noun phrases. The rules for the nested phrases to be extracted are given below. If NNP represents the nested phrase set, ONP represents the possessive noun phrase set, and P represents the preposition set; then, the following is true.
(1) Nested phrases come from possessive noun phrases. For example, the nested phrase in the phrase "its methods" is the pronoun "its," and the nested phrase in "Stuxnet's damage" is the proper noun "Stuxnet." We summarize this rule as follows: (2) Nested phrases are nouns or prepositions in nominal phrases. For example, the noun phrase "efficiency reduction" has the nested phrase "efficiency." We summarize this rule as follows:

Manuscript Format Template for Publishing in Tech Science Press
??
If the extracted nominal phrase contains an entity, only the entity is extracted. Sentence:Autoruns revealed that there are two core files Mrxcls.sys and Mrxnet.sys in the Stunex which was the first malicious code to damage the industry control system in the world.

Extraction
1-gram: industry 2-gram: the industry control, industry control 3-gram: damage the industry, industry control system 4-gram: to damage the industry,industry control system in 5-gram: code to damage the industry,industry control system in the Figure 2: Example of dictionary-based feature construction 2. DBAC model Similar to the BiLSTM-Attention-CRF model proposed in our previous work [Han, Yuanbo, and Tao (2019) where i h is the input of the Attention layer, which is mainly used to calculate the correlation degree between the words i w and other words Here, a W are the model parameters that must be trained.
A global feature representation g r at the document level can then be obtained: Next, a tanh layer is used to obtain the feature representation new i h is input to CRF, and the process is as follows: score( ) ( ) argmax(score( )) result y = D, y .
Here, is the final output tag sequence result (the BIO tag), and W represents the model parameters.

CR of candidates
As there is no large-scale annotated corpus in information security for CR, a method combining rules with machine learning is proposed in this paper to carry out the CR of the candidates. The candidates for CR include pronouns and noun phrases (the entities extracted in the present study are classified as both nouns and noun phrases).
The most difficult part of the procedure is the resolution of the pronoun coreference, which is significantly related to the grammatical structure of the sentence [Sukthanker, Poria, Cambria et al. (2018)]. Therefore, this part of the study is completed using customized rules, and the noun phrase coreference is resolved using machine learning.

CR of pronouns
An analysis of collected texts identifies two categories of pronouns that need to be coreference resolved: relative pronouns and personal pronouns. Since characters are not entities in information security, only third-person pronouns are resolved herein. 1. CR of relative pronouns The antecedent of a relative pronoun always appears in the same sentence and is close to its anaphora. For a relative pronoun, all of the preceding noun phrases are chosen to be its candidate antecedents. Then, according to the syntactic analysis tree of the sentence, the syntactic analysis path between the relative pronoun and the candidate word is extracted, and the shortest path is calculated. The noun phrase in the shortest path is considered to be the last antecedent of the relative pronoun. An example is given in Table 1. Sentence: (Autoruns)1 revealed that there are (two core files)2 (Mrxcls.sys)3 and (Mrxnet.sys)4 in (the Stunex)5, (which)? was (the first malicious code) to damage (the industry control system) in the World.

CR of third-person pronouns
The antecedent of a personal pronoun is most likely to be in the same or preceding sentence. First, candidate antecedents in the same sentence are searched for, and if the candidate set is empty, the candidate words are re-extracted from the previous sentence to find potential antecedents. Since personal pronouns must refer to entities, only security domain entity candidates are reserved. If the candidate set is not empty, the parse tree will start from the node of the personal pronoun and move upward. If there is a juxtaposed structure, including juxtaposed noun phrases, juxtaposed verb phrases, and juxtaposed clauses, the candidate word that is farthest in the first sub-structure (in terms of word distance) will be selected as the antecedent of the personal pronoun. Otherwise, the nearest clause or sentence from the parse tree is found and the furthest candidate word selected as the antecedent. An example is given in Table 2. Sentence: (Stuxnet)1 searches for (specific programs)2, accesses (industrial control systems)3, and ((its)? attack object) is the target program development tool.

CR of noun phrases
First, the features needed for machine learning are introduced. Each feature is obtained by comparing the corresponding attributes between two items that are being resolved, as shown below. Consistency between categories: In Section 3.1, candidates were extracted and classified.
Here, whether the types of the two items being resolved are consistent is directly compared, which is a binary attribute, and the consistent case is true, and the inconsistent case is false. Consistency between alias and abbreviation: If two items are being resolved, one is an alias or abbreviation of the other, and the value is true; otherwise, the value is false. Consistency between singular and plural numbers: The forms of verbs or related verbs after the two items to be resolved are analyzed to determine whether the singular and plural numbers are consistent; the consistent case is true, and the inconsistent case is false. The distance between the two items to be resolved in the text: The number of sentences between the two items to be digested in the text is confirmed. Name similarity: For example, the phrase "the virus" usually has the same reference as the name of a virus, whereas phrases that contain words such as "product" and "company" do not. Appositive: A syntactic analyzer is used to determine whether one of the two items to be resolved is the corresponding phrase of the other and obtain the corresponding phrase of the two items to be resolved. Similarity of head words: in general, the head word in a noun phrase is considered to be a noun. Here, we compare the similarity of the head word in two noun phrases using cosine similarity.

Manuscript Format Template for Publishing in Tech Science Press
??
Similarity of ending words: Cosine similarity is used to compare the similarity of the last word of two noun phrases. Next, the training set is constructed. Consider that the document contains a reference chain , in which the direct adjacent reference item pairs (such as ) generate a positive training sample. The extraction of negative training samples is given below.
For example, if other objects 1 B and 2 B appear between 1 A and 2 A , then a negative training sample can be derived as follows: Sentence: As the World's first cyber "super destructive weapon," (Stuxnet)A1 has infected more than 45,000 networks around the World. (Computer security experts)B1 believe (the virus)A2 is (the highest level)B2 ("worm")B3 ever. (The new virus)A3 uses a variety of advanced technologies, so it is extremely stealthy and destructive. In the example above, the reference chain (Stuxnet)A1-(the virus)A2 -(the new virus)A3 can be used to generate positive training samples: (Stuxnet)A1-(the virus)A2, (the virus)A2-(the new virus)A3. Although referential objects are transitive, we only consider short referential relationships to reduce errors. Similarly, negative training samples can be generated as follows: (Stuxnet)A1 -(Computer security experts)B1, (Computer security experts)B1-(the virus)A2,… . CR is a candidate classification problem. Therefore, we adopt the random forest algorithm. This algorithm is a classifier containing multiple decision trees that is easy to implement and has little computational overhead.
As the world's first cyber "super destructive weapon", (Stuxnet) 1 has infected more than 45,000 networks around the world. Computer security experts believe (thevirus) ？ is the highest level ("worm") 3 ever. (The new virus) 4 uses a variety of advanced technologies, so it is extremely stealthy and destructive.  Figure 3 illustrates the process of using the random forest algorithm for CR. Consider that resolving the candidate word "the virus" is the goal at this time: The algorithm first links the candidate word to all of the possible antecedents within the scope of a certain sentence (in general, all of the noun phrases in two consecutive sentences are chosen). Antecedents beyond this scope are not considered. The antecedent in the antecedent chain with the highest confidence is selected as the antecedent of the candidate word. The overgeneration of a referential chain is controlled by setting a minimum confidence threshold i t . If no confidence value is greater than i t , the candidate word has no co-referential antecedent (this state may be changed during subsequent digestion). i t can be obtained by training.

Data sources
Experimental data collected in our previous study [Han, Yuanbo and Tao (2019)] of texts in the information security field were used, including articles from WeLiveSecurity and Threatpost blogs, CVE (common vulnerabilities and exposures) descriptions, Microsoft security bulletins, and abstracts of journal articles in the information security field. Twenty summaries, 45 blog articles, 59 CVE descriptions, and 50 Microsoft security bulletins were extracted, resulting in a corpus of 9123 sentences. In our previous study, these texts were annotated with entity types, and these annotated corpora were used as the training data of the DBAC model. Then, 20 security reports and 20 blogs were extracted to annotate the reference chains to obtain a total of 45,932 reference chains with 7.5% positive samples. These reference chains serve as training data for machine learning.
Since there are far more negative than positive training samples, to reduce the training time, the negative sample extraction method from [Lee, Surdeanu, and Jurafsky (2017)] was adopted. First, all of the positive samples in the training dataset were used and 10% of the negative samples randomly selected for classifier training. Then, the classifier confidence values of all of the negative samples were checked (i.e., the estimated probability), and only the fuzzy negative training samples of the first 10% reserved, i.e., the negative training samples with the highest confidence values compared with the positive training samples. These more informative negative training samples and all of the positive training samples were used to train the final classifier. A domain dictionary constructed previously by us [Zhang, Guo, and Li (2019)] using Wikipedia and the ontology of the information field, UCO, was used.

Settings
In this study, the dimension of the feature vector is set at 300, the number of nerve cells in BiLSTM at 1000, the minimum batch_size is set at 64, and the maximum number of iterations at 100. The model parameters are updated using a method from [Kingma and Ba (2019)], the learning rate is set at -3 10 , and 2 l is set at -5 10 . To avoid overfitting, dropout technology was used. The dropout values of BiLSTM and the attention layer were 0.3 and 0.5, respectively. The parameter setting in the random forest essentially consists of setting the parameters of a single decision tree. The minimum confidence threshold is 30%, minimum number of leaf nodes is 5, maximum depth is the default value, and number of decision trees is 100. These parameters were obtained through 10fold cross-validation in the training set. The experiment was performed on a machine with two NVIDIA GTX 1080Ti graphical processing units and 64 GB of memory, and the model was trained for approximately 1 h.

Results and analysis
First, the superiority of the proposed method for CR in information security was verified. Four baselines were used: (a) the scaffolding approach proposed in [Lee, Surdeanu, and Jurafsky (2017)]; (b) the method proposed in [Soon, Ng, and Lim (2001)] (this method is referred to by the authors' names (Wee et al.), as the method was not named in the paper); (c) the method proposed in [Zhang, Santos, Yasunaga et al. (2018)]; and (d) the method proposed in [Wiseman, Rush, and Shieber (2016)]. The baselines are applied together with the proposed model to information security data, and the experimental results are shown in Table 3. As shown in Table 3, the proposed model outperforms the other four models for information security. An analysis of the error samples shows that the method of Wiseman et al. is mainly based on using RNN to learn the potential global representation of each entity in the entity class group, and then RNN is used to resolve these entities. This model does not use a specific clustering method, but default entity clusterings were produced. Thus, our proposed model was used to extract the candidates from the texts and then the simple K-means clustering method used to cluster the candidates. However, the experimental result is not ideal. In our analysis, in the clustering of entities, pronouns without domain features are usually clustered together. Therefore, when learning the global feature representation of these groups, domain features that undoubtedly affect the performance of the subsequent CR are difficult to learn. However, the scaffolding approach and the method of Wee et al. both address texts from the general field. Most of the features that were developed for the aforementioned models are for entities in the general field, such as "organization" and "person." Therefore, the CR performance is not sufficiently high for information security. Zhang et al. used a biaffine attention mechanism and optimized the loss function of the candidate extraction to conduct coreference resolution, which required a significant amount of annotated training data to train parameters. Therefore, this model achieved excellent performance for the conll-2012 dataset but performed poorly for the information security dataset with limited annotations.
Next, experiments were carried out on the influence of a single feature on the proposed model. The numerical numbers corresponding to eight features are shown in Table 4. Alias and abbreviation 2

Singular and plural 3
Text distance 4 Name similarity 5 Appositive 6 Similarity of head words 7 Similarity of tail words 8 The impact of a single feature on model performance is shown in Fig. 4.

Manuscript Format Template for Publishing in Tech Science Press
?? As shown in Fig. 4, appositive features have the least influence on CR performance of all of the features, mainly because the identification of appositive is relatively complex, such as for sentences with relatively complex grammatical structures, and the accuracy of appositive sentences that are determined only by syntactic analysis tools is not very high. In addition, the similarity of head words and the characteristics of aliases and abbreviations have the highest impact on CR performance. An analysis showed that the main reasons for these results are as follows: (a) information security text contains many professional terms and abbreviations, e.g., advanced persistent threat (APT); and (b) for noun phrases, the subject word usually determines the main meaning of the phrase. In addition, the effect of extracting candidate words in the text also affects CR. Therefore, the experiment also verifies the performance of the proposed method for extracting candidate words (which is abbreviated as rules-based DBAC). In addition to the abovementioned three baselines, the reference model used here also includes the method developed by us in our previous study [Han, Yuanbo, and Tao (2019)]. Here, we represent it as "previous study." The information security domain entities extracted in this experiment include the four types mentioned above: "product," "vulnerability," "attacker," and "company." The experimental results are shown in Table 5. As seen in Table 5, the proposed method (a rules-based DBAC model) outperforms the other four methods for information security, mainly because the proposed method analyzes the security text, while relying on deep learning, to summarize a set of corresponding extraction rules. This combination increases the model performance. Fig. 5 shows the entity extraction and classification results obtained using the DBAC model and our previous model (from our previous study), respectively, for the example sentence "Stunex searches for specific designed access industrial control systems." Finally, the DBAC model is compared with the models in [Li, Savova, and Kipper (2008) ; Wang, Zhou, Ruan et al. (2019)] to prove the superiority of the proposed model. Since the models in the two studies were combined in different ways, the best models were selected and named using the authors' name. The information security domain entities extracted in this experiment include the four types mentioned above. The dictionary used in these models is the same one constructed in our previous work [Zhang, Guo, and Li (2019)]. The comparison results are shown in Fig. 6. Through analysis, it was found that the DBAC model could identify almost all of the long entities in the texts, while the Li D model [Li, Savova and Kipper (2008)] was weak. The feature computing method based on the domain dictionary proposed by us can make full use of the entity and entity-type information in the dictionary. In addition, since the attention mechanism was added at the document level, more word features can be captured at the document level. This is the main reason why the DBAC model performs better. The superiority of the document-level feature in entity extraction was verified in our previous study [Han, Yuanbo, and Tao (2019)] and will not be repeated here. An analysis of the error results extracted by the rules-based DBAC model shows that the method still suffers from some problems, such as missing antecedents for words, as well as candidate words that cannot be used being obtained (i.e., candidate words with a noncoreference relationship), which still must be solved.

Conclusions
In this paper, a hybrid method is proposed to solve the problem of coreference resolution (CR) in the information security field. This method is mainly used to solve two problems in a CR task: (a) the extraction of all of the candidate words from the given document and the classification of those candidates, and (b) CR of the extracted candidates. A set of rules is developed according to the features of information security texts and is combined with the deep-learning model (DBAC) to solve the problem of extracting and classifying the candidate words in texts. Co-referential resolution is decomposed into pronoun coreferential resolution and noun-phrase co-referential resolution: pronoun resolution is accomplished by rules, and the co-referential resolution of noun phrases is accomplished by machine learning. The experimental results show that the proposed hybrid method is applicable to the information security field and outperforms other models that are based on the general domain. However, there are still unsolved problems in the extraction of candidates. In the future, absorbing more feature construction methods will be considered [Li, Xu, Xian, et al. (2019); Yeh (2018)] to solve these problems.