Multifeature Named Entity Recognition in Information Security Based on Adversarial Learning

In order to obtain high quality and large-scale labelled data for information security research, we propose a new approach that combines a generative adversarial networkwith the BiLSTM-Attention-CRFmodel to obtain labelled data from crowd annotations. We use the generative adversarial network to find common features in crowd annotations and then consider them in conjunction with the domain dictionary feature and sentence dependency feature as additional features to be introduced into the BiLSTMAttention-CRF model, which is then used to carry out named entity recognition in crowdsourcing. Finally, we create a dataset to evaluate our models using information security data. The experimental results show that our model has better performance than the other baseline models.


Introduction
Named entity recognition (NER) aims to extract various types of entities from text. This is a fundamental step in text mining and has received much attention recently, especially in medicine [1][2][3][4][5][6] and biochemistry [7][8][9][10]. In contrast, the development of NER tasks in information security has been relatively slow. In previous works, several methods have been proposed for extracting vulnerabilities and extracting information from unstructured texts [11][12][13][14]. In the past two years, research into NER has basically entered a stagnant state in the domain of information security. The lack of large-scale labelled data in this field is one of the main reasons for this situation.
Snow et al. proposed a way to quickly and cost-effectively obtain large-scale labelled data using Amazon Mechanical Turk [15] and demonstrated that nonexpert annotations were relatively useful for training models [16]. We can use crowdsourcing as an effective way of obtaining large-scale labelled data at low cost within a short time. However, in a professional field, crowd annotations may be of lower quality than those of experts, and we therefore need to integrate highquality consensus labelling from crowdsourcing annotations.
Although we can obtain high-quality annotations using the majority vote method [17], this requires a great deal of manpower, and for some sentences or entities, whose meanings are rather ambiguous, it may be difficult to reach an agreement among the annotators. A generative adversarial network (GAN) has the ability to generate data. Goodfellow et al. proved theoretically that when the GAN model converges, the generated data have the same distribution as the real data [18]. Yang et al. demonstrated the usability of the GAN model for NER using Chinese crowd-sourced annotations [19]. However, the focus of this work is on using the GAN model to find the features of the trust annotators, and its application is mainly in the general domain.
In practice, there is a significant difference between entity types used in general applications and those used in information security. Some entity categories in the information security domain are not simply nouns or nominal phrases, as traditionally defined in NER. For example, consider the text: [a Trojan] "known as 'Bicololo' was first discovered in October 2012 and specially designed to steal login credentials". In this sentence, the phrasal verb "steal login credentials" should be extracted as an entity of the consequence class of the unified cybersecurity ontology (UCO) [20]. Therefore, when we identify named entities in the information security field, we need to use certain supplementary features in addition to the traditional (word and character) features.
In this paper, we propose a new model entitled BiLSTM-Attention-CRF-Crowd to improve the quality of the crowdsourcing annotations in information security. Our goal is to combine a GAN model with the BiLSTM-Attention-CRF model. The GAN model is used to find the common features of annotations in order to integrate the best unique consensus annotations and then pass them to the BiLSTM-Attention-CRF submodel as one type of additional feature. Here, we add an attention layer between the BiLSTM and CRF layers, primarily to process long sentences appearing in the text. A neural network model using a conventional encoder-decoder structure is needed to represent the information in the input sequence as a fixed-length vector; it is difficult to retain all the necessary information when the input sequence is long, especially when the length of the input sequence is longer than the length of the sequence in the training data set, and we therefore added the attention layer to address this limitation. The submodel performs NER again on the crowdsourced annotations for the information security data to improve the quality of these annotations. The main contributions of our work can be summarised as follows: (I) In order to solve the problem of a lack of highquality annotated data in the field of information security, a new model called BiLSTM-Attention-CRFcrowd is proposed to improve the quality of the crowdsourcing dataset by combining a GAN model with the BiLSTM-Attention-CRF model. (II) Due to the diversity and specificity of the entity categories in information security, only basic features such as word and character features are used as input for the BiLSTM-Attention-CRF model, which cannot meet the requirements for NER tasks in this field. Based on this, domain dictionary features and sentence dependency features are introduced. These are used as additional features along with the common features learned by the GAN model. The experimental results show that these additional features have practical value for improving the performance of the model.

Related Works
. . GAN. Compared with its applications in computer vision, the GAN model is less widely used in the field of language processing because the values used in images and video are continuous, and the generator and discriminator can be directly trained using the gradient descent method; in contrast, letters and words in text are all discrete, and the gradient descent method cannot be directly applied. Zhang et al. proposed the TextGAN model, which uses several techniques to deal with discrete variables [21]. For example, it uses a smooth approximation to approximate the discrete output of the LSTM and feature matching techniques in the generator training process. Since the number of parameters in the LSTM is significantly greater than that for the CNN, the LSTM is more difficult to train, and the TextGAN discriminator (CNN) only updates once after the generator (LSTM) has been updated multiple times. Yu et al. proposed SeqGAN, which draws on the concept of reinforcement learning to deal with the discrete output problem. It regards the error in the discriminator output as the reward value in reinforcement learning and regards the training process of the generator as a decision-making process in reinforcement learning. This model is applied to speech text and music generation [22]. Li et al. and Kusner et al. applied GAN to open dialogue text generation and context-free grammar (CFG), respectively [23,24]. We mainly draw on the method of processing of discrete variables and its objective function for feature matching in [21].
. . Crowdsourcing. David et al. presented a confusion matrix for each annotator, using an expectation maximisation (EM) estimation of these matrices as parameters and the true token labels as hidden variables [25]. Dredze et al. proposed a conditional random field (CRF) model for learning from multiple annotations, but the features that the CRF can learn are limited [26]. In previous work [19,27,28], the aim was to model the differences between the annotators and to extract the more trustworthy annotators. Although this improves the performance of the model, this choice of annotation is too dependent on the credibility of an annotator.
. . NER in Information Security. In order to solve the problem of a lack of large-scale labelled data, Rodrigo Agerri et al. designed a projection algorithm to transport NER annotations across languages [29]. However, in information security, there are no scaled annotations sets in any language. Giorgi et al. combined gold-standard corpora with silverstandard corpora by using transfer learning to expand the scale of high-quality labels [30]. However, this is applied in the field of biology.
Few documents from the last three years can be found on the subject of NER in information security. A small number of works have focused on extracting vulnerabilities and attack information from unstructured texts in the past few years [11][12][13][14]. Bridges et al. proposed a maximum entropy model trained with an averaged perceptron to extract entities from text, but these authors extracted only the entities and did not classify the types of these entities [11]. Weerawardhana et al. extracted vulnerability information from an online vulnerability database [12]. Lal proposed a CRF to extract vulnerabilities from the text [13]. Mulwad et al. used wikitology to extract vulnerabilities and attacks from web text, although wikitology is an ontology in the general domain [14].

Feature Selection
A high-quality feature set is key to the success of NER tasks in information security. In this paper, we use word and character features as our basic feature set and others (such as domain dictionary and sentence dependency features) as two kinds of additional features.
. . Word Features. Word features (word embedding) involve a distributed representation of a word in vector space. This can capture semantic and grammatical information about a word from an unlabeled corpus [31]. Words with similar context or semantics are closer in the word vector space, and word vectors can therefore be used for natural language processing tasks such as entity classification, entity alignment, and relation extraction. Word2vec [32] is the most commonly used tool for training word vectors. In this paper, in order to obtain high-quality word vectors, we select 94,534 vulnerability record descriptions from entries in the Common Vulnerabilities and Exposures (CVE) corpus [33] listed since 1997 for word vector training.
. . Character Features. Character features contain structural information about the name of the entity and can represent the specific composition of the entity's name, especially in the information security domain. For example, viruses such as Backdoor.Win32.Gpigeon.pd and Backdoor.Win32.Gpigeon2010.pc, which are PE viruses affecting Windows, have the same prefix; hence, when we encounter one these words, we know that it is the name of a PE virus for windows, based on its prefix. Unlike traditional manually designed character features, we can obtain the character feature vectors for words through training. First, the character vector of each character in the word is obtained by querying the character table, and then the character vector corresponding to the word is used as input for the BiLSTM.
. . Additional Features . . . Domain Dictionary Feature. In information security, in order to better identify entities, it is not sufficient to use only word and character features; we also need to add domain knowledge, including features such as a domain dictionary. At present, there is no established dictionary for reference in information security, so we can use the Internet to construct a preliminary domain dictionary. In this paper, we use Wikipedia as a corpus and the UCO in the information security domain to construct a domain dictionary. Wikipedia contains three types of page: an entry page, no created entry page, and a list page. Entry pages are mainly used to describe concepts. In Wikipedia, the URLs of these three types of pages follow certain rules. For example, the URL for the entry page is usually in the form http://en.wikipedia.org/wiki/ * , and the URL for the list page is usually in the form n.wikipedia.org/wiki/Category. In computer-related fields, we therefore only need to use en.wikipedia.org/wiki/Category: computing as a crawler input for concept capture. After the concept is captured, kmeans clustering is carried out. There are eight important concept classes in UCO, and k can therefore be set to eight for k-means clustering.
The resulting clustering categories are labelled by hand and then reclassified using the Mahalanobis distance. The specific algorithms are as follows: (II) If represents the vector of the subclass of after clustering, then the semantic similarity is calculated by the Mahalanobis distance: The clustering result is determined again. We set the threshold ; if ( , ) ≥ , then the concept is considered not to belong to the current concept category. The initial value of the threshold can be obtained from the original concept under the training category.
(III) If a concept has not been classified into a category after filtering, it can be judged based on its upper and lower concept in Wikipedia. If its upper and lower concept are not in the same category at this time, then the distance between the upper (lower) concept and the class centre of the category is calculated separately. We select a category with a smaller distance as its category. In contrast, we put the concept into the class to whose upper and lower concept belong. (IV) Before classifying a new concept into a category, the algorithm returns to Step I to recalculate the mean vector and covariance matrix for the category.
. . . Sentence Dependency Feature. As discussed above, the entity types in information security are no longer simply the types defined in traditional NER tasks.
We use the example given above to analyse the annotation of the semantic role and syntactic dependencies. We use Stanford CoreNLP [34] as a syntactic analysis tool. The result of this syntactic analysis is shown in Figure 1.
The core verbs in this sentence are "discovered" and "designed", and "Bicololo" is the passive subject of these two verbs. The subject of this sentence is a worm virus from the attacker class. If we want to extract the entity of the consequence class, we should focus on the modifier followed by "designed", for which the phrase "steal login credentials" is the open clausal complement. The open clausal complement is a verb or a verb phrase which is used to add a description of the core verb. At this point, we can create the following rules for extracting consequence class entities: (I) The subject of the sentence should be the type of attacker entity.
(II) We consider the verbs associated with each attacker entity, such as "be designed to/for", "be used to",  Figure 1: Sentence dependency analysis.
"result", "cause", and other predicate verbs and extract the minimum verb phrases or clauses associated with these. Here, the minimum verb phrase or clause is a phrase or clause that does not contain any nested or identical type.
(III) If the relationship between the minimum phrase or clause and the foregoing predicate verbs is that of a complement or modifier, the minimum phrase or clause can be considered as an entity of the consequence class.

Model Design for BiLSTM-Attention-CRF-Crowd
Our model focuses on two tasks. Task 1 is the generation of crowd annotations through adversarial learning in order to integrate the optimal single-consensus annotations. In Task 2, the common features generated in Task Figure 2(b); otherwise, the new features are passed back to the BiLSTM layer. Figure 2(b) shows the BiLSTM-Attention-CRF submodel for NER in the crowdsourced domain dataset. Several features, such as the dictionary, sentence dependency, and common features of the crowdsourced annotations are input to the model to improve the quality of the crowdsourced annotations.
. . Adversarial Learning for Common Features. The structure of the GAN contains two models: a generative network and discriminative model. The goal is to learn a generative distribution that matches the real data distribution. More   specifically, the generative network generates samples from the generator distribution, and the discriminative model learns to determine whether a sample is from a generative distribution or a real data distribution. Adversarial learning is used to find the common features of the crowd annotations. The discriminative model is trained using expert annotations and crowd annotations, which are used as the input for the generated model, and the generated feature distribution is passed to the discriminative model and used to determine the similarities and differences in the feature distributions of the crowd and expert annotations. The model is repeatedly trained until the discriminator can no longer distinguish a difference between them. At this point, the result of the model is the set of common features of the crowd annotations, which is also the set of optimal single-consensus annotations that we want to integrate. As shown in Figure 2 In order to obtain more important features, a new attention layer is used above the BiLSTM layer to capture a new representation of the feature ℎ : where is the model parameter.
. . . CNN. Following this, we add a CNN module based on the outputs of the BiLSTM-Attention model, to determine the similarities and differences between the feature distributions for the crowd and expert annotations. A convolutional operator with a window size of five is used, and then a max-pooling strategy is applied to the convolution sequence to obtain the final fixed-dimensional feature vector. The overall process can be described by the following equations:ĥ where is the CNN model parameters, and the activation function tanh is used primarily to normalise and prevent the loss of features. In the pooling method, the maximum feature is the most important, as it effectively filters out word combinations with less information and can ensure that the extracted features are independent of the length of the input sentence.
The feature ℎ is then mapped to the output D(ℎ ) ∈ [0, 1] using a softmax function to determine whether the input feature is consistent with the feature distribution of the expert annotations.
. . . Objective Function. We use the feature matching method in [21], set S as the expert annotations and use iterative optimisation schemes consisting of two steps: where ∑̂and ∑ represent the covariance matrices for the expert and the crowd annotation features, respectively, ,̂denote the mean features of the expert and the crowd annotation features, respectively, and their values are empirically estimated using mini-batch.
represents the Jensen-Shannon divergence between two multivariate Gaussian distributions ( , ∑ ) and (̂, ∑̂). The main purpose of this is to provide a stronger signal for modifying the generation model in order to make the feature distribution generated by the generated model more similar to that of the discriminant model [21].
In training the generation model, which contains discrete variables, the direct application of gradient estimation would fail. Thus, we draw on the method used in [21], and use a soft-argmax function when performing the inference as an approximation to the inputs of the generated model BiLSTM: where ∘ represents the element-wise product, V is a weight matrix used to calculate the word distribution, and is model parameter. When → ∞, this expression approximates the default input vector calculation formula for BiLSTM.
. . BiLSTM-Attention-CRF SubModel. The BiLSTM-Attention-CRF submodel adds the attention mechanism to the classical BiLSTM-CRF model to allow it to pay attention to the correlation between the current entity and the other words in the sentence and to obtain the feature representation of words at the sentence level to improve the accuracy of the model labelling. The model structure is shown in Figure 2(b).
Using the word featurê, the character feature , and additional features corresponding to the words as the input to BiLSTM, we get the new representation ℎ of the word (here, the method of calculating ℎ is the same as for the ℎ above), which is used as input to the attention layer. The attention weight value in the attention matrix is derived by comparing the current word with the other words ( = 1, 2, . . . − 1, + 1, . . . ) in the sentence.
The method of calculation of ( , ) is shown in (2). A sentence-level vector is then computed as a weighted sum of each BiLSTM output ℎ : Next, we combine the sentence-level vector with the BiLSTM output of the target word as a vector [ , ℎ ] to be passed to the tanh function to produce the output of the attention layer.
Finally, we use as the input of the upper CRF layer. Here, the CRF has two roles: the first is to calculate the score for each in the corresponding annotation, and the second is to use the tagging transition matrix T (to define the score of two consecutive annotations) and the Viterbi algorithm to calculate the best annotation sequence. This process is expressed as follows: = argmax ( ( , )) where the function ( ) is used to calculate the score of the annotation sequence = 1 2 . . . of the input sentence, is the final output annotation sequence result (i.e., BIO annotation), and represents the model parameter.

Experimental Results and Analysis
To evaluate our model, we divided the baseline models into two groups based on their different uses. First, our model and the first group of baseline models were applied to the crowdsourcing annotations to verify the ability of our model to integrate consensus annotations in comparison with other baseline models. Secondly, our model and the second group of baseline models were applied to identify specific entities in information security to verify the ability of our model to identify specific types of entities. Finally we verified the effect of additional features on the performance of the model.
. . Data Sources. The dataset used in this experiment was mainly drawn from the field of information security, and included related blog posts (such as we live security, threatpost), CVE descriptions, Microsoft security bulletins, and information security abstracts. From this corpus, 10,187 sentences were selected (consecutive paragraphs including 20 abstracts, 45 blog posts, 59 CVE descriptions and 50 Microsoft security bulletins) and each sentence was assigned to three annotators to generate crowd annotations. These three annotators were students at the authors' institution with no educational background in information security. Each annotator only needed to annotate four types of named entities in the sentence: the product, the consequence, the attacker, and the version. Two senior students taking information security courses were asked to annotate 1,000 sentences that were randomly selected to train the discriminant model in the GAN. From the crowd annotations, we randomly chose 7,000 sentences as a training set and used the remainder as a test set.
. . Baseline Models. The comparison models were divided into two groups for experiments.
Group 1: to learn the common features of crowd annotations, we used the following as comparison models: (I) Majority vote (MV) [17] (II) Dawid and Skene Model [25].
Group 2: to predict the named entity sequence from unlabeled text, we used the following as comparison models: (I) BiLSTM-Attention-CRF: The model in [35] was trained directly using the crowd annotations. When we used this model, we removed the part of the model that used image features.
(II) BiLSTM-Attention-CRF-VT: This was trained on the data selected from the crowd annotations by majority vote.
(IV) CRF-MA: From the model in [26], we used the source code provided by the author.
. . Settings. There are several hyperparameters in our model. We set the dimensions of the futures vector to 300, the number of units in BiLSTM to 1000, and the minibatch size to 64. The max-epoch iteration was set to 100. The method described in [36] with a learning rate of 10 −3 was used to update the model parameters, and the 2 regularisation was set to 10 −5 . We adopted the dropout technique to avoid overfitting. The dropout was 0.3 for BiLSTM and 0.5 for the attention layer. Our experiment was implemented on two NVIDIA GTX 1080Ti GPU with 64 GB memory, and the model was trained for approximately one hour.
. . Evaluation of Experimental Results. The indicators used in the evaluation of the experiment were the accuracy rate (P), the recall rate (R), and the F1 value.

. . . Performance Comparison of Integrated Crowd Annotation Model (I) Performance Comparison between Our Model and the First
Group of Baseline Models. We use the accuracy rate of the correct annotation obtained from the training corpus used by each model as our evaluation criterion.  Performance comparisons for various models on the test corpus are shown in Table 1.
The performance of MV was relatively poor. This is because it is difficult to achieve uniformity for an ambiguous entity due to the professionalism of the field and the uneven distribution of the annotation level. Our model achieves the best performance. In addition to obtaining the correct annotations in the training corpus, our model also can generate a positive sample that is consistent with the feature distribution of the expert annotations through repeat training. Through the secondary extraction of the BiLSTM-Attention-CRF submodel, the accuracy rate of correct annotations in the training corpus is improved.
(II) Comparison of the Overall Performance of NER in the Information Security Field for Each Model. Table 2 shows that in terms of precision, the BiLSTM-Attention-CRF model that was directly trained using unprocessed crowd annotations had the highest precision. This means that the crowd annotations are useful for training the NER model. The BiLSTM-Attention-CRF-VT model trained on data selected using the MV method from the crowd annotations showed the poorest performance. This model may therefore not be suitable for the information security field. The reason for this may be the complexity and professionalism of the information security text statements. Many entities cannot be selected by voting. In addition, contextual information is lost by voting selection, which means important feature information is lost. The BiLSTM-Attention-CRF model, which directly uses unprocessed crowdsourcing label data as training data, does not lose important context information; however, the level of noise is increased, so its overall performance is slightly lower than that of the BiLSTM-Attention-CRF-crowd model. compared. In Figure 3, the ordinates represent the P, R, and F1 values, respectively. The labels 1, 2, 3, and 4 on the abscissa correspond to the entity type (product, attacker, consequence, and version, respectively), and 5 represents the mean of the three indicators of the model. As can be seen from Figure 3, the model performs better on Types 1 and 4, mainly because the class indicated by Type 1 is the product. In general, although this type of entity belongs to the information security domain, public awareness of this field is relatively high, meaning that the precision of the annotation is relatively high. The category corresponding to Type 4 has a relatively fixed pattern; it always appears after the product name and is usually expressed by numbers.
The performance of the model is best for entities with a fixed patterns, because its features are easier to learn. For Types 2 and 3, which are relatively specific types of entity in the information security field, the precision of the crowd annotations was low. In particular, the consequence class of Type 3 has higher requirements for the professional ability of the annotator and the entity part of the category is not fixed, so the performance is slightly weaker.
(II) Performance Evaluation of Specific Types of Entity with Other Models. In Figure 4, the ordinate represents the accuracy of each model, and the abscissa is the same as in Figure 3. These models take the basic, domain dictionary and sentence-dependency features as additional features for input. It can be seen from the figure that the performance of the BiLSTM-Attention-CRF-crowd model is better than that of the other models. However, the model also has a lower accuracy than the recognition of each type in Figure 3, which also proves that the common feature of adversarial learning has a significant effect on improving the accuracy of the model. word features and character features as inputs (i.e., BiLSTM-Attention-CRF-crowd (basic)) and another using basic features and additional features as inputs (BiLSTM-Attention-CRF-crowd (basic+attach)). The results for the extraction accuracy are shown in Figure 5. It can be seen that the extraction accuracy of the BiLSTM-Attention-CRF-crowd (basic+attach) model is significantly higher than that of the BiLSTM-Attention-CRFcrowd (basic) model, which proves that the additional features have a measurable effect on the accuracy of entity extraction, and, especially on the performance of extracting the consequence class, the practical value of the sentence dependency feature for extracting the entity type of nonnoun or nonnoun phrases verified.
Combined with the above experiments, the performance of the BilSTM-Attention-CRF-Crowd model is excellent and is superior to the other models studied in this paper.

Conclusion
In this paper, we have proposed a new model BiLSTM-Attention-CRF-crowd to improve the quality of crowdsourcing annotations in information security field. The main work includes the following: (1) the common features of crowd annotations are found by the GAN model to generate the best unique consensus annotation; (2) these common features, domain dictionaries, and sentence dependencies are used as additional features to identify the entities of crowdsourcing annotations again, so as to improve the quality of crowdsourcing annotations. We evaluate our model on data sets in the field of information security, and the results show that its performance is better than the other baseline models mentioned in this paper. It is also verified that the proposed domain dictionary features and sentence dependency features have practical value for improving the performance of the model. However, the increase of input features will inevitably lead to an increase in the time complexity of the model. In future, we will consider further improvements to the model.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare no conflicts of interest.