A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Detecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we propose a novel word embedding model, called contrastive word embedding, that enables to maximize the difference between base embedding models. First, we define CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we additionally employ the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we adopt the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier. To validate the effectiveness of the proposed model, we compare our method with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of area under the curve (AUC), which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.


Introduction
Twitter is a representative social media where users write their opinions and share events. It is known that more than 150 million users are wrote more than 500 million tweets per day as of 2019 [1]. To extract useful information from such a large number of tweets, the application of the text classification is necessary to condense the entire large-scale information into a small-scale subclass manageable for analysis. As a result, the text classification of tweets has been attempted in various domains to detect the information related to a specific topic. Typical examples are the geolocation prediction of the user by the classification of tweets containing the geolocation [2], the political affiliation prediction by the classification of tweets related to the politics [3], and the crime prediction by the classification of tweets based on the emotion [4]. However, the classification of tweets is inherently difficult because the length of tweets is limited, i.e., less than 280 characters, and various types of users are involved in writing tweets in an informal way [5]. There have been research efforts to detect CSI in tweets automatically based on machine learning or deep learning models. It has been known that a way of constructing an embedding model significantly affects on the classification performance [9]. As a result, two important issues have been dealt with: (1) the corpus definition used for constructing the embedding model and (2) the classifiers used for training the classification model. Ritter et al. have collected tweets containing keywords related to the cyber attack events and have proposed a weakly supervised learning model [10]. Le et al. have collected tweets containing CSI related keywords written by the selected Twitter accounts and have used CVE data set, which has been collected from National Vulnerability Database (NVD) (https://nvd.nist.gov/vuln/full-listing), for constructing the embedding model. They have trained two anomaly detectors using centroid and one-class Support Vector Machine (SVM) [11]. Chambers et al. have collected tweets containing the name of Distributed Denial of Service (DDoS) attacked organization and have performed time series analysis based on the frequency of tweets to forecast cyber threats [12]. Dionísio et al. have collected the tweets containing cyber threat intelligence and have labeled them manually. They have used a deep learning model based on convolutional neural network (CNN) to detect tweets related to cyber threats [13].
In this paper, we propose a new text classification model for classifying CSI-positive and -negative tweets by devising a novel embedding model. To constructing the embedding model, we consider two kinds of corpora: (1) CSI-positive corpus (e.g., tweets related to CSI) and (2) CSI-negative corpus (e.g., tweets unrelated to CSI). We do not have clear criteria to define the CSI-negative corpus because the subjects are widely distributed. In contrast, we can clearly define the CSI-positive corpus using a keyword set related to CSI such as "RCE", "CVE", or "Exploit" defined in data curated from cyber threat intelligence specialized companies (e.g., Recorded Future) or adopting well-defined public knowledge (e.g., CVE data set). Due to these different characteristics, when we use only CSI-positive corpora in constructing the embedding model, CSI-positive tweets are well classified, but CSI-negative tweets are not; when we use only CSI-negative corpus, it cannot extract distinct features for a specific class, and consequently, the overall classification performance is not satisfactory. These characteristics are supported by the experimental results conducted in this paper (See Section 6).
The contributions of the paper are summarized as follows: 1.
We propose a novel word embedding model, contrastive word embedding, that enables to maximize the difference between base embedding models. We construct each embedding model using CSI-positive corpus or -negative corpus, which have completely different characteristics.
Here, we utilize the background knowledge, which is well-defined public data set for a specific class, to supplement the imbalance of each tweet corpus. Hence, we use two kinds data sets for the CSI-positive corpora: (1) CSI-positive tweet data set and (2) CVE data set as the background knowledge; we use two kinds of data sets for the CSI-negative corpora: (1) CSI-negative tweet data set and (2) Wikitext data set as the background knowledge.

2.
We devise a new text classification model based on the proposed word embedding model. We adopt deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding model and integrate the feature vectors into one classifier. To the best of our knowledge, none of the previous methods have considered both CSI-positive and -negative embedding models in one integrated classifier.

3.
To validate the effectiveness of the proposed classification model, we compare it with two baseline models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. In the experiment, we use 70,000 tweets for CSI-positive and CSI-negative corpora as training data set, respectively, and 30,000 tweets for each corpus as testing data set, respectively. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of AUC, which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.
The paper is organized as follows. In Section 2, we explain the background for the tweet classification. In Section 3, we review the related work. In Section 4, we present data sets and corpus used in this paper. In Section 5, we propose a new classification model based on contrastive word embedding. In Section 6, we present the experimental results to show the effectiveness of the proposed model. In Section 7, we discuss the characteristics of the proposed model. In Section 8, we conclude the paper.

Background
Text classification methods have been studied for various applications because they can be used to enhance the speed of decision-making and the automation of the analyzing process for the text data [14]. Over the past years, machine learning techniques (e.g., k-NN [15] and SVM [16]) have been investigated for the text classification. Recently, due to the huge success of deep learning in the field of natural language processing, there have been many research efforts that proposed text classifiers based on deep learning. Here, two considerable factors affecting the performance of the text classifier are as follows: (1) word representation with a corpus and (2) the deep learning model used for the classifier. In the following sections, we explain the details of each factor.

Word Embedding
Word embedding represents the words in the text in an R-dimensional vector space. It enables us to capture both the semantic and syntactic information of words from the vector space [17,18].
In the deep learning model, the effective word embedding model is essential as well for improving the accuracy of the text classification [9]. Term Frequency-Inverse Document Frequency (TF-IDF) has been widely used as a measure to represent the importance of the words in a document based on the simple statistics (i.e., frequency for each word) [19]. However, TF-IDF does not capture the semantic of the words in the document. Therefore, Mikolov et al. have proposed Word2Vec [20] that can represent the semantic similarity of words. It trains word embedding model considering similarity of words in local context windows. They have also devised a method of representing the text corpus by the phase instead of the word in Word2Vec [21]. Glove is another text representation using global word-word co-occurrence statistics to address the problem in Word2Vec, which does not consider the global co-occurrence [22]. FastText has been proposed to overcome the problem of Word2Vec, which cannot represent out of vocabulary words and infrequent words [23]. To this purpose, FastText breaks each word into a bag of character n-grams and use the sum of all the n-grams as a vector for the word. Li et al. have proposed a new embedding model where each element is composed of a pair of contradicting words (e.g., a pair of unimpressive and impressive words) [24]. The proposed model performs well in recognizing a contradicting relation between sentences. Liu et al. have shown that it is more effective to construct word embedding using the corpus related to a specific target topic than using the general corpus [25].

Multiple Word Embedding
To improve the performance of a single embedding model, there have been research efforts to utilize multiple embedding models. Bruni et al. have proposed multiple embedding models defined from different data models, i.e., textual models and visual models [26]. Luo et al. have shown that the performance of text similarity measurement becomes better when we use multiple word embedding instead of a single word embedding [27]. Zhang et al. have shown that the performance of multiple word embedding from various text representations, i.e., Word2Vec, Glove, and Synthetic, is better than that of each single embedding [28]. Ren et al. have proposed multiple word embedding where one is constructed with original texts and the other with the background knowledge [29].

Background Knowledge for Embedding
The performance of deep learning models tends to be degraded when only a target data set is used as the corpus for training the embedding model because the training data set cannot encompass all the characteristics of the target class due to its imbalanced characteristic [29]. To resolve this imbalance of a target data set, a method of incorporating the background knowledge, which utilizes the external reliable data set for supplementing the target class, has been proposed [30,31]. In particular, tweets tend to contain insufficient information for the classification because it has informal expression and a limited length of the text [5]. In this paper, we also define the background knowledge for CSI-positive and -negative data set and show their effectiveness compared to a target data set (i.e., tweets).

Text Classification with Deep Learning
In the text classification based on deep learning, once the text corpus is represented in the embedding model, its outputs are fed into the classifier based on deep learning. In the following sections, we introduce the representative deep learning models, which will be used in the proposed classification model.

CNN
LeCun et al. have proposed CNN for aiming on image recognition [32]. The basic idea of CNN is capturing a feature of data by moving the kernel, a convolution matrix, to a region in the image. Generally, while neural networks do not maintain spatial information in the image, CNN can maintain it by applying the kernel into each region of the image. For natural language processing (NLP), we can also apply the convolutional layer of CNN to the vector space converted from the text corpus. Because each kernel can learn the embedding on a region (i.e., one sentence in NLP) and capture the semantic and structural features in the sentence, CNN performs well in the text classification. Xu et al. have adopted a CNN model in classifying news data using two kinds of embedding models: topic-based word embedding and Word2Vec [33]. In addition, Kim et al. have shown the effectiveness of CNN in short text classification on movie reviews [34]. Hu et al. have presented a new method, SVMCNN, for the short text classification, which combines CNN and SVM [35]. Specifically, they have used CNN as the feature extractor of short texts and SVM as the classifier, and SVMCNN shows a better performance than each of CNN and SVM. Liu et al. have proposed an attention-gated CNN for the sentence classification by generating attention weights from the feature's context windows before the pooling layer [36], which shows a better performance than standard CNN models.

RNN and LSTM
Elman et al. have proposed recurrent neural network (RNN) for sequential data processing such as voice and text processing [37]. The distinguishing feature of RNN, which is different from general neural networks, is the introduction of the hidden state vector. The hidden state describes the summary of the previous input data, and it is updated whenever the new input comes in. Eventually, after processing all the input data, the hidden state is a summarization of the entire sequences, which is quite similar to the processing of a sequence performed by a human being. Naturally, RNN has the advantage when processing the sentences that are read by a person. However, as the layer becomes deep, gradient exploding and vanishing problems occur, which degrades the performance. To avoid them, Long Short-term Memory (LSTM) has been proposed [38]. To prevent gradient exploding and vanishing problems, LSTM adds the cell state so as to adjust previous information. LSTM has been widely used for the text classification because it can learn high-level representation using a deeper layer due to the cell state while preserving the sequence order of representations, which is provided by RNN. Wang et al. have applied LSTM to the sentiment classification of short texts on social media [39]. Ding et al. have proposed a densely connected Bi-LSTM consisting of multiple Bi-LSTM layers [40], which shows a better performance than Bi-LSTM.

Tweet Classification
A number of studies have been proposed to apply machine learning and deep learning techniques to the tweet classification. First, for the machine learning techniques, Sriram et al. have proposed the classification method based on the Naïve Bayes classifier using the account information additionally to improve the classification accuracy in the tweet [41]. Alsmaddi et al. have proposed a new term weighting scheme and have evaluated nine term weighting schemes containing the proposed scheme for extracting characteristics of tweets [42]. Then, they have applied machine learning techniques such as SVM and k Nearest Neighbor (k-NN) to the corpus made based on each term weighting. Second, for the deep learning techniques, Wang et al. have applied LSTM to classify Internet Movie Database (IMDB) reviews by the sentiment [39]. They have shown that the performance of the deep learning model is better than that of the machine learning method such as Naïve Bayes and Extreme Learning Machine (ELM). Zhou et al. have attempted to concatenate the CNN and LSTM for utilizing their strengths [43]. That is, CNN is used to extract a sequence of the phase representation, and then, LSTM is used to learn the sentence representation from the output of CNN. Graph Convolutional Network (GCN) has been proposed, which is one of the effective graph neural networks that can capture neighbor information in the graph representation [44]. Yao et al. have proposed a new GCN for the text classification using a single large graph consisting of word nodes and document nodes. Because it can jointly learn the embedding of words and documents, it shows a robust classification performance even with a small-labelled data [45]. Yang et al. have applied the capsule networks [46] to the text classification [47]. For this, they have devised dynamic routing strategies to alleviate the disturbance of some noise capsules.

Classification Using Background Knowledge
There have been research efforts utilizing the background knowledge in training embedding models in the text classification. Yang et al. have proposed a new topic model that combines lexical features obtained from two training data sets, i.e., Google Snippet and Ohsumed, and semantic features obtained from Wikitext, which is used as the background knowledge, for the topic classification of the short text [48]. Qureshi et al. have constructed a large graph of categories and articles in Wikipedia, which is used as the background knowledge [49]. This graph-based background knowledge shows a better performance than Word2Vec, FastText, and GloVe. Ren et al. have proposed a new text representation model that calculates the similarity between text data at Fundan University, which is the target data set, and the Chinese encyclopedia, which is the background knowledge [50]. Furthermore, Ren et al. have performed the feature fusion and decision fusion of multiple embedding models made from the target data set and the background knowledge [29].
Based on the investigations of these previous studies, we adopt background knowledge to improve the performance of classifying CSI-positive and -negative tweets. Especially, we focus on the definition of the background knowledge in building multiple word embedding models based on CSI-positive and -negative tweets. We also investigate the effects according to different fusion strategies (i.e., feature fusion and decision fusion) in designing the proposed embedding model (See Section 5.5).

Classification by the Cybersecurity Intelligence
Ritter et al. have proposed weakly supervised learning to detect cybersecurity events using tweets [10]. For the weakly supervised learning, they have annotated tweets containing the keyword "DDoS" out of the tweets written on the dates when the DDos attacks occurred as cybersecurity intelligence (CSI)-positive tweet data. Chambers et al. have proposed a framework to analyze the DDoS attack using tweets according to the following steps: (1) collection of tweets written on the day when the DDoS attack occurred, (2) training them using the basic neural network, (3) detection of attack events from the trained model, and (4) extraction of the attack topics by analyzing the user's response to the attack using an LDA-based model [12]. Zong et al. have collected tweets containing a keyword, "vulnerability", and have applied logistic regression to detect the existence of threats. They have also applied logistic regression and CNN to analyze the extent of threats and the user's opinions on social media about cyber threats [51].
Le Sceller et al. have proposed a new framework for discovering cybersecurity events in real time, called SONAR [52]. SONAR collects tweets containing CSI and clusters them into several groups using the first story detection algorithm [53] where each cluster is regarded as one event. It also identifies new keyword sets related to CSI considering the co-occurrence between the tweets in the Glove embedding model.
Dionísio et al. have proposed a method to classify tweets related to CSI in real time [13]. They have used CNN as the classifier to determine if a given tweet is related to CSI and have used BiLSTM to classify the tweets as one of six entities. They have shown that the proposed method outperforms the machine learning techniques such as SVM and Multi-Layer Perceptron (MLP).
Alves et al. have proposed a Twitter streaming threat monitor that continuously updates the summary of the threats related to a target infrastructure [54]. They have collected CSI-positive tweets based on the accounts and have extracted features using TF-IDF. Then, they have used MLP and SVM as the classifier for CSI-positive tweets.
Le et al. have proposed a new machine learning-based method for detecting CSI [11]. They have used CVE data set as the background knowledge and have adopted two novelty classifiers (i.e., centroid, one-class SVM) to detect tweets related to CSI. For extracting the features to the classifiers, Le et al. have used TF-IDF values obtained from the CVE data set. In contrast, in this paper, we use two representative deep learning models, i.e., CNN and LSTM, for both the feature extractor and the classifier, because the machine learning-based classifier requires the preceding feature extractor and its performance greatly depends on the feature extractor [55]. The detailed architecture of the proposed method is explained in Section 5.5. Especially, in our method, we focus on integrating multiple baseline learning models into one classifier. This implies that our framework can be easily extended to support the other deep learning or machine learning methods including this method by incorporating a new baseline learning model into the framework. As presented in Section 4, the F1-score of the proposed model is observed 0.932∼0.934 at maximum, which indicates quite high accuracy compared to the method proposed by Le et al. [11] where F1-score has been only 0.643, even considering different data sets. Table 1 summarizes the comparison of previous studies on the classification by the CSI. It indicates that they have focused on the usage of the effective classifiers to improve the classification accuracy. None of them have used multiple word embedding models; only one study [11] has used the background knowledge for a single word embedding. In this paper, we focus on the design of a novel multiple word embedding, which is built in a contrastive way in terms of the relevance of CSI, and the definition of the background knowledge for the multiple word embedding. Table 1. Comparison of previous studies on the classification by the cybersecurity intelligence (CSI).

Multiple Word Embedding
Ritter et al. [10] Weakly supervised learning Tweets containing "DDoS" N/A X

Data Sets and Corpus
In this paper, we use three kinds of data sets: (1) curated data, (2) OSINT data, and (3) background knowledge. Curated data are provided by the companies specialized in the field of the cybersecurity such as Recorded Future (https://www.recordedfuture.com/), Sensecy (https://www.sensecy.com/), and Surfwatch (https://www.surfwatchlabs.com/). In this paper, we obtain them from Recorded Future, and they are used to (1) identify Twitter accounts related to CSI and (2) filter tweets related to CSI from the collected tweets. We use tweets for OSINT data, which are publicly opened and can be crawled [56]. By leveraging the curated data, we classify a collection of the tweets into CSI-positive and -negative classes. In addition, we use CVE data set and Wikitext data set as the background knowledge to construct embedding models. Table 2 shows the type, source, data set name, usage, and data size of each data set.

Curated Data
We acquire 1935 total accounts who have been written tweets contained in the "exploit" category analyzed by Recorded Future. The list of accounts is sorted by the number of tweets containing "exploit." Among 1935 accounts, we define RF account set as the top 130 accounts. In addition, we define RF keyword set as the keywords in "industrial term" analyzed by Recorded Future. The total number of keywords included in the RF keyword set is 639, and the examples are as follows: "Internet-security", "flaw", "PoC", "Exploit", "RCE", "Online Security Blog", "CVE", "Flash", "Sandworm", "Shellshock", "Neutrino", "Samba", "Stagefright", "Bin," which clearly show the relevance to CSI.

CSI-Positive Tweet Data Set
We use cybersecurity intelligence (CSI)-positive tweet data set for word embedding, training, and testing of the model. Figure 1 shows an example of a CSI-positive tweet data set. To collect this data set, we use a web crawling framework, Scrapy (https://scrapy.org/). We collect all tweets written by the accounts in the RF account set. Since some of them could not be related to CSI, we filter only the tweets containing at least one keyword in the RF keyword set and define them as the CSI-positive tweet data set.

CSI-Negative Tweet Data Set
We use CSI-negative tweet data set for word embedding, training, and testing of the model. For this data set, we collect the tweets written by random users using the Twitter Streaming API (https://developer.twitter.com/en/docs/tweets/filter-realtime/overview) and define them as the CSI-negative tweet data set. Tweets written by random users are widely distributed in various topics. Thus, it is an adequate data set that has contrastive characteristics against CSI-positive tweet data set.

CVE Data Set
Common Vulnerabilities and Exposures (CVE) is a database for describing disclosed software flaws managed by the National Vulnerability Database (NVD) (https://nvd.nist.gov/vuln/full-listing). A CVE has the brief description of a cyber threat that includes the information about the affected product, versions, vendor, threat type, impact, method, and inputs of an attack. It is a formal document that has been officially verified. In this paper, we use the CVE data set as the background knowledge to supplement the imbalance of the CSI-positive tweet data set.

Wikitext Data Set
Wikitext data set (https://creativecommons.org/licenses/by-sa/3.0/) is a collection of tokens extracted from a set of verified articles on Wikipedia. For this, we use wikitext data with over 100 million tokens. One existing study has verified the performance of the Wikitext data set by showing that the performance of the language modeling using the Wikitext data set is superior to that of using Penn Treebank, which is a target data set [57]. In this paper, we use the Wikitext data set as the background knowledge to supplement the imbalance of the CSI-negative tweet data set.

Basic Idea
We propose a new text classification model for detecting cybersecurity intelligence in Twitter based on a novel embedding model, contrastive word embedding. Figure 2 shows the overall framework of the proposed model based on contrastive word embedding. To construct the proposed embedding model, we define two completely different corpora according to the relevance of CSI: (1) CSI-positive data set and (2) CSI-negative data set. To confirm the conjecture that these two corpora have contrastive characteristics in terms of CSI, we investigate the CSI-positive ratio for each corpus using actual tweet data sets. Specifically, with CSI-positive corpus (i.e., 50,000 tweets) and -negative corpus (i.e., 50,000 tweets), we calculate the ratio of tweets containing the CSI related keywords (See the definition of RF keyword set in Section 4.1) out of the former corpus and that out of the latter corpus, respectively. As a result, we observe that 39.75% of tweets in the CSI-positive corpus and only 4.88% of tweets in the CSI-negative corpus have the CSI related keywords. This implies that the defined two corpora are contrastive in terms of CSI.

Pre-Processing of Data Sets
Because tweets usually contain noises, pre-processing of the collected raw tweets is essential. In this paper, we adopt the existing pre-processing methods that have been commonly used in previous SNS text mining studies [13,58]. The used pre-processing techniques are as follows: (1) all characters except for those used in English are eliminated, (2) uppercase letters are converted to lowercase letters, and (3) tweet dependent properties such as @RT, tag, link, and punctuation are removed. In the case of the CVE and Wikitext data set, we also eliminate all characters except for those used in English.
When we construct the embedding model from a corpus, we use the top 5000 words based on the frequency in each corpus for every embedding model. Table 3 shows the training and testing data sets used for the proposed method. We use a total of 200,000 tweets for training and testing. Specifically, for CSI-positive tweet data set, we use 70,000 tweets from the top 100 accounts in RF account set for training. We use 30,000 tweets from the remaining 30 accounts in RF account set for testing, separating the testing data set from the training data set. For CSI-negative tweet data set, among a total of 100,000 tweets collected from the Twitter streaming API, we use 70,000 tweets for training and the remaining 30,000 tweets for testing.

The Baseline Model
In this section, we present the baseline model based on each single word embedding that will be used as the basis of the proposed classification model. Figure 3a shows the architecture of the baseline model.

Input layer:
We conduct one-hot encoding of the input tweet text. We fix the length of the encoded bits as 100. That is, if the encoded bit length is larger than 100, we use the front 100 bits; otherwise, we pad 0.
Word embedding: For the baseline model, we use a single embedding layer, i.e., CSI-positive or -negative embedding on which we construct CNN or LSTM layers. We use four types of corpora (i.e., CSI-positive tweet data set and CVE data set for CSI-positive corpora; CSI-negative tweet data set and Wikitext data set for CSI-negative corpora). To construct all the embedding models, we randomly initialize output vectors of the model with the size of 100 dimensions.

Feature extractor:
In the proposed model, we employ the feature fusion strategy for integrating feature vectors extracted from multiple learning models into a classifier (See Section 5.5). For this, we need to separate the feature extractor and the classifier because the concatenation of the features occurred between the feature extractor and the classifier. We also design each baseline model with the same strategy for comparison. That is, we use the deep learning model (i.e., CNN and LSTM) for the feature extractor. Then, we apply the classifier to the feature vectors extracted from the feature extractor. For constructing a CNN model, we use a Conv1D layer with 128 filters and a global max-pooling layer; For a LSTM model, we use 128 nodes in a hidden layer and tanh activation.

Binary classifier:
We design a binary classifier based on a fully connected layer with sigmoid activation. We train the model so as to minimize the binary cross entropy.

Contrastive Word Embedding and Model Fusion
The main idea of the proposed model is devising a new classification model based on contrastive word embedding that can effectively integrate multiple word embedding models into one classifier. Figure 3b shows the architecture of the proposed model compared to the baseline model. Here, the distinguishing properties of the proposed model from the baseline model stem from two aspects: (1) the used embedding model and (2) the model fusion. In the embedding model, we use two multiple word embedding models having contrastive characteristics in terms of CSI, i.e., both CS-positive embedding and -negative embedding, for one classification model. In the model fusion, we concatenate the feature vectors extracted from the deep learning model based on each single embedding model and integrate them into one classifier. To focus on the effectiveness due to the contrastive embedding model and its fusion, we use the same base embedding models and deep learning models as in the baseline model. In designing the model fusion, we consider two commonly-used fusion methods of multiple learning models: (1) the feature fusion and (2) the decision fusion [59]. The feature fusion is the concatenation of the feature values extracted from multiple models, and the entire feature values will be delivered into the succeeding step; the decision fusion determines the final decision values considering the feature values extracted from multiple models. In this paper, we employ the feature fusion because it maintains all the features extracted from two base embedding models and the succeeding step (i.e., the classifier) can utilize the interaction based on the difference between them. The decision fusion cannot consider the interaction between features from multiple learning models. Figure 4 shows the concept of the feature fusion used in the proposed model. Specifically, the result of each embedding model is delivered into the deep learning model for the feature extraction. Then, we concatenate the features generated from multiple learning models. For this, we use a fully connected layer, as presented in Figure 3b, where ReLu activation is used and the unit (i.e., the dimensionality of the output space) is 8. As a result, the feature vectors are expanded so as to consider both CSI-positive and -negative corpora, maintaining both information in an integrated vector space. This leads to improving the accuracy of the classification between CSI-positive and -negative tweets.

Experimental Methods and Evaluation Metrics
In the experiment, we aim to measure the classification accuracy of the baseline models and the proposed model. In the baseline model, if we choose CNN as the classifier, one of existing methods [13] is mapped to the methodology. To measure the accuracy of the classification, we use four evaluation metrics: (1) Precision, (2) Recall, (3) F1-score, and (4) AUC. Precision and recall are defined by four components in the confusion matrix: true positive (TP), false positive (FP), false negative (FN), and true negative (TN). TP and TN mean correct prediction; FP and FN wrong prediction. In our problem, TP (or FP) means that the model predicts a given tweet is positive to CSI, and it is actually positive (or negative); FN (or TN) the model predicts a given tweet is negative to CSI, but it is actually positive (or negative). The precision represents the ratio of which model is true (i.e., TP) to what is true (i.e., TP + FP). Equation (1) shows the equation for precision. Recall, which indicates sensitivity, represents the ratio of what the model is true (i.e., TP) to be true (i.e., TP + FN). Equation (2) shows the equation for recall. F1-score is a harmonic mean of precision and recall. It has been known that, even when the data is imbalanced, F1-score can accurately evaluate the performance of the model [60]. Equation (3) is the equation for obtaining F1-score. (1) AUC is another commonly used criterion to evaluate the accuracy of the classification and is obtained by calculating the area of the receiver operating characteristic (ROC) curve as shown in Figure 5 [61]. ROC curve is represented by TPR, y-axis, and FPR, x-axis. TPR is the same as recall in Equation (2). FPR is the proportion of the results that are incorrectly predicted as positive among the negative as shown in Equation (4). AUC means how much model is capable of binary classification [62]; the higher the AUC, the higher the accuracy of the model.  We measured the accuracy of the baseline model based on a single embedding model. Table 4 shows the evaluation result of the baseline model by varying the corpus (i.e., CSI-positive tweet data set, CSI-negative tweet data set, CVE data set, and Wikitext data set) and classifiers (i.e., CNN and LSTM). In CSI-positive embedding, when we used the CVE data set (i.e., background knowledge) as the corpus, it showed much better accuracy than the CSI-positive tweet data set. This result demonstrated the necessity of using the background knowledge for detecting CSI from tweets. We analyzed the result for each corpus in detail. (1) When the CSI-positive tweet data set is used as the corpus, it shows the lowest accuracy out of all the corpora: 0.654∼0.658 of F1-score and 0.543∼0.557 of AUC as the learning models are varied. We speculate that CSI-positive tweets contain much information related to CSI, at the same time, contain information with general expressions used in tweets. As a result, because CSI-positive tweets are relatively correctly classified (i.e., FN is low), the recall is high (i.e., 0.853∼0.865); however, because the ratio of actual positive tweets out of the tweets predicted as positive is low (i.e., FP is high), the precision is low (i.e., 0.526∼0.535). Table 5 shows four components (i.e., TP, FN, FP, and TN) used for calculating precision and recall to analyze the accuracy of the baseline models. Specifically, among 60,000 testing data sets, FN is only 4046∼4403, but FP is 22,204∼23,396.
(2) When the CVE data set is used as the corpus, it shows the highest accuracy out of all the corpora: 0.904∼0.909 of F1-Score and 0.909∼0.912 of as the learning models are varied. In this case, as shown in Table 5, the recall is as high as the case of CSI-positive tweet data set; in addition, the precision is much higher than the case of CSI-positive tweet data set because FP becomes much lower than the case of CSI-positive tweet data set (i.e., only 1129∼1301).
In CSI-negative embedding, both the Wikitext data set and CSI-negative tweet data set show comparable accuracy in the classification. That is, in the case of CSI-negative tweet data set, F1-score was 0.875∼0.878 and AUC is 0.874∼0.879; in the case of Wikitext data set, F1-score was 0.878∼0.886 and AUC was 0.882∼0.888. Because the CVE data set significantly outperformed the CSI-positive tweet data set, we chose the CVE data set as the only candidate for CSI-positive corpus in constructing the embedding model; however, because the CSI-negative tweet data set and Wikitext data set show comparable accuracy, we chose both of them for CSI-negative corpus. Table 6 shows the evaluation result of the proposed model by the possible fusion of CSI-positive and -negative corpus. We indicate that the fusion of the CVE data set and CSI-negative tweet data set shows better accuracy than that of the CVE data set and Wikitext data set.  Figure 6a,b show the comparison of the accuracy between the baseline models and the proposed models using F1-Score and AUC, respectively. We indicate that the proposed models based on contrastive word embedding achieve meaningful improvement compared to the baseline models based on a single embedding model in both F1-score and AUC. Specifically, the proposed model based on the fusion of CVE data set and CSI-negative tweet data set improves the accuracy of the baseline model using only CVE data set by 2.53∼3.32% of F1-score and by 2.41∼2.86% of AUC; it improves using only CSI-negative tweet data set by 6.15∼6.74% of F1-score and 6.26∼7.00% of AUC. Table 7 shows four components (i.e., TP, FN, FP, and TN) used for calculating precision and recall to analyze the accuracy of the proposed model. By comparing Tables 5 and 7, we indicate that FP and TN in the proposed model are quite similar to them in the baseline model where only the CVE data set is used, which shows the best accuracy in the baseline model; however, in the proposed model, TP becomes increase and FN decrease (i.e., 1260∼1680), respectively, compared to the baseline model, improving the overall accuracy. In addition, the proposed model based on the fusion of CVE data set and Wikitext data set improves the accuracy of the baseline model using only CVE data set by 1.76∼2.10% of F1-score and 1.64∼1.76% of AUC; it improves that using only Wikitext data set by 4.40∼5.12% of F1-score and by 4.39∼4.99% of AUC. Here again, we indicate that FP and TN in the proposed model are quite similar to them in the baseline model where only the CVE data set is used; however, in the proposed model, TP increases and FN decreases (i.e., 1016∼1156), respectively, compared to the baseline model.  Here, we analyze the performance improvement of the proposed method compared to the best baseline model (i.e., using only CVE data set). As shown in Tables 4-7 the precision of the proposed model using the fusion of the CVE data set and CSI-negative tweet data set is similar to that of the baseline model using only the CVE data set. However, the proposed model significantly improves the recall by 4.83∼6.54% compared to the baseline model. As a result, the overall performance of the proposed method shows a meaningful improvement compared to the baseline model.

Discussion
The goal of the classification: when we work on the cyber threat intelligence, it is difficult to clearly determine the tweets related to cyber threats from public SNS media. In this paper, we propose a classification model for detecting tweets related to cybersecurity, which can be more clearly determined than tweets related to cyber threats by using the curated data sets without manual efforts (See Section 4). In this paper, to focus on showing the effectiveness of the proposed model, we define the corpus using the clearer criteria. If we can clearly define the corpus for cyber-threat-related tweets, we can also apply the proposed model to the corpus.
Fusion of learning models: to integrate the multiple learning models into a classifier, we have considered two approaches: (1) model fusion, which has been finally used in the proposed model, and (2) data fusion, which have been used in the existing studies for other problems: measuring the semantic relevance of short texts [63] and detection of the adverse drug reaction [64]. We have also considered data fusion for the model fusion. Figure 7 shows the architecture of the data fusion we have considered. That is, when we construct the embedding model, we integrate both CSI-positive and -negative corpus as one corpus. However, we have observed that the data fusion is not effective for classifying CSI-positive and -negative tweets as shown in Figure 8a,b where CVE data set and CSI-negative tweet data set are used as the corpora. That is, when two corpora are combined, the result model even degrades the classification accuracy based on a single embedding model.

Conclusions
In this paper, we have devised a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we have proposed a novel word embedding model, called contrastive word embedding, that enables us to maximize the difference between base embedding models. First, we have defined CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we have additionally employed the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we have adopted the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier.
To validate the effectiveness of the proposed embedding model, we have compared the proposed model with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. In the experiment, we used 70,000 tweets for CSI-positive and CSI-negative corpora as the training data set, respectively, and 30,000 tweets for each corpus as testing data set, respectively. As a result, we have indicated that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of AUC, which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.
In this paper, we proposed the concept of contrastive word embedding, and it has been actually used for classifying the CSI-positive and -negative tweets. The concept of contrastive word embedding can be more widely used because it can be generalized by defining base multiple embedding models and by effectively integrating them. Hence, this is applicable for improving the classification performance of other domains by defining embedding models suitable for the problem and optimizing its combination.

Conflicts of Interest:
The authors declare no conflict of interest.