EX-Action: Automatically Extracting Threat Actions from Cyber Threat Intelligence Report Based on Multimodal Learning

. With the increasing complexity of network attacks, an active defense based on intelligence sharing becomes crucial. There is an important issue in intelligence analysis that automatically extracts threat actions from cyber threat intelligence (CTI) reports. To address this problem, we propose EX-Action, a framework for extracting threat actions from CTI reports. EX-Action ﬁnds threat actions by employing the natural language processing (NLP) technology and identiﬁes actions by a multimodal learning algorithm. At the same time, a metric is used to evaluate the information completeness of the extracted action obtained by EX-Action. By the experiment on the CTI reports that consisted of sentences with complex structure, the experimental result indicates that EX-Action can achieve better performance than two state-of-the-art action extraction methods in terms of accuracy, recall, precision, and F1-score.


Introduction
With the increasing amount of information in modern society, advanced persistent threat (APT) attacks, as a new development of cyber security, have gradually become one of the main attack methods. APT attacks have many characteristics, such as long duration, complex attack methods, and strong concealment. e traditional defense method is a passive defense method, which mainly relies on security equipment and rules matching to generate an alarm for static control. It is not suitable for the protection of APT, 0 day attacks, and other new network security threats [1]. erefore, many organizations have aimed to develop timely, relevant, and actionable CTI about emerging threats and key threat actors to enable effective cybersecurity decisions [2].
CTI is a kind of information that records current and former security threats [3], which contain information such as the reasoning, context mechanism, observable indicators, mitigation measures, and countermeasures of attacks. It is extremely time-consuming for security practitioners to analyze and utilize multisource and unstructured CTI reports. erefore, automatic and efficient information extraction from unstructured CTI reports has become one of the main research directions.
Information extraction is the extraction of valuable information from unstructured CTI reports. e extracted information mainly includes cybersecurity entities and its relationship. Cybersecurity entity recognition identifies named entities in the cybersecurity field, which mainly include names of persons, organizations, places, and some security terms. Entity-relationship extraction is to extract the relationships between security entities in unstructured CTI reports. It is mainly for the triple extraction of known entities and predefined relationships. Complex relationships between entities with contextual connections are hard to be identified. e action consists of the subject, the verb, and the object. Actions not only describe the attack behaviors in the attack process but also include non-predefined entities and their contextual semantic relationships. erefore, actions are crucial for CTI reports. e subject and the object in actions correspond to a pair of security entities, and the verb describes the semantic relationship between the entity pair. e entities and the relationship between them do not need to be predefined in the proposed method.
At present, the extraction of actions is mainly based on semantic dependency [4], and the ontology model [5] is used to identify them. erefore, there are mainly the following challenges in the extraction and identification of threat actions: (1) reat actions cannot be accurately extracted in unstructured CTI reports just relying on their semantic dependency. (2) Relying on ontology methods to identify actions will lose some undefined key threat actions. (3) Information content of extracted threat actions is incomplete, and it is difficult to measure the information content of the extracted threat actions.
In this study, we propose a multimodal learning approach, named EX-Action, to accurately extract and automatically identify threat actions in unstructured CTI reports. EX-Action is a method based on the combination of mutual information and NLP technology. It can extract more actions based on the syntactic structure. ree main contributions of this study are listed as follows: (1) We propose an actions extraction framework, named EX-Action. EX-Action extracts threat actions from the unstructured CTI reports that consisted of complex sentence structure by syntactic rule matching. And then, it identifies threat actions by a multimodal learning algorithm. (2) We use an evaluation indicator named normalized mutual information (NMI) [6] to measure the difference of information content of threat actions, which quantifies the completeness of the information content of threat actions. (3) We apply EX-Action to extract 18210 actions from 243 unstructured CTI reports, and the experimental result shows that the obtained accuracy, F1-score, and NMI of EX-Action are 79.09%, 85.58%, and 85.26%, respectively. e rest of this study is organized as follows. We list the related work of extracting information from the CTI report in Section 2. In Section 3, we introduce the EX-Action framework and describe it. Section 4 gives the experimental results. In Section 5, we discuss the proposed method. Finally, Section 6 summarizes this study.

Related Work
e fragmentation of information in the era of big data gives unstructured CTI reports the characteristics of diversification, fragmentation, and heterogeneity. For these characteristics of unstructured CTI reports, Liao et al. [7] proposed an approach to automatically recover valuable attack indicators from popular technology blogs and convert them into industry-standard and machine-readable CTI reports. Sara Qamar et al. [8] proposed the construction of the Structured reat Information eXpression (STIX) analyzer ontology and its ontology model relationship. eir method can determine the threat relevance, possibility, and affect and expose assets by automatically classifying network threats and formulated rules and inferences. Xun et al. [9] proposed an automatic identification model of threat intelligence (TI) based on a convolutional neural network (CNN) for automatically extracting TI from various unstructured TI data sources.
ese studies reduce the noise data in the CTI report by reorganizing unstructured threat report knowledge to identify cyber threat information in an effective manner.
It is one of the important research contents in CTI analysis that reconstruct CTI knowledge by using the graph mode. Shu et al. [10] used a graph model to organize multisource heterogeneous threat data, which formalize cyber threat intelligence computing into a new security paradigm. Ya et al. [11] proposed an attack entities recognition method to construct a CTI knowledge graph. Jia et al. [12] used existing machine learning technology to organize the knowledge of threat reports and construct a knowledge base of cybersecurity. Du et al. [13] proposed a knowledge graph for human-readable CTI recommendation from the perspective of the attack chain. e threat intelligence knowledge graph helps security practitioners understand cyber threats in a timely and rapid manner. e current research on CTI reports mainly includes real-time perception, dynamic sharing, and effective application. Regarding the application of CTI, it contains structured and unstructured CTI reports. For structured CTI reports, Kim et al. [14] automatically generate rules without human intervention to mitigate new network security threats that have been discovered in real-time. In response to the lack of domain knowledge analysis under the existing structured CTI reports, Tappeiner et al. [15] proposed a domain recognizer based on a convolutional neural network to identify targeted domain of CTI and automatically generates specific CTI from social media data.
In order to solve the problem of overreliance on the analysis of security practitioners results in the inefficiency of CTI applications, Zhu et al. [16] proposed an end-to-end approach for automatic feature engineering, which identifies abstract behaviors that are associated with malware and map these behaviors to concrete features and generates a characteristic semantic network. Zhu et al. [17] proposed an approach to bridge measurement data with manual analysis and train a multiclass classifier to extract IOCs and further categorize them into different stages. Ayoade et al. [18] have leveraged natural language processing techniques to extract attacker's actions from threat report documents generated by different organizations and then automatically classify them into standardized tactics and techniques.
For threat actions extraction from unstructured CTI reports, Husari et al. [5] proposed a method named TTPDrill to extract actions based on semantic dependence and an ontology database, which is used to map actions to different attack patterns. However, TTPDrill will neglect part of threat actions in clause structure and parallel sentences. And it used ontology structure to identify threat actions, which will lose some undefined threat actions in the ontology structure.
Husari et al. [19] developed an approach named Action-Miner, which used NLP technology and based on information entropy and mutual information, to extract low-level cyber threat actions from publicly available CTI sources. However, ActionMiner has relied on syntactic analysis to extract low-level threat actions. It lacks a behavioral subject, and the information content is difficult to guarantee.
is study proposes a framework called EX-Action. It extracts actions based on the syntactic structure and rules mapping and identifies them by a multimodal learning algorithm. EX-Action identifies actions based on multiple features, which improves the accuracy of action recognition and covers actions in complex sentence structures.

Proposed Framework
In this study, we propose a framework called EX-Action. It contains four modules, which are data preprocessing, candidate threat actions extraction, action feature extraction, and action identification. e EX-Action architecture is shown in Figure 1. First, EX-Action preprocesses the obtained CTI report. Second, candidate threat actions are extracted by a rule-based method. And then, candidate action multimodal features are calculated. Finally, EX-Action identifies actions and generates selected actions by a weighted ensemble learning algorithm.

Data Preprocessing.
In this module, EX-Action cleans the data of CTI reports by filtering invalid characters and sentences that do not contain threat actions. ere are some cybersecurity terms in CTI reports. However, these cybersecurity terms are not recognized by NLP technology, such as file paths, IP addresses, and so on. EX-Action uses regular expressions to replace and save unrecognizable terms.

Candidate reat Actions Extraction.
In this module, EX-Action extracts candidate threat actions from preprocessed CTI reports by a rule-based method. A CTI report consists of n sentences, which can be expressed as T � S 1 , . . . , S i , . . . , S n , and each sentence contains several action verbs, S n � V 1 , . . . , V i , . . . , V N . For each verb, EX-Action extracts many actions based on a rules-matching strategy, denoted as A i � action 1 , action 2 , . . . , action m . e extracted candidate threat actions are in the format (subject, verb, and object), i.e., an action consists of three elements which are subject, verb, and object. EX-Action matches the parts of speech (POS) for the three elements that consist of the action. POS are used to match the elements in action. e three elements in action rule matching are given in Table 1. e column "POS" represents the POS of each component, and "POS-Symbols" represent the symbol of POS tagged.
In this module, sentences are tagged by the POS tool [20]. Take the verb that is identified in the results of POS tagging as the start of the sliding window. en, the subject and object are, respectively, searched in the threat description sentence. e window size for searching the subject and object can affect its extraction performance. Some potential objects may have a long distance from the target verb, and therefore, a too small window size cannot get them. However, a too large window size may result in many mismatched VO pairs, which will affect the identification efficiency of EX-Action. EX-Action adopts different strategies to set the sliding window size for searching the subject and the object. For the subject, the sliding window size is set to the number of words before the verb in one sentence, and then, EX-Action matches all nouns and noun combinations in the window with the verb. For the object, a dynamic window mechanism is used to set the sliding window size.
is mechanism adopts the number of words after the verb in one sentence as the sliding window size, and the sliding window stops sliding when encountering another verb. Figure 2 shows an example of the action extraction of EX-Action.
In the process of searching the subject and object, there are phenomena that a noun compound structure or pronouns act as the subject or object. To ensure the integrity of the extracted action information, the multinoun compound structure is tokenized as a noun and matched with a verb. e verb and object can retain the basic information content, but when the pronoun is acted as the object, a lot of information might be lost. erefore, the pronoun as the subject is saved, and the pronoun as the object is discarded in this module.

Action Feature Extraction.
In this module, EX-Action extracts five types of features for each action. e extraction framework of action's features is shown in Figure 3. It contains similarity measurement, probability computation, mutual information value measurement, semantic dependency measurement, and distance computation. e features contains 9 values, feature action−all � F 1 , F 2 , . . . , F 9 . e description of features is given in Table 2. More details of action feature extraction will be described next.

Similarity Measurement.
In this subsection, the similarity between candidate actions and a CTI report p is calculated by the TF-IDF and BM25 algorithms. e TF-IDF method is commonly used to calculate the feature item weight in the process of text vectorization [21]. Equation (1) is used to calculate the weight of a feature item of the action.
where N is the total number of words in the CTI report p, and N(x) represents the number of words x in the CTI report p. Since threat actions contain different numbers of words, the average value is used as the similarity measure of candidate actions. e BM25 [22] is an upgraded algorithm of TF-IDF. It adds a constant to TF-IDF to limit the growth limit of the TF value and uses the document length to evaluate the Security and Communication Networks importance of candidate actions. It performs a weighted summation on the correlation scores between candidate threat actions and the CTI report p, and equation (2) is used to calculate the BM25 of the action.
where tf represents the frequency of each word, idf represents the inverse word frequency of each word, L is the length of the text, and k 1 , k 2 , and b are the adjustment factors.

Probability Computation.
In this subsection, the cooccurrence frequency of the VO pair is calculated to determine the correlation between the candidate action and the CTI report. In action, the subject usually indicates the attacking subject or organization, the verb represents the attack action, and the object represents the operation target in the CTI report. Since attack organizations are different in the attack process, calculating the co-occurrence frequency of SVO triples will weaken the relevance between the action and the CTI report. erefore, EX-Action calculates the cooccurrence frequency of the VO pair under a fixed window is taken as a feature. e correlation between the action and the    CTI report is proportional to the frequency of the VO pair. Specifically, window size m � 25 is used to calculate the cooccurrence frequency of the VO pair in our experiment.

Mutual Information Value Measurement.
Mutual information (MI) measures the reduction in uncertainty of information about one random variable, given knowledge of another [23]. e MI of VO pair and SVO triple are calculated to measure the information content of candidate actions. e correlation between candidate actions and CTI report is proportional to the MI value. Equation (3) where p(s, vo) is the frequency of a threat action, p(s) is the number of occurrences of its subject, and p(vo) is the cooccurrence frequency of VO pair.

Semantic Dependence Measurement.
ere are some candidate actions with high matching degree, but they in fact are inaccurate semantic matching. e feature of semantic dependency is designed to recognizing these actions. e Stanford dependency analyzer [4] is used to analyze the relation of semantic dependency for each sentence. W 1 and W 2 are set as the dependency weight between the subject and the verb and the dependency weight between the verb and the object, respectively. And then, summing the dependency weight (W 1 and W 2 ) as the feature of semantic dependency, Figure 4 shows an example of Stanford semantic dependency for a sentence.

Distance Computation.
In this subsection, two distances are computed. ey are, respectively, the distance between subject and verb and the distance between verb and object. e number of word between the verb and the target word is taken as the value of distance. For example, for a word between the subject and the verb, the distance is recorded as 1.

Action Identification.
Ensemble learning promote weak learners to strong learners by constructing and combining multiple base learners to complete the learning task. In this module, the EX-Action automatically identifies candidate actions by a parallel ensemble learning algorithm. e main process is illustrated in Algorithm 1. e time complexity of Algorithm 1 is O(n 2 ).
In Algorithm 1, the training set D is the input, which contains candidate actions and their features. e ground truth A is the action set of manual extraction. e ground truth A is used to calculate the similarity of candidate actions. Five-base classification learners are used to construct a parallel ensemble classification. ey are, respectively, decision tree (tree), random forest (forest), support vector machine (SVM), linear regression (LR), and multilayer perceptron classifier (MLPC). It can be expressed as It contains two values, which, respectively, express the distance between the verb and the subject (distance SV) and the distance between the verb and the object (distance VO).  Security and Communication Networks e result i is the predicted value for the action generated by the i-th base classification learner. en, different weight ω i is set for the result i and sum them to gain the weighted C t of each action. EX-Action identifies selected actions from candidate threat actions set by the weighted voting method and minimizes the loss function through the linear combination of base learners. S th is a predefined voting threshold; if C t is larger than S th , the candidate action will be regarded as the selected action. Finally, EX-Action calculates the similarity S t between the selected action and the ground truth. If the similarity S t is larger than the predefined similarity threshold θ, the action Action M t is recognized as the correct action. In EX-Action, according to the different classification performances of the different models, different weight values are set for each model. It can be seen that the decision tree behaves the best performance in our experiment. erefore, the decision tree is assigned to the maximum weight, and the weight values of the other four models are all equal.

Experimental Dataset.
We obtained 243 security reports from ATT&CK 1 . ey contain 5136 sentences. e number of sentences regarding different techniques in CTI reports is given in Table 3. In our experiment, 20% of the CTI reports are randomly chosen as the test data. Figure 5(a) shows the distribution of sentence lengths, and Figure 5(b) shows frequency distribution of the test data.
It can be seen from Figure 5 that the length of the sentence that describing threat action is mainly distributed between 10 and 30. erefore, the dataset in this study can be regarded as the CTI report with complex sentence and long length.

Evaluation Metrics.
In this study, accuracy, recall, precision, F1-score, normalized mutual information (NMI), and number of extracted actions (number) are used as performance metrics. e accuracy, recall, precision, and F1score reflect the quantitative difference of threat actions between machine identification and the ground truth. ey can be calculated by equations (4)- (7).  Security and Communication Networks e number represents the number of extracted actions, and NMI represents a measure of the difference in threat actions information content between machine recognition and manual extraction. NMI is often used in clustering to measure the similarity of two clustering results. And it is used to measure the difference in the information content in this study. e NMI reflects the similarity of actions between machine recognition and manual extraction. Equation (8) is used to calculate the information content difference of each action.
where N represents the number of word nodes of the action, C m represents the number of word nodes of the machine recognition action, C h represents the number of word nodes of the manual extraction action, and C ij represents the number of word nodes belonging to both types of actions. e similarity of information between the machine recognition and the ground truth is proportional to the MI value. When the NMI is equal to 1, the information content is equal.

Results and Analysis.
In this section, four subsections are there to show our experimental results and analysis. ey are the feature importance ranking of EX-Action, model comparison, threshold determination, and the effect comparison of existing methods. Note that the best values of each metric are bold in each table.

Feature Importance Ranking of EX-Action.
is subsection shows the feature distribution of actions, the importance distribution of features, and the performance of different features combination. e features contain 9 values. ey are TF − IDF(F 1 ), BM25(F 2 ), P vo (F 3 ), frequency(F 4 ), MI VO (F 5 ), MI SVO (F 6 ), dependence(F 7 ), distance SV (F 8 ), and distance VO (F 9 ). e feature distributions of some actions are shown in Figure 6. It can be seen that the feature value distributions of these actions are nonlinear distributions.   Table 4 gives the obtained result of different features combinations. Under the same conditions, the recall of combination 1 and combination 2 reached the maximum value of 77.82%, and the number of extracted actions reached the maximum value of 1179, but other metrics are lower than combination 7 . e performance of combination 7 is higher than others combinations in terms of accuracy, precision, F1-score, and information completeness. It can be found that combination 7 is more appropriate for the feature selection of threat action identification. e importance distribution of the 9 features is calculated by the Gini index, as given in Figure 7. Figure 7 provides data that the distance of VO pairs has the largest effect on actions recognition. e frequency and the conditional probability of VO pairs are less important than other features.

Model Comparison.
is subsection shows the results of the different base learners, unweighted ensemble learning model (unweighted model), and EX-Action (weighted ensemble model). e results obtained by the different base learners, unweighted model, and EX-Action are given in Table 5.
As given in Table 5, the accuracy and F1-score of tree are higher than other base learners, but its accuracy and F1score are lower than EX-Action. erefore, in EX-Action, the weight of the tree is the largest, and the weight values weights of other base learners are the same. Comparing the results of the unweighted model and EX-Action, the recall of the unweighted model is 81.06%, which is higher than EX-Action, and the number of extracted actions is also higher than EX-Action. However, the accuracy, precision, F1-score, and NMI values of EX-Action are higher than those of the unweighted model.  Security and Communication Networks

reshold Determination.
e voting threshold determines the result of the model recognition actions, and the similarity threshold determines the correctness of actions recognization, which will influence the performance of EX-Action. is subsection tests the optimal parameters for EX-Action through the setting of the voting threshold and similarity threshold. e comparison of results under different voting thresholds and different similarity thresholds is shown in Figure 8.
As shown in Figure 8(a), when the similarity threshold and voting threshold are set to 0.2 and 4, respectively, accuracy, F1-score, and NMI are optimal. Besides, as shown in Figure 8(b), the similarity threshold is the degree of difference between machine recognition action and ground truth. It can be seen that the higher the similarity threshold, the higher the information content it contains and the lower the accuracy and F1-score will be.

Performance Comparison with the Existing Approaches.
In this subsection, the performance of EX-Action is compared with TTPDrill [5] and ActionMiner [19] in terms of accuracy, recall, precision, F1-score, the number of extracted actions, and NMI. As shown in Table 6, the result of EX-Action is higher than the other two methods in terms of accuracy, recall, precision, F1-score, the number of extracted actions, and NMI.
For extracting actions from the CTI reports with complex structures, TTPDrill mainly relies on semantic   Security and Communication Networks dependence. It will ignore part of threat actions in the complex sentence structure like clauses. erefore, TTPDrill extracted fewer actions and behaved in poor performance.
TTPDrill can retain the main information of the action compared with the ActionMiner, so the NMI is higher than ActionMiner. ActionMiner mainly relies on syntactic structure extraction. It can obtain better accuracy and recalls for low-level actions extraction for complex sentences. However, it does not retain the subject of the action, so its NMI value is low. Besides, we compare the examples of actions extracted from CTI reports used in our experimental dataset and the literature [19] that proposed ActionMiner, respectively. e examples of the extracted actions obtained by the three methods in the two datasets are given in Table 7. e actions extracted by the three methods on our experimental dataset are shown on the left of Table 7. It can be seen that TTPDrill has a better effect for extracting sentences with simple structure and obvious dependencies. erefore, its performance is poor in our experimental dataset. ActionMiner can extract more actions than TTPDrill, but it lacks the subject of the action and behaves in poor performance in retaining the information content of the sentence. EX-Action can achieve better results in the number of extracted actions and the retention of the information of the sentences with complex structures.
For the CTI report mentioned in the literature [19], the actions extracted by the three methods are shown on the right of Table 7. It can be seen that the threat description sentence structure in those CTI reports is relatively simple. Comparing the three methods, it is found that actions extracted by TTPDrill can give a good description, but the composite components of the sentences are still not extracted. ActionMiner can extract more actions, but it lacks the subject. e actions extracted by EX-Action are more complete than that of ActionMiner, and the number of extracted actions extracted by EX-Action is more than TTPDrill.

Discussion
e unstructured CTI report records the network attack process, context mechanism, and other information. Accurately extracting and identifying threat actions from unstructured CTI reports will help security practitioners efficiently restore the attack process. In [24], Gao et al. correlated the threat action extracted from the CTI text with the action extracted from the system audit log and constructed a threat action graph to realize an efficient network threat search.

Contributions.
First, EX-Action can be used to extract actions from unstructured CTI reports with complex sentences. It uses syntactic rules to extract threat actions, which can extract more actions in complex sentences. At the same time, machine learning algorithms are used to identify  actions based on their own features, which can identify more actions undefined in the ontology model. Second, EX-Action extracted the action that contains the subject, verb, and object. It also provides a method to extract entity relations, which have contextual semantic relations between entities.

Limitations.
ere are some shortcomings in this study, such as overreliance on part-of-speech and semantic analysis, which may lose part of threat actions and failed to recognize the pronoun referent.

Conclusion
is study proposes a multimodal learning approach to extract and identify threat actions, and it can extract the threat action in complex semantic relationships and recognition of the cybersecurity entity associations with undefined relationships. e experimental result shows that EX-Action can have a certain balance between accuracy and information completeness in the action extraction. In future work, we will research how to avoid overreliance on part-ofspeech tagging tools and try to use pronoun resolution to identify the subject and object of the pronoun.

Data Availability
e unstructured CTI reports data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest. e CTI report in our dataset e CTI report mentioned in paper named ActionMiner reat description APT29 used sticky keys to obtain unauthenticated, privileged console access. APT3 replaces the sticky keys binary executable file for persistence. Axiom actors have been known to use the sticky keys replacement within RDP sessions to obtain persistence. Deep Panda has used the sticky keys technique to bypass the RDP login screen on remote systems during intrusions. Empire can leverage WMI debugging to remotely replace binaries like executable file, executable file, and executable file with executable file.
It creates the following file: caches_version.db. . .. e Trojan creates the following registry entries. . .. Next, the Trojan steals the following information from the compromised computer: keystrokes, clipboard data, screenshot based on specified keywords in the window title, network adapter information such as MAC address, IP address, adapter name, adapter, and description. e Trojan then saves the stolen information in the following location: caches_version.db.