Automated Dataset Generation System for Collaborative Research of Cyber Threat Intelligence Analysis

The objectives of cyber attacks are becoming sophisticated and the attackers are concealing their identity by disguising their characteristics to be others. Cyber Threat Intelligence (CTI) analysis is gaining attention to generate meaningful knowledge for understanding the intention of an attacker and, eventually, to make predictions. Developing the analysis technique requires a high volume and fine quality dataset. However, the organizations which have useful data do not release it to the research community because they do not want to disclose threats toward them and the data assets they have. Due to data inaccessibility, academic research tends to be biased towards the techniques for steps among each CTI process except for the analysis and production step. In this paper, we propose the automated dataset generation system named CTIMiner. The system collects threat data from publicly available security reports and malware repositories. The data is stored in the structured format. We release the source codes and the dataset to the public that includes about 628,000 records from 423 security reports published from 2008 to 2017. Also, we present a statistical feature of the dataset and the techniques that can be developed using it. Moreover, we demonstrate one application example of the dataset that analyzes the correlation and characteristics of incidents. We believe our dataset promotes collaborative research of the threat information analysis to generate CTI.


Introduction
Cyber Threat Intelligence (CTI) is evidence-based knowledge including context, mechanisms, indicators, implications and actionable advice regarding existing or emerging threats to assets McMillan [2013]. One can utilize CTI to have broad situational awareness, to collaborate in defeating cyber threats one is facing with others, and to prevent cyber threats by applying CTI into defense systems.
With the increase of global cyber threats, CTI is gaining more attention as a response to such threats. Many nations and organizations also try to promote the use of CTI by enacting laws that legalize and encourage collecting CTI Council on Foreign Relations [2015], by sharing it on the bi or multilateral cooperation cam [2016] ccd [2008] eni [2004] The White House [2015], and by establishing standards Johnson and Waltermire [2014] MIT [2013]. Furthermore, during the recent decade, the number of articles related to CTI has increased about 24 times as it can be seen in Fig. 1 1 . During the Olympic Winter Games PyeongChang 2018, a cyber attack took place targeting the server operated by the organizing committee. What makes this case taken attention is that security researchers attributed different countries as the actor of this cyber attack. Rosenberg [2018] and GReAT [2018] respectively insisted the Chinese and the Russian threat actor be responsible for the cyber attack. Paul Rascagneres [2018] and Juan Andres Guerrero-Saade and Lesnewich [2018] pointed out that it was unavailable to conclude the attribution based on small amounts of code overlap the malware used by Lazarus Group, the North Korean hacking group. Especially, GReAT [2018] insisted that there was including evidence the Russian attacker tried to disguise the culprit to be the North Korean hacking group. In this example, we can expect that the evidence-based precise analysis that considers all possibly related cases has vital importance for CTI generation.
However, among the traditional intelligence process US Joint Chiefs of Staff [2013]-planning and direction, collection, processing and exploitation, analysis and production, and dissemination and integration -most of the technical research about CTI tends to focused on the steps except for analysis and production step which require real CTI dataset. Despite many advantages of CTI analysis such as 1) interoperability of data (machine, vendor, and organization independent), 2) compact expression of heterogeneous source of threat information, 3) possibility of performing long-term and nation-wide threat analysis Kim et al. [2016], we believe that the most challenging aspect of the study is the limited accessibility of data to researchers. Some web services provide the functionality to searching threat data, but they do not offer enough and useful set of data for research purposes. Also, most of the dataset is only consisted by specific data types, e.g., IP, URL, or hash value, and some dataset is strictly restricted to access in some regions or to a people having particular nationalities.
In this paper, we propose a cyber threat dataset generation system named CTIMiner which automatically collects data from public security reports and malware repository websites, and stores it in a structured format. The generated dataset contains several types of data such as malware analysis information which consisted by file path, mutex, and code sign information, and those listed above. The main contributions of our work are: • Promoting collaborative CTI analysis research by proposing the cyber threat data generation system and the public database • Demonstrating the usages of the dataset for correlation analysis • Suggesting the techniques to be developed to generate CTI from the dataset It would be better to introduce the techniques to generate CTI from the dataset. However, it is beyond the scope of this paper and remained as our future research concern. We believe the suggestion of the required techniques for analyzing the dataset can inspire the researchers and promote CTI analysis research.
The remainder of his paper is organized as follows. The intelligence process and its associations with CTI activities are presented in section 2 with several studies related to each step. The overall system architecture of CTIMiner and the phases composing the run-time process is described in section 3. The dataset structure, the categories of data and the statistical features are explained in section 4. After demonstrating the dataset usage and suggesting the techniques to analyze it in section 5, we conclude this paper in section 7 following the source code and dataset access introduction in section 6.

Intelligence Process and Automated CTI Activities
In the field of military operation, well-defined intelligence process has been adapted to generate intelligence efficiently from low-level data collected in the field to support decision making US Joint Chiefs of Staff [2013]. This process is intended to be followed by a human intelligence officer, but it can also be projected into automated CTI activities. Once the operation direction is determined to fulfill the identified intelligence requirement, the raw data is collected and extracted from the sensors and data sources, which have the ability as well as the functionality to obtain it. The data gathered from the various sources are combined and converted into forms, in other words information, so that the data can be efficiently analyzable. The information is passed into analysis algorithms, such as big data or machine learning based methods, which enable the intelligence to support human analysts. Such intelligence is spread to others who have access to it. The shared intelligence can also be integrated into the intelligence the one already possess.
The association between the intelligence process and the automated CTI activities are illustrated in Fig. 2. From the following subsections, previous studies concerning CTI are introduced in the sequence of the intelligence process, except for the planning and direction step since it is more a strategic matter rather than technical.

Collection
Since CTI is also a product of the threat data processing through the intelligence process, low-level threat data can be collected in this step. Goel classified the types of data to be collected into unstructured data and network data Goel [2011]. The former typically consists of hacker forum postings, blogs, and websites, and the latter is generated from information security systems such as firewalls, intrusion detection systems, and honeynets. Benjamin et al. proposed the method of extracting information from hacker forums, IRC channels, and carding shops to identify threats to them Benjamin et al. [2015]. Also, Fachkha and Debbabi characterized the darknet and compared several methods to extract threat information from it Fachkha and Debbabi [2016].
As a data repository for the research regarding analysis of cyber security, IM-PACT U.S. Department of Homeland Security [2014] which is the newest version of PREDICT provides several types of dataset such as network flow data, IDS and firewall data, and unsolicited email data. It also provides useful tools for data analysis. However, the service is only available to the DHS-approved countries; United States, Australia, Canada, Israel, Japan, Netherlands, Singapore, and United Kingdom.

Processing and Exploitation
During the processing and exploitation step, collected raw data is converted into forms that can be readily used by intelligence analysts and other consumers. Unstructured data and heterogeneous sources of data having different structures can be stored in a unified data format in this step for further analysis.
STIX (Structured Threat Information eXpression) Barnum [2014] and OpenIOC MANDIANT [2011] proposed by the MITRE and MANDIANT are representative standards to express threat data. Especially, STIX is widely used due to the scalability of the schema that uses components such as CybOX (Cyber Observable eXpression), MAEC (Malware Attribute Enumeration and Characterization) and CAPEC (Common Attack Pattern Enumeration and Classification). Liao et al. proposed the elements extracting method to construct structured data from unstructured one Liao et al. [2016]. One thing to notice in this approach is that the meaning of the elements in the context can be also retrieved by using the natural language processing technique.

Analysis and Production
During the analysis and product step, all processed information is integrated, evaluated, analyzed, and interpreted to produce intelligence. Kornmaier and Jaouën insisted that to generate operational or strategic intelligence beyond a tactical one which is technical in nature, the threat data should be fused with data collected from different disciplines such as Human Based Intelligence (HUMINT), Imagery Intelligence (IMINT), Signal Intelligence (SIGINT), and Geographic Intelligence (GeoINT) Kornmaier and Jaouën [2014].
Modi et al. proposed automated threat data fusing system that correlates data crawled from the web applying a stringmatching based approach Modi et al. [2016]. Similar commercial CTI services also exist such as iDefense R Intel-Graph by Verisign and the web intelligence engine by Recorded Future that allows users to navigate through extensive threat data following string-matching correlation. One key feature of Recorded Future is that it can perform predictive analytics for specific future events by the use of information noticed ahead of time Truvé [2016]. However, the commercial services provide indicator-centric analysis approach so that it is hard to trace correlation between incidents.
Kim et al. proposed the general framework for efficient CTI correlation analysis by adopting the novel concept that expresses similarity between threat events in graphical structures Kim et al. [2016]. The graphical structures allow the analysts to trace the specifications and the transition of related cyber incidents to infer the attacker's intention.
Using a threat report as the source of information, Qamar et al. Qamar et al. [2017] proposed the automated mechanism to determine the risk of the threat analyzed in the report towards a networked system. For the purpose, they defined the ontology of IoCs, network, associated risk, and the relations of them. For the risk analysis of the networked system, the four parameters -threat relevance, threat likelihood, total loss of affected assets, and threat reachabilityare defined.

Dissemination and Integration
During the dissemination and integration step, intelligence is delivered to and used by the consumer. There is a guideline Johnson and Waltermire [2014], and a technical standard protocol Connolly et al. [2014] that exists for sharing CTI. Also, MISP (Malware Information Sharing Platform) 2 , MANTIS (Model-based Analysis of Threat Intelligence Sources) 3 and CIF (Collective Intelligence Framework) 4 are useful open-source platforms to store and share CTI.
As more participants in a community share CTI, access control issues with the shared data often arise. Zhao and White proposed the access control model that extends the group-centric Secure Information Sharing (g-SIS) model to support collaborative information sharing in a community Zhao and White [2012]. Even though such assistive technologies promote CTI sharing, social and political issues, for example, the authority to operate CTI sharing policies and the trust management within a community are often controversial to establish collaborative CTI sharing.

Data, Information, and Intelligence
In many CTI related literatures, the terminologies, data, information and intelligence, are often intermixed without clarification. We need to use them clearly based on the definition in US Joint Chiefs of Staff [2013].
Data is the individual facts collected from sensors in the operational environment. Information is data gathered and processed into an intelligible form, and intelligence is the new understanding of current and past information that allows prediction of the future and informs decisions.
These definitions are not only applied to the general intelligence process but also CTI activities. Throughout data fusion and mining process, Bass defined data as the measurement and observations, information as the data placed in context, indexed, and organized, and knowledge, which is equal to intelligence, as the information explained or understood Bass [2000].

CTIMiner System Architecture
We propose a cyber threat data collecting system, CTIMiner, with the system architecture presented in Fig. 3. The CTI collecting procedure is composed of three phases. During the first phase, it gathers threat data from publicly accessible cyber intelligence reports published by organizations and companies. It also collects additional related data from malware repository during the second phase. Finally, all collected data is stored in the database after passing through the last phase that generates combined information in a structured format.

Phase 1: Parsing Indicator of Compromise
This phase starts with collecting cyber intelligence reports which analyze cyber incidents and malware interrelated APT campaigns and groups. For this, we obtain a list of papers from APTnotes 5 which provide publicly available articles and blog contents related to malicious attacks, activity, and software associated with vendor-defined APT groups and/or tool-sets. To maintain the usability of the dataset, we exclude the periodically published threat analysis reports from the list that integrate analysis results about different APT groups that are not interrelated each other. Therefore, one can assume that the extracted data at phase 1 and 2 are related to the same (or related) threat actors. We can use this property to set the ground truth of data for analysis. This property and the dataset usability are explained in detail in section 4 and 5, respectively.
Next, Indicators of Compromise (IoCs) are extracted from the reports using the parser. We utilize ioc parser 6 that extracts IoCs matched by predefined regular expressions such as URL, host, IP address, e-mail account, hashes (MD5, SHA1, SHA256), Common Vulnerabilities and Exposures (CVE), registry, file names ending with specific extensions, and Program Database (PDB) path. Among the obtained data, the malware hash values are passed to the second phase for further data collection, and others to the last phase.

Phase 2: Collecting Analysis Data
Due to the functional limitation of the parser, there can be unextracted IoCs remaining in the reports that can be found in malware analysis data. Moreover, we can get additional data from the analysis results that are not in the contents of the reports. Notably, the valuable data, which cannot be expressed as the regular expression such as mutex, file mapping, code sign, and other strings, are only collectible from the malware analysis results.
To collect malware analysis data, we use the malware repository service, malwares.com, operated by SAINT SECU-RITY Inc., the first cloud-based malware analysis platform in South Korea. It possesses over 800 million malware samples and maintains the partnership with VirusTotal. If the malware analysis results are retrieved by querying the hash value, the data in the results -hashes, URL, IP address, PDB path, code sign, file name, and other strings -are passed to the last phase; otherwise, the hash value itself is passed. We do not store malware samples in the database because of the possible occurrence of the copyright concerns when it is publicly released. For the new hash values found from the results, the analysis data is gathered through the same procedure.

Phase 3: Data Filtering and Storing
The data collected from several sources may be redundant or noisy which can be filtered out in this phase. For example, some files are automatically generated by the operating system regardless of the intent of the malware creator when the malware is executed. We merge the repetitive data and remove noisy data in this phase. What needs to be considered for noise removal is the trade-off between false-positive and false-negative. The filtered data is stored in the MISP server that provides API to manage and export data in various structured formats.
Optionally, we categorized the data types composing the dataset and analyze their statistical characteristics in this phase where the results are presented in the next section.

System Processing Results of Phase 1 & 2
We ran this system to the collected 423 APT reports published from 2008 to 2017, and the numerical processing results are in Table 1. Among 10,391 malware hashes extracted from the reports, we got analysis results regarding 71.5 % of them from the malware repository. Among the analysis information, we found 406 new malware hashes which were not contained in the APT reports and added the analysis information to the dataset. The worth of inclusion of the malware analysis data in addition to the IoCs extracted from the reports is explained in the statistical analysis of the dataset in section 4.2. The dataset is composed of several sets of events and Fig. 4 shows the relationship of one set of events. One set of events composed of two types of events-one report event and several malware events. A report event includes the data extracted from the first phase explained in section 3 which parsed texture IoCs from the APT reports. Whenever malware hashes are detected, and it is possible to have the analyzed data of them in phase 2, malware events are created. These malware events and the report event where the malware hashes are originated from can be grouped under the title of the report.
The data schema of an event is presented in Fig. 5 and one short example of a set of events are in Fig. 6. Since all malware events originated from one report includes the same file name of the report, this can be used as the ground truth of the correlation analysis of the data. Also, compilation dates of malware and publication dates of reports can   be useful to the temporal analysis of the dataset. The sample application of the dataset for correlation analysis using those dataset characteristics is demonstrated in section 5.
The types of attributes stored to dataset are IP, URL, e-mail address, date and time, vulnerability (CVE), file name, PDB path, digital code sign serial number, and other string data such as the author and title of a document. The amount of data, the report, the malware events are in Table 2. Using the source codes we publicly released, one can create a dataset in person composed of the attribute types one interested.

Data Categories and Statistics
We observed that the collected data from the reports and the malware analysis information are related to common cyber campaigns or threat actors which can be categorized as Fig. 7.
The characteristics of each category are as follows.
1 The data that can only be extracted by the parser belongs in this category. The quality and the quantity of this type of data highly depends on the contents of reports and the functionality of the parser.
2 The malware analysis data that is contained in reports but unable to be extracted by the parser belongs in this category. The volume of this type of data shows how much the malware analysis data can compensate for the limitation of the parser. Also, the indicator about this category can be used to compare the quality of analysis results of several malware repositories.  is consisted by various types of data including code sign, IP address and other string information that are valuable to identify the incidents.

Dataset Application
As aforementioned, the objective of generating our dataset is to promote academic research related to CTI analysis. We propose three research topics applying the dataset and demonstrate one dataset application example in this section. It would be better if the novel analysis techniques were proposed, but that is out of the scope of this paper. The provided application example is the automatically generated correlation analysis result of the dataset by MISP.

Noise Removal
As explained in section 4, the dataset includes several types of noise which makes it hard for further data analysis, and causes erroneous results. The reasons that the dataset contains noises are the malfunctions of the data extraction methods and the inclusion of less meaningful data. The effective noise removal technique should be able to consider the contextual necessities of data among the whole dataset or the sets of events. For example, the data contained in several sets of related events where there is little similarity of each event set is noise in the high probability since it increases dissimilarity of the event sets correlated with this data.

Correlation Analysis
Good usability of the dataset comes from finding the underlying relations of data. Without the correlations, the dataset itself is nothing but a significant amount of scattered data that can only be used for searching existence of some items.
Since an event in the dataset is composed of several threat data about it, the correlations between events are determined by analyzing the relation of the threat data consisting them. String-matching based method where many commercial cyber intelligence services provide would be one way to find relations of events. However, this simple method has several limitations. If two events contain attacker names such as 'Bart Simpson' and 'B. Simpson', the simple stringmatching based method will not find the relations of the events. Similarly, if each of the events includes the URLs, 'bartsimpson.com' and 'bsimpson.net', the relations will not be discovered. String-similarity analysis and heuristics can be adopted to overcome such limitation. Moreover, the probabilistic approaches can improve to event-wise analysis considering the relation of sets of data in the events.

Temporal Analysis
Understanding the history of cyber campaigns by adversaries is crucial not only to defend current incidents and presume the underlying intents but also to draw the direction of adversarial activities from the big picture. Furthermore, the Tactics, Techniques, and Procedures (TTPs) identified from the campaigns by the temporal analysis can characterize the behavior of the adversarial groups. Therefore the characteristics can be used as a feature for correlation analysis of the sequences of events.

Dataset Application Example
The proposed dataset can be used for the correlation analysis of cyber incidences.The cyber threat actor group retrieving the correlation as the example is Lazarus group, which is suspected to the attribution of many major cyber campaigns listed as following: • Sony Pictures Entertainment attack (2014) • The bank heist including the Bangladesh Bank (2016) • The worldwide WannaCry ransomware distribution (2017) We conducted a correlation analysis of dataset collected by CTIMiner with help to MISP correlation graph demonstrated in Fig. 8. The security report, the starting point of the correlation analysis, is 'Lazarus' False Flag Malware Shevchenko [2017] marked as a . As mentioned in the report, the Lazarus group was involved which the polish banks heist where the corresponding report is BadCyber [2017] marked as b . The data-wise correlation of incidents can be found in Fig. 8. The data in 1 which is extracted from the reports and from malware analysis results correlates a and b , and the others in 2 link a to c that is another report from BAE systems regarding Lazarus group. Therefore, through a , b and c can have correlation.
Although this paper does not intend to propose the CTI analysis techniques, by applying previously proposed dataset application, we can deduct the practical lessons how this dataset can be used for CTI generation this example. Basically, a CTI analysis algorithm is able to find the connectivities of the data extracted from the same APT report. In advanced, the algorithm can correlate the reports that analyze the same attributes and campaigns. What a CTI analysis algorithm should eventually aim to generate actionable intelligence is to find the patterns of the attack for predicting the intents of attackers and preparing against the similar attack.
Kim et al. proposed the event-centric correlation analysis approach to assist generating such CTI. He suggested the novel concept and the construction algorithm that expresses similarity between threat events and temporal characteristics in graphical structures Kim et al. [2016]. To use our CTI dataset for the advanced analysis, successive researches should be preceded.

Source code and Dataset Access
The source codes of CTIMiner system and the generated dataset described in this paper are available to the public. These are accessible at our GitHub repository 7 . Using the source codes, security reports, and MISP, one can generate a dataset composed of the data types that he/she is interested in.

Conclusion
As the cyber threats are prevalent and the volume of the collectible data increase rapidly, researches develop techniques for each intelligence process to be conducted actively. However, compared to other intelligence processes steps, the studies have been undertaken limitedly for the analysis and production step that requires the real CTI dataset for the analysis. We pointed out that dataset unavailability is the main reason suppressing vitalization of the research despite many interest. To address the problem, we proposed CTIMiner system that generates the dataset consisted of the data contained in security reports and supplemented with malware analysis data related to the reports. After categorizing the types of data collected from the system, we provided the statistical feature of the dataset. To show the usability and applicability of the dataset, we proposed several research topics possible to be conducted using the dataset and demonstrated the correlation analysis result for an event in the dataset.
Our future research direction is to develop and enhance the proposed analysis technique using the dataset on top of the CTI correlation analysis framework Kim et al. [2016]. By releasing this dataset to the public, we believe it can promote the threat information analysis research to generate CTI.