Next Article in Journal
Applying Artificial Intelligence to Wearable Sensor Data to Diagnose and Predict Cardiovascular Disease: A Review
Previous Article in Journal
Comparison of High-Speed Polarization Imaging Methods for Biological Tissues
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework

1
Department of Computer Science, University of West Florida, Pensacola, FL 32514, USA
2
Department of Mathematics and Statistics, University of West Florida, Pensacola, FL 32514, USA
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(20), 7999; https://doi.org/10.3390/s22207999
Submission received: 7 September 2022 / Revised: 17 October 2022 / Accepted: 18 October 2022 / Published: 20 October 2022
(This article belongs to the Section Intelligent Sensors)

Abstract

:
While computer networks and the massive amount of communication taking place on these networks grow, the amount of damage that can be done by network intrusions grows in tandem. The need is for an effective and scalable intrusion detection system (IDS) to address these potential damages that come with the growth of these networks. A great deal of contemporary research on near real-time IDS focuses on applying machine learning classifiers to labeled network intrusion datasets, but these datasets need be relevant pertaining to the currency of the network intrusions. This paper focuses on a newly created dataset, UWF-ZeekData22, that analyzes data from Zeek’s Connection Logs collected using Security Onion 2 network security monitor and labelled using the MITRE ATT&CK framework TTPs. Due to the volume of data, Spark, in the big data framework, was used to run many of the well-known classifiers (naïve Bayes, random forest, decision tree, support vector classifier, gradient boosted trees, and logistic regression) to classify the reconnaissance and discovery tactics from this dataset. In addition to looking at the performance of these classifiers using Spark, scalability and response time were also analyzed.

1. Introduction and Background

Over the past decade, Internet of Things (IoT) traffic has increased exponentially. As more devices transfer data across networks in different sectors such as healthcare, agriculture, logistics, etc., network traffic is expected to increase exponentially, and it is predicted that by 2023, there will be at least 43 billion devices [1]. Hence, being able to monitor and recognize malicious activity and cyberattacks has become imperative. In this work, Zeek, an open-source network-monitoring tool that provides the raw network data needed to tackle today’s toughest networking challenges in the enterprise, cloud, and industrial computing environments, was used to collect data [2]. This one-of-a-kind new modern dataset, UWF-ZeekData22 [3], was labeled using the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework. The MITRE ATT&CK framework is a globally accessible knowledge base of adversary tactics and techniques used to accomplish specific objectives [4]. This work, specifically using Zeek’s Connection (Conn) Log files from the UWF-ZeekData22 dataset [3], tries to identify connections that lead to the two adversary tactics, reconnaissance (TA0043) and discovery (TA0007). Users of the reconnaissance tactic gather information about vulnerabilities, which can be used for future attacks [5], and users of the discovery tactic try to better understand the internal network [6]. Zeek’s Conn log files track the protocols and associated information such as IP addresses, durations, transferred (two way) bytes, states, packets, and tunnel information. In short, Zeek’s Conn files provide all the data regarding the connection between two points [7].
Due to the volume of data involved, the Hadoop Distributed File System (HDFS) was used to store the data. HDFS is a scalable, fault-tolerant system that allows processing of big data by making it available across a cluster of computers [8]. An open-source framework for big data analytics that sits on top of the Hadoop framework, Apache Spark, was used for machine learning (ML). Spark’s machine learning algorithms, specifically decision tree (DT), gradient boosted trees (GBT), logistic regression (LR), naïve Bayes (NB), random forest (RF), and support vector machines (SVM), were used for binary classification in the big data framework. New binning methods were introduced to bin the raw network data, and feature reduction was performed using information gain. In addition to statistical metrics and execution timings obtained from the machine learning algorithms, an analysis was also performed to determine the ideal Spark parameter configurations for classifying a raw network dataset in the big data framework.
The uniqueness of this paper can be highlighted as follows: (i) this analysis is performed on a modern, one-of-a-kind, newly created Zeek MITRE ATT&CK framework labelled dataset, UWF-ZeekData22, which is available at datasets.uwf.edu (accessed on 20 August 2022) [3]; (ii) this is the first work to date that analyzes Zeek network connections that lead to the classification of the reconnaissance and discovery tactics as defined by the MITRE ATT&CK framework; and (iii) new binning methods are presented to bin a raw network dataset.
The rest of this paper is divided into the following sections. Section 2 presents the related works; Section 3 briefly explains how this new dataset was generated and the characteristics of this dataset; Section 4 explains the pre-processing in detail; Section 5 presents the machine learning algorithms and the Spark parameters used; Section 6 presents and discusses the results obtained; Section 7 presents the conclusion, and Section 8 presents the future works.

2. Related Works

Analysis of network data for the purpose of supporting an anomaly-based IDS has been a topic of interest for some time [9,10,11,12,13,14,15,16,17]. The first widely studied network dataset was the KDD99Cup dataset, analyzed in [10,12,15]. Reference [12] used a decision tree classifier to support anomaly-based intrusion detection in network data. They found that a particle swarm optimization algorithm can be used to prune trees and significantly improve model performance. Reference [15] showed that SVM, using principal component analysis for feature reduction, is effective for anomaly detection in a fog computing environment. Reference [10] used multiple learning algorithms (logistic regression, SVM, random forest, gradient boosting tree, and naïve Bayes) to test the performance differences that occur as the number of features are changed. They found that the accuracy of these learners is minimally affected by reducing the number of features, but training and testing time can be significantly reduced.
The next big dataset, NSL-KDD, is a smaller subset of the KDD99Cup dataset, with a significant number of duplicate records removed. It was analyzed by [10,13]. Reference [10] used both NSL-KDD and KDD99Cup and found that the larger, more redundant KDD99Cup dataset produced slower models as compared to the models developed using the smaller NSL-KDD dataset. Reference [13] combined different decision tree models and found that models perform best when combined with a sum-rule scheme.
The UNSW-NB15 dataset was created in 2015 by the University of New South Wales and features nine different attack types along with significantly more modern network traffic as compared to the KDD datasets. Reference [9] analyzed this dataset using the SVM, naïve Bayes, decision tree, and random forest classifiers on the Apache Spark framework and found that the random forest classifier performed the best in terms of both accuracy and execution time. Reference [18] analyzed this dataset on the Apache Spark framework using binary and multiclass random forest classifiers with principal component analysis and information gain as feature selection methods. They found that PCA made model training slower, and in some cases, accuracy was the same or worse than models trained without PCA. They also found that the binary classifier produces more accurate models than the multiclass classifier. Reference [19] analyzed UNSW-NB15 using both binary and multiclass random forest algorithms using information gain and principal component analysis as feature reduction methods and Apache Spark as the framework. They found that binary random forest outperforms multiclass random forest in terms of accuracy and that information gain outperforms principal component analysis in terms of both accuracy and speed.
Reference [17] analyzed CICIDS2017, and this work addresses issues with traffic diversity and volume, lack of metadata, outdated attack payloads, and anonymized packets found in other datasets. Reference [17] implemented k-nearest neighbor, random forest, decision tree, gradient boosting tree, multilayer perceptron, naïve Bayes, and quadratic discriminant analysis classifiers and reported their performance, stressing the importance of generating and analyzing updated IDS datasets to keep up with current network traffic/network attacks.
CSE-CIC-IDS2018, an update to CICIDS2017, followed a set of guidelines to create a systematic update to keep the IDS dataset up to date with current network trends. It was analyzed by [16], who applied gradient boosting tree, decision tree, random forest, naïve Bayes, and logistic regression classifiers to this dataset to determine the impact of feature selection and specifically the use of the Destination_Port feature. Reference [16] found that the Destination_Port feature was generally useful in terms of model accuracy and should be included when analyzing IDS data. This feature affected the performance of the machine learning algorithms. They found that feature selection algorithms produce models that perform as well as or better than models created without using feature selection algorithms while consuming fewer computing resources.
In addition to large, general-purpose IDS datasets, some studies have produced their own data tailored to their specific use cases. Reference [11] created and analyzed datasets of varying sizes labeled SYNDOS10K-SYNDOS2MIL, which were generated by making SYN DoS attacks on a network consisting of three PCs and nine IoT devices. They analyzed these datasets with logistic regression, decision tree, random forest, gradient boosting tree, and linear SVM and found that random forest performed best in terms of both accuracy and execution time on this type of data.
Other studies, such as [14], used data that are not made publicly available. They applied a multilayer SVM to their data to support a block lowest common ancestor algorithm to cluster network intrusion features. They found that this approach produces better clustering detection times when compared to similar methods.
There are very few works to date on the MITRE ATT&CK framework with respect to identifying attacks using machine learning. Reference [20] used two hidden layer feed-forward neural networks to impute missing values in a 2018/2019 ENISA dataset [21] based on ATT&CK descriptions of the intrusions. They found that this is a valid approach for filling out intrusion datasets with missing values. Reference [22] used hierarchical clustering on a small (270 records) intrusion dataset from MITRE to predict likely future attacks based on other recent attacks. They found that 75% of the ATT&CK techniques can be highly predicted based on earlier attacks by the same advanced persistent threat. Thus, though there are some works on the MITRE ATT&CK framework, there are no works that map a large, raw network dataset with attacks labeled according to the ATT&CK framework, which is then analyzed using machine learning.

3. Data

The Zeek Conn Log MITRE ATT&CK framework labeled dataset, UWF-ZeekData22, available at [3], generated using the Cyber Range at The University of West Florida (UWF), was used for this analysis. This dataset has 9,280,869 attack records and 9,281,599 benign records.
Zeek’s Conn log file tracks the protocols and associated information, such as IP addresses, durations, two-way bytes, states, packets, and tunnel information. In short, the Conn log files provide all the data regarding the connection between two points. The full list of the attributes (and attribute types) of the Conn log file is presented in Figure 1. The “mitre_attack” attribute was added to label the data as per the MITRE ATT&CK framework. A description of the attributes in the Zeek Conn Logs file is presented in Appendix A.

3.1. Data Breakdown by Tactic and Technique

Table 1 presents a breakdown of the data in the Zeek Conn logs by the MITRE ATT&CK Tactic. There were 9,278,722 instances of reconnaissance activity, 2086 instances of discovery activity, and very few instances of other adversary tactics in this dataset; hence, only reconnaissance and discovery were used for this analysis. Table 2 presents a breakdown of the Zeek Conn logs by the MITRE ATT&CK Technique, which is a more granular view of tactic (i.e., “T1046 Network Discovery Service” is a technique, which is a member of the discovery tactic).

3.2. The Reconnaissance Tactic

The reconnaissance tactic covers network attacks that are carried out with the goal of gathering information to plan future attacks [23]. This dataset has 9,278,722 reconnaissance records, all of which are T1595 active scanning attacks in the MITRE ATT&CK nomenclature [23], which could include scanning IP blocks, vulnerability scanning, or wordlist scanning.

3.3. The Discovery Tactic

The discovery tactic covers attacks that are meant to learn specifics of network infrastructure [5]. This dataset has 2086 discovery records, all of which are T1046 network service discovery attacks. In the T1046 technique, adversaries attempt to get a listing of services running on remote hosts and local network infrastructure. This is done through port scans or vulnerability scans [5].

4. Preprocessing

Preprocessing is broken down into two parts. The first part presents a way to preprocess raw network data using binning, and the second part uses information gain for feature selection.

4.1. Preprocessing Using Binning

The Zeek Conn logs contain continuous valued attributes, nominal attributes, IP addresses, and port numbers. For processing in Spark’s machine learning environment, preprocessing was required. The following sections cover how each type of column in the dataset (i.e., continuous value, nominal, IP number, and port) was preprocessed and binned.

4.1.1. Binning Continuous Valued Columns

To establish the set of bin ranges for the continuous valued columns in this dataset, the trimmed mean and standard deviation was calculated for each column. To calculate the trimmed mean, first, null values as well as extreme outliers were removed. Several columns in this dataset (for example, duration) exhibit skewed data distributions, often forming long tails skewing to the right. To address the skewness caused by lengthy and low-lying tails, a 10% trim on the data was used to generate the bins. This resulted in 80% of these data being used for mean and standard deviation calculations, with 10% of the lowest-ranking and highest-ranking values being removed. The binning can be outlined with the following edges:
  • edge0 = float(‘−inf’)
  • edge1 = mean_val − stddev_val × 2
  • edge2 = mean_val − stddev_val
  • edge3 = mean_val
  • edge4 = mean_val + stddev_val
  • edge5 = mean_val + stddev_val × 2
  • edge6 = float(‘inf’)
  • edges = [edge0, edge1, edge2, edge3, edge4, edge5, edge6]
The bin ranges were used in conjunction with the PySpark ML Bucketizer function to convert the respective columns into a limited range of integer values. The resulting bins and distribution for the duration column, using the above binning outline, are presented in Table 3 and Figure 2.
The resulting distribution for duration, with bin1 and bin2 ending in a negative value for their high end and having a 0 count of values, was caused by the slant of the dataset, resulting in two standard deviations left of the dataset’s mean pushing over the 0 value. This difference with a normal distribution, where the application of the standard deviation is ideal, can be seen in Figure 3, generated using the maximum value of the trimmed dataset, which clearly shows a heavily leftwards distribution.

Using the Moving-Mean to Bin Continuous Valued Columns

To maintain the desired number of bins in spite of the skewness of the data and to avoid redundant bin ranges, a moving-mean logic was inserted during the establishment of the edges (where the minimum value of an attribute never drops below 0):
  • if min_val ≥ 0:
  •  while mean_val—2 × stddev_val < 0:
  •   mean_val += stddev_val
This moving-mean logic would insert itself in the binning method such that the bins established through a set number of standard deviations within the mean value of the trimmed dataset would shift rightwards with the moving-mean value until the first two bins no longer encompassed a redundant range of data. Table 4 presents the distribution with this process performed on the duration column.
As can be seen in Figure 4, the moving-mean generated two additional bins, taking away from the previously established bin6. Previously representing a bin range of duration values more than three standard deviations away from the true mean of the trimmed dataset, this would now (after the emergence of two additional bins) encompass values more than five standard deviations from the true mean.
The parity of box6 in both sets of results, as well as the obvious positive skewness leftward, is contradictory to the nature of normal distributions. This, however, must be taken with the general level of abstraction inherent in any binning methodology.
Table 5 shows the bin counts for each attribute when this binning method is applied to the continuous attributes in the Zeek Conn file. The moving-mean bins were also calculated. An additional bin was generated for null values with a bin designation of 0.

4.1.2. Binning Nominal Valued Columns

Nominal values within this dataset are columns containing non-numeric data. They may contain names of things, categories, states, or sequences. They often contain non-numeric characters although that is not always the case. One of the difficulties in their consideration in algorithms is that such data can contain many unique values, whose naming conventions do not necessarily indicate an intrinsic value difference. For this analysis, non-numeric values were converted into numbers. Spark provides a convenient means of performing such conversions through the StringIndexer method from MLib [6], which is Apache Spark’s scalable machine learning library. This method was set up to keep all invalid or empty values, converting those as well to separate integer values for binning purposes. This method was applied to the columns presented in Table 6, with the resulting bin counts.
Though the IP addresses and port numbers would also fall under this category, they were handled as explained in the next couple sections.

4.1.3. Binning IP Address Columns

For binning IP addresses of traffic source and destination, the commonly recognized network classifications of A, B, C, D, and E were used, each of which pertain to specific ranges of the first octet in the IP address [24]. These different octet ranges correlate to different default subnet masks, with ascending classifications reserving more bits for the network address and fewer for the host address. Class A is best-suited for serving incredibly large networks, while Class C would normally be assigned to very small networks. Classes D and E are normally relegated to experimental use cases, such as for multicasting, research, and development [24].
Below are the established octet ranges (inclusive) for each classification [24]:
  • Class A: First octet value 0–126;
  • Class B: First octet value 128–191;
  • Class C: First octet value 192–223;
  • Class D: First octet value 224–239;
  • Class E: First octet value 240–254.
Using these ranges, the data in these attributes were binned into seven categories, as shown in Table 7. The “Other” classification was used to capture values that may exceed these boundaries, and a category was used for null or non-applicable values. An example of this classification method was applied to the dest_ip column of the Zeek Conn data, as shown in Table 7. This column represents the IP address of a packet destination on the network, and it totaled 312 unique addresses within the scope of the sample dataset.
Based on Table 5, Classes A and B IP addresses were the most plentiful in this data. This is to be expected given that Classes A, B, and C are considered the most commonly occurring IP types in most network traffic [24]. The remaining bins represent a small fraction of the total dataset.
Applied to both of the IP columns in the Zeek Conn data, Table 8 shows the changes in unique values before and after binning.

4.1.4. Binning Port Numbers

Port numbers can range widely in value, with the Internet Assigned Numbers Authority (IANA) administering ports 0 through 65,535 [25]. This spectrum can be divided into three ranges covering well-known ports, registered ports, and dynamic/private ports. Well-known ports range from 0 to 1023 and are protected, with operating systems restricting access to these ports to only processes with appropriate privileges. Registered ports range from 1024 to 49,151, and their use should only be with IANA registration. All other ports up to 65,353 can be used more freely for all manner of purposes.
Based on this classification system, Table 9 shows the bin ranges that were generated to represent the three port ranges, with a bin value of 0 representing null values. With the bin ranges applied in Table 9, the results for dest_port and src_port are as shown in Table 10.
The difference between the two port types (src_port and dest_port) within this dataset is also more clearly defined, with most of the source network traffic originating from registered and dynamic ports and the destination ports largely made up of the well-known ports, as can be observed in Table 11.

4.2. Information Gain

After binning, information gain was used to assess the relevance of the 18 features. Information gain is the difference between a class’s entropy and the entropy of the class and a selected feature split, with entropy measuring the extent of randomness in the dataset [27]. It is an assessment of the usefulness of a feature in classification.
The following calculations [28] were performed on each feature to produce information gain values for ranking purposes:
G a i n ( A ) = I n f o ( D ) I n f o A ( D )
where
I n f o ( D ) = i = 1 m p i l o g 2 ( p i )
I n f o A ( D ) = j = 1 V | D j | | D | × I n f o ( D j )
Given that
  • Info(D) is the average amount of information needed to identify the class level of a tuple in D;
  • InfoA(D) is the expected information required to classify a tuple from D based on partitioning from A;
  • pi is the nonzero probability that an arbitrary tuple belongs to a class;
  • |Dj|/|D| is the weight of the partition.
The information gain values for the features (attributes) in the Zeek Conn logs are presented in Table 12.
The higher the attribute on the list, the more relevant it is in the classification process. Based on these results, the top 6, top 9, top 12, and top 18 attributes were used to run the machine learning algorithms.

5. Experimentation

Figure 5 presents the experimentation process. First, after binning the raw network data, information gain was calculated. Then, the two tactics, reconnaissance and discovery, were isolated and segregated into separate dataframes for binary classification. The ratio of attack to benign data was maintained at 30%:70% in each dataframe. The dataset was split into 70:30 for training and testing the machine learning classifiers. For each machine learning algorithm, four sets of attributes were tested: 6, 9, 12, and 18.

5.1. Computing Environment

The University of West Florida’s Hadoop cluster was used. This cluster consists of six Dell PowerEdge R730XD servers and three Dell PowerEdge R730 servers, each of which has dual Intel Xeon E5-2650v3 CPUs with 10 cores and 20 threads per CPU (40 cores) and 128 GB of DDR4 RDIMM RAM. The cluster has six worker nodes that execute the machine learning algorithms. Spark 3.2.1 and Hadoop 3.3.1 were used for the environment. Spark provides a rich set of APIs that were used to implement and run the various machine learning algorithms [29].

5.2. Machine Learning Algorithms Used

Six different classifiers were compared: logistic regression, naïve Bayes, random forest, gradient boosted tree, decision tree, and support vector machines. Spark’s Machine learning libraries were used to run the various classifiers, and the BinaryClassificationMetrics was used to generate various statistical results such as accuracy, precision, recall, F-measure, area under the curve, and false-positive rate.

5.2.1. Logistic Regression

Logistic regression (LR) applies a linear discriminant rule and is a widely used machine learning algorithm used for classification of data [11]. Logistic regression considers each feature by associating a specific weight to it to generate a probability of being classified to a specific class. Larger weights represent more variation in a feature and have a larger impact on the algorithm. Binary responses can be predicted by making use of binary logistic regression. Spark 3.2.1′s pyspark.ml.classification.LogisticRegression implementation was used in this study, and the parameters are specified in Table 13.

5.2.2. Naïve Bayes

Naïve Bayes (NB) is a popular machine learning algorithm that uses the Bayes rule of classification and assumes independence of features [10]. The model is easy to build and intuitive [10], and studies have shown that it works with high accuracy for smaller datasets. Spark 3.2.1′s pyspark.ml.classification.NaiveBayes implementation was used in this study, and all parameters used are default values, as shown in Table 13.

5.2.3. Random Forest

Random forest (RF) is an ensemble method; multiple decision trees are constructed, with each tree being built from a limited number of the available features. The classification is performed by polling the collection of trees, and a majority vote is used for this. Spark 3.2.1′s pyspark.ml.classification.RandomForestClassifier was used in this study, with the default settings shown in Table 13.

5.2.4. Gradient Boosted Tree

Gradient boosted tree (GBT) is another class of ensemble methods with many popular implementations. Gradient boosted tree is based on constructing multiple decision trees in sequence, with each decision tree being made in such a way as to minimize the classification error rate of the previous tree(s). Spark 3.2.1′s pyspark.ml.classification.GBTClassifier implementation was used in this study (this is an implementation of stochastic gradient boosting [30]), and the default parameters were used, as shown in Table 13.

5.2.5. Decision Tree

A decision tree (DT) is a set of nodes that partitions the data and returns a binary decision (yes or no) given the node’s condition. Based on the decision made, the corresponding child node is given the input vector. This continues until a leaf node is reached, and the final decision is made. The Spark implementation supports decision trees for binary and multi classification and for regression. The creation of the Spark’s decision tree is based on information gain and works through a greedy algorithm that performs recursive binary partitioning.
Spark 3.2.1′s pyspark.ml.classification.DecisionTreeClassifier implementation was used in this study. Most default parameters were used, as presented in Table 13, except for maxBins and MaxDepth. Due to the binning methods used, none of the features used in the training and test phases contained more than 60 unique values. A max bin value was set for 100, just in case any of the nominal attributes had significantly more in larger datasets. To further account for this possibility, the potential depth of the resulting classifier tree was increased from 5 to 100, at the expense of performance, to ensure a large degree of granularity.

5.2.6. Support Vector Machines

Support vector machines (SVM) is a supervised machine learning algorithm used for regression and classification. The goal of the SVM algorithm is to find an optimal hyperplane (also known as a classification vector) in a multi-dimensional space, which allows for the classification of data points relative to their position to the hyperplane. Linear SVM in Spark machine learning supports binary classification. Internally, it optimizes the hinge loss using the OWLQN optimizer. Spark 3.2.1′s pyspark.ml.classification.LinearSVM implementation was used in this study, and the default settings were used, as shown in Table 13.

6. Results

The results are presented in two parts. The first part presents the results for testing the various Spark parameters in the big data environment, and the second part presents the results of the machine learning classifiers run with the best-performing Spark parameters (determined in the first part).

6.1. Testing Spark’s Configuration Parameters

Spark’s configuration parameters, as shown in Table 14, were declared during the creation of the Spark session and dictated the available resources that the algorithms were allocated.

6.1.1. Performance Results with Various Executor Parameters

The decision tree classifier, with 18 attributes, was used to analyze the executor parameters. Table 15 shows the effect of executor count, executor core count, total executor cores, executor memory, and total executor memory on binning time and training time in seconds. Table 16 shows the total time with additional parameters such as driver memory, driver cores, and shuffle partitions and how they affect total time.
For the rest of the figures in this section, the binning time, training time, and testing time were combined into a single value called “process time”.
Figure 6 presents process times versus the memory per executor. Executor memory is the amount of memory (in GBs) assigned to each executor for use with their assigned cores. From Figure 6, there does not appear to be a strong correlation between the total executor memory and processing time. While a necessary component for utilizing the executors, actual performance improvements could be more accurately attributed to other parameter options. Fine tuning of other parameters could involve reducing the amount of unnecessary memory assigned to executors, hence increasing available resources on the server.
There was a more noticeable effect on performance, albeit with diminishing returns, with the number of executors instantiated and the number of cores assigned to each of them. Multiplying these two values, we arrive at total executor cores. As can be seen in Figure 7, increasing the number of cores available from 10 to 50 had a significant impact on performance, with the aforementioned diminishing returns becoming readily apparent beyond that count and especially after a total core count of 100.
Figure 8 sheds light on the optimal number of executor instances given the total number of cores assigned to the process. The smaller the circle, the less processing time (that is, binning time + training time + testing time) it took to generate the results. The total number of executor cores had a high impact on improving the processing time. This observation is born out here as well, with higher values of total cores showing smaller circles.
An interesting observation can be made along the horizontal line representing 96 total cores (approx. 100). Along this axis are test results for executor counts at 3, 6, 12, and 24. All other parameters being equal, the best results at this core count range were shown with six executor instances. While the best performance in Figure 8 (based on the smallest circle) is given by the lone results at a total core count of 196 (approx. 200), the 96 total core count option with six executors was regarded as the best and hence used to test the overall performance of the machine learning algorithms.
The shuffle partitions parameter was also tested. This is displayed with executor count and performance in Figure 9 and Figure 10. Trends at both 6 and 12 executors show that using fewer shuffle partitions improves performance, with the fastest performance occurring when the number of shuffle partitions is the same as the total number of executors. Values for the shuffle partitions attribute lower than the total number of executors allocated were not tested, as this would leave executors beyond the number of shuffle partitions idle.
We elected to use the parameters detailed in Test ID 19 from Table 16. This is not the fastest performance, but it was less than 3% slower than our fastest settings while using half of the total resources. Table 17 and Table 18 show these spark configuration parameter settings.

6.1.2. Effect on Training Time

As shown in Figure 11, the number of attributes used does not seem to strongly correlate to model training time. The differences in training time between discovery and reconnaissance can be attributed to the size difference between the training sets; the discovery training set contains around 6000 records, while the reconnaissance training set contains around 20,000,000 records.

6.2. Machine Learning Performance

This section presents the results of binary classification performed on the Zeek Conn Log data using the six machine learning algorithms (decision tree, gradient boosting tree, logistic regression, naïve Bayes, random forest, and support vector machine) using the best parameters determined from the previous section (as shown in Table 17 and Table 18).

6.2.1. Evaluation Metrics

Results were collected for accuracy, precision, recall, false-positive rate (FPR), F-measure, area under operating characteristics curve (AUROC), and training/testing times in seconds. Next, these metrics are defined.
Accuracy: Accuracy is the number of correct classifications (i.e., true positives and negatives) divided by the total number of classifications [32].
Accuracy = [True Positives + True Negatives]/
[True Positives + False Positives + True Negatives + False Negatives]
Precision: Precision, also known as confidence, is the proportion of predicted positive cases that are correctly labeled as positive [33]. Precision by label considers only one class and measures the number of times a specific label was predicted correctly, normalized by the number of times that label appears in the output.
Precision = Positive Predictive Value =
[True Positives]/[True Positives + False Positives]
Recall: Intuitively, recall is the ability of the classifier to find all the positive samples, i.e., the true-positive rate. Recall is also known as sensitivity and is the proportion of real positive (RP) cases that are correctly predicted as positive (TP) [33].
All Real Positives = [True Positives + False Negatives]
All Real Negatives = [True Negatives + False Positives]
Recall = True Positive Rate = [True Positives]/[All Real Positives]
False-positive rate (FPR): This is the proportion of negative labels that are predicted to be positive.
False Positive Rate (FPR) = [False Positives]/[All Real Negatives]
F-Measure: The F-measure is defined as the harmonic mean of a prediction’s precision and recall metrics. It is another overall measure of the test’s accuracy [34].
F-Measure = 2 × [Precision × Recall]/[Precision + Recall]
AUROC: A receiver operating characteristics (ROC) graph is a technique for visualizing, organizing, and selecting classifiers based on their performance. ROC graphs depict the relative tradeoffs between the true and false positives. They are two-dimensional graphs upon which classifiers are plotted, with the y-axis being the true-positive rate and the x-axis the false-positive rate (FPR) [35].
The area under ROC (AUROC) is the percentage of the area under the resulting line relative to the entire area of the graph [36].
Recall, sensitivity, and TPR connate the same measure.
Training and testing time: The time it took for an algorithm to complete its training and testing processes (in seconds) is recorded in these two columns, respectively. All other factors being equal, an algorithm is considered to have performed better than another if it completed its calculations more quickly.

6.2.2. Machine Learning Classifier Results

Binary classification was performed using each of the six machine learning classifiers for 6, 9, 12, and 18 attributes.

Machine Learning Classifier Results for Reconnaissance

Machine learning classifier results for the reconnaissance tactic for 6, 9, 12, and 18 attributes are presented in Table 19.
From Table 19, for the reconnaissance tactic, it can be noted that, in terms of accuracy, decision tree, gradient boosting tree, and random forest had the highest averages for all sets of attributes. Naïve Bayes had the lowest accuracy, and logistic regression was only slightly higher than naïve Bayes. In terms of recall, gradient boosting tree and random forest had the highest recall, followed by decision tree. Naïve Bayes, support vector machine, and logistic regression had lower recall rates. In terms of false-positive rates, it can be noted that decision tree had the lowest false-positive rates, and support vector machine and gradient boosting tree had the second lowest false-positive rates. Naïve Bayes had the highest false-positive rates. In terms of training time, random forest had the lowest training time, followed by decision tree, for all attribute combinations. Gradient boosting tree had the highest training times.

Machine Learning Classifier Results for Discovery

Machine learning classifier results for the discovery tactic for 6, 9, 12, and 18 attributes are presented in Table 20.
From Table 20, for the discovery tactic, in terms of accuracy, it can be noted that, decision tree, gradient boosting tree, and random forest had a higher accuracy for all sets of attributes. In terms of recall, gradient boosting tree, naïve Bayes, and logistic regression had higher recall for all sets of attributes, and decision tree was close behind. Support vector machine performed poorly in terms of recall. In terms of the false-positive rates, decision tree, gradient boosting tree, and random forest all had lower false-positive rates for all combination of attributes although decision tree appeared to perform the best.

6.2.3. Overall Results for Reconnaissance and Discovery

Figure 12 shows that, for reconnaissance, on average, 6 attributes performed slightly lower than 9, 12, and 18 attributes. However, a close look at Table 19 shows that this is because the SVM results for six attributes are not consistent with the rest of the results; hence, Figure 13 shows averages for reconnaissance without SVM (note that SVM was just removed from the six attribute list). Without SVM’s six attributes, all the other results are higher for six attributes.
From Figure 11, we can also note that the training time was lower, on the average, for six attributes. Therefore, for reconnaissance, the top six attributes (history, protocol, service, orig_bytes, dest_ip, and orig_pkts) would be enough to classify the reconnaissance tactic for any of the classifiers except SVM.
Figure 14 shows that, for discovery, on average, 18 attributes performed the lowest, and though 6 and 9 attributes performed closely, 9 attributes had slightly higher averages. However, since the top six attributes (history, protocol, service, orig_bytes, dest_ip, and orig_pkts) had results so close to the top nine attributes, the case could be made to recommend six attributes given that, in terms of training time as shown in Figure 11, six attributes performed better than nine attributes.

7. Conclusions

The objective of this paper was to see if the reconnaissance and discovery tactics, labelled using the MITRE ATT&CK framework, could be identified from the Zeek Conn logs using a newly created dataset, UWF-ZeekData22 [3]. In addition to looking at the performance of these classifiers using Spark, scalability and response time were also analyzed.
In terms of optimizing classifier performance on the Spark cluster, we found that more total cores provided to the Spark application make machine learning algorithms run faster but with diminishing returns. It can be noted that classifiers run fastest when the number of shuffle partitions is the same as the total number of executors. There was no significant correlation between runtimes and the total amount of memory allocated (though allocating too little memory can cause executors to crash).
Machine learning results indicate that the tree-based methods (decision tree, gradient boosting tree, and random forest) performed better on most metrics than the other three algorithms in classifying this dataset for both the reconnaissance and discovery tactics. These three algorithms all showed 99% + accuracy for both attack tactics, with similarly higher scores in precision, recall, f-measure, and AUROC. Of the tree-based methods, gradient boosting tree and random forest performed a little better than decision tree in terms of recall for both the tactics, but in terms of the false-positive rate, decision tree had the lowest false-positive rates for both reconnaissance and discovery (in fact, it was at 0% for discovery). Gradient boosting tree and random forest also performed well in terms of false-positive rate for discovery, but for reconnaissance, gradient boosting tree performed a little better than random forest. Based on these results, it should also be mentioned that the binning methods used in this study were also effective in the classification process.
With respect to the training times of the tree-based classifiers, random forest performed the best for the reconnaissance tactic, followed by decision tree, and for the discovery tactic, decision tree performed the best.
With respect to the number of attributes to be used, the top six attributes from information gain (history, protocol, service, orig_bytes, dest_ip, and orig_pkts) can be considered enough to provide the best classification results for both the reconnaissance and discovery tactics for all the machine learning classifiers (except SVM with respect to reconnaissance), and the training time of the top six attributes was also lower.

8. Future Work

This work analyzes Zeek network connections and classifies them based on the reconnaissance and discovery tactics as defined by the MITRE ATT&CK framework. Although we only labelled the attacks based on tactics, we will extend this work to reflect the attack chain and techniques with corresponding IDs from the ATT&CK framework. Using the ATT&CK framework tactics, techniques, and procedures, we intend to analyze more attack traffic and label them with appropriate attack techniques. This will eventually provide a rich dataset for the research community to conduct more in-depth and relevant research on attack classification and pattern recognition using the MITRE ATT&CK mapping.

Author Contributions

This work was conceptualized by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui) and T.G; methodology was mainly done by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui), T.G., T.M., E.P., N.K. and R.P.; software was done by T.M., E.P., N.K. and R.P; validation was done by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui), T.G., T.M., E.P., N.K. and R.P.; formal analysis was done by T.M., E.P. and N.K.; investigation was done by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui), T.G., T.M., E.P., N.K. and R.P.; data curation was done by S.B. (Sikha Bagui), T.M., E.P., N.K. and R.P; writing—original draft preparation was done by T.M., E.P. and N.K.; writing—review and editing was done by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui) and T.G.; visualization was done by T.M., E.P. and N.K.; supervision was done by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui) and T.G.; project administration was done by S.B. (Sikha Bagui) and D.M.; funding acquisition was done by S.B. (Sikha Bagui), D.M. and S.B (Subhash Bagui). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Centers of Academic Excellence in Cybersecurity, 2021 NCAE-C-002: Cyber Research Innovation Grant Program, Grant Number: H98230-21-1-0170.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available at datasets.uwf.edu (accessed on 20 August 2020).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Attributes of Zeek Conn Logs Dataset.
Table A1. Attributes of Zeek Conn Logs Dataset.
Attribute NameDescription of Attribute
tsTime of first packet
uidUnique identifier of connection
id.orig_hIP address of packet sender
id.orig_pOutgoing port number
id.resp_hIP address of packet receiver
id.resp_pIncoming port number
protoTransport layer protocol of connection
serviceIdentification of application protocol being sent over connection
durationHow long connection lasted
orig_bytesNumber of payload bytes originator sent
resp_bytesNumber of payload bytes responder sent
conn_statePossible connection states
local_origIf connection is responded to locally, value is T
local_respIf connection is responded to locally, this value is T
missed_bytesNumber of bytes missed in content gaps, representative of packet loss
historyRecords the state history of connections as a string of letter
orig_pktsNumber of packets originator sent. Set if: zeek:id:use_conn_size_analyzer = T
orig_ip_bytesNumber of packets responder sent. Set if: zeek:id:use_conn_size_analyzer = T
resp_pktsNumber of IP level bytes responder sent. Set if: zeek:id:use_conn_size_analyzer = T
resp_ip_bytesNumber of IP level bytes responder sent
community_id
idConnection’s 4-tuple of endpoint addresses/ports
tunnel_parentsIf connection was over tunnel, indicates *uid* values for encapsulating parent(s) connections used over lifetime of inner connection

References

  1. Huong, T.T.; Bac, T.P.; Long, D.M.; Thang, B.D.; Binh, N.T.; Luong, T.D.; Phuc, T.K. LocKedge: Low-Complexity Cyberattack Detection in IoT Edge Computing. IEEE Access 2021, 9, 29696–29710. [Google Scholar] [CrossRef]
  2. Zeek: About. The Zeek Project. 2020. Available online: https://zeek.org/about/ (accessed on 15 February 2022).
  3. University of West Florida. 2022. Available online: https://datasets.uwf.edu/ (accessed on 20 August 2020).
  4. Trellix. What Is the MITRE ATT&CK Framework?|Get the 101 Guide. 2022. Available online: https://www.trellix.com/en-us/security-awareness/cybersecurity/what-is-mitre-attack-framework.html (accessed on 15 February 2022).
  5. MITRE ATT&CK. Reconnaissance, Tactic TA0043—Enterprise| MITRE ATT&CK®. 2022. Available online: https://attack.mitre.org/tactics/TA0043/ (accessed on 16 February 2022).
  6. MITRE ATT&CK. Discovery, Tactic TA0007—Enterprise| MITRE ATT&CK®. 2022. Available online: https://attack.mitre.org/tactics/TA0007/ (accessed on 16 February 2022).
  7. The Zeek Project. Base/Protocols/Conn/Main.Zeek—Book of Zeek (v5.0.0). Zeek. 2022. Available online: https://docs.zeek.org/en/v5.0.0/scripts/base/protocols/conn/main.zeek.html (accessed on 16 February 2022).
  8. Kala Karun, A.; Chitharanjan, K. A review on hadoop; HDFS infrastructure extensions. In Proceedings of the 2013 IEEE Conference on Information and Communication Technologies, Thuckalay, India, 11–12 April 2013. [Google Scholar] [CrossRef]
  9. Belouch, M.; El Hadaj, S.; Idhammad, M. Performance evaluation of intrusion detection based on machine learning using Apache Spark. Procedia Comput. Sci. 2018, 127, 1–6. [Google Scholar] [CrossRef]
  10. Gupta, G.P.; Kulariya, M. A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark. Procedia Comput. Sci. 2016, 93, 824–831. [Google Scholar] [CrossRef] [Green Version]
  11. Morfino, V.; Rampone, S. Towards Near-Real-Time Intrusion Detection for IoT Devices using Supervised Learning and Apache Spark. Electronics 2020, 9, 444. [Google Scholar] [CrossRef] [Green Version]
  12. Malik, A.J.; Khan, F.A. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Clust. Comput. 2018, 21, 667–680. [Google Scholar] [CrossRef]
  13. Kevric, J.; Jukic, S.; Subasi, A. An effective combining classifier approach using tree algorithms for network intrusion detection. Neural Comput. Appl. 2017, 28, 1051–1058. [Google Scholar] [CrossRef]
  14. Zhang, J.; Sun, J.; He, H. Clustering Detection Method of Network Intrusion Feature Based on Support Vector Machine and LCA Block Algorithm. Wirel. Pers. Commun. 2021, 134, 1–15. [Google Scholar] [CrossRef]
  15. Du, R.; Li, Y.; Liang, X.; Tian, J. Support Vector Machine Intrusion Detection Scheme Based on Cloud-Fog Collaboration. Mob. Netw. Appl. 2022, 27, 431–440. [Google Scholar] [CrossRef]
  16. Leevy, J.L.; Hancock, J.; Zuech, R.; Khoshgoftaar, T.M. Detecting cybersecurity attacks across different network features and learners. J. Big Data 2021, 8, 38. [Google Scholar] [CrossRef]
  17. Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy, Madeira, Portugal, 22–24 January 2018. [Google Scholar] [CrossRef]
  18. Microsoft. Address Classes. Microsoft Docs. 2012. Available online: https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc940018(v=technet.10)?redirectedfrom=MSDN (accessed on 8 April 2022).
  19. Bagui, S.; Simonds, J.; Plenkers, R.; Bennett, T.A.; Bagui, S. Classifying UNSW-NB15 network traffic in the Big Data Framework using random forest in Spark. Int. J. Big Data Intell. Appl. 2021, 2, 39–61. [Google Scholar] [CrossRef]
  20. Rostami, S.; Kleszcz, A.; Dimanov, D.; Katos, V. A machine learning approach to dataset imputation for software vulnerabilities. Commun. Comput. Inf. Sci. 2020, 1284, 25–36. [Google Scholar] [CrossRef]
  21. European Union Agency for Cybersecurity. State of Vulnerabilities 2018/2019—Analysis of Events in the Life of Vulnerabilities. Available online: https://www.enisa.europa.eu/publications/technical-reports-on-cybersecurity-situation-the-state-of-cyber-security-vulnerabilities (accessed on 9 July 2022).
  22. Al-Shaer, R.; Spring, J.M.; Christou, E. Learning the associations of MITRE ATT & CK Adversarial Techniques. In Proceedings of the 2020 IEEE Conference on Communications and Network Security (CNS), Avignon, France, 29 June–1 July 2020. [Google Scholar] [CrossRef]
  23. Fawcett, T. An Introduction to ROC Analysis (PDF). Pattern Recognit. Lett. 2006, 27, 861–874. Available online: http://people.inf.elte.hu/kiss/11dwhdm/roc.pdf (accessed on 22 August 2022). [CrossRef]
  24. Apache Spark. StringIndexer—PySpark 3.3.0 Documentation. 2022. Available online: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html (accessed on 9 March 2022).
  25. Microsoft. TCP/IP Addressing and Subnetting—Windows Client|Microsoft Docs. 2022. Available online: https://docs.microsoft.com/en-us/troubleshoot/windows-client/networking/tcpip-addressing-and-subnetting (accessed on 12 May 2022).
  26. Internet Assigned Numbers Authority. Service Name and Transport Protocol Port Number Registry. 2022. Available online: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml (accessed on 12 May 2022).
  27. Apache Spark. Configuration—Spark 3.3.0 Documentation. 2022. Available online: https://spark.apache.org/docs/latest/configuration.html (accessed on 9 March 2022).
  28. Oracle. Securing Ports. 2022. Available online: https://docs.oracle.com/cd/E89228_03/otn/pdf/install/html_edmsc/output/chapter_6.htm (accessed on 9 May 2022).
  29. Salloum, S.; Dautov, R.; Chen, X.; Peng, P.X.; Huang, J.Z. Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 2016, 1, 145–164. [Google Scholar] [CrossRef] [Green Version]
  30. Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
  31. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann: Waltham, MA, USA, 2011. [Google Scholar]
  32. Accuracy Score. Scikit-Learn. 2022. Available online: https://scikit-learn.org/stable/ (accessed on 6 August 2022).
  33. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score (accessed on 15 July 2022).
  34. Sasaki, Y. The Truth of the F-Measure (PDF). 2007. Available online: https://www.toyota-ti.ac.jp/Lab/Denshi/COIN/people/yutaka.sasaki/F-measure-YS-26Oct07.pdf (accessed on 16 August 2022).
  35. Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation (PDF). J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  36. Available online: https://web.archive.org/web/20191114213255/https://www.flinders.edu.au/science_engineering/fms/School-CSEM/publications/tech_reps-research_artfcts/TRRA_2007.pdf (accessed on 16 August 2022).
Figure 1. Schema of Conn Log File.
Figure 1. Schema of Conn Log File.
Sensors 22 07999 g001
Figure 2. Initial value distribution for duration across bins.
Figure 2. Initial value distribution for duration across bins.
Sensors 22 07999 g002
Figure 3. Normal distribution of duration with improved standard deviation.
Figure 3. Normal distribution of duration with improved standard deviation.
Sensors 22 07999 g003
Figure 4. Value distribution for duration across bins after moving-mean logic.
Figure 4. Value distribution for duration across bins after moving-mean logic.
Sensors 22 07999 g004
Figure 5. Experimental flow.
Figure 5. Experimental flow.
Sensors 22 07999 g005
Figure 6. Process Times vs. Memory per Executor.
Figure 6. Process Times vs. Memory per Executor.
Sensors 22 07999 g006
Figure 7. Process Times vs. Total Number of Executor Cores.
Figure 7. Process Times vs. Total Number of Executor Cores.
Sensors 22 07999 g007
Figure 8. Processing times by number of executors and total number of cores across all executors.
Figure 8. Processing times by number of executors and total number of cores across all executors.
Sensors 22 07999 g008
Figure 9. Processing times as a function of executor count and shuffle partitions.
Figure 9. Processing times as a function of executor count and shuffle partitions.
Sensors 22 07999 g009
Figure 10. Processing times as a function of executor count and shuffle partitions, Time ≤ 200 s.
Figure 10. Processing times as a function of executor count and shuffle partitions, Time ≤ 200 s.
Sensors 22 07999 g010
Figure 11. Number of attributes vs. training times for all algorithms.
Figure 11. Number of attributes vs. training times for all algorithms.
Sensors 22 07999 g011
Figure 12. Reconnaissance: Averages for All Algorithms by Number of Features.
Figure 12. Reconnaissance: Averages for All Algorithms by Number of Features.
Sensors 22 07999 g012
Figure 13. Reconnaissance: Averages for All Algorithms by Number of Features (SVM’s six attributes excluded).
Figure 13. Reconnaissance: Averages for All Algorithms by Number of Features (SVM’s six attributes excluded).
Sensors 22 07999 g013
Figure 14. Discovery: Averages for All Algorithms by Number of Features Used.
Figure 14. Discovery: Averages for All Algorithms by Number of Features Used.
Sensors 22 07999 g014
Table 1. Conn log breakdown by MITRE ATT&CK tactic.
Table 1. Conn log breakdown by MITRE ATT&CK tactic.
Label_TacticCount
Discovery2086
Lateral movement4
Privilege escalation 13
Reconnaissance9,278,722
Persistence1
Initial access1
Exfiltration7
Defense evasion1
Resource development3
Credential access31
Benign data9,281,599
Table 2. Conn log breakdown by MITRE ATT&CK technique.
Table 2. Conn log breakdown by MITRE ATT&CK technique.
MITRE ATT&CK TechniqueCount
T15859,278,722
T10781
T10487
T106812
T10462086
T15883
T10214
T155231
Table 3. Distribution for Duration.
Table 3. Distribution for Duration.
Bin RangeBin ValueCount
Bin11-
Bin22-
Bin3343,548
Bin4415,458
Bin557656
Bin6611,208
Table 4. Distribution for duration with moving-mean logic.
Table 4. Distribution for duration with moving-mean logic.
Bin RangeBin ValueCount
Bin1143,548
Bin2215,458
Bin337656
Bin443382
Bin551506
Bin666320
Table 5. Number of bins for continuous attributes before and after moving-mean.
Table 5. Number of bins for continuous attributes before and after moving-mean.
Bins Used (Nulls Included)Moving-Mean Bins (Nulls Included)
duration57
orig_bytes57
orig_pkts55
orig_ip_bytes66
resp_bytes22
resp_pkts46
resp_ip_bytes46
missed_bytes11
Table 6. Number of bins for integer attributes in Zeek Conn data.
Table 6. Number of bins for integer attributes in Zeek Conn data.
Counts of Integer Bins
proto3
conn_state10
local_orig2
local_resp2
history59
service10
Table 7. Binning illustrated for dest_ip attribute in dataset.
Table 7. Binning illustrated for dest_ip attribute in dataset.
ClassificationDest_Ip BinCount
Class A133,039
Class B246,063
Class C3363
Class D457
Class E50
Other65
Null/Non-applicable0569
Table 8. Number of unique values before and after binning for dest_ip and src_ip attributes.
Table 8. Number of unique values before and after binning for dest_ip and src_ip attributes.
Unique Values before BinningUnique Values after Binning
dest_ip_zeek3126
src_ip_zeek393
Table 9. Binning shown for port number attributes.
Table 9. Binning shown for port number attributes.
ClassificationRangeBin Value
Well-known ports0–10231
Registered ports1024–49,1512
Dynamic/Private ports49,152–65,5353
Null valuesNull0
Table 10. Number of unique values before and after binning dest_port and src_port.
Table 10. Number of unique values before and after binning dest_port and src_port.
Bin ValuesClassificationSrc_Port CountDest_Port Count
0Null00
1Well-known ports98049,003
2Registered ports45,41728,957
3Dynamic/Private ports33,6992136
4Other00
Table 11. Differences between bin values for src_port vs. dest_port [26].
Table 11. Differences between bin values for src_port vs. dest_port [26].
Bin ValuesClassificationSrc_Port CountDest_Port Count
0Null00
1Well-known ports98049,003
2Registered ports45,41728,957
3Dynamic/Private ports33,6992136
4Other00
Table 12. Information gain for features in Zeek Conn dataset.
Table 12. Information gain for features in Zeek Conn dataset.
Attribute No.AttributeInfo_Gain
1history0.827
2protocol0.77
3service0.726
4orig_bytes0.724
5dest_ip0.674
6orig_pkts0.655
7orig_ip_bytes0.572
8local_resp0.524
9dest_port0.486
10duration0.386
11conn_state0.166
12resp_pkts0.085
13resp_ip_bytes0.065
14src_port0.008
15resp_bytes0.008
16src_ip0.007
17local_orig0.002
18missed_bytes0
Table 13. Parameters specified for machine learning algorithm implemented in PySpark.
Table 13. Parameters specified for machine learning algorithm implemented in PySpark.
Machine Learning AlgorithmParameters Used
Logistic RegressionfeaturesCol: str = ‘features’, labelCol: str = ‘label’, predictionCol: str = ‘prediction’, maxIter: int = 10, regParam: float = 0.3, elasticNetParam: float = 0.8, tol: float = 1 ×10−6, fitIntercept: bool = True, threshold: float = 0.5, thresholds: Optional[List[float]] = None, probabilityCol: str = ‘probability’, rawPredictionCol: str = ‘rawPrediction’, standardization: bool = True, weightCol: Optional[str] = None, aggregationDepth: int = 2, family: str = ‘auto’, lowerBoundsOnCoefficients: Optional[pyspark.ml.linalg.Matrix] = None,
upperBoundsOnCoefficients: Optional[pyspark.ml.linalg.Matrix] = None, lowerBoundsOnIntercepts: Optional[pyspark.ml.linalg.Vector] = None, upperBoundsOnIntercepts: Optional[pyspark.ml.linalg.Vector] = None, maxBlockSizeInMB: float = 0.0
Naïve BayesfeaturesCol: str = ‘features’, labelCol: str = ‘label’, predictionCol: str = ‘prediction’, probabilityCol: str = ‘probability’, rawPredictionCol: str = ‘rawPrediction’, smoothing: float = 1.0, modelType: str = ‘multinomial’, thresholds: Optional[List[float]] = None, weightCol: Optional[str] = None
Random ForestfeaturesCol: str = ‘features’, labelCol: str = ‘label’, predictionCol: str = ‘prediction’, probabilityCol: str = ‘probability’, rawPredictionCol: str = ‘rawPrediction’, maxDepth: int = 5, maxBins: int = 32,
minInstancesPerNode: int = 1, minInfoGain: float = 0.0,
maxMemoryInMB: int = 256, cacheNodeIds: bool = False, checkpointInterval: int = 10, impurity: str = ‘gini’, numTrees: int = 20,
featureSubsetStrategy: str = ‘auto’, seed: Optional[int] = None, subsamplingRate: float = 1.0, leafCol: str = ‘‘,
minWeightFractionPerNode: float = 0.0, weightCol: Optional[str] = None, bootstrap: Optional[bool] = True
Gradient Boosted Decision TreefeaturesCol: str = ‘features’, labelCol: str = ‘label’, predictionCol: str = ‘prediction’, maxDepth: int = 5, maxBins: int = 32,
minInstancesPerNode: int = 1, minInfoGain: float = 0.0,
maxMemoryInMB: int = 256, cacheNodeIds: bool = False, checkpointInterval: int = 10, lossType: str = ‘logistic’, maxIter: int = 20, stepSize: float = 0.1, seed: Optional[int] = None, subsamplingRate: float = 1.0, impurity: str = ‘variance’, featureSubsetStrategy: str = ‘all’, validationTol: float = 0.01, validationIndicatorCol: Optional[str] = None, leafCol: str = ‘‘, minWeightFractionPerNode: float = 0.0, weightCol: Optional[str] = None
Decision TreefeaturesCol: str = ‘features’, labelCol: str = “label_bin”, predictionCol: str = ‘prediction’, probabilityCol: str = ‘probability’,
rawPredictionCol: str = ‘rawPrediction’, maxDepth: int = 30, maxBins: int = 100, minInstancesPerNode: int = 1, minInfoGain: float = 0.0, maxMemoryInMB: int = 256, cacheNodeIds: bool = False, checkpointInterval: int = 10, impurity: str = ‘gini’, seed: Optional[int] = None, weightCol: Optional[str] = None, leafCol: str = ‘‘,
minWeightFractionPerNode: float = 0.0
SVMfeaturesCol: str = ‘features’, labelCol: str = ‘label_bin’, predictionCol: str = ‘prediction’, maxIter: int = 100, regParam: float = 0.0, tol: float = 1 ×10−6, rawPredictionCol: str = ‘rawPrediction’, fitIntercept: bool = True, standardization: bool = True, threshold: float = 0.0, weightCol: Optional[str] = None, aggregationDepth: int = 2, maxBlockSizeInMB: float = 0.0
Table 14. Spark configuration parameters [31].
Table 14. Spark configuration parameters [31].
ConfigurationSpark PropertyDescription
Driver Coresspark.driver.coresNumber of cores used for the driver process in cluster mode.
Driver Memoryspark.driver.memoryAmount of memory used for the driver process, i.e., where Spark Context is initialized.
Executor Coresspark.executor.coresNumber of cores used on each executor.
Executor Memoryspark.executor.memoryAmount of memory used per executor process.
Executor Instancesspark.executor.instances
spark.dynamicAllocation.minExecutors
Initial number of executors run if dynamic allocation is enabled, with upper and lower bounds for the number of executors established.
spark.dynamicAllocation.maxExecutors
Shuffle Partitionsspark.sql.shuffle.partitionsDefault number of partitions used when shuffling data for joins or aggregations.
Table 15. Effect on binning and training time for Spark’s configuration parameters.
Table 15. Effect on binning and training time for Spark’s configuration parameters.
Test IDExecutor CountExecutor Core CountTotal Executor CoresExecutor MemoryTotal Executor Memory (GB)Binning Time (seconds)Training Time (seconds)
15210525292.3362.8
252101050283.6562.6
3521020100275.2462.1
45420525238.9144.6
554201050233.1243.0
6542020100233.1344.1
710220550239.0844.0
81022010100231.9244.2
910440550185.8125.1
101044010100174.7125.0
11202405100188.4125.9
12204805100165.0720.8
131088010100155.1720.3
141289610120154.4718.4
161289610120152.2919.1
171289610120145.7618.5
18244965120163.6019.8
196169620120141.4120.5
201289610120150.8419.3
213329640120139.2329.3
22116162020179.3163.8
23216322040172.1437.9
2463219240240132.2723.0
25316482060154.3529.3
Table 16. Effect on total time with additional Spark configuration parameters.
Table 16. Effect on total time with additional Spark configuration parameters.
Test IDExecutor
Count
Cores Per ExecutorMemory Per ExecutorTotal Exec. CoresTotal Executor Memory (GB)Total Time (Seconds)Shuffle PartitionsDriver CoresDriver Memory (GB)
15251025355.24200210
252101050346.30200210
3522010100337.45200210
45452025283.60200210
554102050276.21200210
6542020100277.28200210
710252050283.18200210
81021020100276.20200210
910454050210.99200210
101041040100199.78200210
11202540100214.43200210
12204580100186.02200210
131081080100175.59200210
141281096120172.91200210
161281096120171.4972210
171281096120164.3512210
18244596120183.5124210
196162096120162.026210
201281096120170.1824210
213324096120168.583210
22116201620243.161210
23216203240210.142210
2463240192240155.396210
25316204860183.753210
261081080100178.59200210
2763240192240156.86210
2863240192240161.8412210
2963240192240159.1224210
3063240192240159.3148210
3163240192240159.9196210
3263240192240161.13192210
3363240192240161.6384210
3463240192240159.26410
3563240192240157.796420
3663240192240158.26430
Table 17. Spark’s optimum configuration settings.
Table 17. Spark’s optimum configuration settings.
Config SettingValue
Driver Cores2
Driver Memory10 g
Executor Instances6
Executor Cores6
Executor Memory6
Shuffle Partitions6
Table 18. Spark’s optimum resource settings and total allocation.
Table 18. Spark’s optimum resource settings and total allocation.
ResourcesTotal Allocation
Driver Cores2
Driver Memory10 g
Executor Total Cores96
Executor Total Memory120 g
Table 19. Reconnaissance: Performance of the various machine learning classifiers.
Table 19. Reconnaissance: Performance of the various machine learning classifiers.
ML Algo.Attr.AccuracyPrecisionRecallF-MeasureAUROCFPRTrainingTesting
DT699.30%99.09%98.58%98.84%99.10%0.39%27.9330.087
DT999.31%99.10%98.60%98.85%99.11%0.39%28.8780.088
DT1299.35%99.20%98.65%98.92%99.15%0.34%29.750.086
DT1899.40%99.69%98.30%98.99%99.08%0.13%28.3650.071
GBT699.26%99.39%99.56%99.48%99.07%1.42%80.6390.077
GBT999.29%99.39%99.60%99.50%99.09%1.42%80.1780.076
GBT1299.30%99.38%99.62%99.50%99.08%1.46%79.5990.075
GBT1899.37%99.23%99.88%99.55%99.03%1.81%59.1470.087
LR696.52%94.02%94.38%94.20%95.91%2.57%22.10.057
LR996.52%94.02%94.38%94.20%95.91%2.57%22.2650.051
LR1296.52%94.02%94.38%94.20%95.91%2.57%22.3720.051
LR1896.52%94.02%94.38%94.20%95.91%2.57%23.3750.052
NB695.84%92.11%94.19%93.14%95.37%3.46%15.6340.053
NB995.85%92.11%94.22%93.15%95.38%3.46%16.0780.091
NB1295.85%92.11%94.21%93.15%95.38%3.46%15.70.062
NB1895.86%92.12%94.27%93.18%95.41%3.46%15.2340.056
RF699.19%98.95%99.90%99.42%98.72%2.47%56.2570.048
RF998.11%97.39%99.98%98.67%96.86%6.26%56.2760.075
RF1299.19%98.92%99.94%99.43%98.70%2.55%56.4730.052
RF1899.22%98.94%99.96%99.45%98.73%2.51%47.2860.054
SVM670.01%0.00%0.00%0.00%50.00%0.00%39.0530.031
SVM996.87%95.23%94.28%94.75%96.13%2.02%68.3170.036
SVM1297.36%97.08%94.02%95.53%96.41%1.21%64.3970.036
SVM1897.93%99.04%94.00%96.45%96.80%0.39%66.2160.036
Table 20. Discovery: Performance of the various machine learning classifiers.
Table 20. Discovery: Performance of the various machine learning classifiers.
ML Algo.Attr.AccuracyPrecisionRecallF-MeasureAUROCFPRTrainingTesting
DT699.81%100.00%99.33%99.67%99.67%0.00%3.7980.057
DT999.77%100.00%99.17%99.58%99.58%0.00%3.8120.04
DT1299.91%100.00%99.67%99.83%99.83%0.00%4.0910.08
DT1899.95%100.00%99.83%99.92%99.92%0.00%4.1520.047
GBT699.86%99.80%100.00%99.90%99.76%0.49%20.3590.062
GBT999.86%99.80%100.00%99.90%99.76%0.49%20.6170.062
GBT1299.91%99.87%100.00%99.93%99.84%0.32%20.8020.071
GBT1899.86%99.87%99.93%99.90%99.80%0.32%11.8710.066
LR697.89%93.02%100.00%96.39%98.53%2.94%4.9970.04
LR997.89%93.02%100.00%96.39%98.53%2.94%4.7490.042
LR1297.89%93.02%100.00%96.39%98.53%2.94%4.7460.041
LR1897.89%93.02%100.00%96.39%98.53%2.94%4.810.045
NB697.28%91.19%100.00%95.39%98.10%3.79%2.9750.041
NB997.28%91.19%100.00%95.39%98.10%3.79%3.140.039
NB1294.55%83.80%100.00%91.19%96.21%7.59%3.1440.038
NB1894.55%83.80%100.00%91.19%96.21%7.59%3.030.038
RF699.62%99.47%100.00%99.73%99.35%1.30%13.1710.062
RF999.86%99.80%100.00%99.90%99.76%0.49%12.8020.037
RF1299.86%99.80%100.00%99.90%99.76%0.49%12.6890.039
RF1899.91%99.87%100.00%99.93%99.84%0.32%4.2180.04
SVM697.09%91.51%98.83%95.03%97.62%3.60%21.8340.03
SVM999.48%100.00%98.17%99.07%99.08%0.00%27.6680.027
SVM1265.29%31.85%20.33%24.82%51.63%17.07%27.610.03
SVM1819.96%0.00%0.00%0.00%13.90%72.20%30.8970.036
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bagui, S.; Mink, D.; Bagui, S.; Ghosh, T.; McElroy, T.; Paredes, E.; Khasnavis, N.; Plenkers, R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors 2022, 22, 7999. https://doi.org/10.3390/s22207999

AMA Style

Bagui S, Mink D, Bagui S, Ghosh T, McElroy T, Paredes E, Khasnavis N, Plenkers R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors. 2022; 22(20):7999. https://doi.org/10.3390/s22207999

Chicago/Turabian Style

Bagui, Sikha, Dustin Mink, Subhash Bagui, Tirthankar Ghosh, Tom McElroy, Esteban Paredes, Nithisha Khasnavis, and Russell Plenkers. 2022. "Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework" Sensors 22, no. 20: 7999. https://doi.org/10.3390/s22207999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop