Extended Isolation Forest for Intrusion Detection in Zeek Data

: The novelty of this paper is in determining and using hyperparameters to improve the Extended Isolation Forest (EIF) algorithm, a relatively new algorithm, to detect malicious activities in network traffic. The EIF algorithm is a variation of the Isolation Forest algorithm, known for its efficacy in detecting anomalies in high-dimensional data. Our research assesses the performance of the EIF model on a newly created dataset composed of Zeek Connection Logs, UWF-ZeekDataFall22. To handle the enormous volume of data involved in this research, the Hadoop Distributed File System (HDFS) is employed for efficient and fault-tolerant storage, and the Apache Spark framework, a powerful open-source Big Data analytics platform, is utilized for machine learning (ML) tasks. The best results for the EIF algorithm came from the 0-extension level. We received an accuracy of 82.3% for the Resource Development tactic, 82.21% for the Reconnaissance tactic, and 78.3% for the Discovery tactic.


Introduction
Over the past decade, the rapid growth of Internet of Things (IoT) devices has led to an exponential increase in network traffic.As the number of connected devices continues to rise across diverse sectors such as healthcare, agriculture, logistics, and more, the volume of data being transferred across networks is expected to surge exponentially.To address the mounting challenges posed by the escalating scale of IoT data and the prevalence of cyber threats, effective monitoring and detection of malicious activities have become critical.
This research leverages Zeek, an open-source network-monitoring tool renowned for its ability to provide comprehensive raw network data, to collect a modern unique dataset, UWF-ZeekDataFall22 [1].This dataset has been meticulously labeled using the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework [2], a globally accessible knowledge base that characterizes adversary tactics and techniques used to achieve specific objectives.
Among the various adversary tactics, this work specifically focuses on detecting three critical tactics: Reconnaissance (TA0043) [3], Discovery (TA0007) [4], and Resource Development (T1589) [5].The Reconnaissance tactic gathers information on vulnerabilities that can be exploited in future attacks.The Discovery tactic aims to gain deeper insights into the internal network structure.And the Resource Development tactic focuses on various attacking tools and methods of identification.
Zeek's Connection (Conn) Log files are instrumental in tracking and recording vital information about network connections, including IP addresses, durations, transferred bytes, states, packets, and tunnel information.By analyzing these Conn log files, we aim to identify connections that exhibit patterns associated with Resource Development, Reconnaissance, and Discovery tactics, which are indicative of potential cyber threats.
To handle the enormous volume of data involved in this research, the Hadoop Distributed File System (HDFS) [6] is employed for efficient and fault-tolerant storage.The Apache Spark framework, a powerful open-source Big Data analytics platform [7], is utilized for machine learning (ML) tasks.The novelty of this paper is in determining and using hyperparameters to improve the Extended Isolation Forest (EIF) algorithm, a relatively new algorithm, to detect malicious activities in network traffic.The EIF algorithm is a variation of the Isolation Forest algorithm, known for its efficacy in detecting anomalies in high-dimensional data.Our research assesses the performance of the EIF model on the UWF-ZeekDataFall22 dataset [1].
The rest of the paper is organized as follows.Section 2 presents the related works; Section 3 presents the background information, that is, a description of the anomaly as well as anomaly score, and the Isolation Forest and Extended Isolation Forest algorithms; Section 4 describes the dataset, a relatively new dataset; Section 5 presents the methodology; Section 6 presents the results; Section 7 presents the conclusion; Section 8 presents the future works.

Related Works
Research on the Extended Isolation Forest (EIF) algorithm is relatively new but has shown promise in various domains.The Isolation Forest algorithm has demonstrated effectiveness in detecting anomalies in high-dimensional datasets.Liu et al. (2008) [8] proposed the Isolation Forest approach for intrusion detection, showing its advantages in handling large-scale data with high-dimensional features.Chen et al. (2019) [9] extended the Isolation Forest to address the challenge of detecting time series anomalies, achieving remarkable results in identifying abnormal patterns in temporal data.Sharma et al. (2022) [10] proposed an extension to the Isolation Forest algorithm in the form of the Extended Isolation Forest (EIF) for detecting advanced persistent threats (APTs) in enterprise networks.By leveraging the power of EIF, the authors achieved remarkable accuracy in identifying complex attack patterns in large-scale network data.
Li et al. (2017) [11] presented the Extended Isolation Forest (EIF) method as an enhancement to the traditional Isolation Forest algorithm.The EIF algorithm demonstrated robustness and scalability in detecting intrusions in computer networks, making it a promising candidate for real-time network security applications.
Fan et al. (2021) [12] proposed an improved version of the Isolation Forest algorithm coupled with the Self-Organizing Map (SOM) clustering technique for anomaly detection in network security.The study showcased the effectiveness of the enhanced Isolation Forest in accurately identifying network intrusions.
Zhou et al. (2019) [13] applied the Extended Isolation Forest algorithm to detect intrusions in industrial control systems.The study demonstrated the efficacy of EIF in detecting anomalous behavior and potential cyber threats in critical industrial networks.Thangaraj et al. (2020) [14] proposed an enhanced version of the Extended Isolation Forest tailored for intrusion detection in software-defined networks (SDNs).This study highlighted the effectiveness of the enhanced EIF model in accurately detecting intrusions in SDN environments.
Huang et al. (2019) [15] developed an efficient Random Forest Extended Isolation (RF-EIF) algorithm for anomaly detection.By combining the power of the Random Forest ensemble with the EIF method, the authors achieved improved accuracy in detecting anomalies in diverse network datasets.
Recent advancements have continued to push the boundaries of anomaly detection using variations of the Isolation Forest.Liu et al. (2024) [16] introduced the Layered Isolation Forest, a multi-level subspace algorithm designed to improve the original Isolation Forest's ability to handle local outliers and enhance anomaly detection performance.This method maintains the efficiency of the original algorithm while achieving superior performance metrics on both synthetic and real-world datasets.
Similarly, Wu et al. (2024) [17] demonstrated the application of the Extended Isolation Forest in the fault diagnosis of avionics equipment.Their study utilized a combination of feature selection and EIF to detect and categorize faults in electronic modules, highlighting the practical engineering value and effectiveness of EIF in real-world applications.

Gaps in the Literature
Despite these advancements, specific gaps remain in the current literature: 1.
Optimization of Isolation Level: While many studies have demonstrated the efficacy of EIF in various contexts, there is a lack of comprehensive research focusing on the optimization of isolation levels within the algorithm.Isolation levels are crucial as they directly influence the algorithm's ability to accurately detect anomalies.Our research addresses this gap by systematically exploring and determining the optimal isolation levels for the EIF algorithm, thus improving its performance in detecting malicious activities in network traffic.By addressing these gaps, our research contributes to the enhancement of the EIF algorithm, making it a more robust tool for intrusion detection and anomaly detection in network traffic.

What Is an Anomaly
An anomaly or outlier refers to any data point or observation that notably differs from the rest of the data.Anomaly detection plays a crucial role and finds practical applications across different fields, such as identifying fraudulent bank transactions, detecting network intrusions, spotting sudden fluctuations in sales, and detecting changes in customer behavior, among others [18,19].Numerous methods have been devised to identify anomalies in data.We focus on the implementation of Isolation Forests, which is a supervised anomaly detection technique [8].

Anomaly Score
The output of Isolation Forest is anomaly scores.When a point travels through a tree, the length of its path can be an indication of its uniqueness.If it goes deeper, the point is not that unique; if the path is shorter, that data point may be an anomaly [8].In Figure 1, the red path may be anomalous, while the blue path would be a normal path.
When the point is run through multiple trees in the forest, the combined length of its path can give an anomaly score.If the score is closer to 1, the point is an anomaly, but if it is less than 0.5, it is a normal point [20].On the other hand, if all points cluster around 0.5, that dataset may not have distinct anomalous points [20].When the point is run through multiple trees in the forest, the combined length of its path can give an anomaly score.If the score is closer to 1, the point is an anomaly, but if it is less than 0.5, it is a normal point [20].On the other hand, if all points cluster around 0.5, that dataset may not have distinct anomalous points [20].

Isolation Forest
Isolation Forests (IFs) are supervised models used to detect anomalies in data.They are similar to Random Forests and use Decision Trees to identify unusual data points.One advantage of IF is that it does not rely on building a profile for the data, making it computationally efficient [8].
However, IF has a bias in how the trees are branched, which can lead to uneven anomaly scores.This inconsistency can cause false positive results and suggest patterns that do not actually exist in the data [20].
In Figure 2a, we have a normally distributed 2D dataset.A data point close to (0, 0) should be nominal, and the anomaly score should increase radially away from this point.From the score map in Figure 2b, we see that there are rectangular regions along the x and y axes where anomaly scores are lower, and the score is not equally increasing in a circular way as we expect [20].
Figure 3 has a dataset with two clusters, and the score map creates ghost clusters alongside the real ones.Similarly, in a dataset with a sinusoidal structure (Figure 4), the score map completely fails to capture the hills and valleys in the data distribution.

Isolation Forest
Isolation Forests (IFs) are supervised models used to detect anomalies in data.They are similar to Random Forests and use Decision Trees to identify unusual data points.One advantage of IF is that it does not rely on building a profile for the data, making it computationally efficient [8].
However, IF has a bias in how the trees are branched, which can lead to uneven anomaly scores.This inconsistency can cause false positive results and suggest patterns that do not actually exist in the data [20].
In Figure 2a, we have a normally distributed 2D dataset.A data point close to (0, 0) should be nominal, and the anomaly score should increase radially away from this point.From the score map in Figure 2b, we see that there are rectangular regions along the x and y axes where anomaly scores are lower, and the score is not equally increasing in a circular way as we expect [20].When the point is run through multiple trees in the forest, the combined length of its path can give an anomaly score.If the score is closer to 1, the point is an anomaly, but if it is less than 0.5, it is a normal point [20].On the other hand, if all points cluster around 0.5, that dataset may not have distinct anomalous points [20].

Isolation Forest
Isolation Forests (IFs) are supervised models used to detect anomalies in data.They are similar to Random Forests and use Decision Trees to identify unusual data points.One advantage of IF is that it does not rely on building a profile for the data, making it computationally efficient [8].
However, IF has a bias in how the trees are branched, which can lead to uneven anomaly scores.This inconsistency can cause false positive results and suggest patterns that do not actually exist in the data [20].
In Figure 2a, we have a normally distributed 2D dataset.A data point close to (0, 0) should be nominal, and the anomaly score should increase radially away from this point.From the score map in Figure 2b, we see that there are rectangular regions along the x and y axes where anomaly scores are lower, and the score is not equally increasing in a circular way as we expect [20].
Figure 3 has a dataset with two clusters, and the score map creates ghost clusters alongside the real ones.Similarly, in a dataset with a sinusoidal structure (Figure 4), the score map completely fails to capture the hills and valleys in the data distribution.The complexity of the IF algorithm is the same for both stages: O(t ψlog ψ) [20].

Training Stage
As shown in Algorithm 1, the training stage performs sub-sampling and builds an ensemble of isolation trees.Each tree's height is limited by its ceiling, which is approximately the average height of a binary search tree (BST) for the size of the given data.The algorithm for training is separated into two functions.Recursion is used in Algorithm 2 for

Evaluation Stage:
A given point is put into each tree and an average h(x) is provided.

Isolation Forest Algorithm Training Stage:
The forest is created.The complexity of the IF algorithm is the same for both stages: O(t ψlog ψ) [20].

Training Stage
As shown in Algorithm 1, the training stage performs sub-sampling and builds an ensemble of isolation trees.Each tree's height is limited by its ceiling, which is approximately the average height of a binary search tree (BST) for the size of the given data.The algorithm for training is separated into two functions.Recursion is used in Algorithm 2 for

Evaluation Stage:
A given point is put into each tree and an average h(x) is provided.

Isolation Forest Algorithm Training Stage:
The forest is created.The complexity of the IF algorithm is the same for both stages: O(t ψlog ψ) [20].

Training Stage
As shown in Algorithm 1, the training stage performs sub-sampling and builds an ens ble of isolation trees.Each tree's height is limited by its ceiling, which is approxima the average height of a binary search tree (BST) for the size of the given data.The a rithm for training is separated into two functions.Recursion is used in Algorithm 2

Evaluation Stage:
A given point is put into each tree and an average h(x) is provided.

Isolation Forest Algorithm Training Stage:
The forest is created.The complexity of the IF algorithm is the same for both stages: O(t ψlog ψ) [20].

Training Stage
As shown in Algorithm 1, the training stage performs sub-sampling and builds an ensemble of isolation trees.Each tree's height is limited by its ceiling, which is approximately the average height of a binary search tree (BST) for the size of the given data.The algorithm for training is separated into two functions.Recursion is used in Algorithm 2 for building the isolation trees.The output of the training stage is an Isolation Forest prepared for the scoring of each given point [20].

Evaluation Stage
The output algorithm of the evaluation stage is the path length of a given point.The average path length in the Isolation Forest is computed and handed over to the anomaly score formula.Algorithm 3 is used to estimate the path where IF is not able to isolate the points [20].

Extended Isolation Forest
To overcome the biases and limitations of the traditional Isolation Forest, the Extended Isolation Forest (EIF) algorithm has been developed.The EIF algorithm introduces modifications that allow for random slopes in the branch cuts, making the scoring more reliable and reducing the impact of artifacts [21].
The EIF algorithm is particularly significant in the context of network traffic analysis for several reasons: 1.
Improved Detection of Complex Anomalies: EIF can better isolate outliers in highdimensional data, which is common in network traffic.This is because the random slopes in the cuts allow the algorithm to adapt to complex structures in the data, making it more effective at identifying subtle anomalies that might be missed by traditional methods [21].

2.
Reduction of False Positives: By mitigating the bias inherent in the branching process of traditional Isolation Forests, EIF reduces the occurrence of false positives.This is crucial in network traffic analysis where high false positive rates can lead to unnecessary alerts and increased workload for security analysts [21].

3.
Scalability and Efficiency: Like the traditional Isolation Forest, EIF is computationally efficient and scalable.This makes it suitable for real-time intrusion detection systems that need to process large volumes of network traffic quickly [21].
By addressing these challenges, EIF enhances the reliability and accuracy of anomaly detection in network traffic, providing a more robust tool for cybersecurity applications [21].

Branching in Extended Isolation Forest
Keeping the branch cuts parallel to the axes has no fundamental reasoning.So, instead of picking a feature and value at every branching point, the Extended Isolation Forest picks a random slope and intercept for the branch cut [21].
Suppose we have an N-dimensional dataset.For the random slope requirement, we can choose random numbers for each coordinate of a normal vector over the N-sphere.For the random intercept, we can pick a random number from a uniform distribution over the range of values present.So, the algorithm transforms into the following two tasks shown in Figure 6 [21].
To overcome the biases and limitations of the traditional Isolation Forest, the Extended Isolation Forest (EIF) algorithm has been developed.The EIF algorithm introduces modifications that allow for random slopes in the branch cuts, making the scoring more reliable and reducing the impact of artifacts [21].
The EIF algorithm is particularly significant in the context of network traffic analysis for several reasons: 1. Improved Detection of Complex Anomalies: EIF can better isolate outliers in highdimensional data, which is common in network traffic.This is because the random slopes in the cuts allow the algorithm to adapt to complex structures in the data, making it more effective at identifying subtle anomalies that might be missed by traditional methods [21].2. Reduction of False Positives: By mitigating the bias inherent in the branching process of traditional Isolation Forests, EIF reduces the occurrence of false positives.This is crucial in network traffic analysis where high false positive rates can lead to unnecessary alerts and increased workload for security analysts [21].3. Scalability and Efficiency: Like the traditional Isolation Forest, EIF is computationally efficient and scalable.This makes it suitable for real-time intrusion detection systems that need to process large volumes of network traffic quickly [21].By addressing these challenges, EIF enhances the reliability and accuracy of anomaly detection in network traffic, providing a more robust tool for cybersecurity applications [21].

Branching in Extended Isolation Forest
Keeping the branch cuts parallel to the axes has no fundamental reasoning.So, instead of picking a feature and value at every branching point, the Extended Isolation Forest picks a random slope and intercept for the branch cut [21].
Suppose we have an N-dimensional dataset.For the random slope requirement, we can choose random numbers for each coordinate of a normal vector over the N-sphere.For the random intercept, we can pick a random number from a uniform distribution over the range of values present.So, the algorithm transforms into the following two tasks shown in Figure 6 [21].Picking  ⃗: Draw a random number for each coordinate of  ⃗ from a normal distribution Ɲ(0,1).
Picking  ⃗: Draw from a uniform distribution over the range of values present at each branching point.

Picking a random intercept, vector(p)
Extended Isolation Forest Picking a random slope, vector(n)

Extended Isolation Forest
To overcome the biases and limitations of the traditional Isolation Forest, the Extended Isolation Forest (EIF) algorithm has been developed.The EIF algorithm introduces modifications that allow for random slopes in the branch cuts, making the scoring more reliable and reducing the impact of artifacts [21].
The EIF algorithm is particularly significant in the context of network traffic analysis for several reasons: 1. Improved Detection of Complex Anomalies: EIF can better isolate outliers in highdimensional data, which is common in network traffic.This is because the random slopes in the cuts allow the algorithm to adapt to complex structures in the data, making it more effective at identifying subtle anomalies that might be missed by traditional methods [21].2. Reduction of False Positives: By mitigating the bias inherent in the branching process of traditional Isolation Forests, EIF reduces the occurrence of false positives.This is crucial in network traffic analysis where high false positive rates can lead to unnecessary alerts and increased workload for security analysts [21].3. Scalability and Efficiency: Like the traditional Isolation Forest, EIF is computationally efficient and scalable.This makes it suitable for real-time intrusion detection systems that need to process large volumes of network traffic quickly [21].
By addressing these challenges, EIF enhances the reliability and accuracy of anomaly detection in network traffic, providing a more robust tool for cybersecurity applications [21].

Branching in Extended Isolation Forest
Keeping the branch cuts parallel to the axes has no fundamental reasoning.So, instead of picking a feature and value at every branching point, the Extended Isolation Forest picks a random slope and intercept for the branch cut [21].
Suppose we have an N-dimensional dataset.For the random slope requirement, we can choose random numbers for each coordinate of a normal vector over the N-sphere.For the random intercept, we can pick a random number from a uniform distribution over the range of values present.So, the algorithm transforms into the following two tasks shown in Figure 6 [21].Picking  ⃗: Draw a random number for each coordinate of  ⃗ from a normal distribution Ɲ(0,1).
Picking  ⃗: Draw from a uniform distribution over the range of values present at each branching point.Once these two pieces of information are determined, the branching criteria for the data splitting for a given point → x are as follows: If the condition is satisfied, the data point → x is passed to the left branch; otherwise, it moves down to the right branch [21].

Extension Levels
The algorithm easily adapts to higher dimensions.In this scenario, the branch cuts are no longer straight lines; instead, they become N − 1-dimensional hyperplanes [21].
For an N-dimensional dataset, we can consider N levels of extension.As we increase the extension levels, the algorithm's bias in producing a non-uniform score map is reduced.The lowest level of extension in the Extended Isolation Forest coincides with the standard Isolation Forest [21].
Having multiple extension levels can be beneficial when the dynamic range of the data in different dimensions varies significantly.Reducing the extension level helps in selecting more appropriate split hyperplanes and reduces the computational overhead.For example, if we have three-dimensional data with a much smaller range in two dimensions compared to the third (essentially distributed along a line), using the standard Isolation Forest might yield the most optimal result [21].

Our Extended Isolation Forest Algorithm
The adjustments made to our Extended Isolation Forest algorithm are explained below.

Training Stage
The forest is created from trees, as shown in Algorithm 1.In Algorithm 2, the two lines that pick a random feature and a random value for that feature are updated with lines 4 and 5.In addition, the test condition to reflect inequality is also changed.Line 6 is a new addition that allows the extension level to change.With these changes, the algorithm can be used as either the standard Isolation Forest or as the Extended Isolation Forest with any desired extension level [21].

Evaluation Stage
In Algorithm 3, the changes are made accordingly.The normal and intercept points from each tree are used with the appropriate test condition to set off the recursion for figuring out the path length [21].
By using the Extended Isolation Forest, we can achieve more accurate anomaly detection and better interpret the results for complex data distributions.This makes it a valuable tool for various applications including fraud detection, network intrusion detection, and more [21].

1.
Reconnaissance (T1590): This initial phase involves gathering information about potential targets through activities like OSINT, vulnerability scanning, and probing for weaknesses.Analyzing reconnaissance data provides early warning signs of cyber threats and helps fortify defenses.

2.
Resource Development (T1589): In this stage, attackers acquire tools, techniques, and infrastructure required for the attack, such as custom malware and command-andcontrol (C2) infrastructure.Analyzing resource development data reveals the types of tools and methods used by attackers, aiding in identifying potential threat vectors.

3.
Discovery (T1087): After the initial compromise, attackers explore the target environment to understand its layout and locate sensitive information.Activities in this stage include system enumeration, scanning for network shares, and probing for vulnerable services.Analyzing discovery data detects unauthorized access attempts and lateral movement within the network.

Data Description
Table 1 shows us the number of instances of each tactic available in the UWF-ZeekDataFall22 dataset.
This work only uses the data from Resource Development, Reconnaissance, and Discovery tactics since there are not enough data for the rest of the tactics.The "none" tactic indicates benign data.Since binary classification is being performed, datasets were created for each of the three tactics, and 70% benign data was combined with 30% tactic data.The following is the distribution of the datasets used for this experiment: Resource

Preprocessing
To effectively implement the Extended Isolation Forest, the dataset needed to be preprocessed.Since the dataset contains columns with different types of values, such as continuous, nominal, IP address, port number, etc., the first preprocessing step that was taken was binning, in line with [22].After the attributes were binned, an information gain [23] technique was applied to rank the features according to their importance.Table 2 lists the features according to their importance score, received using the information gain calculations.

Libraries
Python's sklearn library was used for the standard Isolation Forest implementation.The gridsearch library from sklearn was used for the parameter tuning.
There was no library for the extended IF implementation, as it is a relatively new technique and still in its experimental stage.We used a github repository provided by the original authors [8] for obtaining the anomaly scores.However, this implementation lacked any further hyperparameter tuning.
We also experimented with another library from h2o for the EIF implementation.This version gives more control for controlling some of the parameters.

Libraries
Python's sklearn library was used for the standard Isolation Forest implementation.The gridsearch library from sklearn was used for the parameter tuning.
There was no library for the extended IF implementation, as it is a relatively new technique and still in its experimental stage.We used a github repository provided by the original authors [8] for obtaining the anomaly scores.However, this implementation lacked any further hyperparameter tuning.
We also experimented with another library from h2o for the EIF implementation.This version gives more control for controlling some of the parameters.

Hyperparameter Tuning
A grid search was performed to find the best number of trees and subsample size for the standard IF implementation.The best values obtained were as follows: n_estimators: 100 max_samples: 256 For the Extended IF implementation, the number of trees and sample sizes were varied since there were no libraries and the original implementation lacked this functionality.The values tested were n_trees [120, 200, 500, 700, 1000] and sample_size [64,128,256,512] for extension level 0. The best results came from the following values: n_trees: 1000 sample_size: 256

Varying Extension Level
Extended Isolation Forest allows for the changing of the levels of extension.Since there are 18 attributes in the dataset, the levels can be incremented to 17. First, the algorithm was run setting the extension level to 0, is the same as the standard Isolation Forest.Then, the algorithm was run by incrementing the level by 1 each time and the confusion matrices were generated each time.So, for each tactic, the Isolation Forest algorithm was run once, and the Extended Isolation Forest was run 18 (ext 0-17) times.After the successful completion of each level, we found that in most cases, the results remained the same for the different extension levels.

Results
This section presents the performance metrics used to assess the results as well as the results.

Performance Metrics
The following performance metrics were used to assess the results of the Isolation Forest as well as the Extended Isolation Forest: accuracy, precision, recall, F1 score, and specificity.
However, it is important to note that accuracy might not be the best choice when dealing with imbalanced datasets where one class is significantly more prevalent than the other.In such cases, a high accuracy score can be misleading, as the model might be performing well on the majority class while performing poorly on the minority class.

Precision (Positive Predictive Value)
Precision is a metric that focuses on the accuracy of positive predictions made by the model.It quantifies the proportion of correctly predicted positive instances out of all instances that the model predicts as positive.The formula for precision involves dividing the number of true positives by the sum of true positives and false positives.

Precision =
True Positives (True Positives + False Positives) Precision is particularly useful when the cost of false positives is high, meaning that false alarms are costly or undesirable.In medical diagnosis, for example, precision is crucial to minimize misdiagnoses.4 shows that the best scores for Reconnaissance were also attained by EIF-0, though EIF-1 and -2 results were close to EIF-0.The best results are highlighted in green.5 shows that the best scores for Discovery were also attained by EIF-0,1, highlighted in green.

Conclusions
Experimentation was carried out with all possible extension levels, that is, up to 17.The best results across all metrics come from extension level 0. This is in fact the standard Isolation Forest, with no slopes in the branch cuts for any dimensions.However, the standard Isolation Forest does not perform well compared to this implementation.Even though the algorithm is the same, the implementation of Extended Isolation Forest performs better than the standard case.
The reason behind the lower scores for higher extension levels can be the varied ranges of dimensions in the dataset.The features we have for the data are spread over various ranges of values with no correlation to each other.Multiple levels of extension can be useful where the dynamic range of the data in various dimensions is very different.In such cases, reducing the extension level can help in the more appropriate selection of split hyperplanes and in reducing the computational overhead.As an extreme case, if we had three-dimensional data, but the range in two of the dimensions was much smaller compared to the third (essentially data distributed along a line), the standard Isolation Forest method would probably yield the most optimal result.

Future Work
This work is on an Extended Isolation Forest implementation for network data to detect anomalies.This algorithm is implemented on top of the original algorithm, but the results do not reflect what was expected from the higher extension levels.The next step will be to build an implementation of the EIF algorithm using other techniques described in the related works section and compare the results.

Figure 3
Figure3has a dataset with two clusters, and the score map creates ghost clusters alongside the real ones.Similarly, in a dataset with a sinusoidal structure (Figure4), the score map completely fails to capture the hills and valleys in the data distribution.

Figure 4 .
Figure 4. Dataset with sinusoidal structure: (a) Sinusoidal data points with Gaussian noise; (b) anomaly score map.3.3.1.AlgorithmThe hyperparameters of the IF model are as follows: t = number of trees ψ = subsampling size The algorithm is split into two stages.First is the training stage where the forest is created.The second stage is the evaluation stage, which puts a given point into each tree and provides an average path length of the point, as shown in Figure5[20].

Figure 4 .
Figure 4. Dataset with sinusoidal structure: (a) Sinusoidal data points with Gaussian noise; (b) anomaly score map.3.3.1.AlgorithmThe hyperparameters of the IF model are as follows: t = number of trees ψ = subsampling size The algorithm is split into two stages.First is the training stage where the forest is created.The second stage is the evaluation stage, which puts a given point into each tree and provides an average path length of the point, as shown in Figure5[20].

Figure 4 .Figure 3 .
Figure 4. Dataset with sinusoidal structure: (a) Sinusoidal data points with Gaussian noise; (b) anomaly score map.3.3.1.AlgorithmThe hyperparameters of the IF model are as follows: t = number of trees ψ = subsampling size The algorithm is split into two stages.First is the training stage where the forest is created.The second stage is the evaluation stage, which puts a given point into each tree and provides an average path length of the point, as shown in Figure5[20].

Figure 4 .
Figure 4. Dataset with sinusoidal structure: (a) Sinusoidal data points with Gaussian noise anomaly score map.3.3.1.Algorithm The hyperparameters of the IF model are as follows: t = number of trees ψ = subsampling size The algorithm is split into two stages.First is the training stage where the fore created.The second stage is the evaluation stage, which puts a given point into each and provides an average path length of the point, as shown in Figure 5 [20].

Figure 6 .
Figure 6.Branching implementation in Extended Isolation Forest.

Figure 6 .
Figure 6.Branching implementation in Extended Isolation Forest.Picking → v : Draw a random number for each coordinate of → n from a normal distribution

Figure 6 .
Figure 6.Branching implementation in Extended Isolation Forest.
Picking a random intercept, vector(p) Extended Isolation Forest Picking a random slope, vector(n) (0,1).Picking → p : Draw from a uniform distribution over the range of values present at each branching point.

Figure 7
Figure 7 presents the flowchart of the methodology used for this work.
6.1.1.Accuracy Accuracy is a commonly used metric to assess the overall performance of a predictive model.It measures the proportion of correct predictions made by the model among all predictions.Accuracy takes into account both positive and negative classes and provides a comprehensive view of the model's correctness.Accuracy = (True Positives + True Negatives) (True Positives + True Negatives + False Positives + False Negatives) 6.1.3.Recall (Sensitivity, True Positive Rate (TPR)) Recall, also known as sensitivity or the true positive rate, emphasizes the model's ability to correctly identify positive instances.It calculates the proportion of true positives 6.2.2.Reconnaissance Table

Table 4 .
Results matrix for Reconnaissance.

Table 5 .
Results matrix for Discovery.