NLP Methods in Host-based Intrusion Detection Systems: A Systematic Review and Future Directions

Host based Intrusion Detection System (HIDS) is an effective last line of defense for defending against cyber security attacks after perimeter defenses (e.g., Network based Intrusion Detection System and Firewall) have failed or been bypassed. HIDS is widely adopted in the industry as HIDS is ranked among the top two most used security tools by Security Operation Centers (SOC) of organizations. Although effective and efficient HIDS is highly desirable for industrial organizations, the evolution of increasingly complex attack patterns causes several challenges resulting in performance degradation of HIDS (e.g., high false alert rate creating alert fatigue for SOC staff). Since Natural Language Processing (NLP) methods are better suited for identifying complex attack patterns, an increasing number of HIDS are leveraging the advances in NLP that have shown effective and efficient performance in precisely detecting low footprint, zero day attacks and predicting the next steps of attackers. This active research trend of using NLP in HIDS demands a synthesized and comprehensive body of knowledge of NLP based HIDS. Thus, we conducted a systematic review of the literature on the end to end pipeline of the use of NLP in HIDS development. For the end to end NLP based HIDS development pipeline, we identify, taxonomically categorize and systematically compare the state of the art of NLP methods usage in HIDS, attacks detected by these NLP methods, datasets and evaluation metrics which are used to evaluate the NLP based HIDS. We highlight the relevant prevalent practices, considerations, advantages and limitations to support the HIDS developers. We also outline the future research directions for the NLP based HIDS development.


INTRODUCTION
To protect an organization by minimizing cyber attacks and thwarting new threats, intrusion detection is the prevalent security practice, which is performed by a system or software called an Intrusion Detection System (IDS) [19]. An IDS monitors host (e.g., server) data such as system calls (SC) or network data such as network traffic for detecting malicious activity and generates alerts upon detection of any such suspicious activity. Since the occurrence of malicious activity increased 358% from 2019 to 2020 [30], IDS has become an integral system to be deployed by organizations to gain insights into the potential malicious activities occurring within their Information Technology (IT) Infrastructures.
The malicious breaches are mostly (70% in 2020 [92]) carried out by external actor (i.e., intruder) compared to internal (i.e., insider) actors of an organization. These intruders can be individuals, organized groups, and even nation-state actors (i.e., state-sponsored), where breaches by nation-state actors are the most expensive ($4.43 million on average in 2020 [44]). Malicious breaches affect every industry (e.g., healthcare, financial services, and government organizations), costing $3.86 million dollars per breach on average [44]. The impacts of such malicious activities are innumerable such as making sustainability difficult for organizations, loosing organizational reputation, posing threat to national security, etc. Hence, detection of such malicious activities using IDS is of utmost importance to ensure rapid response with an aim of minimizing damage for easier recovery and risk mitigation.

PRELIMINARIES
In this section, we present an overview of HIDS and compare our SLR with the existing relevant literature reviews.   Figure 1 shows the main modules and a general overview of HIDS. A HIDS typically monitors the events occurring in an organization's host infrastructure. Firstly, the data collection module monitors and collects audit data, log files, or SC traces, which capture valuable information about the running applications in a system. System or application log files and audit data are usually available in textual format. Similarly, SC traces can be considered as text documents including SC as individual words. So, researchers adopt NLP techniques for HIDS. This research trend motivated us to systematically review the HIDS studies, which consider textual data sources and adopt the use of NLP techniques.
For the data collection module we focus on the data sources (e.g., system log, audit data), dataset types (i.e., real, simulated), and their availability (i.e., private, public) (Section 4.3) used in the reviewed studies.
Next, the feature engineering module extracts discriminative features, which represent the normal or attack behavior of the input data. Thus, we present the NLP-based features, their types, feature extraction, and selection methods (Section 4.1.1) adopted to generate reliable features, which accurately represent the behavior of system activities.
The intrusion detection module applies the generated features to diverse detection techniques. Thus, we focus on the detection techniques (e.g., language modeling, semantic ontology), learning types (e.g., supervised, unsupervised), and the classifier structure (e.g., ensemble, sequential) adopted (Section 4.1.2) to investigate the host data for intrusions.
Signature-based or Misuse Detection (MD) uses a library of known attacks' signatures and identifies system behaviors matching the signatures present in the library as intrusions. MD gains low False Alarm Rates (FAR) but is unable to detect zero-day attacks [60]. However, anomaly detection (AD) or behavior-based detection builds a model based on the normal behavior of system activities and detects deviations from normal patterns as anomalies, which can detect zero-day attacks but have high FAR [60]. Besides, hybrid methods combining MD and AD can develop systems with high Detection Rates (DR) for known attacks and low FAR for unknown attacks. We included all MD, AD, and hybrid HIDS in our review.
Next, the alert generation and evaluation module generates alerts based on the attack detection outcomes and shares with the security experts of the organization. We focus on the attacks, their categories, and impacts that are detected by the reviewed studies (Section 4.2). We also present the metrics (e.g., FAR, DR) used for the evaluation of HIDS in the reviewed literature (Section 4.4).

Comparison to Existing Literature Reviews
IDS is a highly active research domain and the existing literature has been reviewed from diverse aspects due to the significance of intrusion detection to protect an organization from cyber attacks. For ensuring the unique and novel contribution of our SLR, we extensively analyzed the related reviews and compare them with our SLR as follows: Most of the existing reviews are mainly focused on 'NIDS' though their title refers the generalized term 'IDS' or 'cyber-security', while our SLR is focused on 'HIDS'. Thus, these existing reviews [5,8,41,51,57,62,97] have only 1 to 2 papers and other existing reviews on IDS [6,7,10,13,17,18,39,50,60,81] have 0 papers in common with our SLR of 99 papers. Next, lets consider the reviews focusing on HIDS [15,18,48,63] (comparative analysis in Table 1). These reviews differs from our SLR in terms of objective, included papers, and results. None of these existing HIDS reviews focus on NLP. The semantics and contextual analysis ability of NLP methods help HIDS to detect unknown and adversarial attacks with lower FAR and higher accuracy [S5, S40]. Hence, we focus on NLP to identify NLP-based features (e.g., word embedding), detection techniques (e.g., seq2seq language modeling), and future directions.  [48] × × × (13) no DL × × × × × 2 Liu et al. [63] × ✓ (-) × (20) no DL × × ✓ (6) ✓ (8) × 9 Bridges et al. [15] × ✓ (2) × (22) no DL × ✓ (4) ✓ (22) × × 10 Bukac et al. [18] × × × × × × × × × 0 ✓represents yes, × represents no, -represents not available A decade-old review [18] (2011) focused on standalone HIDS and did not consider hybrid or collaborative HIDS, while our SLR does not confine the scope to standalone HIDS. The authors discussed network traffic, process behavior, file integrity, and security of HIDS against tampering without focusing on any ML/DL/NLP approaches. Another review [48] aims at the existing non-domain-specific anomaly detection techniques, and how they can be adopted from other domains to the HIDS domain, whereas our SLR discusses signature, anomaly, and hybrid HIDS. This review does not focus on features, attacks, datasets, and evaluation metrics, which are discussed in our SLR. Review [15] categorized the host data sources (e.g., system logs, windows registry) to discuss the existing literature from data source perspective.
This review does not focus on the application of NLP techniques, feature extraction methods (e.g., manual, automated), DL approaches, classifier structure, and metrics, whereas our SLR analyses all these HIDS aspects. This review included the majority of the papers before 2010; also they prioritized to cover the use of host data sources including NIDS.
In contrast, we conduct SLR on (2010-2021) time-range as the adoption of NLP techniques in HIDS research gained momentum in this decade [S40, S43, S56]. Review [63] focused on SC-based HIDS and their application on embedded systems, in contrast we exclude (Section 3.2) the HIDS of any specific application area (e.g., embedded systems, IOT).
In summary, none of the above-mentioned existing HIDS reviews focus on NLP application, feature extraction methods, DL-based approaches, and classifier structure as shown in Table 1. We identified, categorized, and analyzed features (4 types), detection techniques (74 techniques), attacks (125 attacks in 12 categories), datasets (36), and evaluation metrics (22) used in NLP-based HIDS. Our SLR presents a comprehensive overview supporting the generalization and concreteness of our findings based on 74 detection techniques, compared to the existing reviews, which include 13 [48], 20 [63], and 22 [15] detection techniques. Notably, none of them covered DL-based techniques (e.g., language modeling, word embedding), while we identified 22 studies in recent years using DL approaches with significant better performance for real-life applications handling a huge volume of data. Considering the included papers, our SLR notably differs from these existing HIDS reviews having only 2 [48], 9 [63], 10 [15], and 0 [18] papers in common.

RESEARCH METHODOLOGY
To gain insight into NLP-based HIDS, we conducted Systematic Literature Review (SLR). SLR is a widely adopted research approach in Evidence-Based Software Engineering (EBSE) [54] as SLR evaluates and interprets a research topic utilizing a reliable, rigorous, and auditable methodology [53]. We followed the SLR guideline provided by Kitchenham et al. [53]. For guiding our analysis we aimed to answer four research questions (RQs) as shown in Table 2 with corresponding motivations. Our review protocol steps are presented in the following subsections (3.1-3.4): To present the features and methods employed in the feature engineering step of HIDS and to clarify the role of NLP in this step • RQ1.2: What detection techniques have been used by the NLP-based HIDS?
To analyse the adopted detection techniques, type of learning, and classifier structure used to model the HIDS and present the prevalent practices to help practitioners adopt NLP-based HIDS. RQ2: What type of attacks are detected using NLP techniques in HIDS? To identify what type of attacks can be detected by NLP-based HIDS RQ3: What are the datasets/data sources used in HIDS research?
To provide insight to practitioners and researchers on the HIDS datasets including types of data source, data generation method, and availability. RQ4: What are the evaluation metrics used in HIDS literature?
To identify the evaluation metrics used to evaluate the NLP-based HIDS

Search Strategy
We formulated our search strategy to retrieve the maximum number of relevant studies based on the guideline provided by Kitchenham et al. [53]. The search strategy includes the following steps: 3.1.1 Search Method. We utilized the automated database search method [53] to retrieve the relevant studies from digital search engines and databases. We used the largest academic literature database Scopus digital library, which indexes over 5,000 publishers worldwide including the relevant sources (e.g., Elsevier, Springer) [28,59]. We complemented Scopus with IEEE Xplore and ACM Digital Library, which are the most frequently used academic digital libraries [61].
We complemented the automatic search by extracting more relevant studies using snowballing [95].
3.1.2 Search String. We used the guidelines of study [53] to develop a comprehensive search string. Considering the key terms "host intrusion detection" and "NLP", we created several pilot search strings composed of the synonyms and related terms. For the first term, we considered it's varied representation (e.g., Host based intrusion detection, host IDS) along with the terms related to both anomaly and misuse detection. We excluded the term 'HIDS' as it is also used to represent 'high-dimensional and sparse (HiDS)', which provided irrelevant papers. Regarding the second term, we noticed the inclusion of 'NLP' is not useful as the papers on HIDS do not usually specify such term even though they use a wide variety of NLP techniques (e.g. n-grams, word embeddings, language modeling, etc). After executing a series of pilot searches in the titles, abstracts, and keywords of the papers on the databases and checking the inclusion of the papers that were known to us, we designed the following finalized search string: "host intrusion detection" OR "host based intrusion detection" OR "host anomaly detection" OR "host based anomaly detection" OR "host based anomaly intrusion detection" OR "host based ids" OR "host ids" OR "host based misuse intrusion detection" OR ("signature based intrusion detection" AND "host" ) Table 3 shows the inclusion-exclusion criteria in line with our SLR aim and RQs. These are used for selecting the relevant studies and excluding out-of-scope papers retrieved from the data sources. We developed a quality assessment  criterion adopted and adjusted from a few studies [37,83]. Table 4 presents the assessment questions (AQs). We graded the reviewed studies on each quality assessment criterion using a three-tier ("Yes"=1, "Partially"=0.5 or "No"=0) scale and presented the percentage of papers identified in each tier in table 4. The assessment score of a paper is calculated by adding the scores of the answers to the 6 AQs. To assure the reliability of our review's findings, we only include papers of acceptable quality, that is, those with an assessment score more than 3.00 (50% of the perfect score).

Selection of the Primary Studies
Fig 2 (a) shows the number of studies retrieved in each of the following phases of the study selection process. We performed Database Search, Duplication Removal, Title-Abstract-based Selection, Full-text based Selection and Quality Assessment against the inclusion-exclusion criteria (Table 3) and quality assessment criteria (Table 4). Further, Snowballing is recommended as search string may not retrieve obscurely phrased studies, and the selected digital libraries may not exhaustively include all peer-reviewed papers [95]. We used forward and backward snowballing following the guideline of study [95] by scanning the citations and references of the selected papers, respectively.
For snowballing we followed the same selection process including title-abstract, full text based selection, and quality assessment. In total 99 papers were selected for our SLR as enlisted in Appendix 8.1, each with a unique identifier (S#).
Data Synthesis: We analyzed the context data items (i.e., D1-D5) using descriptive statistics as shown in section 3.5.
We analyzed the RQ relevant data items using thematic analysis considering the guidelines of study [14] including the following steps: Familiarizing with data by reading and examining our extracted data. Generating initial codes to capture features, detection techniques, attacks, datasets, and evaluation metrics for NLP-based HIDS. Searching for themes and generating potential themes for each data item by merging the corresponding initial codes based on their similarities. Reviewing themes and mapping themes were performed iteratively to review all the codes and themes to revise their allocations if needed. We aggregated features and detection techniques themes to categorize the NLP-based HIDS solutions. Besides, we mapped the themes of a data item to the themes of other data items. For example, we mapped the datasets to NLP-based HIDS solutions and attack categories. To finalize the RQs' answers, the synthesized results for each RQ were reviewed and any disagreements were discussed by the authors in daily Slack channel discussions and weekly meetings.

Studies Distribution
Demographic information of the papers (e.g., types, venue) is considered helpful for the novice researchers [82]. Fig. 2 (b) presents the consistent upward trend of papers number in NLP-based HIDS in this decade due to the ever-growing threat landscape. Almost half of the total papers (48/99) and 61.22% of the journal papers (30/49) were published between 2018 to 2021 indicating the rapidly growing research attention and maturity in adopting NLP in HIDS.
The reviewed papers were primarily published in the following research areas: Cyber-security (19), Network Communications (11), Data Science and AI (11), and also Software Engineering (5). The diversity of research areas in terms of the publication shows the interest of researchers with different research background in HIDS research.

RESULTS
This section presents the results of analyzing and synthesizing the extracted data for answering our RQs. To recommend the prevalent practices we reflect on the findings of our RQs as shown in Fig. 3, which presents the taxonomy of the literature on NLP-based HIDS. For conciseness, we added the study mapping with the corresponding categorization of the RQs in our online appendix [4]. NLP plays a key role in both the main components of HIDS (i.e., feature engineering and detection engine). To answer RQ1 (NLP techniques), we analysed and categorized the NLP-based feature engineering (RQ 1.1) and intrusion detection techniques (RQ 1.2). To give a comprehensive and distinct categorization of the NLPbased HIDS solutions, we combined the two main NLP-based components of a HIDS; i.e., feature engineering and detection techniques adopted by the reviewed studies. Each RQ (i.e., RQ1: NLP techniques, RQ2: attacks, RQ3: datasets, and RQ4: evaluation metrics) with corresponding categorizations and findings are discussed in the following subsections, respectively. For the frequently used acronyms in this article, Table 13 in Appendix 8.2 presents their abbreviations.

RQ1: NLP Techniques
This section discusses different NLP techniques adopted by the reviewed HIDS to perform feature engineering (RQ 1.1) and intrusion detection (RQ 1.2) for host-based intrusions. Feature engineering module in HIDS is widely assisted with various NLP techniques that effectively extract meaningful representations of the input data to be processed by the detection techniques. Similarly, the use of NLP techniques (e.g., language modeling, text classification, semantic ontology) is prevalent to detect intrusions in the intrusion detection module. To highlight the effective HIDS methods, we only cover the NLP techniques proposed by the reviewed studies and disregard the baseline approaches (i.e., methods used to compare and prove the effectiveness of the proposed method). We present the categorization of the NLP features and detection techniques and aggregate them to provide a comprehensive and novel categorization of the NLP-based HIDS solutions. These categorizations are based on analysing the extracted data from the reviewed papers with our minimal interpretations.
4.1.1 RQ 1.1: Feature Engineering Techniques. To answer RQ 1.1 we discuss the feature types, adopted NLP features, feature extraction techniques, and feature selection methods used by the reviewed studies. Content-based and statistical features are used by the reviewed studies. Content-based features further includes contextual and attribute-based features. This categorization is adapted and modified from Sabir et al. [80]. Table 5 presents the features type with corresponding strengths and limitations. The distribution of the feature types is shown in Fig 4 (b), where the reviewed studies used either a single feature type or combined multiple feature types to prepare the input to a HIDS model. We refer to the combined multiple feature types together as hybrid features. The study mapping with the feature types is available in our online appendix [4] A.2. Feature extraction techniques used in these studies are manual, automated, or semi-automated (both), each with it's own advantages and disadvantages that are presented in Table 5. Figure 4  i. Contextual Features are the most commonly used features in our review that are typically extracted from SC traces using an automated process. Contextual features denote the hidden context, semantics, or sequential dependencies of data. In our review, 49 studies used only contextual features, and 17 studies combined contextual features with other feature types to obtain feature vectors. Contextual features preserve the sequential ordering information of the SC within the trace, which provides important insight into process behavior to significantly improve the ability of HIDS to accurately predict the process behavior as normal or malicious.
N-gram representation has the ability to preserve sequence information of the SC, while the value of N can be applied based on the context [S78]. Thus, n-gram/sliding window of SC is observed to be the most frequently used feature for the analysis of SC as shown in Fig 4 (c). Sliding window or n-grams were used to generate features (SC subsequences),  S25, S26, S39, S88], to predict the next SC sequence or model the SC semantics to analyze sequences in sentence-level by seq-to-seq language model [S5, S63, S64], and deep learning models [S11, S58, S93]. Two studies [S19, S77] used n-gram language model that can extract features flexibly from SC traces compared to the semantic feature extraction (i.e., phrases) adopted in study [S40]. A higher n-gram size is preferred since that preserves more SC sequence information to resist mimicry attacks [S62]. Although sliding window/n-gram preserves the SC traces' sequential information, the representation vectors are long and with an increasing value of N, the model requires more storage space and processing time [S2]. As an alternative, a few studies [S19, S44, S60, S97] used only the most frequent n-grams to reduce the feature space at the expense of loss of some relevant information.  Contextual features were generated using sliding window of different sizes by convolutional filter by a few studies [S8, S10], where input was given as one-hot encoding, which produces sparse and high-dimensional vectors. This encourages researchers to utilize neural network language model (i.e., word embedding), which follows the sequential representation method to capture semantic relation, and contextual information from SC with reduced size vector and have better generalization ability than n-gram [S78, S82]. Contextual features were generated using Keras embedding by a few studies [S15, S67, S73, S78, and S82]. Also, semantic preserving Word2vec, GloVe, and fastText embedding were used a few studies [S68, S74], where fastText model can capture additional information about contextual relationships between SC by considering subword information.
Contextual features preserving sequential order, e.g., sequence of system manipulation [S21], event stream [S47], and SC sequences are efficiently used for detecting anomalies. Utilizing the SC sequential information, a few studies [S28, S29, S79, S95] modeled the SC using HMM and state transition models [S24, S45, S50] holding the context-sensitive transitions of SC for detecting intrusions.
iii. Attribute Features include the textual attributes of different data sources (e.g., audit log file, web queries), which were extracted manually by 15 studies and used with other feature types (i.e., statistical and contextual). Seven studies [S23, S31, S51, S56, S89, S90, and S92] extracted diverse SC's arguments and other related attributes (e.g., path name or file name, return values, and file modes) for different attacks detection, which are a promising resource for identifying the anomalous behavior of a process. Similarly, qualitative attributes of file system directories [S30, S48, S99], web query attributes [S53], and web server log file attributes [S18] were extracted to detect intrusions in file systems and web logs.
ii. Statistical Features rely on the distribution of data and are calculated based on diverse computations. Learning a model based on statistical features is faster due to their less computational complexity compared to contextual features.
Only statistical features were used by 22 studies and 17 studies combined statistical features with other feature types.
However, models based on statistical features normally suffer from low detection accuracy due to the loss of valuable sequential information [S41]. Different statistical computations (e.g., minimum, maximum, standard deviation, variance, most frequent SC, second frequent SC, median, skewness, harmonic mean, kurtosis) were performed in a study [S7] to extract useful features from SC traces. Bag-of-word (BOW) was used in a study [S80]. A few studies [S14, S32, S33] used min, max, least, and most repeated data sequence to detect anomalous low and high footprint in LINUX environments.
Term frequency-inverse document frequency (TF-IDF) or TF vectorized features [S3, S6] are the 2nd most frequently used feature for the analysis of SC as shown in Fig 4 (c). Term frequency (TF) as frequencies of the SC in a trace were considered as feature in a few studies [S36, S37, S41, S42], which do not capture sequence information of the SC [S78].
Feature selection methods: To mitigate the weakness of high dimensional feature space only 22/99 studies explicitly mentioned the feature selection method they adopted. Most (7/22) studies [S19, S41, S44, S60, S62, S65, S71] applied frequency-based feature selection requiring manual threshold setting such as considering only n-grams with frequency greater than a pre-defined threshold. A study [S1] used filter approach by using statistics on n-gram frequency sequences.
Apart from using these classifier independent feature selection methods, classifier-specific methods were also used in the reviewed studies. For classifier-specific methods, the studies used rough set theory [S59] and two variants of rough set feature reduction (Crisp and fuzzy), Bayesian theory [S34], and Random Forest (RF) [S75] to remove the redundant and irrelevant features. For faster reduction of feature vector dimensionality Principle Component Analysis (PCA) was used in a few studies [S36, S52]. In contrast, truncated singular value decomposition (SVD) method was used in one study [S6] as it is computationally efficient compared to PCA and conventional SVD. Feature selection using unsupervised learning-based clustering was applied in several studies [S23, S31, S42, S48, S99] to find the hidden structure in HIDS data such as SC arguments, contextual information, and domain level knowledge, which can provide better representation and reliability of the data [24]. A newer (i.e., 2017) NLP-based approach is attention mechanism that is used to automatically learn relevant features and emphasize certain factors in DL techniques was used by several of the reviewed studies [S64, S65, S93, S94] for effective intrusion detection.

Sequence-based (12)
Ontology-based (4) Model/Language-based (11) Ensemble (12) Rule-based (27) Decision Tree/C4.5/C5 (8) PART (1) RIPPER (2) OneR (1) ZeroR (1) Rough Set Classification (RSC) (2) Random Forest (RF) (6) Isolation Forest (IF) (3) ExtraTree(s)Classifier (1) AdaBoost (2) Bagging Classifier (1) GradientBoostingClassifier (1) XGBoost (3) Instance-based ( sources such as textual logs or SC traces that can be regarded as natural text. For example, intrusion detection using SC traces that include SC sequence made by the applications resemble the domain of text classification [S60]. HIDS adopts the methods proven to be successful in NLP tasks for intrusion detection. We have categorized the detection techniques into 3 main categories: Traditional Machine Learning (ML), Deep Learning (DL), and Rule-based, including 9, 3, and 3 sub-categories, respectively. This categorization is shown in Fig 5 and the detailed study mapping with the subcategories is available in our online appendix [4] A.3. Table 6 highlights the advantages and disadvantages of these detection categories. Since different type of HIDS used diverse learning types, classifier structures, and attack detection classes while adopting these detection techniques, we discuss these aspects as follows: Learning type and HIDS type: To cover each type of HIDS, we considered all the HIDS papers using NLP techniques regardless of the HIDS type (i.e., misuse, anomaly, and hybrid) (discussed in Section 2.1). Besides, different learning types (i.e., supervised, unsupervised, semi-supervised, hybrid) are adopted for the detection in the reviewed studies. The description of each learning type with corresponding strengths and limitations are presented in Table 6. Fig 6.a shows the mapping distribution of the detection sub-categories with corresponding learning types and HIDS types. The choice of a learning type highly depends on the availability of labeled data. As unsupervised approach finds patterns/partitions from unlabeled data, semi-supervised approach requires a portion of data to be labeled, whereas supervised learning requires complete labeled data. In the HIDS area, normal data is highly available (generated by the normal execution of programs within a system) but malicious data is insufficient (require simulating the increasing attack types). Thus, the semi-supervised (used by 52.5% of the studies ) anomaly detection by training with only normal samples and detecting the deviations from the learned model as anomalous is prevalent in the reviewed studies, compared to supervised (32.3%) and unsupervised learning (8.1%). However, semi-supervised methods lead to high FAR as they classify the unseen normal behavior as attacks, so there is a need for sophisticated methods to mitigate false alarms.
Classifier structure type and attack detection classes: Different classifier structures such as base (53.5% of the studies), multi (25.3%), sequential (11.1%), and ensemble (10.1%) were used for detection techniques in the reviewed studies. Table 6 presents the description of them with corresponding strengths and limitations. Besides, most of the studies (91.8%) just detected anomaly that is no specific type of attack is indicated (i.e., referred to as attack detection), which performs binary classification. However, several studies (8.1%) did either multi-class or both binary and multi-class classification. The multi-class classification (i.e., referred to as attack classification) is done to detect the specific attack type such as Adduser, Hydra-FTP, Java-Meterpreter, etc [S6]. Fig. 6.b shows the mapping distribution of the detection sub-categories with corresponding classifier structure type and attack detection (binary) or classification (multi).

Detection techniques categorization:
Here we present the categorization of the detection techniques in terms of learning type, HIDS type, classifier structure type, features, and attack detection classes.

Supervised
Uses labelled data to train a learning model on normal and attack data.
• Stable performance and effective way to detect known attacks.
•Requires labeled training data that is costly and time-consuming to provide • Difficult to detect unknown attacks Semi-supervised Trains only normal samples with no anomalies in a given training data set • Only the normal class's labelled data is required.
• Suffers from high false alarm rate Unsupervised Without any prior knowledge, utilizes statistical models to detect anomalies.
• Does not require labeled training data • Lower computational complexity (1) Support Vector Machines (SVM) and its variants are the most popular (24/99) algorithms (e.g., SVM (10), One-Class SVM (OCSVM) (12)) used for binary and one-class classification adopting supervised (9) and semi-supervised (11) learning. For example, OCSVM using SC with their frequencies [S37] obtained close performance to OCSVM applied to semantic model [S40]. OCSVM with variable length n-gram features [S61] [S5, S63] proposed a sequential framework consisting of sequence prediction followed by one-class classification, where Isolation Forest (IF) obtained lower FAR but OCSVM achieved higher DR. Another study [S85] proposed sequence-based and frequency-based approaches, where frequency-based OCSVM outperformed KNN, instance-based methods proposed in 2 studies [S36, S41], and had lower FAR than K-Furthest Neighbors (KFN) .
Support Vector Data Description (SVDD) [89] was adopted by a study [S57] to detect attack types more precisely. A study [S86] proposed fuzzy support vector algorithm based on SVDD to handle the sensitivity to noises and outliers of SC sequence, which is more robust than SVM and fuzzy SVM. Furthermore, Sequential Minimal Optimization (SMO) was used by a study [S62] leading to faster training of SC-based anomaly detection based on modified vector representation.
Gaussian Mixture Models (GMM) (3/99) using frequency and binary features extracted from file system-specific audit records produced lower FAR compared to SVM [S30]. This base method was extended and outperformed in terms of FAR by using sequential framework including K-Means and one-class GMM [S48] and by using GMM-based outlier detection, K-Means-based clustering, and KNN-based outlier filtering to generate alerts using sliding window [S99].
SC2 proposed in study [S45] classified arbitrarily long SC sequences using NB, with class conditional probabilities derived from Markov chain modeling of sequences. The upgraded algorithm, SC2.2 proposed in study [S70], addressed two challenges: the issue of zero transitional probabilities, and vanishing probabilities. SC2.2 was later employed by a few studies [S23, S31] to classify traces rewritten with SC clusters and study [S21] used ensemble including first-order Markov-Bayes models and increased the Markov chain's order up to third-degree to reduce its FAR.
(3) Statistical model includes HMM model (13/99), which is used to model SC traces [S28] and to classify kernel module sequences including rule-based approach [S25]. A sequential design including DT to classify web log files and HMMs to model the normal behavior [S53] outperformed DT, LR, and SVM in terms of accuracy and FAR.
6 studies combined multiple anomaly detectors using ensemble classifier structure, where Iterative Boolean combination (IBC) was used to combine multiple HMMs with different parameters [S22]. IBC was also used to combine heterogeneous classifiers [S46], which was improved in study [S39] for incremental learning, (i.e., learning new data and combining the newly trained HMMs with the previously trained HMMs). Since combining all detectors may be inefficient due to the redundancy in outputs, Pruned Boolean Combination (PBC) selected a subset of diverse and accurate detectors to combine homogeneous detectors [S95] and heterogeneous detectors (e.g., OCSVM) [S96]. Later, PBC was improved in study [S79] by a weighted pruning approach.
To reduce the required large training time, study [S44] used improved HMM by replacing frequent common SC sequence with unique n-gram that reduced accuracy compared to HMM due to the loss of system behaviour information.
HMMs were also adopted to predict the most likely system call during an attack, where using 7-12 hidden states resulted in better prediction accuracy [S11].
Conditional Random Fields (CRF) using SC n-grams of order 6-8 for misuse detection [S35] outperformed SVM and NB. Clustered Markov Networks (CMN) including KMC-based clustering and Markov network based cluster modeling [S42] performed similar to CMN with Outlying subspace (CMN-OS) and outperformed Clustered Label Propagation (CLP), Label Propagation (LP) [12], MN, k-means outlier detection and had less sensitivity to noise compared to others.
(4) Instance-based models perform the prediction based on the similarity (e.g., Euclidean [S36, S41], Mahalanobis distances [S36], Wagner-Fischer distance [S24]) of a data point to its neighbors in the training set. KNN Brute-force achieved the highest accuracy compared to KNN based on BallTree, KDTree, SVM, LR in study [S2]. KNN-based misuse detection [S52] showed high performance for both attack detection and classification. To detect low and high foot-print attacks KNN based on zero-watermarking algorithm [S33] outperformed SVM in terms of DR and FAR, which was extended by study [S32] using normalization-based feature retrieval. A study [S85] used KFN for frequency-based AD, which outperformed KNN and methods of other 2 studies [S36, S41] with higher accuracy and lower FAR. KNN was used in multi classifier [S3], sequential [S99], and ensemble classifier structure [S87] (including K-centers, KNN, and other ML models using majority voting), which obtained a higher performance over base classifiers.
(5) Ensemble includes tree-based algorithms (e.g., RF (6/99) and (IF) (3/99)). Due to RF's explainability, it was used for behavioral AD [S75] and AD based on a sequence prediction model [S64, S94], where CNN and RF performed the best out of 9 different classifiers employed by study [S94]. Due to RF's less computational costs for real-time detection of the applications which invoke a large number of SC, RF using TF-IDF of SC n-grams of traces with more than 500 SC outperformed DT, KNN, MLP, MNB, and SVM [S3]. For AD using statistical feature extraction for IF obtained the best AUC compared to LOF, OCSVM, and KNN for most of the datasets and cross-platform applications [S1]. Use of 3-gram frequency-based ensemble classifiers (AdaBoost, Bagging, Gradient-Boosting, XGBoost, RF, and ExtraTree) outperformed other models such as NB-based methods in terms of F1-score and AUC [S65]. AdaBoost using contextual information of system manipulations outperformed SC2.2 and second-order Markove model (MM) in terms of FAR but the memory complexity of third-order MM is better than Ada-boost [S21]. EXtreme Gradient Boosting (XGBoost) is faster than AdaBoost, which was employed by a few studies [S65, S89, S90], where study [S90] used attributes collected from the SC traces content; a study [S89] enhanced it by leveraging dynamic tracing buffer for training data selection.
KMC using low dimensional SC frequency vectors acquired by PCA effectively detected most types of attack than KNN in a study [S36]. Modified KMC based on harmony search in a study [S72] outperformed Fuzzy c means and Kmeans in terms of accuracy and FAR. 2-means algorithm-based clustering using Damerau-Levenshtein distance (text similarity measure) and classification tree based on normal activity patterns were used for comparing against current SC sequences in a study [S84]. This study [S84] was extended by a study [S24] to investigate the effectiveness of fuzzy clustering for AD that achieved high F-Measure. Regular expressions [S23] and Levenshtein distance [S31] were used to create clusters that enhanced the AD classification performance. Similarly, KMC was used by several studies [S42, S48, S99] in a sequential framework to produce higher-level feature space useful for improving the accuracy.
(7) Neural Networks is used only for AD in the reviewed studies. NN obtained the best results to classify the SC trace represented by tfidfvectorized dimensionality reduced n-gram feature vectors compared to SVM and DT in a study [S6]. Since ELM [42] is a single hidden layer feedforward NN with promising performance and faster training time compared to traditional methods, it achieved the best results in several studies [S27, S32, S40, S97] [S40, S58], real-time detection is challenging due to the requirement of complex weighted frequencies of trace phrases. A study [S14] extended study [S32] by adding anomalies in dataset using GAN for MLP classifier, which significantly outperformed the model based on the imbalanced dataset.
(8) ML-based Rule system includes ML methods that automatically identify, learn, or evolve rules for modeling a decision engine. The Decision Tree (DT) and its extensions, C4.5 and C5, were employed by 8/99 studies. Supervised C4.5 on the frequent n-gram vectors of SC outperformed NB, SVM, and MLP in a study [S19]. Semi-supervised C5 outperformed OCSVM for AD in another study [S92]. A study [S62] used several algorithms (e.g., NB, KNN, and SVM) including rule-based algorithms such as C4.5, Zero Rule, One Rule, and Repeated Incremental Pruning to Produce Error Reduction (RIPPER), where the lower space complexity of C4.5 made it more appealing for both attack detection and classification. This study [S62] was outperformed in a study [S60] by frequent n-gram-based voting ensemble design, while the base learners included NB, SVM, DT, RF, and PART. Rough Set Classification (RSC) was used as base classifier for AD in 2 studies [S7, S59], where RSC outperformed DT, SVM, KNN, and NB in terms of accuracy [S7].  [S65] were also adopted for AD by the reviewed studies.
(ii) DL-based Techniques: The adoption of DL for NLP-based HIDS started from 2017. DL gained popularity over time with promising results compared to ML as ML fails accurate intrusion detection for real-time large scale data due to simplicity and relying mostly on manually-extracted features [S82]. For predicting the next SC sequence or detecting attacks sequence-to-sequence language modeling, Deep Neural Network (DNN), and ensemble NN have been used.
(1) Deep Neural Network (DNN) including Deep Multi-Layer Perceptron (MLP) is used as base in a study [S78] outperforming traditional ML (e.g., RF, DT, and SVM), as multi in a study [S65], and as ensemble on the outputs of STIDE, text classification, and graph-based method in another study [S98]. A study [S11] modeled system's normal behaviour using LSTM and used Youden's index for AD. Considering SC logs as text, a study [S68] applied NLP (one-hot vector, word2vec, GloVe) to transform the SC sequences and combined them with 3 inputs (SC, SC mapped kernel modules, and their combination) using LSTM. Though one-hot encoding showed the best performance, other methods can also be helpful for complex IDS and the use of only kernel modules is discouraged due to the information loss during the mapping process. This study was extended in study [S74] by applying fasttext-based LSTM.
Extracting features using sequence-based (RNN), non-sequence-based model (AE) and advanced NLP-based attention mechanism to improve the detection interpretability reported in a study [S93] outperformed OCSVM, LSTM.
A study [S17] introduced CNN-based HIDS in 2018 considering the similarities of SC traces with the NLP representation. CNN outperformed RNNs (LSTM, GRU) in a study [S10], and SVM (of [S2]) in another study [S71]. A study [S8] used NLP-based character-level processing of SC and temporal CNN including causal convolution (to extract sequential feature) and dilated convolution that outperformed OCSVM [S37], (KNN, NB, SVM, K-means, Zero-R, One-R, RIPPER, DT) [S62], ELM, and LSTM. Moreover, word-embedding-based semantic feature extraction reported in a study [S82] improved CNN performance over traditional ML (e.g., LR, SVM, KNN, and NB). A study [S80] in 2021 proposed an (2) Seq-to-seq language modelings are adopted considering SC sequences as instances of the language for communication between users (or programs) and system, where SC and SC sequences refer to words and sentences in natural languages. Since RNN deals with long sequential problems, seq2seq language modeling is introduced in HIDS by semantically modeling SC for real-time detection. RNN-AE was used to detect zero-day attack in a study [S94].
Variational Encoder-Decoder (VED) with two RNNs (LSTM, GRU) were used for AD in a study [S63], which was extended in another study [S5] using a pre-processing step resembling NLP-based Question-Answer considering every two successive sequences as a pair (source seq (qus), target seq (ans)). Here, the applied RNN-VED captured the semantic meaning of SC to predict the next SC using the given context. Further, RNN-AE is used to predict SC traces in a study [S64] using GRU due to it's simpler structure and reduced parameters compared to LSTM. Besides, CuDNNLSTM language model supported only by GPU is adopted in [S73] achieving 10 times faster training than LSTM.
(3) Ensemble neural network reported in a few studies [S4, S15, S67] ensembled CNN and RNN models, where CNN extracted SC relation features and RNN captured SC sequences context by extracting long-distance sequence dependency.
A study [S4]  (1) Sequence based techniques (12/99) mostly deals with SC sequences that build a model of normal patterns and detect attacks by sequence matching. STIDE [36] was used by studies to investigate the effect of kernel modules in reducing FAR and execution time [S25], to be integrated with other techniques as an AD tool [S26], to implement an ensemble HIDS [S46, S96, S98], and as the baseline for comparison [S76, S87]. A study [S85] modified STIDE by defining a similarity score based on the matching hits of test SC sequences to all normal sequences, which outperformed several studies using frequency-based [S36, S41], semantic [S40], frequency and sequence-based [S37], and statistical [S33] feature-based techniques. Though study [S40] had lower FAR than STIDE [S85], it required long semantic features extraction time making STIDE [S85] more appealing for real-time detection. A study [S16] used a similarity approach considering the minimal number of subsequences required to build a complete covering of a given sequence, which outperformed other text similarity measures (e.g., Levenshtein's distance, Longest Common Subsequence/Substring) in terms of accuracy and execution time. Detection using a similarity measure based on sequential association patterns is done in a study [S66] from process-SC database of the normal temporal transaction set. A study [S56] integrated control-flows, data-flows, and their inter-dependency, and identified any violations of semantic flow specifications as an attack occurrence, which is efficient than the approaches focusing only on control-flows or data-flows.
(2) Semantic Ontology by integrating heterogeneous data sources and collecting security knowledge of assets, vulnerabilities, attacks, and their relationships to implement HIDS was used in several studies [S13, S29, S43, S55].
The ontology helps to link and infer means and consequences of attacks whose signatures are not yet available [S43].
Study [S13] first proposed IDS ontology by collecting malware information from Symantec's website to notify users any relevant threats. A study [S55] employed a Knowledge-Based Temporal Abstraction framework for temporal abstractions of security domain knowledge to detect early malware temporal patterns. Another study [S29] used an off-the-shelf Named-Entity Recognizer (NER) based on cybersecurity data [49]. The NER was used to extend the existing Unified Cybersecurity Ontology [88] (consists of attack patterns, indicators, etc) to reason over the multi-sensors' data to predict security events. A study [S43] used OpenCalais [3] information extraction tool to take inputs from different data streams and reasoned on the knowledge asserted into the ontology to infer the attack possibility.
(3) Model/language-based category includes approaches that use states of the data stream, tree, or language-based rules to model the training data for detecting intrusions. For example, using Kernel State Modeling (KSM) a study [S38] defined 8 states (e.g., file system state, kernel state, memory management state) and represented SC by their corresponding states to detect anomalies. A study [S50] used State Transition Table ( runtime behaviors) without deteriorating its high precision and time efficiency. Another study [S77] used tree of SC n-grams to model normal profile using occurrence frequency grouping that obtained the best results by n-grams of order 3, which needs to achieve reasonable time and space complexity.
Hybrid HIDS outperformed single IDS as reported in a few studies [S20, S47]. For hybrid HIDS study [S20] used both misuse (ARIMA time series modeling of host log) and anomaly (Apriori generating association rules based on the host log attributes frequency) detection. Another study [S47] used Linear Temporal Language-based rules for misuse detection and genetic process mining modeling normal processes for anomaly detection as hybrid HIDS. Normal behavior is modeled using a grammar-based compression algorithm in a study [S54]. Another study [S81] used SC frequency matrix of lookahead pairs for pre-defined threshold-based AD. To identify more attacks a study [S9] used association rule mining for the unknown attack and Brute-Force pattern matching algorithm. Another study [S51] pruned the false alarms generating rules and enriched the rules that increase the coverage of attacks.  (20) Content-based features used for ML-based HIDS. S18, S19, S21, S22, S24, S28, S34, S35, S39, S44, S45, S60, S62, S70, S72, S79, S84, S88, S95, S97 Content-Rule (12) Content-based features used in a rule-based HIDS.

Categorization of the reviewed NLP-based HIDS Solutions.
We developed a detailed categorization of HIDS approaches highlighting the applications of NLP. To build this categorization, we combined the two main NLP-based components of a HIDS; i.e., feature engineering and detection techniques adopted by the reviewed studies. For feature engineering, we consider two main categories statistical and content-based (content-based includes contextual and attribute-based features). Taking the combination of 2 feature types with 3 detection techniques, we identified 6 categories that distinguishes the reviewed studies in distinct categories. Table 7 shows the categories along with their descriptions and mapped studies. The categorization includes a hybrid category that refers to the studies that either use hybrid features (both statistical and content-based) or hybrid detection techniques. It is the most dominant category (37/99), including studies on ML+Rule-based (7), ML+DL-based (10), DL+Rule-based (1) detection technique, and other papers of this category used hybrid features. Fig 7 demonstrates the distribution of the identified categories over the years. As Figure 7 suggests, hybrid and ML-based approaches were adopted all through the duration. However, the rule-based approaches were popular in the earlier years, but the manual rule defining and the increasing amount of security data with real-time detection requirement made DL and hybrid more popular choices in the recent years.

RQ2: Attack Categories
Our motivation for studying and categorizing security attacks is to contextualize the attacks that are targeted to be detected by the reviewed NLP-based HIDS. To answer RQ2 we present the type of attacks that are detected using NLP techniques in HIDS by the reviewed papers, where (i) the HIDS is directly evaluated using these attacks or (ii) the HIDS is evaluated against a public dataset including particular attacks. 125 instances were reported as attacks in the reviewed studies, which we categorized into 12 attack categories. Table 8 shows the attack categories and highlights the impacts of the attacks on the security requirements, which provide a deep understanding of security attacks and the invasion target. The attack categorization was adapted and adjusted from study [51]. To contextualize the NLP-solutions in relation to the detected attacks, we mapped the attack categories with the NLP-based HIDS solutions categories as shown in Fig 7 (b). Later, we show the mapping of the attack categories to the used datasets (Section 4.3: RQ3) in Fig 8   (b) showing the attack categories covered by the used datasets in NLP-based HIDS.

Impacts on Security Requirements:
We consider the following most significant security requirements needed to be provided by a security framework, which can be impacted by the reviewed attacks in NLP-based HIDS: • Confidentiality ensures data security by preventing information from being disclosed to unauthorised individuals, entities, or processes. • Data integrity protects data from unauthorised modifications throughout its life cycle, ensuring its accuracy, trustworthiness, and validity.
• Availability ensures the availability of information or services for legitimate users upon demand.
• Authentication is the process of recognizing a legitimate user identity to be verified before data access.

Categorization of Attacks detected in NLP-based HIDS:.
Here, we briefly discuss the attack categories in terms of included attacks, impacts on the above security requirements, and mapped NLP-based HIDS solutions.
(1) Arbitrary Code Execution (ACE) is the most reported attack type in the reviewed studies (76/99). It involves an attacker gaining control injecting own code by exploiting some vulnerability. It can affect any of the security requirements depending on the target of the executed arbitrary code/commands on a target machine/process [84].
(3) User to Root (U2R) attack enables intruders to gain a system's root access starting with access to a normal user account (achieved by password sniffing, dictionary attack, or social engineering) [90]. It bypasses the authentication and threatens the data integrity by removing security policy specified files from the victim hosts. Examples include Adduser, sunsendmailcp, and secret attacks, where an authorized user removes the special files (ntfsdos, sqlattack) [90].
(4) Brute force attack employs a trial and error process that generates a huge number of guesses and validates them to collect data (e.g., account password, personal id number), which damages the data confidentiality and can circumvent the authentication by acquiring an authorized user login information. Example includes 'Password Guessing' [S47] in which an intruder guesses a user password more than three times. Except for statistics-rule solution category, all the other solution approaches have been adopted to detect brute force attacks.
(6) Backdoor secure remote access to a machine, or plaintext in cryptographic system, which can be utilized to get access to sensitive data (e.g., passwords), alter or remove information on hard drives, or transfer data across autoschediastic networks. It impacts the confidentiality, integrity, and authentication. For example, a study [S49] proposed NORT that detected a family of backdoors called Win32.Hydraq. All approaches except Content-Rule were used to detect backdoors in the reviewed studies.
(7) Worm is a type of malware, which replicates itself to spread to uninfected computers throughout the network without human intervention [84]. Worms can damage services availability by consuming bandwidth/storage space, and affect data integrity by corrupting or modifying files on a targeted computer. Various studies [S49, S55] have proposed HIDS, which are able to detect worms (e.g., NetSky.y, Zhelatin.uq, and Mytob.x). These worms can over-write other executables and try to exploit OS components. 2/7 studies adopted hybrid solution approach, while each of the other solution approaches was adopted by a single distinct study for each category.
(10) Data theft threatens data confidentiality by stealing information stored on corporate databases, devices, and servers. For example, two studies [S75, S26] reported T1003 attack and PHP 5.3.5 vulnerability that steals user credentials stored in lsass memory, and context-sensitive data from the memory of the process (CVE 2011-1153), respectively. The reviewed studies did not use rule-based or Content-ML approach to detect data theft.
(8) Trojan is a program that appears appealing and legitimate but has anomalous code in it, which can affect any of the security requirements. For example, Trojan-PSW (steals user account information), AdWare (displays advertising popups), WebToolbar (installs in-browser content without users' consent), Trojan-Spy (keylogging, monitoring processes), and Trojan-Ransom (prevents users access to demand payment). Except statistic feature (statistic-ML and statistic-rule), all the other approaches were adopted to detect trojans.
(9) Probe attacks scan a system or network automatically to gather records of private systems or a DNS server to find legitimate IP addresses (ipsweep, mscan, lsdomain), host OS types (mscan, queso), active ports (mscan, portsweep), and known vulnerabilities (satan) [90] that devastates the data confidentiality. Only statistic-ML and hybrid approaches were adopted to detect probe.
(11) Virus is a code that attaches itself through any infected files and self-replicates when the programme is run [84].  [S47]. While all the NLP-based HIDS solution categories were used to detect this attack category, the top-3 prevalent solutions used are hybrid (13), Content-ML (12), and Content-Rule (8).

RQ3: Datasets
We identified 36 different datasets employed in the reviewed NLP-based HIDS. We discuss these datasets with respect to types of data sources, data generation, availability, attack, and solution. The primary resource of data used in HIDS research is log files as they contain information about usage patterns, activities, and operations within an OS, application, or server. Three main types of log data sources are typically used by the reviewed studies which are SC, audit data, and system log. Datasets have been generated by collecting these data sources either in real (i.e., organization/production environment) or simulated (either synthetic data or controlled environment (e.g., testbed, emulation environment, lab)) environments. We have adapted and modified the dataset categorization from study [78]. Table 9 describes the types of data sources and datasets along with their strengths and weaknesses. Besides, based on dataset availability, the reviewed studies can be categorized as public and private/customized. Public datasets in the HIDS domain are usually outdated, lack sufficient labeled data, or do not cover diverse attack types. To overcome these limitations, some studies were motivated to explore methods for generating new customized datasets, which are usually kept private.  Table 10 presents a list of currently accessible public datasets used in the primary studies with their description, strengths, and weaknesses.
4.3.1 SC datasets. Since analyzing system logs and audit data requires managing a huge amount of data to extract useful information, SC analysis is the most popular (78/99) approach for developing HIDS. SC is a primary artifact of the OS kernel and filtering, interpretation, and processing are not used that can obfuscate events [S40]. The reviewed studies used public (67/99), private (7/99), and both (4) datasets to implement or evaluate the target HIDS.
SC-based public dataset has 7 datasets including the recent (i.e., 2018) real Attack-caused Windows OS SC Traces Dataset AWSCTD [22]. It includes real-life malware information from VirusShare [1], VirusTotal [2] with worm, Trojan, backdoor, and mics attacks. It is used by 2 studies [S4, S10], which adopted DL and hybrid approach, respectively. (2004) includes U2R, R2L, DoS, and misc attacks. In spite of a remarkable usage of the UNM dataset (24/99) in developing HIDS research, that dataset is marked as outdated and limited in scope [27]. Firefox DS included normal traces for Firefox3.5 by executing 7 different testing frameworks and anomalous traces by launching contemporary attacks. It was used by (2/99) studies that adopted hybrid approach to detect the included ACE, DoS, and misc attacks. Further, Florida Tech and University of Tennessee at Knoxville (FIT-UTK) macro execution traces comprised of 36 normal and 2 anomalous traces that correspond to DoS and ACE attacks, used by a study [S51] that adopted hybrid approach.
Australian Defence Force Academy Linux Dataset (ADFA-LD) [27] is the most used (49/99) dataset published in 2013, which includes U2R, R2L, ACE, and brute force attacks. All the approaches except the statistics-rule were adopted to detect intrusion in ADFA-LD dataset by the reviewed studies. While ADFA-LD dataset was obtained from Ubuntu OS, Australian Defence Force Academy Windows Dataset (ADFA-WD) was obtained from a Windows XP SP2 host.
ADFA-LD and ADFA-WD (6/99) were intended to represent modern attack structure and methodology to replace the older datasets DARPA and UNM. Only ML and hybrid approaches are used to detect attacks in ADFA-WD dataset. Table 9. Type of NLP-based HIDS data sources and datasets, their description with advantages and disadvantages

Type Description Strengths Weaknesses
Type of Data Sources System Log Logs generated by operating system • Includes valuable information for IDS (e.g. warnings, errors, system failures).
• Requires managing huge amount of data to identify and extract useful information Audit Data Log files produced by individual applications • Keep useful data about sequence of events of a program (e.g., successful/failed authentication, SC, user command logs.
• Requires managing a large amount of data, identifying and extracting useful information SC SC Sequence invoked by a process that properly reflect the behavior of an individual program.
• Primary artifact of the OS kernel with no filtering, interpretation, nor processing applied that can obfuscate events [S40].
• Loss of some relevant information about occurred events

Real
Datasets including real data captured from a real organization/production environment. Both data and environment are real.
• Providing true distribution of data • Imbalanced datasets with insufficient number of malicious activity • Covering a limited attack types Simulated Dataset including either synthetic data (e.g., artificially generated data) or data captured within a test bed or emulated controlled environment.
• Able to reproduce balanced datasets.
• Able to generate rare misuse events.
• Useful for attacks for which real data is not available.
• Tool specific • May not depict real distribution of data • Not a representation of real heterogeneous environment.
SC-based private datasets includes a real dataset [S42] that collected SC snapshots from a corporate network and malware from a security company, where the statistic-ML approach was adopted. which allow monitoring the execution of a program and read SC traces on user or kernel space. Ptrace helps to trace Linux SC used in a study [S50] to produce data from 3 Linux self-contained programs (gzip, cat, and ps). Besides, Strace open-source application utilizes ptrace to provide statistics about a trace in text format. Linux Trace Toolkit Next Generation (LTTng) tracer saves traces in Common Trace Format (CTF) [32]. In a study [S10], AWSCTD was appended with SC using drstrace, which traces SC for Windows OS. Study [S59] modeled the process of sending and receiving e-mails. Two studies [S22, S39] generated data using Conditional Relative Entropy (CRE) which divides conditional entropy by the source's maximal entropy to provide an irregularity index to the generated data.
Audit data-based public datasets includes 4 real datasets used by 4 studies [S30, S48, S83, S99] including U2R, backdoor, data theft, and mics attacks, for which all the 4 studies adopted hybrid approach. Vergina, www_ee, and thmmy datasets were used by a few studies [S30, S48, S99] to develop file system-based HIDS methods using Basic Security Module (BSM) audit records. Data were collected from 3 real-life web servers, vergina, www_ee, and thmmy.

ADFA-LD [27] 2013
Alternative to older datasets (DARPA, UNM) and collected under Ubuntu OS running services and simulating attacks.
• Relatively up-to-date and representative of contemporary attacks • Lacks SC arguments, return values, or other metadata ADFA-WD [27] 2013 Collected on a Windows host, and a total of 12 known vulnerabilities were exploited to simulate different attack types.
• Relatively up-to-date and representative of contemporary attacks • Lacks SC arguments, return values, or other metadata • The number of vulnerabilities used to create malicious activity was inadequate. [22]. CANALI-WD [20] 2012 Includes program execution traces observed both in a synthetic environment and on real-world machines with actual users and under normal operating conditions.
• Presents a large collection of anomalous traces compared to previously published SC datasets. • Not biased towards particular runtime environments, or usage patterns.
• Lacks some useful information such as SC arguments, timestamp, etc.

UNM [31] 2004
Includes Synthetic Sendmail UNM, Synthetic Sendmail CERT, live lpr UNM, and live lpr MIT datasets. Synthetic traces were collected in production environments by running a prepared script.
• Widely used as benchmark • Includes programs of varied size and complexity, and different kinds of intrusions (buffer overflows, symbolic link attacks, and Trojan programs) • Outdated • Lacks SC arguments or other metadata • Extremely limited in scope and not represent full sampling of OS, focus on single processes (process IDs, SC IDs) AWSCTD [22]  • New extended dataset for Windows • Includes parameters (SC arguments, return value) for in-depth training • Includes varied malware types • Lacks some useful information such as SC arguments, timestamp, etc.

• Used in a few studies
Audit data-based datasets DARPA/KDD [29] 1998/99 Includes Basic Security Module (BSM) data file with SC-based audit data produced in a victim's machine for hostlevel audit.
• First standard corpora for evaluation of NIDS and widely used as benchmark • Includes arguments and return values • Very obsolete, unable to accommodate the latest trend in attacks • Focus on NIDS, lacks the information required to train HIDS-suitable methods NGIDS-DS [40] 2017 Obtained from Ubuntu 14.0.4 host that is equipped with an auditing mechanism and includes 99 host log files.
• Loss of parameters and more accurate timestamps • Used in a few studies However, the links of datasets presented in relevant study [S30] are no longer accessible. Besides, Purdue Unix Shell (PUS) dataset used by a study [S83] includes 8 user's log records within two years.
Though DARPA (1998DARPA ( /1999 includes DoS, R2L, U2R, and Probe attacks, it is considered obsolete as it is unable to accommodate the latest trend in attacks [70]. The next-generation IDS dataset (NGIDS-DS) is relatively new (2017) including diverse attacks and thread information, but both parameters and more accurate timestamps are missing.
Audit data-based private datasets The real datasets include log audit project of an IT security company [S53] and gathering data from real-life malware information sharing, threat intelligence platforms, etc [S13, S29]. For detecting intrusion in the audit data-based private datasets only Content-Rule and hybrid approaches were used . Simulated 5 datasets include 5/12 attack categories. Studies collected data from local standalone cluster installed in local machine [S9], synthetic log related to an information system [S47], simulated environment including 7 computers and a server simulating the Internet [S55], and programs running on the server and attacks exploiting their vulnerabilities using customized loadable Linux kernel module (LKM) [S56]. A study [S57] generated artificial data in a two-dimensional data space corrupted by Gaussian noise.
The hybrid (real+simulated) dataset in [S75] used both lab and MITRE's production enterprise environment to create labeled data including data theft and ACE attacks and adopted a hybrid solution approach.

System logs datasets.
Only 4/99 studies used system logs for HIDS, which adopted Statistic-Rule [S20], Content-ML [S18], Content-Rule [S43], and hybrid [S91] approach. These datasets are private and include ACE and misc attacks. The real datasets include the collection of host logs in the production office environment [S20] and Windows Security logs from two different organizations [S91]. Simulated datasets include Enhanced Custom Log (ECL) files used in a study [S18] and attacks were simulated in a controlled environment on a local network in another study [S43].

RQ4: Evaluation Metrics
We present 22 evaluation metrics that have been used to evaluate the reviewed NLP-based HIDS. The reviewed studies that perform an intermediary step (e.g., sequence prediction, clustering) and then perform the intrusion detection to enhance the detection capabilities, evaluate both the steps with distinguished evaluation metrics. Thus, our 22 identified metrics are categorized under performance of intrusion detection and performance of intermediary task categories.
4.4.1 Performance of Intrusion Detection: 14 metrics were used for evaluating HIDS performance in the reviewed studies. We categorized them as detection performance and computation performance. Table 11 presents the metrics of each category with their description, mathematical representation, and the mapping to the NLP-based HIDS solution categories. The study mapping with the metrics is available in our online appendix [4] A.5.
Detection Performance: It includes 12 evaluation metrics that have been used to validate the detection result to identify benign or malicious attack or attack types by the reviewed HIDS studies. As shown in Table 11, mathematical representation of each of these evaluation metrics includes any of True Positive (TP), False Positive (FP), True Negative (TN), or False Negative (FN) [46]. Since HIDS usually intends to maximize the attack DR and minimize the FAR, these are the top-2 most used metrics. In the HIDS domain, DR, recall, true positive rate (TPR) are referred to as the same and FAR, false positive rate (FPR) are considered as same [63]. The third and fourth most used metrics are ROC-AUC (45) and accuracy (25). Compared to the classification accuracy, AUC based on ROC curve (ROC-AUC) provides a more robust measure for the evaluation of HIDS derived from imbalanced datasets as accuracy can be dominated by the majority benign class in HIDS. However, if the dataset is highly imbalanced, the shape of the ROC-AUC curve can TF-IDF (2) TF-IDF value rises in direct proportion to the frequency of a word in a document, but is offset by the number of documents in the corpus that include the word.

S64, S94
Cosine Similarity (2) Cosine of the angle between two vectors to determine if they are in the same direction. S64, S94 Clustering Dunn index (2) Ratio of minimum inter-cluster distance to maximum intra-cluster distance. S24, S84 Partition Index (1) Ratio of the sum of clusters' compactness and separation S24 Rand index (1) The algorithm's percentage of correct decisions S84 Davies-Bouldin index (1) Errors caused by representing data vectors with cluster centroids and clusters' distance. S84 C-index (1) A cluster similarity measure S84 be misleading [46]. Thus, we recommend the use of MCC, which is considered as one of the best-balanced measures [46]. Unfortunately, only 1 study [S4] under content-DL category used MCC, while it was expected to be used by the static-ML, content-ML categories as well. The fifth most used metric is precision (21/99) and sixth most used metric is F-measure (19/99), which combines both precision and recall (i.e., DR). It is to be noted that the least number of metrics were used by the rule-based HIDS category (i.e., static-rule, content-rule) as they did not used different metrics such as TNR, Confusion matrix, classification error, F-measure, or MCC.
Computation performance: The computation performance of HIDS is measured using time and resource utilization.
To evaluate the required time the reviewed studies (23/99) reported training time, testing time, or execution time.
Resource utilization refers to the storage or resource usage (e.g., the size of the stored model such as learned rules or memory required for training a model). Evaluation of HIDS in terms of time and resource utilization is significant to prove the suitability of real-life applications. For example, given the complex structure of DL methods, it requires more time and memory even though it provides a high detection accuracy. Unfortunately, only 2 of the reviewed studies [S4, S58] under content-DL category reported time, while none from this category reported the resource utilization, which questions the applicability of the proposed HIDS in the industrial setting.

Performance of Intermediary Tasks:
Some of the reviewed studies performed an intermediary task before performing the detection, which is categorized as sequence prediction and clustering. As the performance of HIDS strongly relies on the performance of such intermediary tasks, various metrics have been used to evaluate these intermediary tasks as shown in Table 12 with other relevant details.
Metrics of sequence prediction: There were 8 studies [S5, S15, S63, S67, S71, S73, S82, S94] that performed the next SC sequence prediction to monitor a system state and predict an attack behavior. The use of predicted sequence as supplementary information with the invoked sequence for the detection classifier significantly improved the performance of HIDS [S63, S94]. inter-cluster distance for which Dunn index gives stable clustering quality evaluations. However, Davies-Bouldin index is recommended as it shows prominent performance compared to Dunn index and C-index [9].

OPEN RESEARCH CHALLENGES AND FUTURE DIRECTIONS
We

HIDS data availability
In this section, we present the main issues with the datasets used by the reviewed studies and their potential solutions.
i. Updated and complete public data needs to be used to evaluate the NLP-based HIDS. Firstly, the datasets should be kept up-to-date as with the changing threat landscape the applications, OS, and their SC change rapidly over time making outdated datasets useless for evaluating modern HIDS [38,75]. Yet most of the public datasets are outdated (e.g., the top-3 most used datasets in our review ADFA-LD, UNM, and DARPA are almost a decade old). Thus, the concept drift issue [S83] and modern attack scenarios should be handled by releasing new versions of the existing datasets.
Secondly, most of the current IT systems are multi-threaded, still the available public datasets are not complete as they lack thread information and textual metadata, which are promising indicators for HIDS [65,72]. For example, ADFA-LD lacks metadata (e.g., parameter, return value) and NGIDS-DS lacks both parameters and more accurate timestamps. As considering only the temporal order of SC can be susceptible to mimicry attack, it can be evaded by combining other system artifacts [S61] (e.g., SC arguments, function calls, and other user-space information). To address these issues Grimmer et al. [38] developed Leipzig Intrusion Detection-Data Set (LID-DS), which is the first HIDS dataset containing SC and their timestamps, thread ids, and diverse metadata of several recent, multi-process, multi-threaded scenarios.
Though it is publicly available since 2019, it is used by none of the recent primary studies. Hence, it is suggested to use the newly available more complete HIDS datasets rather than using the decade-old benchmark datasets.
To mitigate the above-mentioned issues of the public datasets, 23 studies used their own customized datasets which are often kept private due to the intellectual property issues, competitive advantage, or time and money spent [98]. It is recommended to make the datasets publicly available to support the validity and future research in this domain.
ii. NLP-based data augmentation is recommended to improve the HIDS performance. A prediction model's bias towards the majority benign class in commonly used imbalanced dataset significantly affects the frequently used supervised learning-based (32/99) HIDS performance. There are only 4 studies that attempted to balance the datasets and proved it's effectiveness using diverse methods such as oversampling (SMOTE [S60]), generating malicious samples (GAN [S71], cycle-GAN [S14], and NLP-based SeqGAN, Seq2Seq [S65]). The NLP-based research trend that is gaining popularity from 2019 is various text augmentation approaches (e.g., swap/delete word, contextual similar word, word occurrence statistics-based word insertion/replacement, etc) [64,73,94]. While Qiu et al. [77], Kobayashi et al. [55], and Fadaee et al. [35] present the effectiveness of text augmentation in different classifiers, NLP language model, and machine translation model, respectively, there is a huge gap in HIDS domain to investigate the applicability of these augmentation techniques to improve the HIDS performance.
iii. Use of diverse data will enable the NLP-based HIDS to be platform-independent, improve detection capability, and prove the scalability of the HIDS in real-life. We observed that most studies used only a single dataset (e.g., ADFA-LD (34/99), UNM (15/99), and DARPA (3/99)), while only 24/99 studies used multiple datasets for the training and evaluation of HIDS. Firstly, the use of datasets from different OS will make it platform-independent (e.g., ADFA-LD, NGIDS-DS are Linux-based and ADFA-WD is from Windows OS) and the use of multiple datasets in studies (e.g., [S64, S72, S84, S92]) will help to prove the generalizability of the proposed NLP-based HIDS solution. NLP-based approaches are recommended as NLP-based semantic approach provides high-level portability between diverse OS [S40]. Secondly, the use of a large dataset is recommended to represent the real-world scenario to ensure scalability without compromising accuracy. As in the reviewed studies (e.g., [S15, S71, S78, S88]) increasing the size of the dataset may increase the training and prediction time of the applied ML/DL model making it unrealistic to be used in the industrial setting. Thirdly, to improve the HIDS detection efficiency in studies (e.g., [S30, S48, S69, S82, S99]), it is suggested to use the combination of heterogeneous data sources (e.g., SC, network events, data obtained from real attack environment (War Games)).

Efficient feature engineering
25 studies used semi-supervised learning, which uses only benign data for training that makes the feature selection more challenging to find the discriminative features due to the lack of attack data. Tedious manual feature extraction and lack of focus in feature selection are two main issues, that we discuss with their potential solutions.
i. NLP-based automated feature extraction is recommended for reliability and adaptability as real-life HIDS deals with an enormous amount of continuous data with rapidly changing threat landscape. So, manually extracted features may become outdated due to feature drift [11], and can be easily evaded by attackers. Still, 41/99 studies used manual feature extraction process and only a few studies adopted NLP-based automated feature extraction (e.g., CNN and LSTM based contextual features) and use of multi-level feature extraction (e.g., CNN-LSTM) from SC sequences. However, there is still a demand for taking advantage of recent more sophisticated NLP techniques in HIDS. For example, use of more feature learning layers using VDCNN [26], use of contextual word embedding techniques (e.g., Bidirectional Encoder Representations from Transformers (BERT) [33]), deep contextual embedding (e.g., Embeddings from Language Models (ELMo) [76]), and many more advanced NLP techniques are yet to be explored in HIDS domain. The use of the advanced NLP techniques presents a huge opportunity for future research.
ii. Automated feature selection is highly recommended for reducing dimensionality, computation time, overfitting, improving ML model interpretability, and prediction accuracy [16]. Only 22/99 studies explicitly mentioned their feature selection method, where the most used methods (e.g., freq-based, PCA, filter) require manual threshold selection [33].
However, attention mechanism can mitigate this limitation, which is adopted by only 4/99 studies in DL models. Thus, we emphasize the use of attention mechanism combined with automated feature extraction to discover more prominent and relevant features from HIDS data.

Real-world application and evaluation of HIDS
We propose the following research directions as the real-life applicability of HIDS is still in question.
i. Advanced NLP-based detection techniques that have not been explored in the HIDS domain (e.g., Hierarchical Deep Learning for Text (HDLTex) [56], Googles Universal Sentence Encoder (USE) [23] based text classification) creates a huge potential research horizon to improve the accuracy and reduce the FAR of HIDS. Though TCN is an outstanding alternative to recurrent architecture, it has been used only in 1 study [S8]. Besides, NN with more hidden layers usually provides better performance [26], which opens up a research direction to explore the deeper versions of CNN, RNN, and 3 studies considered it as future work. A study [S40] showed that use of NLP-based semantic method inherently makes the HIDS resilient to mimicry attacks. Given the recent prevalence of adversarial attacks targeting models in the cyber security domain [79], we highly recommend to consider the adversarial resilience of the HIDS by adopting methods from other research areas such as randomness, sanitizing data, adversarial training, and using semantic methods.
iv. Enhance HIDS model interpretability to enhance the transparency of attack predictions, which helps the practitioners to efficiently debug and refine the model or data for improved performance [25]. Only a few studies [e.g., S16, S75, S93] presented the important features and reasoning of the model's prediction result. HIDS can be motivated from the interpretable NIDS research area [67,68,93]. Thus, the applicability of the HIDS interpretability in terms of (1) model-based interpretability (creates a model that is interpretable by nature), (2) post hoc (applies an interpretability approach after training a black box model) [71] needs to be explored. Besides, NLP approach such as locating significant n-grams in sentence based on the intermediate outputs of CNN can provide interpretability [43].
v. Real-time detection is a must for HIDS as a business may fall victim to attacks every 40 seconds [34]. However, only 22/99 studies mentioned the time metric for the HIDS evaluation. While the reviewed studies gained high accuracy by applying DL approaches, they incurred a long training time (e.g., study [S63] using LSTM+VED required 6h 45min, study [S5] using LSTM and GRU required 50.03min and 43.53min, respectively). Firstly, the use of CNN approaches is more time-efficient as RNN-based (LSTM/GRU) methods encode the input tokens sequentially and operations in RNN-based network structures can not be parallelized, which results in low-efficiency [52]. Secondly, hardware support like GPU can be used to improve the HIDS time-efficiency [85,87]. Thirdly, HIDS deployed in the industry has to deal with a huge volume of data (e.g., 1 trillion security events are generated by HP per day [21]), it is highly recommended to use the big data technologies (e.g., Spark, Kafka, and Hadoop) for high scalability of security data processing.
vi. Integration and orchestration of HIDS in SOC is significant, as a Security Operation Center (SOC) of an organization uses hundreds of security tools [47] including HIDS. Even the use of multiple HIDS to collect data from different sources is a common practice in SOC [19]. Hence, HIDS should be able to collect and process data from diverse tools (e.g., MISP (Malware Information sharing Platform) [69] shares cyber security indicators to be integrated with HIDS). Further, the never-ending flow of alerts generated by HIDS tools needs to be correlated by Security and Information Management (SIEM) system/ Security Orchestration, Automation, and Response (SOAR) Platform.
For example, SIEM tool Splunk [86] enables to search, analyze, and visualize the alerts gathered from the tools (e.g., HIDS) of the SOC. Hence, the data interpretability and interoperability of the HIDS tools is a key requirement to ensure the integration and orchestration [45] of HIDS in a SOC, which has not been focused in the reviewed studies.
Recently, only 4/99 studies [S13, S29, S43, S55] utilized semantic ontology to integrate IDS/IPS sensor information (e.g., HIDS, NIDS), sensor data streams, malware data (Symantec's website), web text-data, domain expert knowledge, cyber-threat intelligence and used semantic reasoning over the ontology for detecting intrusions. Thus, to mitigate the gap between academia and industry, HIDS should focus on the existing tools and techniques available in the industry and incorporate them to evaluate and integrate the proposed HIDS in SOC scenario. This leads to the following future research directions: comparative evaluation of proposed HIDS with the existing opensource HIDS tools (e.g., WAZUH/OSSEC [74]), incorporating the available updated threat intelligence data sources for detection, and ensuring the HIDS alerts/output to be interpretable by the SIEM/SOAR tools for smooth integration and orchestration of proposed HIDS in real SOC scenarios. Semantic ontology and reasoning can be further explored to integrate and orchestrate [45] HIDS with other security tools, which will validate the use of the proposed HIDS in the real SOC scenario.

THREATS TO VALIDITY
We carefully followed the guidelines provided in study [53] to design and conduct our SLR. We adopted suitable steps to mitigate the effects of identified threats to validity of this SLR as presented below: Search Strategy: Missing some relevant studies is a common threat of an SLR. To minimize this effect, we used Scopus (the most comprehensive search engine with the largest indexing system [28,59]) and complemented it with the two most frequently used digital libraries IEEE Xplore and ACM Digital Library [61]. Moreover, we ran a series of pilot searches to find a search string to ensure the retrieval of the relevant papers already known to us. Besides, both forward and backward snowballing techniques were conducted to find other relevant papers overlooked by the search string.
Selection process: The study selection process may be affected by the authors' subjective judgment. To mitigate this threat, we performed a multi-step process (Section 3.3) with clearly specified inclusion-exclusion criteria to select the relevant studies. We also defined specific quality assessment criteria to exclude low-quality papers. At each step of the selection process, we discussed and resolved ambiguities to minimize the selection bias.
Data Extraction and Synthesize: Results and findings may be influenced by human error and author bias in data extraction, data analysis, and data interpretation. To address this issue, a data extraction form was created and iteratively improved to collect sufficient and consistent information required to answer the RQs. Besides, all the data-extraction activities, data synthesizing, and interpretation of our quantitative and qualitative analysis were cross-checked by the authors. All the disagreements were discussed and resolved through discussions.

CONCLUSION
This paper presents an SLR aimed at systematically and rigorously selecting and analyzing the existing literature on NLP-based HIDS. The findings are expected to form an evidence-based body of knowledge of taxonomic analysis of the features, detection techniques, attacks, datasets, and evaluation metrics used in NLP-based HIDS for practical use in the industrial setting. We synthesized 99 papers of the last decade on NLP-based HIDS. We aggregated the features and detection techniques to categorize NLP-based HIDS solutions into 6 categories. We mapped these NLP-based HIDS solutions to the attacks, datasets, and evaluation metrics to help practitioners select a suitable type for a specific application. We discussed the role of NLP in HIDS, the impact of attacks that are detected by HIDS, the efficiency and effectiveness of the proposed approaches. We also presented comparative analysis among our proposed categorizations of the used features, datasets, and evaluation metrics.
Our review aims to help researchers by providing an overview of this burgeoning research landscape. The increasing number of studies in NLP-based HIDS shows the significantly growing attention among the research community as 48 papers were published in the last 3 and a half years. Yet our review identified crucial open issues and proposed a roadmap of future work to help the researchers to mitigate those issues as discussed in Section 5. The implications for the researchers are as follows: (1) We recommend the researchers to focus on real-time accurate HIDS leveraging advanced NLP techniques (e.g., text augmentation, NLP-based low-shot learning), adopting incremental learning, enhancing HIDS model interpretability with robustness to adversarial attacks using NLP, and semantic integration of HIDS in SOC.
(2) We encourage the researchers to explore the application of contextual and deep contextual embedding for adaptability to a huge amount of continuous data since NLP-based contextual features were dominantly used.
(3) Our findings advocate the demand for further research in DL-based HIDS as the recent trend of adopting DL approaches showed significantly better performance compared to the prevalently used traditional ML approaches.
(4) We recommend the researchers to further focus on time and resource-effective use of ensemble classifier structure as it improves the detection performance compared to the dominantly used base classifier structures.
(5) We encourage the researchers to make their dataset publicly available as 19 studies used private datasets including 6 real industrial datasets, where the inaccessibility to the dataset hinders advanced research.
(6) We encourage to use MCC, which is a suitable metric for the mostly used HIDS imbalanced datasets.
(7) Our comprehensive taxonomy is expected to help researchers to frame future researches in this domain.
Our findings are expected to help practitioners to potentially utilize the NLP-based techniques as we highlight the existing prevalent practices and considerations in NLP-based HIDS. The implications for the practitioners are as follows: (1) We recommend that small-medium enterprises with resource-constraints can adopt the most used traditional ML approaches, whereas the resource-enriched large corporations can investigate the applicability of complex DL approaches as DL can cope up with large-scale data.
(2) We encourage the practitioners to publicly share the recent attacks' details encountered in a real industrial setting. Sharing the attack signature will help to store the up-to-date signature for signature-based detection and will help to evaluate the anomaly-based models' capability to detect new attacks. The most reported attack types in the reviewed studies were ACE, R2L, and U2R.
(3) For securing critical infrastructure, we suggest the practitioners to adopt the big data frameworks and sophisticated hardware support such as HPC and GPU for real-time detection by HIDS.
(4) Our findings guide the practitioners providing the prevalent practices such as dominant use of semi-supervised learning approach due to the lack of balanced dataset, use of multi-class classification to detect an attack-type, and use of anomaly-based approach to detect unknown attacks (e.g., zero-day attack). We recommend the practitioners to use hybrid approach to gain the benefit of both signature and anomaly-based approaches.
(5) We recommend the practitioners to train and validate the HIDS model with their industry-specific data before the deployment of the target HIDS, as our review found that 75 studies used public datasets, which are mostly outdated and lack sufficient and diverse attack instances.
(6) Practitioners are encouraged to analyze the trade-off between accuracy and required time while choosing an HIDS model to achieve the maximum detection accuracy with a lower false alarm rate at minimum processing time.