Neural Network-Based Voting System with High Capacity and Low Computation for Intrusion Detection in SIEM/IDS Systems

Integrating intelligence into intrusion detection tools has received much attention in the last years. *e goal is to improve the detection capability within SIEM and IDS systems in order to cope with the increasing number of attacks using sophisticated and complex methods to infiltrate systems. Current SIEM and IDS systems have many processes involved, which work together to collect, analyze, detect, and send notification of failures in real time. Event normalization, for example, requires significant processing power to handle network events. So, adding heavy deep learning models will invoke additional resources for the SIEM or IDS tool. *is paper presents a majority system based on reliability approach that combines simple feedforward neural networks, as weak learners, and produces high detection capability with low computation resources. *e experimental results show that the model is very suitable for modeling a classification model with high accuracy and that its performance is superior to that of complex resource-intensive deep learning models.


Introduction
It is no secret that Internet access has become an indispensable part of life. In fact, most businesses and government institutions operate online. However, in addition to the important benefits and services offered daily by computer networks, they also raise network security issues as many unscrupulous cyberattackers are also active on the Web, waiting to hit vulnerable systems. e integration of cybersecurity tools and threat detection has become increasingly important to prevent downtime. Security devices such as Security Information and Event Management, or SIEM [1], and Intrusion Detection Systems, or IDS [2], have become a core part of monitoring and defending networks and hosts against intrusions.
Unfortunately, this has become quite difficult as attacks are evolving rapidly in terms of complexity and sophistication, especially attacks with signatures that are not recorded in public databases (0-day attacks) and those that target specific systems and vulnerabilities. Such attacks can be used to go unnoticed by most organizations' defense mechanisms and infiltrate the target network. Indeed, in the 2018 data breach investigations, we see that 68% of breaches last year took months or longer to be discovered [3], and these breaches happen within few minutes or even seconds.
Under these constraints, researchers and security experts try to provide intelligence, adaptation, and pattern recognition for SIEM and IDS systems. In particular, they use machine learning models to improve the efficiency and accuracy of these systems by providing historical data to these models. is gives the algorithm or the model more "experience," which can, in turn, be used to make better decisions or predictions. For this reason, machine learning techniques represent the best choice over traditional rulebased algorithms and even human operators [4], and they are widely used in multiple fields and industries [5]. e problem is that machine learning models have some particularly demanding needs in terms of computational resources to train and calibrate. On the other hand, SIEM/IDS have other resource-intensive processes such as collecting and normalizing events that are running with other advanced analytics modules. is makes the use of machine learning in such systems more challenging. is research presents a new approach to create a machine learning model, by combining small submodels, with low computation and high detection rate.

Motivation: Why Combining Models Is Interesting?
ere is an interesting area in the field of machine learning called "ensembling," where researchers try to increase model capacity without a proportional increase in computation.
is is due to the fact that some models, especially neural networks, need significant computational resources to train when facing large datasets or complex classification problems. In fact, some researchers even started investigating other approaches to trade accuracy of the classifier for speed and memory usage [6]. Others focused instead on externalizing some of the heavy important processes in SIEM systems in order to leave resources available for correlation and advanced data analysis techniques [7].
Model ensembling is a very powerful technique to increase accuracy of models on a variety of machine learning tasks. It consists of several base models combined together in order to produce one optimal model that will best predict the desired outcome. Some ensembling methods require the base models to be pretrained to only combine their predictions on test set; other methods require multiple rounds of training on different chunks of the train set. is paper examines the first category, especially Majority Vote Systems, where the prediction of the ensemble represents the majority prediction of the base models. e benefits of majority vote can be clearly seen below.
Suppose a SIEM/IDS system uses three neural-based binary classifiers with 70% of classification accuracy. For an observed event, the neural classifier can predict four scenarios: (i) Scenario 1: all three classifiers are correct: 0.7 × 0.7 × 0.7 � 0.3429. (1) is value can be interpreted as the probability that the three classifications predict the same class. (ii) Scenario 2: two classifiers are correct and one is wrong: is value can be interpreted as the probability that the two classifications predict the same class. is means that the majority vote corrects an error most of the times (≈44%). (iii) Scenario 3: the probability that two classifiers are wrong and one is correct: (iv) Scenario 4: the probability that three classifiers are wrong: So, by adding the outputs from the first and second scenarios, we see that the output of the majority vote ensemble can be correct with an average of ≈78%. e combination of models or learners can sometimes improve the overall accuracy; that is, the prediction accuracy of the ensemble model is greater than that of all the other base models. But that is true only when two things are met: (a) e base models are weak learners. is means that they are classifiers that are only slightly correlated with the target label (they can often label examples better than random guessing). ey also have the distinction of being computationally simple, unlike strong classifiers that are arbitrarily well correlated with the actual classification. (b) e weak learners should be uncorrelated in their predictions. is is really important because if all learners predict the same class (even with different probabilities), then the final prediction of the ensemble model will always predict the same class. So, a lower correlation between ensemble model members is needed in order to increase the errorcorrecting capability.

Contributions.
is paper presents a machine learning model that can be integrated with SIEM/IDS systems for intrusion detection. Our model treats security events from different angles and using different weak models and, according to the majority of predictions, we classify the nature of the event. is article presents the following main findings: (i) A simple and effective intrusion detection model is developed using a majority system and based on reliability. is model can be easily integrated in current SIEM and IDS systems as it was developed following best practices in machine learning. (ii) An investigation of some of the techniques used to improve the overall accuracy of the whole model using weak learners is presented. (iii) A novel approach is used to create weak learners using only artificial neural networks. (iv) e proposed model proved to be of high capacity without a proportional increase in the resources used. e model has also shown promise when compared to other current complex models. e remainder of this paper is structured as follows: Section 2 examines the related work to intrusion detection using neural networks in SIEM and IDS systems. In Section 3, we present the components of our majority model based on reliability. e model combines the outputs of weak neural network using a novel approach. In Section 4, we implement our approach and compare it with the most relevant related work. Finally, we conclude the paper in Section 5.

Related Work
ere is a decent effort to apply neural networks in SIEM/ IDS systems. ese applications can be classified into two groups: intrusion detection and user behavior analysis. e goal is to create a neural network-based system that automatically learns complex normal behavior (or a normal security event) and at the same time knows what suspicious activity (or an attack signature) looks like.
Generally speaking, before developing a machine learning model, the data should follow certain preprocessing steps before feeding them to the model. e authors in [8] applied different preprocessing and discretization techniques to the NSL-KDD dataset and found that these steps have a big effect on the execution of the classification algorithms. Also, we argue that each preprocessing step has a specific purpose and interpretation and cannot be used out of context if the final goal is to have a machine learning model that operates well on production and not just on the used dataset. In this context, we will divide past literature on the applications of neural networks in intrusion detection or in user behavior analysis into two categories based on the preprocessing steps used, specifically, depending on how the authors transform the nominal or categorical features of the dataset. We distinguish works that use one-hot encoding as a preprocessing step and others that implement other techniques such as integer encoding to transform nominal features because it affects the relationship between variables in the data. is will be discussed in later sections.
(1) One-Hot Encoding (Used Preprocessing Step) e authors in [9] investigated the impact of the cost and the training function (optimizer) on the accuracy of neural network classifiers within SIEM/IDS systems. is work evaluates 37 feedforward neural networks, where each model contains different cost and training functions. e best model, combined with Support Vector Machines (SVM), achieves an accuracy of 81.8% on the famous NSL-KDD dataset [20]. In [10], the authors used a deep learning approach by implementing a two-stage classifier: a sparse autoencoder and softmax regression (or SMR) based NIDS. e sparse autoencoder is used for feature learning that outputs relevant features to the softmax regression for classification using the NSL-KDD dataset. For 2-class classification, the authors achieved a total classification accuracy of 78.06% (first stage, 88.39% accuracy rate, and SMR achieved 78.06%). Another work that used a similar approach was done by the authors in [11]. is work used a variant of sparse autoencoders for unsupervised feature learning and a logistic classifier is then utilized for classification on NSL-KDD dataset. e authors achieved an accuracy of 87.2%. In another effort to propose a suitable intrusion detection model using deep learning approach, the authors proposed in [12] RNN-IDS an intrusion detection using recurrent neural networks that can improve the accuracy of current intrusion detection systems. e model achieved 81.29% accuracy using 80 hidden nodes in binary classification. An effective network intrusion detection system, using an architecture called Channel Boosted and Residual learning based deep Convolutional Neural Network (CBR-CNN), is developed using one-class classification in [13]. e model uses a Reconstructed Feature Space that reconstructs a feature space using only normal traffic. e anomalies in the train set are generated far away in the reconstructed feature. Next, a Channel Boosting is used to improve classifier performance by increasing diversity in the classifier's input feature space. Finally, a CNN classifier is used to classify events. e model achieved an accuracy of 89.41%. e authors showed that these results are better than other classification algorithms such as J48, Naive Bayes, random forest, and multilayer perceptron. is work was implemented using the NSL-KDD dataset. e authors in [14] proposed a deep neural network to develop an effective IDS that classifies and predicts attacks in an Internet of Medical ings (IoMT) environment. In this work, the authors also argue that it is recommended to use one-hot encoding on the categorical characteristics in order to avoid creating numerical values that can be misunderstood by the algorithm due to some ordering issues. After preprocessing, optimizing, and tuning of the deep neural network by hyperparameter selection methods, the model was tested on a Kaggle intrusion detection dataset, where it showed an increase of 15% of accuracy and a reduction of 32% of the time required for training and classification.
(2) Other Preprocessing Steps e authors in [15] used neural networks as an event classifier within SIEM systems. e classifier, called CONTEXTUAL, is used with another subsystem called GENIAL based on genetic programming to improve the correlation engine of SIEM systems. e neural network-based subsystem classifies all the events collected by the SIEM system and GENIAL generates new correlation rules according to the neural network classification. In their paper, the authors discussed the performance of their correlation engine and have not provided information about parameters of the neural network. e authors in [16] proposed a user behavior classifier based on neural networks to detect malicious activities that use valid credentials and standard administrative tools to evade detection. In their Security and Communication Networks work, they implemented several feedforward and recurrent neural networks with different hidden layers and on different number of epochs in order to identify the suitable model for user behavior analysis. According to the authors, the best feedforward neural network achieved 98% accuracy and the best recurrent neural network achieved 97% accuracy. e authors in [17] proposed an anomaly network intrusion detector constructed on the KDD dataset. e model was built by studying 48 IDSs (neural networks) and by studying the most important parameters that, according to the authors, are input features, normalization function, number of hidden nodes, and the activation function. At the end of the study, the authors selected the two best models and implemented them as a network intrusion detection system. On a different dataset, the authors in [18] proposed an anomaly intrusion detector using hybrid principal component analysis-(PCA-) firefly based machine learning model to classify intrusions and attacks. e hybrid PCA and firefly algorithm was used to reduce dimensionality and training time. is will help the classifier, XGBoost in this case, to work better on the reduced dataset that has better features and lower complexity. Machine learning was also applied in cybercrime classification. For instance, the work in [19] proposed a framework by combining multiple models: Naive Bayes is used for classification, k-means is used for clustering, and the TFIDF or tf-idf vector process is used for feature extraction. e goal was to support analytics regarding the identification, detection, and classification of the integrated cybercrime offenses. e literature survey is summarized in Table 1.
From this literature review, regardless of the used preprocessing steps, we can make the following remarks: (a) Compared to other machine learning techniques, neural networks have the important advantage of being flexible and can be adapted to particular use cases of intrusion detection in general. But this leads directly to the question of tuning to determine the optimal hyperparameters of a system based on a neural network. (b) In order to tune an intrusion detection model, most researchers train many different candidate networks and only select one (the best) and discard the rest. is raises three issues: First, all of the effort and resources dedicated to train the remaining networks are wasted. Second, the selected model that had the best performance on the validation set might not be the one with the best performance on test data or new unseen ones. ird, sometimes even unselected models can outperform the best one when labeling rarely observed events.
In this work, instead of working with a single model, we will use three models that will process security events from different angles and classify them according to the decision made by the majority.

Majority System Design Based on Reliability
In this section, with the purpose of explaining the different modules of our proposed system, we will use a chronological methodology to justify the final model shown in Figure 1. e proposed system has three major components that work together to achieve a reliable event classifier: (i) Data preparation module: this module will apply different preprocessing techniques to the used dataset before feeding them to the proposed model. Our goal is to propose a model that can be easily integrated into a SIEM/IDS system. In this context, we have followed the best practices in this phase, since we noticed that some preprocessing techniques were applied out of context. (ii) Weak neural networks module: this uses feedforward neural networks as weak learners in our framework. We will detail later the structure of the neural networks used and how we managed to create weak learners using NNs. (iii) Ensemble module: this module contains the ensemble model that combines the predictions of the weak learners in a way the majority vote will be produced. is module also considers the reliability of the base learners.
More details on these modules will be given below. For the ensemble module, we will also present other models that were implemented during the experiment because they will play a role in the justification of the choice of a majority function based on a reliability system as an ensemble module.

Data Preparation Module.
In this study, we used the NSL-KDD (Network Security Layer-Knowledge Discovery Database) dataset as an input. It is a refined version of its predecessor KDD'99 dataset.
is database was collected from the 1998 DARPA Intrusion Detection Evaluation Program that was prepared and managed by MIT Lincoln Labs to survey and evaluate research in intrusion detection.

Dataset.
NSL-KDD is a dataset proposed by Tavallaee after criticizing the inherent problems of KDD'99 in [21]. Because KDD'99 contains many duplicate records, a machine learning model is likely to learn high-frequency attacks that can affect test-process evaluation results and prevent it from learning infrequent records, which are usually more damaging to networks. As a result, NSL-KDD comes with the following improvement: (i) It does not include redundant and duplicate records in the train and test sets. us, the classifiers will not be biased towards more frequent records, and the performances of the learners are not biased by the methods that have better detection rates on the frequent records. (ii) e number of records in the train and test sets is reasonable, which allows for affordable experiments on the complete set without the need to randomly select a small portion. As a result, the evaluation results of different research studies will be consistent and comparable. More information on the improvements can be found in [21].
NSL-KDD is a collection of downloadable files available to researchers. ey are listed in Table 2.

Data Preprocessing.
In each record of the NSL-KDD dataset, there are 41 different features of the stream and a label to indicate if a record is an attack or a normal flow. Each feature can have a categorical/nominal value such as the protocol used in the connection (Protocol_type � TCP, UDP or ICMP) or a numeric value like the length of time duration of the connection. In this subsection, we will detail the steps we followed to prepare the dataset:

Reference
Model Dataset Accuracy (%) El Hajji et al. [9] NN with best cost and training function NSL-KDD 81.8 Javaid et al. [10] Sparse autoencoder with SMR NSL-KDD 78.06 Gurung et al. [11] Sparse autoencoder with LR NSL-KDD 87.2 Yin et al. [12] RNN based IDS NSL-KDD 81.29 Chouhan et al. [13] CBR-CNN based IDS NSL-KDD 89.41 Maddikunta et al. [14] DNN with PCA-GWO Kaggle dataset 99.9 Suarez-Tangil et al. [15] NN and GP --Ussath et al. [16] RNN and NN -89 Chiba et al. [17] NN KDD 99.62 Bhattacharya et al. [18] PCA-firefly based XGBoost Kaggle dataset 99.9 Gadekallu et al. [19] Naive Bayes Kaggle and CERT-In repositories 99.9 (a) Input Features e first important phase in constructing an effective intrusion detection model is feature selection. It is the process of selecting a subset of relevant features that will allow learning models to perform faster and more efficiently in order to improve the classification accuracy. For the NSL-KDD dataset, a number of researchers have studied the impact of some features on the accuracy of classification. For instance, the authors in [22] showed that the attributes 9, 20, 15, 17, 19, 21, and 40 have minimum or no role in detection of attack. e authors in [23] mentioned that features 7, 8, 11, and 14 are not useful, since there are almost all zero values in the dataset. In this context, these features were removed from the train and test sets in this study to retain only 29 features in total.

(b) Categorical Encoding
Categorical encoding refers to the process of assigning numeric values to nominal (nonnumeric) features in order to facilitate the processing task.
Since we are using only neural networks in our model, it is necessary to convert the textual data of the dataset into numerical values. Generally speaking, there are two approaches to encode categorical variables: (1) One-hot (binary) encoding: it is a binary representation of nominal features where the categorical value is removed and a new binary variable is added for each unique nominal value (2) Integer encoding: it refers to the process of coding a categorical variable using integers such as 1, 2, and 3 e main difference between the two approaches lies in the existence of an order relationship between the values of the nominal feature in question. Integer encoding only makes sense if the categorical variable has an order; that is, if we are studying a dataset that follows the Syslog format, then the ordinal feature Severity_Level will be encoded with numeric values from 0 to 7 according to the predefined levels of the log format (from Emergency to Debug). On the other hand, because one-hot encoding implies an assumption of independence between the records of the dataset, it is used on nominal attributes with no order.
All the nominal features of the NSL-KDD dataset are independent and do not follow any order. For this reason, we used one-hot encoding to transform the nominal features. For example, Protocol_Type is encoded into 3 binary variables (tcp: (1, 0, 0); udp: (0, 1, 0); icmp: (0, 0, 1)). By applying this transformation to all the categorical features in the dataset, each connection record in NSL-KDD will be represented by 110 coordinates instead of 29. As a final remark, there are many researchers who worked on NSL-KDD (or KDD dataset) and used numerical encoding to represent the nominal features of the dataset. Having done so, the model will assume a natural ordering between categories, which may lead to poor performance or unexpected results (predictions halfway between categories) [24]. With this in mind, we will not include the papers with this preprocessing step in our comparison. (c) Data Normalization e goal of this step is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
is is interesting only when features have different ranges. Data normalization has the advantage of speeding up some machine learning algorithms. Generally, the two most effective attribute normalization methods for NSL-KDD preprocessing are mean range [0, 1] (min-max normalization) and statistical normalization (Z-score normalization) as mentioned by the authors in [25]

Pushing the Limits of Feedforward Neural
Networks. e goal of our study is to propose a voting system that uses neural networks to classify events based on majority decision. is suggests, from the preliminaries section, that the three neural networks should be weak learners with uncorrelated predictions. In this context, initial studies and tests were performed in order to create weak neural networks that differ by using the following: (1) Neural networks are simply strong learners that try their best to approximate a function or minimize the error to achieve the desired output, regardless of the dimension space or the used parameters/constraints (2) is is a consequence of the first explanation: even by varying the parameters of each model, the predictions of each trained model are still correlated. is was discussed before; even with the different predictions, all the three models predict the same class (for most observations) Note that, in these initial experiments, we even varied the input features for each model, using a number of feature selection algorithms for each model. Our initial hypothesis was that prediction of the base learners would be uncorrelated if they were trained on different dimension spaces. Although this is not completely false, it is hard to achieve when the used base models are strong learners such as neural networks.
is explains why neural networks are not used as weak learners in model ensembling in many studies. However, in the documentation of JMP (John's Macintosh Project), which is a suite of computer programs for statistical analysis developed by the JMP business unit of SAS Institute [26], it is mentioned that neural networks can be used as base learners if the base model is a single-layer model with 1 to 2 nodes. Otherwise, the benefit of faster fitting can be lost if a large number of models are specified. For the NSL-KDD, using 1 or 2 nodes did not create a weak learner but rather an unstable one. So, in order to create a weak learner using neural networks, we propose the following approach depicted in Figure 2.
e input features will be divided into three subsets; each neural network will use one subset during training and drop the rest: during training, the first neural network will use the first 30 features (0, 1, . . ., and 29) and drop the rest, the second neural network will use the second 30 features (30, 31, . . ., and 59) and drop the rest, and finally the last one will drop the first 60 features and use the features 60, 61, . . ., and 89. is approach is like the dropout regularization technique but with the following differences: (i) In the dropout regularization, nodes are dropped randomly at each epoch. Using our approach, the learner will drop the same input nodes in every epoch. We used this modified version of dropout regularization to train the neural networks on different subset of features and then test the trained model on the whole set of features (the original dimension space, using 90 features). (ii) e goal of dropout regularization is to improve the generalization of the model, where the goal here is to improve the specialization of each model.
It could be argued that if each neural network was trained on different features, then the predictions of the models will differ, which could improve the performance of the ensemble model. However, the initial results revealed that this is not the case; this is because even if the model was trained on different subset of features, the test set will also be mapped through the same transformation (e.g., will have the same features used during training). is detail is important for strong learners such as neural networks.

Ensemble Module.
In our initial tests, we used a simple averaging function so we can propose an approach to create weak learners based on neural networks that differ from each other. After this has been done, we can now dive more into the different techniques to combine the predictions of the pretrained neural networks. In this research, we have studied the following ensemble techniques:

(a) Hard Voting (b) Mixture of Experts (c) Majority Function (d) Weighted Average (e) Majority System Based on Reliability
Each technique was implemented separately in order to choose the best one for our study. Details of each ensemble techniques are discussed as follows: (a) Hard Voting is is the simplest case of majority voting. Classes or labels are predicted via plurality (majority) voting of each classifier. Note that here we consider the classes and not the probability prediction of the classifier.

Security and Communication Networks (b) Mixture of Experts Layer
In this approach, each model, called expert, is made to focus on predicting the right label for the cases where it is already doing better than other experts, which causes specialization. is concept also aims to create specialized base neural networks, except that here it uses a manager function (another neural network) to do so. is manager monitors the expert's predictions to improve the overall accuracy and performance of the whole model as shown in Figure 3. In this context, the authors in [27] proposed the Mixture of Experts (MoE) layer, which consists of a number of experts (namely, four), each a simple feedforward neural network, and a trainable gating network, which selects a sparse combination of the experts to process each input. All parts of the network are trained jointly by back-propagation. e MoE layer was also implemented in this study and has been modified accordingly to use our 3 specialized weak neural networks, instead of using the same feedforward neural networks with the same architecture.

(c) Majority Function
Each neural network will process an event using its defined architecture and parameters and produce a probability p. is probability represents the classification accuracy of a neural network for the processed event. e objective of this phase is to use a majority voting system to classify security events. Since each neural network uses a different subset of input features, each model will produce a different classification accuracy p i . If the majority of the neural networks classify an event as an attack, then the final decision should represent the majority vote. For this, we propose the following steps: (1) Order the results (p 1 , p 2 , p 3 ) of each neural network in ascending order (r 1 , r 2 , r 3 ). (2) Take the function (3) Calculate the weighted average: f r 1 , r 2 , r 3 � 1 − w r 2 2 r 1 + 0.5r 2 + w r 2 2 r 3 .

(6)
Varying the input features and the parameters of each model is an important key factor, as it will guarantee the absence of a relationship and equivalence between the existing probabilities of each neural network. By ordering the result of each neural network, r 2 becomes the decision between learning "attack" or "normal." If r 2 > 0.5, the final output is labeled "attack," r 2 < 0.5, and the event is considered "normal." (i) In our proposed function, we have augmented the decision boundary from 0.5 to 0.6 to have a clear decision on the processed event. For instance, if r 2 > 0.6, then this is a clear consensus of two algorithms," as it is codified in w(r) as a clear "attack" by two algorithms. is same reasoning is true for r 2 < 0.4 (two algorithms labeled the event as "normal"). (ii) In the gray area of w(r), where 0.4 ≤ r 2 ≤ 0.6, the formula gives a weight to both extreme answers (r 1 and r 3 ), which takes into account the fact that that we do not really have a consensus. As r 2 changes from 0.4 to 0.6, the weight shifts gradually from r 1 to r 3 , making sure the function f is continuous.   Security and Communication Networks One issue we have noticed when using our formula is that the function becomes noncontinuous when r 2 crosses 0.5. For example, if r 1 � 0, r 2 � 0.49, and r 3 � 1, applying the average of the consensus votes ("normal") would result in (r 1 + r 2 )/2 � 0.47. If r 2 changes slightly to r 2 � 0.51, the consensus vote changes to "attack," so the average of the consensus votes would be (r 2 + r 3 )/2 � 0.53, which is a big change.
To solve this problem, we decided to design a kind of generalization of equation (2) by adding a fully connected final layer with softmax activation. e new designed function will have the following expression: where α, β, and c are trainable parameters. e architecture of this ensemble module is depicted in Figure 4. e main difference between this architecture and that illustrated previously in Figure 1 is that here the neural networks are trained twice. e first time is when the 3 NNs are pretrained independently until an acceptable classification accuracy is reached. e second time is that when we build a larger neural network with the 3 NNs in parallel by adding a softmax layer, we will train the last layer so that the parameters α, β, and c are configured to always follow the rule of the majority.

(e) Majority System Based on Reliability
Another alternative solution to the issue of the Majority Function (M.F) is to have one reliable voter decide the class of an observed event when r 2 is close to 0.5. is is done using a reliable system, as depicted in Figure 5, which will be used interchangeably with the Majority Function.

Security and Communication Networks
For this, we propose the following steps with a modified version of the Majority Function: (1) Order the results (p 1 , p 2 , p 3 ) of each neural network in ascending order (r 1 , r 2 , r 3 ). (2) Take the function (3) Calculate the weighted average: (i) By following the same reasoning as before, the final decision of the Majority Function is clear when r 2 > 0.6 (or r 2 < 0.4). (ii) If r 2 is near 0.5, the most reliable voter, which correctly labeled most of the events during training, will classify events in test phase. is approach proved to be the best and was used in the final model. e details of operation and reasoning of the majority function based on reliability are presented in the implementation section.

Experimentation and Discussion
In order to analyze our proposed approach, we implemented the whole model in Python using PyTorch library. Details of the hardware and software used during the implementation are given as follows:

Experimentation.
We have used the files KDDTrain+.csv and KDDTest+.csv for training and testing, respectively. e sample sets were randomized in order to avoid any possible bias in the presentation order of the sample patterns to the ANNs. 20% of the training set was used for validation. Details of the architecture of each specialized neural network are shown in Table 3. Note that we have tried to vary most of the parameters studied by authors interested in intrusion detectors based on a neural network. e variation of these parameters will also make the predictions of the different learners. However, since we combine these predictions, we have not modified some parameters such as the activation function, so that the predictions are in the same range of values. Figure 6 shows the performance of the three base learners. Each neural network has been trained for 30 epochs. Note that although we used a validation set, we did not perform any regularization techniques such as early stopping, because our goal is not to improve the generalization of models but to create specialized models. us, we might see some overfitting phenomena for some base learners (e.g., NN3). We remind the reader that our approach to creating weak neural networks follows the same training/ testing procedure used in dropout regularization: To evaluate the performance of the weak neural networks and overall modules, we used 4 evaluation metrics that take the following parameters into account: (i) True positive (TP) represents the correct classification of an intrusion (ii) False positive (FP) is the incorrect classification of a normal user taken as an attack (iii) True negative (NP) represents a correct classification of a normal activity (iv) False negative (FN) is an instance where the intruder is incorrectly classified as a normal activity Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations: Recall (sensitivity) measures the proportion of the positive values that are correctly classified:  F-score is the weighted average of precision and recall. erefore, this score takes both false positives and false negatives into account: Table 4 shows the classification accuracy of the base learners.    Next, each ensemble model was implemented and compared; results are summarized in Table 5.
According to Table 5, the worst performance of the ensemble models is the Hard Voting approach. is is because it considers less information compared to other approaches (the binary output of the voter).
e weighted layer on the other hand showed a moderate improvement in all performance metrics, but the results were not significant enough to be considered. Also, the weighted layer required a second round of training to calibrate the weighted or the trainable parameters α, β, and c, which does not meet the purpose of this study. e most interesting ensemble approaches for us were the majority function and the MoE layer because they both improved the classification accuracy of the voters without proportionally using more resources for training. Although we initially expected that the MoE layer will significantly improve the results because the whole model was designed for this specific purpose, this was not the case and we explain this by the following reason: the MoE layer uses the Gater in order to train only a tiny part of the network (namely, 3 experts) given one training example, whereas, usually, we would have to propagate the input through the whole network (in this case all experts). Each expert sees, therefore, not all samples but just a subset and specializes (becomes an expert) on those. In other words, the learners in the MoE layer become experts after the training. In addition, our approach uses learners trained only on a subset of features, so they become experts on these features. So, using learners that are already experts on a subset of features to recognize a certain type of observations did not help the overall systems, which is the reason why we decided to move on and try to enhance the Majority Function instead.
One issue we mentioned in the Majority Function, which the weighted layer did not solve, is how the decision changes when one of the voters is in the gray area (near 0.5) and the other voters clearly choose different classes (one voter votes >0.6 and the vote of the second model is <0.4). So, for this issue, instead of trying to find optimal weights α, β, and c, where α p 1 + ßp 2 + c p 3 , we propose to have one voter, which proved to be reliable during training in the gray decision area and have that voter decide for the other voters the nature of processed events, hence proposing a reliability system illustrated in Figure 5.

Operation of the Majority Function Based on
Reliability.
e goal of the reliability system is to help the ensemble model (Majority Function) overcome the gray area that the voters reach. Specifically, when two voters are decisive and vote for different classes and one voter is indecisive (gives a probability near 0.5), the reliability system is used to choose the most reliable voter of the two decisive voters. e goal is to have the reliable voter decide the nature of the event in gray areas. e review module is launched using training data. e voters (base learners) are trained and the ensemble is created using the Majority Function proposed before. Next, the review module is executed to check if the final prediction (on training data) mislabeled an event and the issue mentioned before is present. In this case, the reliability algorithm is launched to reward voters when they correctly classified an event. e reward points are different; the highest point is awarded to the voter who correctly predicts an event and the others do not.
In testing phase, the Majority Function is used only when there is a clear consensus of two voters (at least two votes have probabilities >0.6 or <0.4). If one vote is in the gray area, the reliability system is executed to proceed with the most reliable voter during training.
By adding this reliability module to the Majority Function, we got significant improvements in the predictions of the ensemble model. Table 6 shows the results of the Majority Function based on reliability.
e formal time complexity of both the review module and the reliability algorithm is O(N), where N is the number of training examples. Since the review module calls the reliability algorithm, the overall complexity is O(N2).
As for our weak neural networks, since the models are trained in parallel, we will only consider the model with the most complex architecture, which is NN3. So, the time complexity of NN3 which has the architecture 90-25-1 is

Discussion.
e final step is to present a qualitative discussion of the relevance of the proposed model by comparing our approach to the most relevant related work presented so far. Note that, as previously mentioned, we exclude works for which one-hot encoding was not used as a preprocessing step for the reasons discussed before. In addition, the scope of this comparison only focuses on works based neural networks. Table 7 summarizes our findings during this qualitative analysis, mainly focusing on examining the model capacity  and the computational cost of the approaches. Model capacity in our study is the ability to generalize well with test data. So, in our case, the higher the classification accuracy of the model, the higher the capacity. e computational cost criterion, as the name suggests, is the amount of resources that should be allocated to train a model. us, deep neural networks with large number of nodes and layers require significant processing power to train and tune. e authors in [9] used a neural network with an optimal parameter focusing mainly on the cost function and the optimizer algorithm. e model has a small architecture with one hidden layer and 25 hidden neurons, so training this model is not heavy on resources, which decreases the computation cost factor. e accuracy, however, is not that high.
e model proposed in [10] is a two-stage classifier, so it requires two training cycles before testing the model. e authors also mentioned that they used 10-fold cross-validation on the training data on some cases, which makes training longer. e final classification accuracy (including the output from the sparse autoencoder) is around 78%, which makes the capacity of the model very low. e model in [11] has a high capacity compared to the precious work; it provides a classification accuracy of 87% but with a low precision. e model, however, uses many layers of features learning, an intermediate classifier, feature stacking, and finally a deep net. So, training such model is computationally heavy. e use of recurrent neural networks in [12] for intrusion detection gave a moderate classification accuracy (81.29%). Compared to feedforward neural network, RNNs are difficult to train because they require memory-bandwidth-bound computation. e authors also used an important number of hidden nodes (80) considering the modest performance of the model. Finally, the highest evaluation metrics were found in [13] and in this research. e main difference is simplicity: in our approach the results are obtained using weak feedforward neural learners with a small number of parameters, while their approach is based on conventional neural network that involves channel boosting, autoencoders, and stacked autoencoders and many intermediate functions. Also, tuning small neural networks is not computationally complex because the largest model uses only 25 hidden neurons.
We also want to mention that most of these works were compared with other techniques according to different criteria, from which we cite [28][29][30]. Note that, in terms of classification accuracy rate, our work also outperformed these models.
Finally, some authors followed the same preprocessing steps we applied on the NSL-KDD dataset but used different machine learning models such as SVMs [31], unsupervised learning [32], and reinforcement learning [33]. ese models were not included in the comparison because the authors have used a sampling process during training and testing, while we used the full dataset. Consider the following: (i) e authors in [31] proposed an attack detection system using SVMs which enables early discovery of the APT attack. To develop the model, they used one-hot encoding for preprocessing and PCA for dimensional reduction; next they compared the classification accuracy of four classifiers: Support Vector Machine, Naive Bayes classification, the decision tree, and neural networks. e authors claim that the SVM achieved 97.22% accuracy, but the model was trained and tested on two different subsets from the original training set (KDDTrain+) and not on the test set (KDDTest+).
(ii) In [32], the authors used an unsupervised learning algorithm autoencoder and achieved 91.70% classification accuracy. To preprocess the NSL-KDD dataset, they used one-hot encoding on the categorical features and then applied PCA for dimensionality reduction. e authors mentioned that they used 85% of the dataset for training and 15% for testing, which means that they did not work on the complete dataset as we did in this work. (iii) For the work in [33], the authors used deep reinforcement learning on two datasets: NSL-KDD and AWID. For each dataset, they compared four reinforcement learning algorithms: Deep Q-Network (DQN), Double Deep Q-Network (DDQN), Policy Gradient (PG), and Actor-Critic (AC). e best results are obtained for the DDQN algorithm. e authors also used a sampling function on the datasets during the preparation, but, in order to compare the results with other works, they ran the tests on the full datasets. For the NSL-KDD dataset, they achieved 89% accuracy, 91% F-score, 89% precision, and 93% recall. We can see that the metrics are really close to the ones we found with simpler weak neural networks.

Conclusion
SIEM and IDS systems are tools that process a huge amount of data on a daily basis. is involves collecting, normalizing, and detecting attacks and sending notifications in real time. is research takes into account the processing power that is dedicated to the operation of SIEM/IDS systems and proposes a new machine learning model with high detection capability and low computation resources.
Our model is based on reliability system that processes events using different simple feedforward neural networks and predicts the output using majority functions. We have also proposed a new way to create weak learners using neural networks by making them "experts" on small subsets of the input feature space. Our model achieved an accuracy of 89%, which is superior to many works and competes with deep learning approaches.
In a future study, we will evaluate the proposed system in a real network. We will also increase the number of base learners and evaluate the performance on multiclassification.

Data Availability
e datasets used to support the findings of this study are available at https://www.unb.ca/cic/datasets/nsl.html.