Machine Learning in Network Anomaly Detection: A Survey

Anomalies could be the threats to the network that has ever/never happened. To detect and protect networks against malicious access is always challenging even though it has been studied for a long time. Due to the evolution of network in both new technologies and fast growth of connected devices, network attacks are getting versatile as well. Comparing to the traditional detection approaches, machine learning is a novel and flexible method to detect intrusions in the network, it is applicable to any network structure. In this paper, we introduce the challenges of anomaly detection in the traditional network, as well as the next generation network, and review the implementation of machine learning in anomaly detection under different network contexts. The procedure of each machine learning type is explained, as well as the methodology and advantages presented. The comparison of using different machine learning models is also summarised.


I. INTRODUCTION
N ETWORK security has become increasingly critical these days, from the traditional computer network and cellular network to the next generation software defined network (SDN) and Internet of Things (IoT). The rapid growing network brings efficiency and convenience to our life, as well as the demand for high quality of service. Even though the network use case is getting more complex and a network device needs to process more data, users hope to get responses more quickly and show a lower tolerance to the service interruption. Firewalls, deep packet inspection (DPI) systems and intrusion detection systems (IDS) are the typical methods for anomaly detection, however, the cost to deploy these countermeasures and the complexity of system have to be considered [1], [2]. The security issue arises along with the evolution of network, the diversity of network services and applications provides hackers more opportunities to compromise the network than ever before. Especially, the working procedure in the next generation network is quite different from the legacy network, current anomaly detection methods need upgrade to adapt to the change in these networks. Due to the large number of connected devices and high-speed broadband, anomaly detection requires to process big data over a complex network structure with a prompt reaction. This has become one of the biggest challenges to protect networks [3]- [5].
Machine learning (ML), as an analytical tool based on statistics, has been widely discussed and deployed in various areas. Its capability to make decisions after study and analysis relieves people from processing a huge amount of data. Furthermore, its response to abnormal behaviours is usually much quicker than human beings, which is an advantage in early detection. ML can create diverse models with various algorithms, the way to work with these models also has a big difference. Even if running the same model to detect the same type of attack, the outcome varies depending on the features that you prefer ML to consider [6], [7]. As a matter of fact, the most difficult step using ML is data preparation. Because the output of ML highly relies on the data from which algorithms learn the skill to distinguish normal operations from anomalous behaviours. Thus, in this paper, we introduce ML algorithms, as well as discuss the implementation of ML models in anomaly detection under different network contexts.
The contributions of this article are: • It presents a comprehensive survey on the ML types.  The rest of this article is organised as follows. Related works are presented in Section II. Section III introduces four ML types and their procedures in anomaly detection. Then Section IV reviews the challenges of anomaly detection under various network contexts. Detailed survey and comparison of existing solutions are in the Section V. Finally, Section VI concludes this article.

II. RELATED WORK
Machine learning has been applied to security in various types of networks. Buczak et al. [8] and Hodo et al. [9] focused on the intrusion detection system using supervised and unsupervised learning in the cyber network. Da Costa et al. [10] surveyed the intrusion detection using ML applications under the context of IoT. Ucci et al. [11] and Gibert et al. [14] researched malware detection and classification in the Windows system with supervised and unsupervised learning. Tahsien et al. [12] and Hussain et al. [13] described the potential threats in the IoT and presented ML applications to solve these issues. Nassif et al. [15] studied the threats in the cloud network, and the way to secure cloud network using supervised learning.
Although ML techniques have been deployed in diverse domains for addressing security issues, there is no comprehensive study on the anomaly detection using four types of ML models under different network environment. Hence, in this paper, we provide a survey of ML techniques which deployed in various kinds of networks for anomaly detection. The comparison of our paper and existing survey papers are summarised in Table 1.

III. BACKGROUND OF MACHINE LEARNING (ML) TECHNIQUES
As a subset of Artificial Intelligence (AI), ML is a powerful tool that can be used for network anomaly detection via scientific study of traffic samples, this procedure is quite different under each ML category. Even running the same ML model with two identical datasets, the performance may vary from the way a ML algorithm is used, such as the features chosen from the dataset or the weight defined for each feature. More features do not always mean better results, instead it could lead to overfitting in the model [16]. Thus, it is worth reviewing and comparing current solutions so as to better understand and build a model with available ML techniques and data in hand. ML can be classified into four categories as shown in Figure 1: (i) supervised learning (SL), (ii) unsupervised learning (UL), (iii) semi-supervised learning (SSL), and (iv) reinforcement learning (RL).
(i) Supervised Learning (SL) learns from existing labelled datasets, which is called training set, and by comparing with the known labels the predicted output can be evaluated. Past experience is used as a reference to make a decision, and a high quality training set is always essential to build a well-performed model, however, a satisfying result is not guaranteed by the dataset only, the training method is another key factor in building a trustworthy predictor. In the SL, a classifier model is created through training first, after that it is able to predict either discrete or continuous outputs. Before prediction, the performance, such as accuracy, of a SL model is usually validated to show its reliability. SL can also be divided into classification and regression techniques [17]. The classification technique classifies input data into discrete categories, it calculates the probability of a test sample to be under each category, and the one with most votes wins [18]. This probability is the likelihood of a sample belonging to a class. Typical applications including medical imaging and credit scoring. The regression technique predicts continuous responses, usually quantities, from the input variables, for example, changes in temperature or fluctuations in power demand [19]. Typical applications include electricity load forecasting and algorithmic trading. To evaluate these two techniques, the classification model can check the percentage of correct predictions; while the regression model could calculate the root-mean-square error, because the output is continuous, a deviation between the prediction and real value is acceptable. (ii) Unsupervised Learning (UL) finds hidden patterns or intrinsic structures in data to group them, it has input data but no expected output variables. Unlike SL, there is neither labelled sample nor training process, which is to say it works on its own and its performance can hardly be evaluated. Although some researchers use existing labelled data in the UL model to verify its outcome, this is unable to achieve in the real implementation, and sometimes experts have to analyse the result manually to run an external evaluation. UL is mainly used for clustering and dimensionality reduction. In the cluster problem, it uses clustering techniques so that one sample may belong to one cluster only or multiple clusters; while in the dimensionality reduction, UL identifies the correlated features in the dataset, so that redundant information can be removed to reduce the noise. Typical applications include market research, and object recognition [20]. (iii) Semi-supervised Learning (SSL) combines both labelled and unlabelled data to build the classifier, which is suitable for the scenario that has a paucity of labelled dataset. It employs the training process as mentioned in the supervised learning to prepare a predictor with limited labelled data, and this predictor classifies unlabelled samples, then each pseudo labelled sample is given a confidence value to tell the administrator whether this prediction is assured. Those confident samples will join the new training set to update the classifier until all the data have labels. As unlabelled data is actually tagged randomly in the prediction, assumptions, such as smoothness and cluster, have to be made prior to the training of unlabelled instances [21], [22]. (iv) Reinforcement Learning (RL) uses states, actions, and rewards to judge if the machine has made a good decision. The algorithm used in RL is called an agent, and the agent is working in the object, called environment. At first, the environment sends the current state to the agent, and the agent chooses actions in response to that state, so that it enters a new state based on the action. Then, the environment sends this new state and a reward to the agent. This loop keeps running until the agent receives a terminal state. Through the rewards given by the environment, the agent develops an optimal policy to achieve the maximum long-term rewards [23]. To evaluate the performance of ML models, there are four main metrics: accuracy, precision, recall and F-measure. The procedure of running ML in anomaly detection is summarised as follows. In general, the process of anomaly detection using SL is shown in Figure 2, it includes: data preparation, algorithm selection, model training, evaluation, model improvement and prediction. Data preparation is the most important and time-consuming step, from data collection to annotation. The collected data is not ready to work in the SL, duplicated data shall be removed, features must be extracted and converted to the format that can be understood by the SL algorithm. Besides, a classifier is added to each sample to prepare a group of labelled data. This data group is further split into VOLUME 4, 2021 training set and validation set. Once the SL algorithm is chosen, a predictor is trained via the training set, and it is then evaluated through the validation set. Parameters of a SL algorithm can be adjusted to reach the best outcome according to the result of evaluation. In the end, the trained model is able to predict samples in real time [24].

A. SL IN THE ANOMALY DETECTION
Rather than using all the features in the dataset, select only the key features for training and prediction is a better option, because it filters features that are not strong related to the output, and facilitates an enhanced understanding of the model [25]. Sometimes the performance of prediction improves, and sometimes even though the outcome impairs the degradation of prediction is very limited. Additionally, this saves system resources and time in training. Techniques to extract features are described below: 1) Wrapper approach searches for essential features by evaluating the output using the predictor itself. The entire feature group is rearranged into several subsets, and the subset which has the lowest estimated error is considered as the most related features in the prediction [26]. Genetic algorithm (GA) [27] and recursive feature elimination [28] can be applied in the anomaly detection application. 2) Filter approach assesses feature importance via the characteristics of dataset, such as correlation, and the predictor is ignored in this method [29]. Typical algorithms include fisher score [30] and correlation based feature selection [31]. 3) Embedded method is a trade off between the previous two methods, because the computational cost in wrapper method is high; while the selected features using filter method may not be optimal. Thus, embedded method picks features in filter mode and validate the performance in wrapper mode [32], [33]. Lasso is a typical algorithm that can be employed in anomaly detection [34]. Apart from the measures above, ensemble is also used to gain a more stable and reliable model, two typical ensemble types are bagging and boosting. Bagging method trains classifiers independently and votes with equal weight, it reduces variance in the model [35]; while boosting method trains a new model based on the previous model, it has low bias in the model [36]. As no training with labelled data is performed in the UL for anomaly detection, finding outliers in the data is based on the assumption that abnormal behaviours rarely occur. The procedure is given in Figure 3, similar to SL, data are collected and adapted to the form that can be understood by the UL algorithm, but no label is required. Comparing to SL, UL is able to process computationally complex cases, because it is data-driven and can handle unknown scenarios. On the basis of the feature of anomaly and the principle of UL algorithm, a specific attack is more likely to be detected by certain UL models, which is to say that selecting a suitable algorithm is also necessary for anomaly detection [37]. Moreover, feature extraction and normalisation are usually performed on the data before sending to the UL clustering models. Numerical data are always preferred in the test, such as IP address and number of bytes, because they are valuable information in a cluster [38], [39].

B. UL IN THE ANOMALY DETECTION
In the real implementation, the accuracy of an UL clustering model is hard to evaluate, the outcome might be untrustworthy. However, with labelled data, the performance of UL in the anomaly detection is proved to be satisfying [40]. Especially in front of unknown attacks, UL has its advantages over SL. Since SL models rely on the training data, unknown attacks might slip through the net due to the lack of related records, and UL models could step in to detect the issue. The procedure of using SSL is illustrated in Figure 4, apart from the data preparation of labelled and unlabelled data, SSL trains a model with labelled data first. These labelled data could be in multiple classes, which means the training set has samples of all the attack types; or in one class, i.e. normal samples, which is to say the predictor is trained by normal traffic only and needs to classify anomalous traffic. As a trade-off between SL and UL, SSL is more applicable in the real world, because it is an option to obtain a relatively reliable prediction with a small number of data. SSL relieves the lack of labelled data and ensures the model has adequate training before implementation, however, incorrect classification of the unlabelled data could mislead the model to false prediction [41].

D. RL IN THE ANOMALY DETECTION
RL is a mistake-driven learning method, which is depicted in Figure 5, this learning style is quite similar to the learning of human beings. One of the challenges to leverage RL in anomaly detection is the definition of RL parameters: state, action and reward. Although it does not require data labelling, there are too many features in the network that  False prediction rate increases with unfamiliar data. Labelled data is rare and the cost of labelling is high in the real world. Computational cost is high during training, especially with large data size.
Unable to handle complex tasks.

UL
Categorise unlabelled data from features given.
Decision can be made without labelled data.
Able to detect novel threats if their features are quite different from the normal ones. Getting unlabelled data is much easier. Can handle complex scenarios. Quicker response in classification than SL.
Unable to know the performance due to the absence of labelled data. It may be a costly affair to analyse the output when identifying the threat type in a complicated scenario.

SSL
Initialise supervised learning with a small group of labelled data, and then classify unlabelled data accordingly.
Expand training set with high confident unlabelled data. The final model is trained by labelled and pseudo labelled data.
Obtain more confidence from labelled data than in the UL. Labelling limited size of data is acceptable in the real world.
Employment of incorrect predicted unlabelled data could mislead the classifier to make wrong decisions.

RL
Use trial and error method to try all the possible state-action pairs so as to find the strategy with a best long-term return.
Use the concept of reward to judge a response to the environment. Emphasise the final outcome rather than a single instant output.
Applicable to complicated real world problems that require the best results after a series of operations rather than a single action.
Resource consumption is high, because it is going to try all the state-action pairs. can become the state in RL, as well as the reward after each operation. These parameters determine the performance of RL in identifying malicious behaviours in the network. Unlike other ML types, errors can be corrected in the RL, the machine "realises" its mistake from rewards or long-term returns, and then it will avoid these actions under the specific environment. Since RL is learning through the interaction with network for anomaly detection, a large amount of data and computing resources are required to achieve an ideal solution.
The difference of using the four ML types for anomaly detection in general is summarised in Table 2.

IV. ANOMALY DETECTION CHALLENGES IN VARIOUS NETWORKS
Before we explain ML for anomaly detection, challenges under different network contexts, including computer network, cellular network, SDN, IoT and cloud network, are discussed. Although cyber security has been researched for years, there are still open issues and challenges in different types of networks. In the traditional computer network, the intrusion detection system (IDS) is a typical countermeasure deployed to protect the network, especially in the large scale network, it is a mature system against threats. However, there are still some challenges in the IDS when protecting the traditional network. The three key factors in evaluating an IDS are: accuracy, completeness and performance. The accuracy and completeness are hard to measure, and most of the evaluations are done by contrived dataset, which is hard to be unbiased and comprehensive. The complexity of IDS also increases in order to cover more attack scenarios. Furthermore, since new attacks are introduced and existing attacks are changing their methods, to update the profiles in the IDS after an unknown attack been detected, or to update the IDS itself to adapt to the change of attack method is not an easy task [2], [42].
Unlike legacy computer networks, devices in the cellular/wireless network are usually wireless connected and have mobility, a secure access authentication is essential to alleviate the probability of threats, such as DoS/DDoS attacks. Moreover, the large number of applications running in the cell phone provides a big opportunity for malware. Due to the diversity of services on the cell phone, the network is getting increasingly complex. Although the network can mostly work around the issue to maintain the normal connection, the anomaly detection still requires a high human workload. Because the analysis and localisation of the root cause are time-consuming, even for experienced engineers [43].
Both SDN and IoT networks have not been widely deployed yet,so that not much experience in these two networks. However, from the features and network architecture, some challenges can still be concluded. For SDN, the centralised control brings scalability issues, especially against VOLUME 4, 2021 flooding attacks [44]. The open source platform allows various detection methods to be implemented in the network, in the meanwhile, how to avoid the conflict between these methods is a potential challenge. For IoT, end devices are usually lack of security features due to energy efficiency, so the placement of anomaly detection system needs to be considered. Moreover, the detection range is also challenging, as existing solutions only target specific attacks, they need to be combined over the entire network [45].
IDS is also employed in the cloud network for anomaly detection, and since a cloud network consists of multiple components, the IDS needs to be configured in each component. Thus, the position of IDS is a challenge, and the work load of configuration is heavy [46].
In simple terms, the anomaly detection in the network is usually achieved through condition monitoring, the network state is defined by comparing the measurement with the maximum and minimum boundaries. To summarise, the key challenges in the existing anomaly detection solutions are the complexity of system and adaptability to the diversity of attacks. While a ML model is able to solve these issues, because the model can be updated and improved through learning, however, the performance still needs evaluation in each network.

V. ML FOR ANOMALY DETECTION USE CASES
In this section, we describe the ML applications in anomaly detection under different network domains.

A. TRADITIONAL COMPUTER NETWORK
Although countermeasures against attacks in the cyber network have been researched for many years, both these solutions and hackers are getting sophisticated, and the network scenario is becoming complex as well. ML involves more automatic quick responses to the change in the network, as well as in anomalous behaviours. Hamamoto et al. [47] group network flows by time intervals, and extract key features, such as bits per second and source IP addresses. For numeric data, they can be used in the model directly; while for nominal values, entropy is calculated to represent the distribution of a specific value within the time interval. GA learns the behavioral pattern of traffic and predicts the network behavior. Based on the output from GA, fuzzy logic (FL) evaluates whether the traffic flow is abnormal in a time interval. Normally, labelled data and unlabelled data are processed separately in SSL, Gu et al. [48] cluster normal and abnormal data by the small amount of tagged samples using K-Means algorithm, the density of each data point within a specific radius is computed to find the cluster centre. With unlabelled data, the centre of each cluster is updated until convergence. To detect anomaly in the network, the distance between the data feature and each cluster centre is calculated, and the data is classified into the cluster with the shortest distance.
Sometimes the state or environment of RL model could be difficult to describe, so that the definition with the aid of other ML algorithms becomes a feasible solution. Alauthman et al. [49] use the output of NN to be the host state in a RL model for botnet detection, this state contains two sub-states which are the probability of being malicious and legitimate. The highest expected reward is then extracted depends on which probability is higher. A better NN policy will replace the old one, as well as the new behaviours that get a higher rewards will join the training set. Hence, the RL agent is improving to create a superior detector. Also, the RL model is used to enhance the detection performance in other models. Smadi et al. [50] train a NN model to outline the email filter system, and a RL model is employed during the training of NN model. For each training, neurons in the current NN model are updated, and a reward is given based on the output. A NN model with a higher reward always replaces the old model until the preset round of training is hit or the termination condition is met. Xu et al. [51] employ RL to adjust anomaly detection modules, which contain a variety of detection algorithms, to find the optimal strategy. Before adjustment, the implementation of a strategy is set as the target state, and the parameter adjustment is defined as an action in the RL model. Actions are taken iteratively until the state of anomaly detection module hits the target, or the predefined number of iteration is reached. With multiple attempts using various anomaly detection strategies, the one with largest accumulative rewards is the optimal strategy. To determine the reward in RL, Sethi et al. [52] adopt IDS to grade the output from RL agents. Multiple agents are deployed in routers under the same context of network, and each agent has several classifiers to predict the RL state. The state vector, which consists of the output from classifiers along with feature vector, is fed to the deep Q-network to obtain a Q-value. The action function makes decision on whether it is an attack or not through the comparison of Qvalue and threshold. A positive or negative reward is received if the classification is the same as the actual result, which is given by the IDS.
Besides detecting abnormal behaviours, ML can also be used in the network management to avoid potential threats. Jin et al. [53] employ RL to find the best scheduling policy to manage intranet traffic with the consideration of security. Each user has a reputation value to indicate how trustworthy his traffic is. The state in the RL is represented by the available bandwidth of links and the flows that are waiting to be scheduled. Actions are given per flow in the proposed model, and each action is comprised of the bandwidth allocation to this flow. The performance of scheduler is rewarded by the utilisation of links, length of queue, latency and the user trust level. This RL model considers security, performance and user requirement in the network when defending threats from inside. The ML applications in the traditional computer networks are summarised in Table 3. Can automatically adapt to any change in the network. The SL model is improved through the interaction with the RL model.
Xu et al. [51] RL: Q-learning General network anomalies Use the RL model to adjust parameters in the anomaly detection models, and find the optimal model who has the highest reward.
The optimal configuration of an anomaly detection model is found through the RL model.  Various attacks, such as DDoS attacks Use ensemble method to obtain multiple predictions, and the final decision is made through these outputs.
Predictive performance is improved.

B. CELLULAR/INTERNET SERVICE PROVIDER NETWORK/WIRELESS NETWORK
Comparing to the computer network, cellular/wireless networks are more wireless connection oriented. Because of the transmission medium, devices and links are more vulnerable to the attacks than using wired connection. And for cellular network, latency shall be much lower than in the computer networks because of voice services, while ML is able to provide a low detection delay approach. Malicious mobile applications usually generate benign traffic which far outweighs anomalous traffic, so that imbalanced data becomes a problem in the data analytic, because there is not enough information that can clearly indicate the abnormal behaviour. However, ML can overcome this challenge, Chen et al. [54] identify malicious behaviours in the cellular network so as to detect malicious applications. Based on the destination IP address and domain name in the packet, most SL algorithms have an excellent accuracy in judging malwares in the cell phone. Otoum et al. [55] deploy a RL model in the wireless sensor network to detect anomalies. The cluster head is elected based on the factors, such as the connectivity of a node and signal strength. Then the cluster head collects sensed data and redirects them to the RL model for analysis. The RL model makes decision on whether a sensor is behaving abnormally and the reward is given accordingly. As supervised learning relies on labelled training set, on the one hand, its response to the anomalies might be slow due to the lack of abnormal samples. Hence, Dromard et al. [56] involve grid incremental clustering algorithms in the UL to rapidly detect any abnormal state in the network, so that real-time detection is achievable. The entire dataset is partitioned into cells, and each cell contains a subset of the VOLUME 4, 2021 original dataset. Then, dense cells who have a common face are grouped to form a cluster, which reduces the complexity comparing to handle the whole dataset. When new data come into the network, the update only happens in the previous feature space partition so that the computation is finished fast. On the other hand, annotating a large scale dataset is a big challenge in SL, so Al et al. [57] propose an automatic labelling algorithm for applying anomaly detection in the cellular network. This algorithm considers two factors to classify a sample: range of KPI value and time series profile. A threshold is defined for KPI value to determine whether it is normal, while for time series profile the mean value and standard deviation are considered. Only when both of these two factors are abnormal, the sample is categorised as anomaly.
Without labelled data, UL is also a good option. Dey et al. [58] filter man-in-the-middle (MitM) attacks through profiles and features of incoming traffic. The operating system and coarse location of a client are utilised to determine whether the request is suspicious. And then an unsupervised clustering algorithm based on inter-packet delay further inspects the traffic. Hoang et al. [59] propose a simple method to detect eavesdropping attacks based on one-class labelled data, which only known as normal, using UL. An area that contains normal data is defined by one-class SVM (OCSVM) first. Then, unlabelled data are divided into two groups via K-Means model. For those data that sit within the predefined area, they are labelled as normal, or abnormal otherwise if outside the area.
Although UL can work solo to analyse problems, combining it with other ML techniques could result in a superior output [62]. Qu et al. [60] combine Mean Shift Clustering Algorithm (MSCA) and SVM to detect unknown attacks in the wireless sensor network, MSCA distinguishes attacks through abnormal features, and SVM is employed to maximise the margin between normal and attack features, so that the error in classification is minimised.
Ensemble method is another approach to improve predictive performance, a training set is divided into several small subsets, and one or multiple SL algorithms generate several classifiers via training by these subsets. The final prediction is given by combining the outputs from classifiers, i.e. the one with most votes wins [63], [64]. Vanerio et al. [61] use a supervised learning model, called Super Learner, to enhance anomaly detection with ensemble learning approach. Super Learner is able to find the best combination of a group of basic prediction algorithms. Through the evaluation over a semi-synthetic dataset [65] which records traffic in the cellular network, results are better than using a single prediction model. Existing solutions are summarised in Table 4.

C. SDN
The programmability of SDN simplifies the implementation of ML than in other networks. ML security applications can be developed and deployed in the SDN directly without any licence or compatibility concern. Furthermore, data collection via SDN controllers is much easier than in the traditional network due to the centralised management. Sebbar et al. [66] deploy a SL model in the southbound interface (SBI) to detect MitM attack, it aims to disconnect a node if it is anomalous. The state of a network node, time to live and response time are the references to suspicious requests, the data that collected via SBI are labelled normal or abnormal according to these conditions. Then labelled data are sent to the RF algorithm to train a classifier, which allows or drops new connection requests. Khamaiseh et al. [67] explore time-window of traffic in early detection using SL in SDN, as a small time-window means a short duration before SL making the decision, it could be a double-edged sword leading to an early detection or a worse accuracy, because the SL predictor may not have decent information to make the correct prediction. The centralised control of SDN allows the controller to periodically collect statistics from switches, so that the SL model can obtain up-to-date information to judge if a request is malicious. The predictor is trained offline with existing datasets, and then any new request is sent to the predictor for inspection. Based on the output, flow entries to forward or drop packets are inserted to the forwarding devices [68], [69].
Since SDN decouples the control and data plane from legacy networks, the link between these two planes are no longer sitting in the same hardware. Thus, attacks against control plane, data plane and the link between them must be considered separately. Santos et al. [70] compare the performance of SL models in DDoS attack detection in SDN, the target of DDoS attack includes three categories: controller, flow table and the bandwidth between the switch and controller. It is found from the test that the most important features for correct prediction are the IP source port, and the number of packets and bytes in flows.
Sometimes the outcome of ML model is not accurate enough, adding an extra step, such as entropy measurement, to double check the data could be an approach to improve the performance in anomaly detection. Dehkordi et al. [71] propose to collect statistics of network every period of time and calculate the entropy to see if it is normal, if the entropy value is under a predefined threshold, then a SL classifier is involved to determine whether it is an attack. Song et al. [72] introduce three subsystems running over the controller to predict threats in the SDN, these subsystems are used for data processing, classifier creation and decision making. As the training is based on past experience, data processing filters key information and discard irrelevant data so as to provide useful clues to the RF classifier. Based on the prediction from RF classifier, normal requests are processed and abnormal data are blocked. Furthermore, when the decision is ambiguous, the model will use the entropy to measure the ambiguity to make the final decision.
According to the user requirement, RL can be applied to trigger specific operations against intrusions. Now that the mitigation of DDoS attacks is to drop malicious traffic, Simpson et al. [73] introduce a probability value in the agent to instruct the switch to drop relevant packets. Two agent modes, instant and guarded, are proposed and validated in the single and multi-agent scenarios. In the instant mode, an agent directly chooses the probability of drop to partially discard current traffic flows, and it allows at least 10% flows to pass through. This mode has no interest in the future state. Then in the guarded mode, traffic could be completely blocked, and the future state may cause the update of stateaction values. Through the evaluation, the instant agent mode performs well in the multi-agent scenario, in which several agents are working separately to protect the traffic towards the same server; while the guarded mode has a better output under single-agent scenario, in which an agent controls all the flows going to the server. Sampaio et al. [74] deploy RL in the SDN to achieve load balancing. They use the SDN controller to monitor the load on each link, any link with over 80% load is regarded as a high status and will trigger the agent in RL to modify the route. After the route update, a positive reward is given if there is no link with high status, otherwise the reroute action has a negative reward. This model can also be adopted to redirect malicious traffic. Han et al. [75] divide a network into nodes and links, and each node or link only has two states. A node is either normal or compromised, while a link is either turned on or off. The combination of node and link state reflects the current network state. Depends on the state, an action could be switch on/off nodes or links, or even doing nothing. Since the objective is to protect critical servers, the reward is characterised by the availability of server, the cost of mitigation and the number of reachable network resources. Assuming the attacker is aware of the RL agent and able to falsify the reward, the adversarial training is found capable of alleviating the impact during RL training. The ML applications in the SDN are summarised in Table 5.

D. IOT
Along with the large number of connected devices in IoT, the process of big data and unknown attacks becomes a main problem in the security domain. Since more portable devices join the network with very limited security features than ever before, hackers have more opportunities to commit flooding attacks, because the behaviour of a user is more arbitrary in the IoT context, and compromising an IoT device is much easier than hacking a firewall-equipped computer.
ML, especially UL, shows its capability to detect unknown attacks with low computational resources. Both feature selection and dimensionality reduction aim to reduce the number of features, the difference is that original features are not changed in the feature selection. Nõmm et al. [76] explore botnet attack detection with a small number of features using UL models. Rather than running a common model for all the IoT devices, a separate model which works for each device is proposed. Several main features are first selected by feature selection methods, such as entropy and variance, then these data are sent to the UL models for clustering, and it obtains an acceptable accuracy. To achieve a higher detection accuracy, Liu et al. [77] propose to cluster data into three groups rather VOLUME 4, 2021 than two based on the suspicious level, and only the highly dubious group is malicious. A node collects periodic probe messages from a trusted source node via multiple paths, and the contribution of each node to a path is calculated. These contribution metrics are used as features in the clustering. While Karimipure et al. [78] use a similar concept to partition the smart grid network into sub-systems, in which data are processed in parallel. And the behaviour in each sub-system is learnt to be the reference for anomaly detection.
Since the UL is capable of clustering data into groups without any training, it is also applied with other algorithms to detect anomalies in the IoT network. Ahmed et al. [79] utilise iForest algorithm to determine if a measurement sample in the smart grid network is compromised. To categorise samples, principal component analysis (PCA) is invoked to transform the data size to a smaller dimension first. Then, the iForest sets up a binary search tree to isolate each sample. As the UL model splits all the samples to groups, the samples who are easy to isolate are more likely to be abnormal. Because the amount of compromised sample is usually small, and its feature is different from normal samples.
The hybrid of UL and SL could also improve the efficiency of detection. Ali et al. [80] train auto-encoders (AE) via UL to extract the features from the unlabelled dataset. Next, these features are merged according to their weights. Finally, the combined features are computed in a supervised manner to create a detection model. It is worth noting that the training of AE here is different from the training in the SL, because AE does not use labelled data, its objective is to minimise the reconstruction error. The reconstruction error is defined as the difference between the original data and the reconstructed data, and this error devises a threshold that is used to classify the data. Bhatia et al. [81] also demonstrate AE based classifier, which is trained by only normal traffic, to detect DDoS attacks.
Despite the fact that a predictor is able to distinguish attack data from normal data, as well as to classify the attack type through training, employing a SL model for a specific job is also applicable. Anthi et al. [82] propose a three layer ML model, layer 1 for profiling and learning the normal behaviour of each device, layer 2 for anomaly detection and layer 3 for attack classification. Each layer has a SL model to make predictions, which means a specific type of attack is identified after being inspected three times.
Due to the significance of annotating untagged data in the SSL, Rathore et al. [83] propose a two-tier verification approach to classify unlabelled data. They use extreme learning machine (ELM) algorithm to train the model with classified data, and send both labelled and unlabelled data to semi-supervised fuzzy C-means to filter high confident unlabelled data. After that these unclassified data with high confidence are examined again using the model trained by ELM, only those still have a high confidence will then be allowed to join the labelled data group. This process recurs till all the data are classified. In order to gain high confidence when tagging unlabelled data, vote is also a quite popular solution. Li et al. [84] adopt disagreement-based principle in the tri-training [85] method to classify unlabelled data. When two learners agree on the classifier of a sample, but the third learner disagrees, then the third learner is taught by the previous two learners on this sample. Ravi et al. [86] split labelled dataset into normal and various attack classes, samples from each class are picked in a stochastic way. The Euclidean distance of unlabelled data against these samples are calculated to find the minimum value, which classifies the unknown sample. This classification repeats multiple times, and a sample is labelled only after more than half of the decisions pointing to the same cluster.
Apart from detect anomalies directly, RL can also be applied to improve the existing solutions. Gu et al. [87] involve RL to adjust attack detection threshold in an entropy-based framework, it successfully improves the detection rate and decreases the false alarm rate. IoT related anomaly detection methods using ML are summarised in Table 6.

E. CLOUD/FOG/EDGE NETWORK
Filters and rules are popular anomaly detection measures in the legacy network, however, they haven't shown decent results in security investigation in the cloud/fog/edge networks. ML or the combination of ML and rules has produced satisfactory results when deployed as IDS in the cloud [88]. Kim et al. [89] design a hybrid ML model in the cloud environment to detect and classify network threats. Key features are first selected via RF algorithm, then unlabelled data are clustered by these key features using UL models, and these clusters are unnamed so far. In order to know what attack a cluster represents, a threat label is added to each sample for naming clusters later. The threat label is defined by the value of some features in the labelled data, and the cluster name is given by the distribution of threat labels in each cluster. Thus, UL and SL models are employed for anomaly detection and classification, respectively. Aljamal et al. [90] and Baek et al. [91] also employ UL models for clustering and SL models for training and detection. The new clusters are labelled based on the assumption that normal data are in the large and dense clusters while anomalies belong to small or sparse clusters. Thus, a threshold function is defined to judge whether a cluster is small or large, as well as its density. After that data in the small and sparse clusters are labelled as abnormal, and other data are tagged as normal. SL models are then trained by these labelled data and employed to detect anomalies in the network.
Salman et al. [92] categorise attack types by a step-wise model, it is an improvement from the traditional single-type model. A single-type model is trained by a specific type of attack with normal traffic, so that network traffic has to go through all the single-type models for attack categorisation. While the step-wise model divide normal and anomalous traffic first, and then SL models recognise attack types using anomaly data only. The step-wise model puts several attack types in the same group, and once the group is determined, the specific attack type is further detected. Chkirbene et

Papers ML Models Anomaly Methodology Advantages
Nõmm et al. [76] UL: iForest and OCSVM Botnet attacks Use UL models with less than 10 selected features to detect attacks.
Reduced feature set consumes less resources during process, while the detection rate is still reasonable.

Packet modification attacks
Use an UL model to cluster nodes into three groups, and only remove highly suspicious node.
Higher accuracy than clusters.
Karimipure et al. [78] UL: Dynamic Bayesian Networks False data injection attacks The interaction in the subsystems is changed by attacks, as well as the pattern in the UL model.
Efficient computation and can detect unobservable attacks.
Ahmed et al. [79] UL: iForest and PCA Covert data integrity assaults Use an UL model to create multiple forests, and the measurement with shortest average path length is attack.
Computational complexity is low and detection time is short.
Ali et al. [80] UL: AE; SL: Linear SVM DDoS attacks Extract and merge key features in a weighted fashion, then use a SL model to detect DDoS attacks.
Prediction accuracy is improved after feature learning.
Bhatia et al. [81] UL: AE DDoS attacks Train an UL model with normal traffic, and use it to detect attacks.
More effective against new and unknown attacks comparing to some SL models.
Anthi et al. [82] SL: DT 12 kinds of attacks, such as DoS and MitM Three-tier inspection using SL models to classify the attacks.
Can predict the type of attack.
Pseudo label unknown data by checking it twice, and classify it when both outputs have a high confidence.
Better labelling approach, and fast real time detection.
Train three classifiers with labelled data, and classify unlabelled data by majority voting.
Higher detection rate and lower error rate comparing to some SL models.

Data deluge attacks
Use the shortest Euclidean distance from the unknown data to the known clusters to annotate unlabelled data.
Higher accuracy than some SL models.
Gu et al. [87] RL: Q-learning High-rate and low-rate IoT attacks, such as ARP spoofing Use a RL model to adjust detection threshold so as to improve the performance.
Well adaption to IoT environment.
al. [93], [94] split a time period into several slots and use a SL model to predict the network state within each time slot. The most frequent decision within the time period is chose to be the prediction. Moreover, a weight value is involved in the prediction phase to improve the accuracy, it is the bias in the prediction of each category. The final decision is made from the number of predictions under the same category multiply by its weight value. In other words, a prediction that has a high weighted category is more likely to influence the result. Priyadarshini et al. [95] use SDN controllers in the fog network to block DDoS attacks before they enter the cloud network. A classifier model is trained in advance and running over the SDN controller to determine the state of network traffic, and flow entries are generated by the controller and inserted to the switches in the network to forward legitimate packets and drop malicious packets. Apart from traffic analysis, SL can also be deployed to classify cloud users via their credentials, and access control is applied to the cloud network according to the user [96].
To obtain the finest ML model with a specific training set whilst averting overfitting, more than one model can be trained and compared to find the best one [97]. Xu et al. [98] first randomly choose samples from labelled dataset to create several subsets, and use bagging method to train multiple models from these subsets. Then another dataset which consists of unreliable anomalous and unlabelled data is employed to train another model. Also, various sampling ratio is verified to get the optimal ratio through evaluation metrics. Finally, all the models are assessed through the entire dataset, including both labelled and unlabelled data. Finally, the one with lowest learning error or balanced accuracy is the best fitted model. SSL is also applied in the cloud and fog network due to the lack of labelled data. Xu et al. [99] introduce fog enabled infrastructure and fog assisted AI engine to deploy SSL models in the fog network. The infrastructure creates multiple virtual machines, and each machine hosts a partitioned subset from the original dataset. These partitioned subsets are then uploaded to the AI engine to train detection models. SSL model is applied to the same subset to find the optimal learning model based on the accuracy of detection. Gao et al. [100] employ ensemble method to train a NN classifier with labelled data, and then this classifier predicts all the unlabelled data. The prediction is processed through fuzziness evaluation to extract valuable information, after that these pseudo-tagged data enter an ensemble system to double check the classification before being accepted as training set. The implementation of ML in the cloud/fog/edge network are summarised in Table 7.

F. EXPERIMENTAL DATASET
Apart from the aforementioned networks, some ideas have been proposed in general, not targeting a specific network type, and these methods have been evaluated through public datasets. Hosseini et al. [102] detect DDoS attacks by inspecting packets twice in the network. Classifiers are trained VOLUME 4, 2021 Various network attacks.
Feature selection with a SL model and use an UL model to cluster unlabelled data. Define a cluster by the distribution of threat labels in the cluster.
Only needs a small amount of labelled data. Combine both SL and UL models, and use labelled data to correct the prediction of unlabelled data.
The ability to recognise new traffic pattern is enhanced, and the detection accuracy is increased.
offline by existing datasets, in the meanwhile, essential features are extracted for the detection phase. To complete the transmission to the server, a packet is examined on both the client side and network side. On the client side, the state of packet is predicted by its essential features and a divergence test. As long as the packet is not considered as an attack, it is sent to the network proxy for further inspection, otherwise it is dropped. In the network proxy, an attack profile database contains all the known attack patterns, any packet matches a profile is discarded. Even if the attack is new to the database, it can still be detected via the trained classifiers, and its characteristic is recorded in the database for future detection. Gu et al. [103] propose a two-layer hierarchical ensemble model to detect anomalies. They first split the original dataset into heterogeneous training sets by fuzzy C-means clustering algorithm. Then several base classifiers are trained by these subsets. Their outputs are aggregated in a nonlinear manner, and are fed to an upper layer classifier to train a final model. The decision whether the traffic is an intrusion is made by the final model.
When labelled data is unavailable, AE can be used to capture the non-linear correlations in the data feature, and to find the latent representation which is insensitive to the variance of data to determine if anomaly happens. Nicolau et al. [104] introduce new regularisers to the AEs to push normal data to a small area whose centre is the origin. So that abnormal data are easy to be figured out, as they locate far away from the origin. Choi et al. [105] prepare three training sets which have different ratios of abnormal data to normal data, and use each of them to train four AE models. Each AE model produces key features of the training set, and these key features are employed to reconstruct the original dataset. If the reconstruction error is less than the threshold, it is normal data, otherwise, it is abnormal.
When the number of labelled data is small, how to annotate unlabelled data from these known ones has a deep impact on the performance of SSL models. Ashfaq et al. [106] introduce fuzziness to categorise unlabelled data into three groups, which are high, mid and low, with a model trained by NN, this model is initially created from labelled data and it gives each unlabelled data a fuzziness value. The data with a high or low fuzziness value will join the existing labelled data group, and this new group is used to train an updated model to classify the test data. While the mid fuzziness data group are still ambiguous according to the classifier, so they will not be added to the labelled data to reduce the risk of misclassification. Moreover, in the process of unlabelled data, Idhammad et al. [107] run four algorithms to reduce irrelevance and noise in the normal data to increase the accuracy in the DDoS attack detection. Entropy of Flow Size Distribution (FSD) within a time window is calculated and compared with threshold, an abnormal entropy triggers the traffic data in that specific time window be divided into three groups by co-clustering algorithm. Based on the assumption that attack traffic becomes much more than normal traffic during DDoS attacks, the group with a lower information gain ratio has the normal traffic only, and the other two groups contain anomalous traffic. After these unsupervised process, the two data groups with malicious traffic are sent to extra-tree algorithm for SL steps. With labelled normal data only, abnormal states can still be realised through SSL. Zavrak et al. [108] propose an AE based model whose training uses only normal data, after the model is trained the validation dataset, which is comprised of half normal and half anomalous data, is sent to the model to create a threshold for anomaly detection. Since a test sample will be rebuilt in a trained AE, the anomaly threshold is defined by the difference between the reconstructed and the original input data. If the difference is larger than the threshold, the sample is labelled as abnormal, otherwise it is normal. Al-Jarrah et al. [109] randomly divide the whole dataset into multiple clusters first, so that a cluster may contain labelled data only, or unlabelled data only, or mixed. For the cluster which has untagged data only, it finds the nearest labelled data to form a new mixed cluster, which contains both labelled and unlabelled data. Tri-training is employed to process mixed clusters, it creates three classifiers from the original dataset, and tags unlabelled data as long as two or three classifiers agree on the labelling. While for fully labelled cluster, the proposed model builds binary classifier if the cluster contains multiple classes of data, otherwise label the cluster with one class data only. The ML applications validated via public datasets are summarised in Table 8.

VI. CONCLUSION
Machine learning is trying to prove itself in multiple fields, among which anomaly detection is a feasible application that attracts lots of attentions. No matter what is the network scenario, people still have numerous options from ML models. Hence, we present a comprehensive review on the ML in network anomaly detection. From SL to RL, each category processes data in a different style, which leads to a large gap in the outcome. Supervised, unsupervised or semi-supervised learning model is picked based on the dataset on hand, the proportion of labelled data is a key factor in selecting a model. By contrast, RL is a totally different style, it allows the model to try all the state-action pairs so as to identify the best solution. In addition to the model selection, data quality is the most vital part for anomaly detection, it directly links to the prediction performance. In the future, we would like to explore more for the application of deep learning techniques in the next generation network, such as SDN and IoT.
DR AKRAM AL-HOURANI (PhD, BEng, MBA, SMIEE, CPEng) is a Senior Lecturer and the Program Manager for the Master of Engineering (Telecommunication and Networks) at the School of Engineering, RMIT University, Melbourne, Australia. Dr Al-Hourani completed the Ph.D. degree in 2016 from RMIT University. He published more than 55 journal articles and conference proceedings, including 3 book chapters. In 2020, Dr Al-Hourani has won the IEEE Sensors Council Paper Award for his contribution in hand-gesture recognition using neural networks. He has extensive industry/government engagement as a chief investigator in multiple research projects related to The Internet-of-Things (IoT), Smart Cities, Satellite/Wireless Communications. As a Lead Chief Investigator, he oversaw the design and deployment of the largest open IoT network in Australia in collaboration with 5 local governments "Northern Melbourne Smart Cities Network", this project has won the 2020 "IoT Awards", the official awards program of IoT Alliance Australia. Prior his academic career, between 2006-2013, Dr Al-Hourani had extensively worked in the ICT industry sector as an R&D engineer, radio network planning engineer and then as an ICT program manager for several projects spanning over different technologies; including mobile networks deployment, satellite networks, and railway ICT systems. Dr Al-Hourani is serving as an Associate Editor in Frontiers in Space Technologies and Frontiers in Communications and Networks, and as a Guest Editor of the special issue "Satellite Communication" in MDPI Remote Sensing. His current research interests include UAV communication systems, automotive and mmWave radars, energy efficiency in wireless networks, and the Internet-of-Things over satellite.
DR KARINA GOMEZ CHAVEZ in 2006, she received her engineering degree in Electronic and Telecommunication Engineering from the National Polytechnic School, Ecuador. In 2006, she received her Master degree in Wireless Systems and Related Technologies from the Turin Polytechnic, Italy. In 2007, she joined Communication and Location Technologies Area at FIAT Research Centre. In 2008, she joined Future Networks Area at Create-Net, working on several National, European and Industrial projects. In 2013, she obtained her PhD degree in Telecommunications from the University of Trento, Italy. In July 2015, she is a lecturer at School of Engineering at RMIT University, her role is to coordinate several networking courses and supervise several PhD and Master Students. Currently, she is a project manager at Milano Teleport. Her current research activity is mainly focusing onto Energy Efficiency Networks, 4G/5G Mobile Networks Architecture and Network Protocols, Internet of Things (IoT) Technologies, Software Defined Networking (SDN), Network Functions Virtualization (NFV), Network Security, Multi-layer Resources Management and Orchestration and Emergency Communications. She has several patents and has published her research in important journals and conferences.
PROF BENJAMIN RUBINSTEIN actively research topics in machine learning, security & privacy, databases such as adversarial learning, differential privacy and record linkage. Prior to joining the University of Melbourne in 2013, he enjoyed four years in industry research labs including Microsoft Research Silicon Valley and IBM Research Australia, and received the PhD (Computer Science) from UC Berkeley in 2010. He has been part of teams that have: analysed privacy at the Australian Bureau of Statistics, a major bank, and Transport for NSW; robustness of translation systems to data poisoning attacks with Facebook; helped identify and plug side-channel attacks against the Firefox browser; deanonymised Victorian Myki transport data and an unprecedented Australian Medicare data release; developed scalable Bayesian approaches to record linkage tested by U.S. Census; and shipped production systems for entity resolution in Bing and the Xbox360. Since joining Melbourne in 2013, he has been awarded $4.68m in competitive funding ($2.26m as lead). He also co-leads the CATCH Joint MURI-AUSMURI which has been funded over AUD $8m to convene a team of 16 experts across 7 universities for fundamental discovery in robust human-AI teams for cybersecurity. VOLUME 4, 2021