Mitigating Adversarial Reconnaissance in IoT Anomaly Detection Systems: A Moving Target Defense Approach based on Reinforcement Learning

The machine learning (ML) community has extensively studied adversarial threats on learning-based systems, emphasizing the need to address the potential compromise of anomaly-based intrusion detection systems (IDS) through adversarial attacks. On the other hand, investigating the use of moving target defense (MTD) mechanisms in Internet of Things (IoT) networks is ongoing research, with unfathomable potential to equip IoT devices and networks with the ability to fend o ff cyber attacks despite their computational deficiencies. In this paper, we propose a game-theoretic model of MTD to render the configuration and deployment of anomaly-based IDS more dynamic through diversification of feature training in order to minimize successful reconnaissance on ML-based IDS. We then solve the MTD problem using a reinforcement learning method to generate the optimal shifting policy within the network without a prior network transition model. The state-of-the-art ToN-IoT dataset is investigated for feasibility to implement the feature-based MTD approach. The overall performance of the proposed MTD-based IDS is compared to a conventional IDS by analyzing the accuracy curve for varying attacker success rates. Our approach has proven e ff ective in increasing the resilience of the IDS against adversarial learning.


Introduction
The Internet of Things (IoT) has become an intrinsic technology in various automated industries and largescale smart city and government services including wearable health devices and autonomous transportation [1].A major concern about IoT systems is security, which has become more difficult to implement in embedded IoT devices having limited computational resources to run firewalls and advanced cyber defense algorithms [2].IoT devices and networks are known to be increasingly vulnerable to security attacks on data integrity and service availability [3,4].With increased connectivity, IoT devices can become compromised and used as zombies.Hackers can control these devices remotely and exploit them for illegal purposes like carrying out large-scale distributed denial of service (DDoS) attacks, e.g., via a Control & Command (C&C) server [5].
At the IoT network level, intrusion detection systems (IDS) constitute one of the essential defense tools that monitor the entire IoT network traffic [6].As most IoT devices are deficient in computational and memory resources, most IoT networks rely on network-based IDS to provide security for the collective nodes within the network [7].These IDS can either be signaturebased or anomaly-based [8].However, adversaries continue to devise new ways of orchestrating stealthy attacks that are usually cunningly evasive and could render signature-based detection defunct.This is why there have been growing research interests in building anomaly-based IDS to detect unknown attacks.These IDS often rely on machine learning (ML) algorithms to detect malicious behavior [9][10][11].Nonetheless, this reliance adds to the IoT network's threat surface due to the presence of adversarial learning attacks.
Conventional IDS in IoT networks are as susceptible to attacks as the nodes within these networks.Little attention has been placed on protecting IDS, but the damage caused by a cyber attack that compromises an IDS could be far more detrimental than that caused by compromising nodes within the network, since the IDS's mission is to protect the network.An adversarial attack stealthily injects an input to an ML model that is purposely designed to cause the model to make a mistake in its predictions despite resembling a valid input when observed [12].Adversarial attacks may belong to three categories: In white-box attacks, the adversary possesses knowledge of the training and testing datasets of a given IDS, what ML model and miscellaneous techniques are employed by the IDS, among other things.In gray-box attacks, the adversary only has partial knowledge of the network.Lastly, black-box attacks require that the adversary has no knowledge of the network, and launches the attack blindly and arbitrarily.
In this paper, we aim to proactively protect anomalybased IDS in IoT from potential adversarial learning attacks using a novel Moving Target Defense (MTD) approach.Our goal is to prevent or mitigate the impact of successful reconnaissance within the system, where stealthy adversaries attempt to probe the ML-based IDS in order to collect useful information about its design and operation in order to eventually craft subverting attack patterns.Although researchers using MTD have made important strides within the community [13], no work has yet considered implementing MTD on IoT gateways involving IDS components.The contributions of this paper are described as follows: • A novel, stochastic game-theoretic MTD model based on architecture decentralization and learning diversification to protect ML-based IDS in IoT networks against adversarial threats.The model decomposes the target system into a number of logical sub-components and apply MTD to mitigate the impact of the adversary's reconnaissance over time by expanding the exploration surface in terms of features.
• A reinforcement learning (RL) solution to autonomously optimize MTD timing and operations within the network in an adaptive fashion without prior knowledge of the adversarial behavior and environment model.The proposed RL algorithm learns the optimal defense policy via trial and error in order to avoid unnecessary configuration shifts.
• An extensive experimental validation of our proposed threat prevention solution using the state-of-the-art, real-world ToN-IoT dataset [14].
Our results are promising and demonstrate the effectiveness of the MTD-enabled IDS in reducing the impact of adversarial reconnaissance while maintaining high detection accuracy.
The remainder of the paper is structured as follows.Section 2 provides background information and discusses the related work.Section 3 describes the proposed MTD model.Section 4 presents our RLbased solution to the MTD timing problem.Section 5 describes our experiments and results.Finally, Section 2 concludes the paper.

Adversarial Learning Threats to IDS
The Fast Gradient Sign Method (FGSM) is an adversarial learning attack that uses the concept of gradient descent in neural networks, which is in turn an iterative optimization process to minimize the adversarial error during an attempt to compromise a given deep learning (DL) model [15].This results in data points that look indistinguishable from the original ones but results in serious misclassification when passed through a DL model.FGSM is one of the most common adversarial learning attacks in the literature and is mostly used in image misclassification.The Barrage of Natural Transforms (BaRT) randomly sets up the classifier to be vulnerable to a number of transforms based on the criteria which the ML model uses in its prediction.Some transforms could be: Noise injection, FFT perturbation and Color precision reduction (for image classification) [16].In a jacobian based saliency attack (JSMA), by analyzing the jacobian matrix of outputs with respect to inputs, one is able to deduce how the output probabilities behave given a slight modification of an input feature.In [17], JSMA was implemented against a multilayer perceptron model using the CIDS and TRAbID datasets for network traffic classification.All these adversarial threats can render the IoT network highly vulnerable to compromise by subverting the performance of the IDS.
Generally, the defense approaches against an adversarial learning attack on IDS can be grouped into reactive and proactive [18]: • Reactive: Involves carrying out patches on a given network, based on the kind of adversarial attacks experienced.This could be done in an iterative manner to enhance the robustness of the IDS.
• Proactive: Involves altering the underlying architecture or learning procedure of the IDS, e.g., by adding more layers, training the detection model in real time with adversarial attack samples, or increasing the sensitivity of loss/activation functions.The loss function is given by: f (x + δ) = f (x), Where δ is the input injected by the adversary to cause perturbation and impact the prediction accuracy of the IDS.
To defend against FGSM, gradient masking (that naturally transforms a threat model from a white/gray box into a black box) is used to mask the model's output with respect to its input [19].Adversarial training could also be used, where the IDS is trained with some adversarial examples to make it immune to adversarial attacks, but this can sometimes lead to label leakage and over-fitting as the adversarial examples generated during the training phase may not be present during the predictive/testing phase [20].The authors in [21] came up with a novel model for adversarial training that involves feature scattering in a given latent space.They generate the feature-scattering adversarial examples in an unsupervised manner as a deliberate attempt to address possible label leakage.A novel generative adversarial network (GAN)-based adversarial defense method called Cowboy was proposed by [22].This approach both detects and defends against adversarial attacks by using both the discriminator and generator of a typical GAN trained on the same dataset.The method is inspired by hypothesizing that adversarial samples ought to exist out of the data pipeline understudied by a GAN.

Moving Target Defense in IoT Systems
In contrast to reactive security approaches where a cyber attack may cause damage to the system before counteraction is initiated, MTD is a cyber defense paradigm that proposes to proactively make systems and networks dynamic to increase the difficulty for an attacker to be successful with exploits in the first place.With MTD, there is a full-on acknowledgment that systems are always going to be vulnerable against zero-day threats regardless of how many times the attack surface is shrunk through patching, because there is an unfair asymmetrical advantage that attackers have on static systems over time.Hence, the reconnaissance activities of the attacker on a static network would render the attacker in a favorable position to launch a successful attack [23].It is also easier to create backdoors in traditional systems if system parameters are unchanging/static.Furthermore, resource-constrained devices and networks may not be able to add complex security set-ups and hence are most vulnerable to attacks.So far, MTD research in IoT networks is typically focused on the nodes, and is oblivious to the vulnerabilities of the IDS gateways through which all traffic converge and diverge, which can also be exploited by resourceful adversaries in order to penetrate the network.
MTD proposes the movement of system parameters at certain periods so that the state of the network at T 0 is different from the state of the network at some arbitrary period T 0 + δt, thereby making it difficult for the attacker to do any proper reconnaissance.The goal is not to reactively reduce the attack surface with traditional countermeasures like patches, but to proactively keep moving the attack surface to make it seemingly impossible or very difficult to be successful with an attack.The key design questions to consider while setting up an MTD-based network are: what to move, how to move it, and when to move it.With regards to what parameters to move/change, the moving parameter (MP) could be the data (e.g., formats), software, network (e.g., IP addresses, port numbers), platform (e.g., OS, firmware), runtime environment (e.g., RAM address space), or even hardware (e.g., routines in enterprise switch brands).With regards to how to move, the MP can be made to move from one configuration to another via randomized shuffling or using a predefined optimization algorithm to create diversification.Finally, the timing problem when implementing MTD involves determining the trigger to initiate the move in the network, which could be based on a specific time or event, or a combination of both.
In a given network, a typical use-case could be random re-assignment of IP addresses and port numbers.This can thwart the reconnaissance activities of the attacker.For example, using scanning tools like nmap will yield different results for each scan and will therefore not be useful knowledge to the attacker.Also, MTD in IoT networks may involve constantly changing the communication protocols between nodes and the gateway (e.g., WiFi, Bluetooth, Zigbee, etc.).Such diversification would make it difficult for an adversary to complete her exploit as each protocol is completely different from the other, with no correlation whatsoever.
Because of the uncertainty created by changing configurations on the network, it is possible to quantitatively attempt to describe the degree of uncertainty of a given network with a chosen MP by evaluating the number of states the MP is capable of taking on, and the probability that it takes on a certain state.This can be modeled using Shannon's entropy [24].For a uniform probability distribution, Shannon's entropy will directly depend on the number of states available for a given MP [24].This theoretical inclination further buttresses the following point: Instead of reducing the attack surface, MTD rather enlarges the "exploration surface" domain for attackers, and then moves the attack surface as a sub-domain with the exploration domain, thereby making it difficult for attackers to orchestrate attacks.That said, even though more states of an MP translate to higher uncertainty and hence greater difficulty for the attacker, it could be much more difficult and costlier to have numerous states, depending on the MP in question.For example, it is easier for the defender to deploy multiple states of IP addresses compared to multiple firmware.This also means for an attacker, it may be easier to break through 254 different states of a node's IP address than 5 different states of a node's firmware.Hence there ought to be some qualitative representation of weighing the cost of states of a given MP, for both the attacker and the defender.
MTD is a relatively new area of research that is rapidly gaining momentum, especially for low-resource networks like IoT [25].The authors in [26] implement an MTD model that involves shuffling proxies of which clients ought to connect to access system resources.The authors in [27] introduce a novel MTD against network reconnaissance.It is a software-defined technique called the Sniffer Reflector.This MTD architecture is basically set up to prevent successful network probing by the attacker by providing forged responses to network scans.The authors in [28] propose a novel MTD framework employing IoT-enabled data replication to replicate sensory and control signals in cyber-physical systems.This framework combines two layers of uncertainty, hence reducing the arbitrary attacker's ability to learn about the IoT network over time.It also reduces the impact of false data injection attacks on a given system model.
An MTD system typically requires a continuous adaptation of system configurations to be able to effectively hinder attacks, which can cause some overhead on the resources of a given system.On the other hand, limiting adaptations in a bid to reduce the overhead could make it much easier for attackers to successfully execute attacks.Therefore, determining the right time to make adaptations with the objective of minimizing long-term costs highlights the importance of the MTD timing problem.For instance, the authors in [29] propose a cost-aware MTD model to make smart and optimized adaptations by analyzing the tradeoff between reducing system overhead and increasing the resilience of the system to attacks.Also, the authors in [30] propose a game-theoretic MTD model to address the timing problem when proactively facing DoS attacks.The model provided a guided framework to ensure that the defending system moved at on optimal time to yield the most resilience for the least cost on network performance.

The Proposed MTD-enabled IDS
Our MTD model against adversarial learning in IoT anomaly detection systems is mainly based on a novel feature shuffling mechanism that can be incorporated into the training phase of the ML-enabled IDS to shield it from reconnaissance.To provide the research community with a solid theoretical foundation that could become the basis for future works on MTD (e.g., using different MP), the designed shuffling mechanism fundamentally relies on a stochastic game-theoretic formulation between the IDS and the attacker, which we describe in detail in this section.

System Model
The proposed IDS architecture is based on logical decentralization and aggregation.This was inspired following the analysis of the ToN-IoT network dataset [14].First, the 45 features of the ToN-IoT dataset were reduced to 15 prime features after conducting the dimensionality reduction technique of selecting the features that contribute the most variance in the dataset.Features with minuscule variance contributions were eliminated.This is done to reduce the overhead of applying our MTD approach.In our architecture, the typical IDS is logically split into i decentralized IDS components.Each IDS component is trained with a unique combination of n prominent features, and every combination is different for each IDS component.For each instance of traffic that goes through the IDS architecture, it is transmitted in parallel to all IDS components to be classified as either normal or malicious traffic based on the feature combinations that each IDS component was trained with.The classification outcome is then aggregated and a common classification result is chosen by virtue of a simple majority rule.Features are subsequently reshuffled during the trigger of the next shuffle iteration, which is dependent on the action taken by the IDS as part of the stochastic game.
The attacker is assumed to have unlimited resources for reconnaissance allowing her to constantly bombard IDS components with data instances as part of her adversarial probing goals (e.g., finding out what combination of features is being used by a given logical component).If one or more components within the architecture start to record detection rates α that are relatively lower than other components, this implies that it may have been compromised.When half or more IDS components have been deduced to be compromised, this triggers the IDS architecture to shuffle features and re-train its components.Training of the IDS is performed online when a shuffle is triggered.The IDS components need to collaborate with each other in other to classify a given traffic instance.Therefore, they ought to be trained with the same instances for meaningful results.The only difference is that each IDS component's classifier interprets a given traffic instance based on the feature combination assigned to it, but the traffic instances must be the same for all.The model may adopt a strict policy to ensure that the accuracy of the MTD-based IDS always exceeds a certain threshold.Hence, the combination of features could be constrained by the total accuracy criterion set.For example, if it is desired that after training and testing, the accuracy obtained from aggregating the results of each IDS component should not be less than 97% , then this obviously restricts the number of combinations possible -valid combinations.After reshuffling and training is complete, the IDS components are replaced in real time with the newly trained ones.

Game Characterization
Game Theory has been extensively used to evaluate situations where individuals have conflicting objectives.A game may be defined as a strategic interaction between two or more entities (players), which act in such a manner as to maximize their wins and minimize their losses (whether cooperatively or competitively) [31].Pay-offs or Utilities are the quantifiable motivations that players get for executing corresponding actions.The characteristics of the proposed stochastic game can be described as follows.
• States: Let S v be the set of all possible states in the game.The i IDS components constitute a state.They take on binary values: "1" to indicate that the IDS has been compromised, and "0" to indicate that the IDS component has not been compromised yet.The transition from one state to another, or choosing to remain in the same state, can be influenced by several factors: one or more IDS components within a conglomerate being supposedly compromised (by inferring from the relative detection rate), the defender just trying to maximize their rewards, etc.
• Actions: Let A A be the set of all possible actions available to the attacker, and A D be the set of all possible actions available to the defender.Naturally, the number of valid actions heavily depend on the operational goals of the game, for example, allowing only feature combinations that result in a certain accuracy percentage of the overall IDS.In our model, the agent's actions are selected to be between shuffling the IDS configurations or not.The attacker actions are selected to be between probing the IDS parameters or not.
• Transition probabilities: Let P (s ′ |s, a A , a D ) represent the transition probability from state s to state s ′ given the actions of the attacker (a A ) and defender (a D ).This captures the stochastic nature of the game, where the next state depends on the current state and the actions taken.
• Payoff functions: Let R A (s, a A , a D ) represent the payoff received by the attacker when in state s and taking action a A , given the defender's action a D .Let R D (s, a A , a D ) represent the payoff received by the defender when in state s and taking action a D , given the attacker's action a A .
• Strategies: Let π A (s) be the attacker's strategy, which determines the action a A to be taken in state s.Let π D (s) be the defender's strategy, which determines the action a D to be taken in state s.
• Value functions: Let V A (s) represent the expected cumulative payoff for the attacker starting from state s, considering the attacker's strategy π A and the defender's strategy π D .Let V D (s) represent the expected cumulative payoff for the defender starting from state s, considering the attacker's strategy π A and the defender's strategy π D .
Figure 1 depicts a scenario in our dynamic game, which is considered a semi-perfect information game.This is because the attacker is aware of when there is a change in state within the system.As compromising just one IDS component is not sufficient to launch an evasion attack (by virtue of the majority rule principle), the attacker is able to tell that her attack was impactful when the malicious payload goes through undetected (false negatives increase significantly because at a completely compromised state, the IDS is ineffective at detecting malicious traffic).After the termination of an episode due to the game being over, the IDS architecture is triggered to change the state by executing the "shuffle" action, and the attacker knows this because the false negatives would have tremendously dropped after retraining.The defender on the other hand is aware of the possible move of the attacker by virtue of the fact that there is a significant disparity in the relative detection rate among the IDS components.
The stochastic zero-sum game model formulation can be expressed as follows [32]: where the players' value functions are recursively defined, considering the expected cumulative payoffs and the transition probabilities.
Any stochastic game-theoretic model can be further expanded to include reinforcement learning as a potential solution.This is because this ML paradigm is based on the trial-and-error approach, and so an agent and its environment can be modeled as playing repetitive games of trial-and-error until an optimal solution is found.If we are able to model our IDS architecture so that a given state provides information about how many IDS components have been compromised or not, and the transition to the next state is dependent on the attacker's success rate in compromising IDS components, assuming a perpetual attacker (an attacker constantly probing), and also that the transition to the next state is dependent on the defending system's choice of action (whether to stay in a state or shuffle), then the perpetual attacker's impact can be incorporated as part of the environment that the defending system (agent) has to learn from.
The majority rule criteria for the IDS architecture to maintain high classification accuracy implies that if more than half of the IDS components are compromised, then the game is over for the defender, and hence the termination of an episode.Hence, there would exist different ways that a game/episode could terminate, and hence reinforcement learning is used to ascertain the most optimal policy for the defender based on the testbed, as well sub-optimal strategies that could approach the optimal strategy.

Threat Model
As is the case with most ML-based systems, anomalybased IDS are prone to adversarial attacks.The kind of adversarial attacks that the attacker is able to launch depends on the information that she is privy to about the system.Our MTD model defends against gray-box adversarial attacks and assumes that: 1) the attacker has knowledge of the entire feature space that the IDS architecture uses to train and test traffic (datasets); 2) the attacker has knowledge of the IDS architecture including its decentralization-aggregation approach; and finally 3) she is cognisant of the splitting number of prominent features among the IDS components.Nonetheless, the attacker is oblivious to the innerworking of the feature shuffling mechanism the IDS uses for the dissemination of features among the logical IDS components.Specifically, the attacker is unaware of the exact feature combination per IDS component at any given time.Consequently, if this is figured out by the attacker, she is able to inject successful adversarial examples into that IDS component.Also, if the attacker had knowledge of the feature shuffling mechanism, his information about the entire system would be complete and hence this would have been a white-box attack model.Therefore, on the spectrum of gray-box attack scenarios, this is the worstcase scenario from the defender's perspective.This is particularly important because in assuming that the attacker has enough knowledge to compromise a given network, we are able to address most of the loopholes within a network for worst-case eventualities which rarely happen, but are very possible.
Typically, with unrestricted access to the feature space of the network traffic of the IDS architecture, the two categories of adversarial injection attacks that could be launched are: Data poisoning and evasion attacks.The former is implemented during the training phase of the IDS, while the latter is usually executed in the testing phase of the IDS (i.e., traffic transmission in real time).There has been significant progress in protecting systems against data poisoning attacks [33].Hence, we only focus on evasion attacks due to their potential stealth.An evasion attack typically involves manipulating a given instance of traffic so that it is misclassified by the IDS.Ideally, because the attacker has knowledge of what features the IDS uses for classification, this attack should be easy to execute over time by constantly probing the IDS components.However, the dynamic shuffling of unique combination of features among IDS components makes it difficult for the attacker to know what combination of features are used for a given IDS component at a given time t, rendering the information gathered by the attacker unworthy of being used.The goal of MTD in any case is to stretch the time t that an attacker requires to compromise a network as practically as possiblebearing in mind the cost involved in terms of system resources and performance degradation [34].Hence, allowing both the defender and the omnipotent attacker to play the game of features provides insights on how long the game can go on before the entire IDS is compromised.

Uncertainty Analysis of our MTD Approach
The feature space diversification of the theoretic model highlights the different possible feature combinations of the IDS components.For i components of a given IDS architecture, each component is represented by a bit.Assuming 5 IDS components, the IDS architecture is represented by [00000] bits.The flip of a bit from 0 to 1 denotes the compromise of an IDS component.The termination of a single iteration of the game is triggered when more than half the IDS components (i/2) have 1 as their bit values.It is important to re-emphasize that the objective of MTD is to create as much uncertainty in the system as possible, so as to make it difficult for an arbitrary attacker to successfully complete the reconnaissance phase and launch an attack over time.
Let X = i/2.The diversification of the proposed IDS architecture can be measured by the sum of combinations of all possible bits that can take the value 0 before termination at X = i/2.This is bounded by: where On the other hand, the diversification based on feature combinations (MP) is given by: C n f If f denotes the number of unique features per IDS component, the total diversification of the system is thus given by the product of the two expressions as follows: Entropy is a measure of the randomness or uncertainty within the system.The correlation between entropy and uncertainty implies that the greater the entropy within the system, the higher the uncertainty and difficulty for the attacker to successfully gather reconnaissance and launch an attack [24] [23].The entropy H(X) is given by [35]: Based on the diversification expressions above, the maximum probability that the theory stipulates for which the IDS architecture will not be compromised is given by: The 1 2 probability of the IDS components represents a uniform probability of a bit either being 0 or 1.The 1 n probability of the feature set represents a uniform probability of training a given IDS component from an n feature space.The entropy then is given as: Based on these theoretical expressions, the following can be deduced: 1) The measure of uncertainty can be increased by increasing the features space n from which IDS components can be trained; 2) The measure of uncertainty can be increased by reducing the number of feature combinations f that each IDS component is trained with, given a feature-space pool n (this obviously would have an impact on the accuracy of each IDS component because the less features one trains an IDS with, the less information it has to properly classify a given data instance); and 3) The measure of uncertainty can be increased by increasing the number of IDS components i in a given IDS architecture (but this will incur more computational overhead).In all cases, it is preferred to use entropy as the ultimate measure of uncertainty because it provides a tighter and systematic way of looking at which parameters matter the most when it comes to increasing the uncertainty within the system [23].

Reinforcement Learning for Solving the MTD Timing Problem
In RL, the agent undertakes a sequence of actions that yield feedback signals from the environment in the form of rewards or punishments.The agent learns over time based on the actions taken and the rewards/punishments received, and the environment evolves based on the agent's actions.The environment state transitions can be seen as stochastic sequences or Markov decision processes [36].The agent's ultimate task is to find an optimal policy that yields the most rewards after experiencing several episodes of the RL game [37].In our MTD model, we propose a new reward function for the agent which depends on the number of compromised components i c and the number of shuffled components i s = i − i c .It also depends on the action taken (whether "shuffle" or "stay"), and if the action is taken at a time when the number of compromised IDS components i c is below or above the threshold i/2.We setup our reward function according to algorithm 1, where the parameters a, b, c, and d are used to tune the reward estimation be the system administrator according to the importance of actions.Based on this model, the defender has two (2) actions: Stay or Shuffle.The "Stay" action indicates a decision by the defender to remain in the same state based on predefined objectives of the model.For example: Reward threshold, number of compromised components, etc.The "Shuffle" action indicates a decision by the defender to change state.This also means the "Shuffle" action would involve different feature combinations and how that affects metrics like the overall performance of the IDS architecture.The agent's decision to trigger a shuffle or not would either result in an episodic game ending quickly or going on for as long as possible, provided the agent chooses to shuffle before at least half of the IDS components within the architecture are compromised.However, choosing to shuffle continuously results in computational and memory costs on the IDS architecture due to retraining IDS components with different feature combinations iteratively for every trigger move.Ultimately, the agent would want to return (reward -cost); end else if agent_action == 0 and i c > i/2 then cost ← d * i c ; return (reward -cost); end Algorithm 1: The reward function of the agent learn about the environment over time for a given number of episodes and total corresponding rewards yielded based on the actions taken, and to find the closest optimal policy or sub-optimal policy so as to determine when to initiate a move.The model is set up so that the attacker constantly attempts to launch a successful evasion attack.This would naturally consist of sub-actions of feature combinations.For the sake of simplicity of the model, the attacker's action is to constantly attempt to evade, and is incorporated into the environment the defense agent has to learn.Given an arbitrary pool of prominent features f p to shuffle from, if each IDS component was shuffled with f features, then the number of combinations possible for each IDS component from the pool is given by: C f p f .To represent each IDS components' combination by using bits, we obtain log 2 (C f p f ) bits.This relation indicates that each IDS component's feature combination is represented as an x-bit binary number, which is extremely important in the implementation phase, where it is much easier and more convenient to put an abstraction on feature combinations with binary representations.Hence, an evasion attack is deemed successful if for any arbitrary log 2 (C f f split )-bit feature combination of each IDS component, the attacker's feature combination matches.
The Q-learning algorithm adopted to solve the MTD timing problem is shown in algorithm 2. Given a policy function π(s), which indicates a set of "action-paths" to follow as a function of the present state s, V π (s) denotes the expected utility at a given state and U (R, γ) is the sum of discounted rewards (where γ is the discount factor and is equal to 1 if the reward function is not affected by the circumstances of the future) [38].The RL problem is formulated as follows: ) is referred to as the Q-value at a chance node, which is the quality of a state action-pair.Essentially, the optimal value at any state s would be the Qvalue that is maximized over a set of actions.The optimal policy π(s, a) is the state-action combinational subset that maximizes the quality function (Q-value), as seen in algorithm 2. The bell's optimality equations are feasible solutions only when there is a working model for the environment.This is where model-free RL solutions come in handy.We leverage the Q-learning technique which has information about which stateaction in any given policy yields the most rewards.At each step k, there is a re-evaluation of the value of a state (or the quality of a state-action pair Q) until where K is the total number of steps.Q-learning is an off-policy RL technique, which implies that the quality of the next state-action pair is explorative, and will capture even sub-optimal Q-values in an attempt to find the optimal policy -the argmax that maximizes the Qvalue for the next state.Its off-policy quality makes it possible to keep track of almost all episodes that lead to termination.There is a tunable parameter ϵ to add some randomness in exploring sub-optimal policies.In algorithm 2, the Q-table is updated iteratively, and the agent's choice of action gains optimality over time.

Experimental Setup
We validate our MTD solution using the state-of-theart IoT security dataset called ToN-IoT, which was generated by simulating MQTT-based traffic [39] an MQTT machine-to-machine network architecture of 12 sensors, a broker, a camera, and an attacker.Also, five scenarios of data records were retrieved: Normal operation, aggressive scan, UDP scan, Sparta SSH brute-force attack, and MQTT brute-force attack.Three abstraction levels of features are also employed in the composition of this dataset (after extraction from raw pcap files): packet features, unidirectional flow features, and bidirectional flow features.The MTD parameter n was chosen to be 3 based on the analysis performed on the ToN-IoT dataset.It was found that for an IDS component to be trained on a combination of prominent features that have enough variance for decent classification of traffic instances, they ought to be trained with at least three features.The anomaly-based IDS relies primarily on a ML classifier.In our experiment, we set-up a random forest classifier consisting of ten decision trees as our ML model for the IDS.The random forest is a meta estimator that understudies and fits a given dataset into two or more decision trees, and uses the average classification of each decision tree for a more accurate and robust classification.Our experimental MTD-enabled IDS consists of 5 different random forest classifiers (logical IDS components).After feature selection was conducted on the ToN-IoT dataset, the chosen features were recorded in a mutable list (called the pool of features).During the training phase of the IDS components, each was randomly trained on three unique combinations of features from that feature pool.To compromise a given IDS component, we use the CleverHans library for our adversarial learning attack implementation.It is an open source python library for conducting adversarial attacks on ML models.Adversarial examples were generated by feeding the test traffic through the cleverhans.torch.attack.noiseclass, which generated the adversarial examples for the ToN-IoT test data samples by injecting adversarial noise that led to the misclassification of data instances.
From the python repository, the Scikit-Learn library was used to implement the random forest classifier model.The Pandas library was used for reading the ToN-IoT dataset as a data frame into the working environment, and then for the various manipulations of the data frame.The Numpy library was used for handling data fragments as arrays, tinkering with them, and re-writing them back on to the data frame for further analysis.The OpenAI gym library was used to provide the baseline environment for implementing the RL algorithm and incorporating the incessant adversarial attacks as part of the the agent's environment.The machine used for the experiment is an Intel64 Family 6 Model 165 Stepping 2 GenuineIntel Processor with 7,968 MB of RAM.
The experimentation steps can be summarized as follows.First, we set up and train two separate IDS: A conventional IDS and the MTD-based IDS, each using the random forest ML model as the classifier, using the ToN-IoT dataset as the traffic specimen.Next, we measure and record the accuracy test traffic going through the conventional IDS.Using the cleverhans adversarial framework, we inject adversarial noise into the test traffic and measure the accuracy drop, then we do this for increasing odds of success of the attack launch (i.e., 0.5, 0.6, 0.8, 1).For each adversarial attack attempt on the MTD-based IDS, the RL algorithm learns the corresponding reward obtained progressively over time and influences in real time what action the agent takes over the course of a number of episodic runs.The odds of the attacker are increased accordingly from 50% to 100%.The accuracy at each of these odds is recorded, as well as the CPU and memory usage amidst adversarial attacks and MTD deployment.

Results Analysis
As seen in Figure 2, the simulation testbed for the MTDbased IDS is set up so that the IDS defending agent goes through various scenarios of which an episode could terminate, and the associated reward is obtained after an episodic termination.The attacker's incessant adversarial attacks are incorporated as part of the environment and the testbed also records in real time how the overall accuracy of the MTD-based IDS is doing for each episodic step.Resetting to a new episode or triggering the "action" results in the retraining of the 5 IDS components (indexed 0 − 4). Figure 3  One of the biggest concerns of having an MTD-based system is coming up with a systematic scheme to know when to initiate an adaptation in an optimized or quazioptimized manner.Here, we leveraged RL to guide the behavior of the agent and allow it to learn in real time what actions to take and when to take them to maximize gain and minimize cost.Figure 5 shows the cumulative reward of the MTD-based IDS agent over 250 episodes.The results show the agent's reward was increasing over the number of episodes, which is a strong indication that learning was taking place.
Figure 4 shows the overall accuracy comparison of the conventional and MTD-based IDS for various attack success rates.The various success rates of the attacker were simulated using the python library Numpy's uniform probability distribution (using the same seed value for both the conventional and MTDbased IDS for objectivity and neutrality), and changing the odds incrementally.For example, as can be seen from the figure, at 50 % odds of success, the attacker has an equal chance of success and failure for both the conventional and MTD-based IDS.In this case, there was no trigger for an adversarial attack, hence the reason the conventional IDS has an accuracy of close to 100 %.Neither was attacked at this success rate, however the slight drop in the accuracy of the MTD-based IDS is due to the implementation of MTD (the decentralization and aggregation in the MTD architecture).As the attacker's success keeps increasing however, it can be seen that the accuracy degrades significantly for the conventional IDS, whereas for the MTD-based IDS, it degrades much less comparatively.
Figure 5 shows the progression of the agent's reward duting learning.Figure 6 shows a positive correlation between the reward accrued and the overall accuracy of the MTD-based IDS.This also informs us that the MTD-enabled IDS agent was truly learning to optimize and maximize its overall accuracy.It must be clarified though that the system's inherent costs were not incorporated into the reward function of the agent during the RL training phase.The agent's reward was simply a function of the number of compromised IDS components and number of shuffled IDS components, as ratios of the total number of IDS components within the system.In our future work, we intend to include the network cost within the reward function of the agent.
In Figure 7, we show the CPU usage of the MTDbased IDS in comparison to the conventional IDS for different success rates of the attacker.As can be deduced, the constraints on CPU resources increase for every increase in the adversarial attack success   IDS for varying success rates of the attacker.

Conclusion
Anomaly-based IDS in IoT networks mostly rely on machine learning, and hence are vulnerable to adversarial threats.In this paper, we designed a novel MTD solution based on feature shuffling to allow the IDS to counter stealthy reconnaissance efforts and prevent successful evasion attacks.Our solution is based on a solid game-theoretic problem formulation between the defense system and the adversary, which we solve using reinforcement learning to orchestrate the configuration shifts in an adaptive fashion.The results show that our MTD-enabled IDS is more resilient to adversarial attacks.The impact of our findings is indisputable and will undoubtedly contribute significantly to the MTD literature.We plan to replicate the experiments in a real-world IoT testbed to achieve stronger validation on various datasets and investigate other potential moving parameters that may allow us to expand the exploration surface of the IDS.
provides a microscopic view into what goes on within an episodic step: The combination of features used to train a specific IDS component, what each IDS component's accuracy is after training, and the combined IDS accuracy after training (with or without an adversarial attack).

Figure 2 .
Figure 2. The first 10 episodes of the game played by the MTD-based IDS agent, and the corresponding reward and accuracy obtained in each episode.

Figure 3 .Figure 4 .Figure 5 .Figure 6 .Figure 7 .Figure 8 .
Figure 3. Microscopic view into the first 8 episodes (indexed by their accuracy values) showing the feature combinations each IDS component was trained with.