1 Introduction

The evolution of cyber adversary attack capabilities is on display nearly every day and has become a part of the new normal for cyber defenders, stakeholders, and society at large. So much has this become the accepted paradigm that cyber defense is conducted in the planning and operational model of a permanent arms race with cyber adversaries [18, 31]. One notable domain of potential evolution is that of using artificial intelligence (AI) technologies and advancements for cyber attacks [16, 19, 41, 44].

While once thought to be impossible or simply decades away, in 2016, enhancements in search and neural network classifiers as well as computing advancements led to the creation of the AlphaGo system, capable of decidedly outperforming the world’s best Go players [38]. As such, observers in many domains may reasonably wonder, if a game with a branching factor of 250 can be conquered, how long until other “impossible”, complex problems are also dethroned. Within the cyber domain, one can already observe efforts and advancements in the ability to carry out autonomous cyber attacks [2, 4, 16, 21]. While the equivalent of an AlphaGo-level autonomous cyber attack system has not yet presented itself, one can objectively assert that cyber defenders no longer believe such an adversary is abstract [7, 41].

Consequently, defense against an AI-enabled, autonomous adversary will require cyber defense technology, tactics, and strategies that specifically counter the competitive advantages gained from an adversary’s use of AI. Along these lines, one promising cyber defense tactic is that of cyber deception. Cyber deception can deflect, distort, deplete, and discover cyber adversaries and attacks, and be tailored to interact or engage with adversaries to achieve precise detrimental effects [1].

In this work, we envision (and implement) a future cyber adversary whose actions and decisions are entirely controlled by an autonomous system. This autonomous system uses search techniques to drive its cyber attack operations and achieve the desired objective(s). Given such a scenario, we ask the questions of (1) whether novel cyber deceptions can be constructed and deployed in such a manner as to directly target weaknesses in the automated planning and search techniques; and (2) can an effective emulation system be implemented to evaluate cyber deceptions and autonomous adversaries against each other at scale. To answer and explore these questions, we present and evaluate Mirage, a cyber deception and autonomous cyber adversary experimentation system.

This article is an extension of previously published work [22], extending the work with an analysis of the robustness of our cyber deceptions against a reinforcement learning agent in simulation.

1.1 Key contributions

  • Three novel cyber deceptions purpose-built for countering autonomous cyber attacks.

  • Anansi - a Windows operating system deception service framework to deploy and actuate cyber deceptions.

  • Experimental analysis for 72 offensive cyber operation trials in emulation, varying the adversary profile, choice of planning technique, and employed deceptions.

  • Demonstration of the robustness of deception, evaluating it in a high fidelity cyber simulation environment against reinforcement learning agents.

2 Background

2.1 Cyber deception

Cyber deception is generally described as any planned action(s) taken to mislead and/or confuse attackers, thus causing attackers to take specific actions that aid in the actual defense of the cyber system under which the deception is being applied to [43]. Cyber deception has long found significant application and value within computer systems and networks [12, 32]. Common cyber deception constructs include honeypots [30] and honeytokens [13], which are fake/decoy computer systems, resources, and/or data that are deployed on real computer and network systems for purposeful influence and engagement with an attacker.

For the purposes of this work, the cyber deceptions created and evaluated within Mirage can all be categorized as novel honeytoken implementations. Specifically, Mirage’s cyber deceptions are all computer file-based cyber deceptions.

2.2 Adversary emulation

Adversary emulation is a variant of the discipline of cyber red teaming that aims to emulate a known cyber threat/adversary to much higher fidelity with regard to known actions, behaviors, and objectives than standard red teaming would normally account for [3]. As a sub-discipline, adversary emulation was established to address the needs of evaluating whether computer systems and networks were protected against specific advanced persistent threats (APTs). In effect, it is a marriage of the discipline of cyber threat intelligence to the red teaming process. While adversary emulation is not the direct focus of our work, it nevertheless provides the tools and infrastructure for our experimentation program. For our work, our chosen adversary emulation platform is Caldera [28].Footnote 1

2.3 Reinforcement learning

Reinforcement learning (RL) is a subset of machine learning that focuses on training agents, which can be computer programs or physical robots, to make decisions in dynamic environments. The central idea behind RL is that the agent learns by interacting with its surroundings through a simulated environment often called a “gym” [5]. Instead of being provided with labeled data as found in classical machine learning algorithms, the agent generates its own experience data to learn from. The RL agent explores its environment using a decision-making policy to observe the current state and take actions it decides are optimal. After each action, the agent receives a reward from the environment which serves as feedback, indicating how good or bad the action was in terms of achieving the agent’s goals. The primary objective of the RL agent is to improve its decision-making policy to maximize the cumulative reward it can accumulate over time. It aims to learn to associate specific actions with higher rewards resulting in a policy that consistently selects actions that lead to higher rewards.

There are a variety of training algorithms that can be used to derive and optimize the policy. Policy-based approaches such as Proximal Policy Optimization (PPO) attempt to learn the policy directly. In contrast, value-based approaches like Q-learning first attempt to find an optimal value function which estimates the quality of each state-action pair, and then derive the policy from the value function [25]. Policy-based algorithms are generally less sample-efficient but are better suited for stochastic environments and continuous action spaces, whereas value-based algorithms are more sample-efficient but are limited to discrete action spaces and only optimize the policy indirectly. Actor-Critic methods combine elements of both types by having an actor learn the policy and a critic estimate the value function. Recent approaches such as Asynchronous Advantage Actor-Critic (A3C) have shown considerable performance with regard to training speed but still have some drawbacks such as instability and increased implementation complexity.

All of the above methods fall within the category of model-free algorithms, where experiences are solely derived from interactions with the environment. In contrast, model-based methods train an additional internal model of the environment used for simulating future scenarios. The policy can plan ahead and train on these model-simulated trajectories. Model-based approaches can work very well for problems with well-defined environment rules such as in physics-based simulations. Nevertheless, they are not without their drawbacks, including limitations imposed by the trained model, which can include bias from model-approximation errors, limited data, and structural assumptions about the real environment [25].

3 Related work

For the purposes of our work in Mirage, related research and development work can be categorized into the overlapping areas of cyber attack simulation and emulation environments, cyber deception deployment systems, and cyber deception research. Recent work in these areas is detailed.

Within the past few years, due to the growth and accessibility of RL “gyms,” there have been multiple independent efforts focused on cyber attack gyms with varying levels of fidelity and scenarios of focus. CyberBattleSim is a lower fidelity/high-level RL gym for training an attacker agent against a simulated cyber network environment [27]. CyberBattleSim focuses on a limited set of attacker actions, such as lateral movement. Walter et al. extended CyberBattleSim with the deception components of decoys, honeypots, and honeytokens in order to evaluate the optimal placement of deception components against attacker agents [42]. Similarly, NASim is another RL gym using a simulated cyber network environment to evaluate the effectiveness of deceptions, specifically attempting to answer defensive value questions such as how many honeypots to use against certain cyber attacker models [33]. CybORG is a gym focused on the training of defensive agents given specified cyber network scenarios (environments) and attackers (adversarial agent) [40]. CybORG has higher fidelity/low-level action space for its simulation environment, making it closer to realistic cyber attack scenarios than CyberBattleSim. The first CybORG challenge included both simulation and emulation operations, but subsequent challenges are pure simulation. CyGIL is an RL gym built to work with an emulated environment instead of a simulated environment [24]. In this regard, CyGIL is quite novel and allows for an extremely high fidelity/low-level action within its RL gym. Both CyGIL and Mirage use Caldera for their attacking agent and offensive action space.

Beyond simulation and emulation environments, research has also been conducted in developing applied cyber deception systems. Al Shaer et al. developed automated cyber deception design and deployment systems, including DodgeTron [34], CHIMERA [17], and SODA [35]. These deception systems all center around the automation, deployment, and efficacy of cyber deceptions against targeted malware instances. Specifically, CHIMERA and SODA focus on the targeting and actuation of deceptions towards adversarial tactics and behaviors gleaned from previously analyzed malware samples. Additionally, at the concept and ontology level is the MITRE Engage framework which is a knowledge base for cyber deceptions and adversary engagement [29].

Lastly, there is notable work on the quantification of the effects of cyber deception on cyber adversaries. Ferguson-Walter et al. conducted multiple, significant studies on the effects of cyber deception on actual cyber attackers by way of large human research trials with penetration testers, red-teamers, and computer specialists [10, 11, 37]. These studies are entirely centered around human adversaries but do conclusively quantify the detrimental effect of cyber deception on a cyber attacker. Similarly, these effects have also been cataloged from a concept and case study perspective in [8].

4 Emulation

To quantify the effects of file-based deception against automated adversaries, we designed and built a framework for deception experimentation that runs in emulation — on virtual machines running standard commercial software. In the remainder of the section, we will discuss the details of the framework and the results of our experiments.

4.1 Experimentation framework

Each trial in our experiments consists of a specific adversary and deception choice. These are deployed together on a small Windows domain running in an Amazon Web Services environment. These components are shown in Fig. 1, and described below.

Fig. 1
figure 1

The operational architecture of Mirage emulation system

4.1.1 Adversary profiles

Two adversary profiles were defined to evaluate deceptions against. Both target Windows systems and were implemented in PowerShell. A full list of abilities can be found in Appendix A.

Thief. A simple exfiltration adversary that contains Tactics, Techniques, and Procedures (TTPs) for creating a staging directory, performing sensitive file discovery, copying the found files to the staging directory, compressing the staged files, and finally exfiltrating the zipped archive back to the Caldera server.

BlackSun. A ransomware adversary that contains a diverse set of TTPs aimed at both finding and encrypting sensitive files, as well as covering its tracks. This adversary was modeled on a real-world threat of the same name first observed in 2020, written in PowerShell for access to Microsoft cryptography modules [6].Footnote 2

Both adversaries contain a matching set of lateral movement abilities targeting a Windows environment. At a high level, these abilities discover user credentials, use those credentials to open a file share to a remote host, copy the agent binary to the remote host over the file share, and then executing the binary using Windows Management Instrumentation (WMI). This process creates an additional Caldera agent on the new host for the server to make use of when choosing actions.

4.1.2 Cyber attack planners

The core Caldera platform provides the main capabilities for matching facts in its knowledge base against the requirements for abilities in an adversary profile to determine which actions it is able to take. The platform also allows for more complex decision-making logic through Caldera modules called planners. These planners are able to overlay any decision logic over the actions currently available to the adversary profile and further decide which specific actions to take.

The following planners were selected for use in our experimentation program:

  • Batch - A simple planner that executes all available actions at each iteration. The planner is primarily used as a base line in the experiments.

  • Look-Ahead [14] - Chooses a single action at each iteration based on the expected reward. Action-reward values are set by the user apriori. Then in the operation, the planner calculates rewards for abilities based on the discounted values of ability sequences up to a maximum depth.

  • Guided [23] - Constructs a directed attack graph and performs a goal-based search to find and execute actions that lie along the shortest path to the goal. At each iteration, the planner chooses the action closest to its goal.Footnote 3

4.1.3 Cyber deceptions and deception service

As the primary component under test, the cyber deceptions were the target of our Mirage experiments. To create novel cyber deceptions specifically tailored towards inhibiting an autonomous cyber adversary/attack, we first identified general planning techniques that are found across artificial intelligent search and planning algorithms and consequently focused on three of those techniques. In effect, we make the strong assumption that any autonomous cyber attack system will use the following techniques in their implementation:

  • Attempt to reduce state space via (1) ignoring or abstracting state space, (2) removing state space via heuristics, sub-goal localization, and (3) removing symmetric branches/paths.

  • Will do online planning and decision-making; that is will have the ability to re-plan.

  • Operations will be goal-oriented, which will fall inline with common cyber attack objectives (e.g., persistence and data theft)

Given these assumed search and planning techniques (in use by the adversary), we proceed to develop deceptions that directly exploit them for negative effects. For example, if an adversarial autonomous cyber attack system aims to reduce the state space, the goal of our novel deception is to purposely prevent such state space reduction or even expand it. Table 1 details the three novel deceptions created for evaluation against an autonomous adversary.Footnote 4

Table 1 Descriptions of novel cyber deceptions for targeting autonomous cyber attack systems

An observer will note that these deceptions are all based around computer host file objects. This was due primarily to the complexity of deception implementation — manipulating file operations are relatively simpler than more layered cyber deception. However, file-based deceptions are also currently one of the most commonly deployed and effective cyber deceptions on real-world systems [45].

As part of our simulation infrastructure, there was a need for a deployable, dynamic deception framework that would change known states of a planner’s environment in the form of adversary deception. As such, we created Anansi.

Anansi works by monitoring PowerShell log files for all users that appear locally on the machine and examining each command for keywords that are the trigger signature for a deception to be deployed, as a live reaction to adversary activity. Deception configurations and trigger signatures are kept in a JSON file placed on endpoints during deployment. The Sneaky Files and Black Hole deceptions were implemented and deployed with Anansi. For more information on how these deceptions operate (please see Table 1).

Fig. 2
figure 2

Mirage experiment results for the Thief adversary (left) and BlackSun ransomware adversary (right), with hue color denoting cyber deception deployed in cyber range. In these charts, the x-axis is the percentage of time the adversary spent on planning, and the y-axis is the percentage of successful actions executed by the adversary

4.1.4 Cyber range

For Mirage’s cyber range infrastructure, we use Caldera Range, a closed-source plugin for the Caldera platform that allows for Caldera to seamlessly connect to a cyber range environment, deploy agents (implants), and execute operations. We used the Amazon Web Services (AWS) cloud infrastructure backend for the tool while executing on our experimentation program. The targeted network for our experiments consisted of five Windows host machines and a single Windows domain controller.

Each host in the domain was seeded with user credentials for a single neighboring host in a ring topology. Each host was also populated with six sensitive files, split between two directories. Each file was initialized with a random filename. One host began with a Caldera agent running on it to serve as a starting point for the operation.

4.1.5 Experimentation program and metrics

The suite of experiments consisted of three episodes per combination of adversary profile, cyber attack planner, and deception strategy. Using the two adversary profiles, three cyber attack planners, and three deception strategies outlined in previous sections, this resulted in a total of 72 cyber operations completed over the experimentation program. The total number of episodes per combination was limited to three due to the time-consuming nature of the process for initializing the cloud environment, performing the operation, and tearing down the environment.

Caldera produces a detailed operation report as a JSON file upon the completion of an operation. These operation reports were used to compute a set of metrics for answering analytical questions concerning the performance of the adversary against the various deception strategies. Descriptions of the metrics computed from the operation reports are as follows:

  • Total number of actions executed over the course of the experiment.

  • Number of actions that failed to complete.

  • Number of actions that were repeated multiple times in the experiment.

  • Time spent on failed actions in seconds.

  • Time spent planning choice of next actions.

  • Number of facts learned over each trial

  • Cumulative score over all learned facts.

  • Total experiment run-time in seconds.

4.2 Experimentation results

The first goal of this work was to test novel cyber deceptions against autonomous cyber adversaries and evaluate the effects of the deceptions on those adversaries. The following sections detail the overall performance of the deceptions and discuss the validity of the chosen evaluation metrics.

4.2.1 Performance of cyber deceptions

Figure 2 shows the overall experimental results view of the two cyber adversaries (Thief, BlackSun) for each of the novel cyber deceptions detailed in Table 1, over the two chosen metrics of successful_action_proportion and proportion_of_time_spent_on_planning. As our experimentation program had 8 metrics, 2 adversary profiles, and 3 planners there are 150 specific data views. We chose this data view for the paper as it is an accurate representation of the performance of all deceptions across the 2 adversaries. For this data view, the best-performing deceptions are found in the bottom right and the worst-performing deceptions are in the top left.

Key observations of the experimentation program were as follows:

  • All the deceptions had a clear (negative) effect on the adversarial cyber planners, regardless of the adversary profile (i.e., Thief, BlackSun) or planner (i.e., Guided, Look-Ahead). Observing Fig. 2, the data points are nearly linearly separable and even inclined to clustering.

  • For the Thief adversary, the deceptions had distinct negative effects on the Guided and Look-Ahead planners. Notably, these planners were faster, but the deceptions induced many more failed actions.

  • The File Facade deception forces the Guided and Look-Ahead planners to consider more information and thus take significantly more time when planning.

4.2.2 Cyber deception metrics

As outlined in Section 4.1.5, Mirage captured 8 metrics from the cyber operation experiments, all aimed at capturing the performance of the decision component of the autonomous cyber adversary. We (generally) conclude that these metrics served as appropriate measures of impact to the adversarial cyber planning component as one metric (proportion_time_spent_on_planning) was directly related to planning performance and the rest were strong proxies (e.g., total_actions, failed_actions, cumulative_score_over_all_learned_facts, and total_run_time). With regard to the proxy metrics, in Mirage’s controlled cyber ranges, noise and unpredictability are reduced enough to confidently conclude that action failures or unexpected deviations are the result of Caldera’s planner responding to the deceptions, not other external factors. For example, if the Thief adversary takes some additional failed actions not seen in the baseline experiment, these failed actions are not the result of chance or minute changes in the cyber range. Furthermore, our team was able to select random failed actions and manually verify their cause by using Caldera’s knowledge base and post-operational report. However, one key metric missing from this work, that is usually found in the evaluation of automated planners, is that of the number of states evaluated by the cyber adversarial planning system for a given action decision. Future experimentation should include the mechanisms to capture internal decision metrics.

4.2.3 Computation time and costs

For the 72 emulation cyber operations, it took 24.76 hours of compute time on six AWS t2.medium instances with each instance costing $0.0644/hour. We estimate that provisioning stand-up and tear-down of the compute resources roughly doubled compute time. The total cost (stand-up, experiment, tear-down) for all experiments was about $19 or about $0.25 per experiment. Total compute time for all experiments was nearly 50 hours or 69 minutes per experiment.

While both the computational cost and time may not be prohibitive, scaling the experiments can quickly become intractable, especially noting that our experiments had a simple flat network. This trend confirmed a key hypothesis that a higher fidelity emulation system of deceptions and autonomous adversaries has physical limits, despite the efficiencies found in automated cyber range deployment and autonomous Caldera execution of operations. These conclusions are inline with previous works [24, 40].

5 Simulation

As noted above, temporal limitations inherent in emulation notably precluded the ability to scale experimentation of cyber deceptions, computer system environments, and autonomously enabled adversaries. To overcome this challenge, our team worked this past year on evolving Mirage from using an emulation system to that of a high-fidelity simulation system. That is, a simulation system and framework characterized by high fidelity, easily definable, and extendable cyber environments; comprehensive cyber attack agents and action spaces; and the ability to overlay and encode cyber deceptions into the environment.

Additionally, with the transition to a simulation environment, the Mirage system was able to expand from not only evaluating automated planning-based adversary cyber agents but also reinforcement learning-based agents. As detailed below, for our initial experimentation of the simulation system, we opted to use reinforcement learning agents, as opposed to automated planning-based agents.

The following section details the simulation system used by Mirage, known as CyberLayer, and the reinforcement learning gym our team integrated with CyberLayer. Additionally, we discuss how deception defenses are embedded in the simulated environment and our initial experiments with a reinforcement learning adversarial cyber agent.

5.1 Environment

5.1.1 CyberLayer

CyberLayer is a simulation environment for cyber operations based on the AI gym design pattern. It generally conforms to the pattern of many such reinforcement learning frameworks.Footnote 5 The critical requirement was that the environment be highly representative of real-world environments. A learned policy in the CyberLayer environment necessarily needed to transfer and perform just as well in the real-world environment that the environment’s computer network model was based on. This required the environment topology to be represented through an accurate data model and underlying data structures. It also required that the action space conform to and provide all the constraints of live-fire cyber actions in actual environments. A learned policy in CyberLayer for a given toolset (translated to a given action space), and a given environment should perform the same in the simulation as it would a live-fire implementation of the environment.Footnote 6

In order to create an environment with this kind of realism but still at a lighter weight level than emulation or virtualization, significant care was put into the logical evaluation of when a move was valid, the state changes that would be caused by different moves, and the ways that the action space was effected based on an agent’s current position.

With the basic foundation of a computer network environment representing computer systems, file systems, and the networks that interconnect these systems, it was fairly straightforward to implement mechanics that would interact with these systems for deception effects. If an action such as “ls” would list files in a directory, when an agent ran “ls” under deception circumstances, before the observations were returned to the agent the deception system could be called to return the outputs of a designated deception strategy (e.g., file obfuscation and additional files) and those could be injected into the observations of the agent. As in real-world deception, this creates circumstances where not all agent observations perceived of the environment correlate with ground truth reality of what is in the environment.

To our knowledge, the previous works of CybORG [40] and CyGIL [24] are the closest analog to CyberLayer. Currently, CybORG is open-sourced, and CyGIL and CyberLayer are closed-source.

Relative to CybORG, CyberLayer operates at a different level of abstraction with regard to action spaces. CybORG’s action spaces consist of actions like “Discover Remote Systems,” “EscalateAction,” and “StopService,” which one could say are at the equivalent level of MITRE ATT &CK techniques. Comparatively, CyberLayer’s action space consists of direct computer system commands, TTPs, and/or exploits. For example, it has such actions as “net view,” “nbtstat,” “Get-DomainComputer” commands. Additionally, CybORG was built for large-scale open contests (i.e., CAGE Challenges) where competitors are responsible for building the corresponding agents for the specific CybORG environment and scenario. This differs from CyberLayer in that CyberLayer has both environment and agent components natively.

Relative to CyGIL, CyberLayer takes the more traditional approach of being a direct, manually implemented simulation system, versus the novel emulation-simulation feedback system that CyGIL maintains. In effect, CyberLayer does not require the “burn-in” emulation cycles that CyGIL requires; however, new action spaces and environment effects must be manually coded into CyberLayer, while in CyGIL they can be learned and translated more autonomously. Similarly, CyberLayer and CyGIL both contain action spaces that do contain subsets of actions taken from MITRE Caldera.

Fig. 3
figure 3

Data flow within Mirage’s reinforcement learning gym

5.1.2 Reinforcement learning gym

As the CyberLayer simulation system adheres to the Open AI Gymnasium API, it can be integrated into standard reinforcement learning gyms. Our team chose the open-source reinforcement libraries Ray and RLlib as the gym to integrate the CyberLayer environment into.

Ray and RLlib are two open-source libraries that are commonly used for reinforcement learning (RL) and distributed computing. They work together to provide a powerful framework for training and deploying RL agents in various environments. Ray serves as the underlying distributed computing framework that powers RLlib’s distributed training capabilities. It manages the allocation of resources, such as CPUs or GPUs, for training RL agents in parallel. RLlib leverages Ray’s task scheduling and distributed data processing features to efficiently distribute RL training workloads across available compute resources. Users can define custom RL environments and experiment configurations using RLlib’s high-level APIs. RLlib provides a variety of RL algorithms that can be easily integrated into user projects, and users can experiment with different algorithms to find the best fit for their specific problems [26].

As CyberLayer is the environment of our reinforcement learning model, it interfaces with the RLlib gym framework in two key ways: supplying observations and rewards back to the RLlib agent, and receiving actions from the RLlib agent and executing them in the CyberLayer environment.

For supplying the learning agent with the observation of the environment, Gymnasium ObservationWrappers were utilized to wrap the entire CyberLayer environment state and filter down to the desired observation space (and appropriate encoding format) for the agent to receive. CyberLayer is designed to provide access to its comprehensive environment state so that any possible observation space may be used. Thus Gymnasium ObservationWrappers could be implemented for any observation space that is desired by merely querying CyberLayer for any state information the observation space requires. Similarly, the reward policies used in the reinforcement learning model are architected to be modular and supplied to the CyberLayer environment at run-time. Reward policies also have access to the comprehensive environment state of CyberLayer, and any policy or function may be calculated over the environment state to get the reward for the current step.

Equivalently, CyberLayer receives actions from the reinforcement learning agent through Gymnasium ActionWrappers. ActionWrappers are used to decode the output from the agent’s model/policy as well as execute any furthe preprocessing (e.g., grounding or variable replacement) required before sending the action to CyberLayer to execute. Finally, before the output (chosen action) is passed from the agent to the ActionWrappers (and then to the CyberLayer environment), it is also put through RLlib action masking. Action masking allows for the clipping (reducing) of the action space available to the agent for the current step, by removing actions that are not allowed or irrelevant in the environment. Our team found that using action masking in our initial experiments can greatly improve training, and/or more quickly identify optimum training hyperparameters.

Figure 3 shows the reinforcement learning gym and CyberLayer environment used by the Mirage system.

5.2 Agent design

The setup of the reinforcement learning game makes the assumption that agent choice of action occurs only locally to a host on a network. This means that in an operational environment, each agent would make decisions independently of one another while only sharing network-level knowledge. This allows for game termination on simple local goals, where the goal for a lateral movement is to move to any other visible host and the goal for exfiltration is to remove any files from the host the agent resides on. This is in contrast to other approaches where agents must reach a specific end-state on the network [24]. It also reduces the complexity of decision-making for each individual agent on the network, given its scope of action is strictly local.

Within both the emulation and simulation setups there is a distinction between abstract actions and grounded actions. An abstract action is the generic type, such as Copy File. The grounded action populates the action parameters with specific values, such as the path of the file to copy. The reinforcement learning algorithm selects the abstract action. Other mechanisms are then used to select an available grounded action of the provided abstract action type. This two-stage approach allows for the action space of the reinforcement learning algorithm to be small and fixed size. For this experiment, the second stage mechanism is simple random selection.

Given that there are several actions with preconditions on agent knowledge, an action mask is generated based on the current state of the agent’s knowledge base and defined constraints on actions. For example, an action to copy a file is constrained by whether the agent has discovered file paths through a scan. The mask is recomputed at each step once a simulation result has been processed and the agent’s memory updated. It is then used to clamp the outputs of the fully connected neural network to disallow the selection of constrained actions at that time step. This hastens training by leveraging known constraints to avoid unrealistic policies [20].

The observation space was a simple history of executed actions. The horizon of the action history was set to the total number of actions allowable in an episode and padded to that length with zeros. Actions are appended to the beginning of an action history list and one-hot encoded. The length of the observation vector was equal to the number of available actions (13) multiplied by the horizon of the action history (50) for a total of 650 values.

The reward policy consisted of four components: a small negative reward for a failed action, a small positive reward for gain of new knowledge in the form of facts to encourage exploration, a large constant size positive reward for a successful lateral movement, and a large variable size positive reward for a successful data exfiltration that scales based on how many files were exfiltrated. The total attainable award was equal to approximately one (Table 2).

Table 2 Reward values

5.3 Experimental setup

For the purposes of this work, the training algorithm used to update the agent’s decision policy was Proximal Policy Optimization (PPO) [36]. There are alternatives including Importance Weighted Actor-Learner Architecture (IMPALA) and other Actor-Critic (AC) algorithms [9, 15]. However, due to PPO’s stability, ease of implementation, and the relative speed of CyberLayer, PPO was selected instead [25].

The training was performed on a laptop with a 2.7GHz six-core Intel Core i7 CPU and 32 GB of memory. No GPU was utilized. The training setup was defined as follows:

  • Learning Rate — 0.0001

  • KL Coefficient — 0.2

  • Number of SGD Iterations — 30

  • SGD Minibatch Size — 128

  • Training Batch Size — 4000

  • Maximum steps allowed in episode — 50

  • Six parallel workers, with five environments each

The set of actions was designed to match the Thief adversary that was demonstrated in the emulation experiments. To confirm the validity of the actions in CyberLayer, the results of the experiments performed in the emulation environment described in Section 4.2 were used to hand-validate the environment. Implementations of actions in CyberLayer were tuned until results of games performed using an agent that replicates Caldera’s planning capabilities closely matched the results obtained from the emulation games (Table 3).

Table 3 Set of actions used in simulation reinforcement learning experimentation

An episode of the game was terminated on three conditions: successful achievement of both exfiltration and lateral movement goals, having no further actions available to the agent, or hitting the episode cap of 50 steps.

The topology of the simulation network used was designed to closely mimic the emulation network described in Section 4.1.4.

Fig. 4
figure 4

Mirage simulation experiment results tracking mean reward (left) and mean length of episode (right) during training, with hue denoting type of deception applied

5.4 Experimentation results

Figure 4 shows two metrics tracked during training. These metrics demonstrate the algorithm’s ability to optimize the rewards it receives from the environment for accomplishing goals and the number of executed actions it took it to do so.

Through these two metrics, baseline training run with no deception demonstrated that the optimal policy that the algorithm was able to converge on received a higher average reward while performing fewer total actions. Each deception applied had a clear effect on both metrics.

  • The Black Hole deception targets a fixed exfiltration staging directory to stop the adversary from copying content to that directory for later exfiltration. It had the effect of stopping the adversary from receiving the bulk of the available reward but also allowed the adversary to prune away that route of attack, resulting in shorter episode lengths.

  • The Sneaky Files deception targets an added honey directory to cause files discovered in that directory to have inconsistent filenames. Because of this, an agent’s action to copy a file will sometimes result in a failed action. The effect is shown to be both a lower reward from the adversary choosing to exfiltrate less content and an inability to further optimize the attack length.

A secondary result is the demonstration of the efficacy of a simulation environment to better allow for large quantities of games/episodes than the emulation system demonstrated in Section 4.1.4. With the setup described in Section 5.3, each training experiment took an average of 37 minutes and 28 seconds, while executing an average of 5,489 games. This speed of execution over emulation was a key enabler of reinforcement learning and larger-scale assessment of deception against reactive autonomous adversaries.

6 Mirage evaluation

The second goal of this work was to evaluate the overall efficacy of the Mirage prototype in capturing and quantifying the deception effects on an autonomous cyber adversary. The following sections discuss the modularity and scalability of Mirage when it comes to evaluating additional novel deceptions and adversary planning algorithms on more complex cyber ranges.

6.1 Modularity of Mirage in emulation

6.1.1 Deceptions and planners

Adding deceptions and planners entails a high development effort and cost. In short, while Anansi and Caldera completely allow for new deceptions and planners to be added, each new component requires non-trivial development and testing as the components function on real computer and network systems, often dealing with low-level APIs. For the deceptions under test, the Anansi service does allow for standard deployment and actuation (on a Windows host) and thus some efficiency of scale. However, as the deceptions are entirely real, they still require serious test and evaluation.

For the adversarial Caldera planners under test, we make the same conclusion that constant linear development costs are present for any additional planner. Each new Caldera planner must be implemented within Caldera and tested, apriori to any deception experimentation. Additionally, while Caldera serves as a sufficient operational platform for executing adversarial cyber attacks, a known limitation is that Caldera’s data model and architecture do restrict the complexity of any implemented planner. For example, the Guided planner we implemented can only guarantee proper execution for the chosen Ransomware and Thief adversary profiles; not all additional adversaries.

6.1.2 Ranges

The Caldera Range plugin allowed Mirage’s experimentation program to be deployed on its cyber range with notable automation and ease of repeatability. Cyber range specifications were deployed in a “click-and-shoot” manner. That is, Caldera Range allows for easy specification of desired hosts, services, and network configurations, then automates the building of the cyber range in AWS environment using Ansible and the AWS API. For future experiments, adding network complexity, additional host and network services, user data, etc. would be a relatively low-cost effort. Adding some services would be more complex (e.g., Outlook web server). Compared to creating additional deceptions and planners, the effort required to extend and enhance the cyber range is notably less.

6.1.3 Metrics

Mirage’s ability to capture metrics depends on its system sub-components. With Caldera Range as the cyber range component of Mirage, capturing range activity at all levels is amenable. However, there are current limits to capturing internal deliberation metrics from Caldera’s planning services, specifically the measurement of state space exploration executed by the Caldera planner. This can be implemented but would have to be done for each Caldera planner under test, as such a mechanism is not built into the core Caldera platform.

6.2 Modularity of Mirage in CyberLayer simulation

6.2.1 Deceptions and agents

In contrast to Mirage’s emulation system, adding and encoding deceptions into the CyberLayer system requires significantly less effort. To add deceptions to CyberLayer, one must essentially create a set of (programming) functions that hook any observation feedback sent to the agent, and modify the perceived environment state as required. The data model’s API in the CyberLayer backend handles the maintenance of keeping a ground truth instance of the environment. It is also responsible for overlaying instances of deception on the state observed by the agent. For Mirage simulations, the data backend is also very performant as an in-memory network graph was used as the underlying data structure.

For developing and testing additional cyber agents in CyberLayer, there is also a reduction in the required effort; however, the reduction is more nuanced. With the CyberLayer simulation environment and reinforcement learning gym, adding new agents potentially requires implementing additional reward policies, observation spaces, and action mappers. However, the RL algorithms that serve as the underlying decision engine for the agent do not need to be implemented as they already exist in RLlib, and can be applied to cyber agents within CyberLayer simulation environment. This is opposed to our emulation system, where additional cyber agents required additional planners (i.e., the decision engine of the agents in that case) to be implemented outright. For our emulation experiments, observation spaces and action mappers were also abstracted away from Mirage as they are handled implicitly by Caldera’s operational API. Overall, the emulation case is more limiting, as in our empirical experience developing new planners for emulations is a higher level of effort than developing new reward policies, observation spaces, and action mappers for simulations.

6.2.2 Cyber environments

With regard to extending and manipulating the cyber environments in CyberLayer, it is naturally also less effort than doing the equivalent in Mirage’s emulation system. With CyberLayer, even for more complex actions, systems, services/applications, etc., one is still only creating a very good mock, not the actual implementation, which is always less computational effort. For example, while creating a simulated domain controller is not trivial, it never approaches the cost of creating a real domain controller for an emulated environment. Furthermore, simulated components can be tested and evaluated much more quickly and readily with programmatic tests whereas emulated components necessitate live integration testing.

6.2.3 Metrics

Lastly, as expected with any simulation system, Mirage’s use of CyberLayer and RLlib for simulations allows for extensive metrics and performance evaluation. With CyberLayer, the entire state of the cyber environment is accessible, and analytics can be inserted into any aspect of the simulation. Similarly, for the reinforcement learning gym, RLlib, and more specifically Ray, it comes off-the-shelf with significant performance and evaluation capabilities to use in any gym training experiments.

7 Conclusion and future work

In this work, our team initially designed, prototyped, and evaluated Mirage, an experimentation system for evaluating cyber deceptions against autonomous cyber adversaries. Three novel cyber deceptions, tailored to targeting automated planning techniques, were created and evaluated against two types of adversaries (ransomware, data theft). These adversaries were executed by autonomous Caldera planners that utilized different planning techniques (future reward, forward search). To deploy these novel cyber deceptions and allow for their dynamic nature, our team also created Anansi, a Windows operating system service to actuate and control the cyber deceptions. To comprehensively emulate and test these novel cyber deceptions and adversaries, we conducted 72 live offensive cyber attacks on an AWS cyber range using Caldera’s Range plugin. As a result of these experiments, we assessed three core challenges to expanding on our work: development costs (of emulation), experimentation time (of emulation), and Caldera planner limitations.

To then improve upon Mirage’s emulation system, and address those noted challenges, specifically of development cost and experimentation time of emulation, our team proceeded to integrate Mirage into a simulation system, called CyberLayer, to evaluate cyber deceptions at machine speed. CyberLayer enabled high-fidelity simulation of offensive cyber environments where the action spaces in the simulated environment are equivalent to real-world cyber attack actions. Additionally, our team then integrated the reinforcement learning gym RLlib into CyberLayer to enable a full-scale offensive cyber gym to train reinforcement learning agents in simulated cyber environments that contained cyber deceptions.

Our first set of experiments of Mirage with the CyberLayer simulation environment involved a reinforcement learning cyber agent whose action space was that of the simple Thief cyber adversary, and where the simulated cyber environment had the same computer network architecture and hosts as our previous emulation experiments. The agent was trained with PPO, and the optimum performance of the agent in simulation environments with deception matched the overall trends observed in the previous emulation experiments. That is, as an initial efficacy test, the simulation system proved valid for testing Mirage’s cyber deceptions.

Following this work, our team plans to further develop CyberLayer and the integrated RLlib reinforcement learning gym as a means to develop more sophisticated cyber agents that can manage larger action spaces and more complex cyber environments. Specifically, further research and development are required around observation spaces, reward policies, and action mappers in order to support the training of more sophisticated agents. Additionally, as CyberLayer was only quickly introduced in this publication, our team plans to have follow on publications to detail CyberLayer and its reinforcement learning gym at length.