Forensic Readiness of Industrial Control Systems Under Stealthy Attacks

. Cyberattacks against Industrial Control Systems (ICS) can have harmful physical impacts. Investigating such attacks can be dif-ﬁcult, as evidence could be lost to physical damage. This is especially true with stealthy attacks ; i.e., attacks that can evade detection. In this paper, we aim to engineer Forensic Readiness (FR) in safety-critical, geographically distributed ICS, by proactively collecting potential evidence of stealthy attacks. The collection of all data generated by an ICS at all times is infeasible due to the large volume of such data. Hence, our approach only triggers data collection when there is the possibility for a potential stealthy attack to cause damage. We determine the conditions for such an event by performing predictive, model-based, safety checks. Furthermore, we use the geographical layout of the ICS and the safety predictions to identify data that is at risk of being lost due to damage, i.e., relevant data. Finally, to reduce the control performance overhead resulting from real-time data collection, we select a subset of relevant data to collect by performing a trade-oﬀ between expected impact of the attack and the estimated cost of collection. We demonstrate these ideas using simulations of the widely-used Tennessee-Eastman Process (TEP) benchmark. We show that the proposed approach does not miss relevant data and results in a reduced control performance overhead compared to the case when all data generated by the ICS is collected. We also showcase the applicability of our approach in improving the eﬃciency of existing ICS forensic log analysis


Introduction
In Industrial Control Systems (ICS), software handles safety-critical processes that often form the core of a nation's critical infrastructure; e.g., power generation and chemical processes. Differently from traditional IT systems, cyberattacks against ICS can cause physical damage [27]. Following an attack, a digital forensics investigation is usually performed to identify how an incident occurred using digital evidence [12]. However, digital evidence in ICS can be volatile, as control devices typically feature low storage resources and can be damaged by attacks [41]. ICS therefore need to be Forensic-Ready [24,38,44], i.e., capable of identifying, collecting, and preserving, in advance, data that may be used as evidence to investigate potential, known incidents [5].
The collection of all data at all times may not be feasible due to the large amount of data typically generated by an ICS. This is particularly true for largescale, geographically-distributed ICS; i.e. those that consist of a large number of process equipment and control devices whose layout occupies a large area -e.g. chemical processes and power plants. Data stored in an ICS' process historian, while potentially useful in diagnosing anomalies, may not explain how an attack occurred [2,23]. Most of the work in ICS forensics is geared towards post-incident investigations [29] and a few approaches attempt to investigate potential attacks while the system is running, performing live forensics [39].
Previous work relies on the detection of attacks to trigger data collection. Thus, it may not be effective when an ICS is targeted by a stealthy attack, which can take advantage of sensor noise or other physical properties of the system to evade anomaly detectors [18]. Also, previous work does not focus on the collection of relevant data [32] that can explain how an attack occurred and can also be lost to damage caused by an attack. The real-time collection of such data from control devices can incur a performance overhead, which can negatively affect the performance requirements for safe control operations. In this paper, we aim to engineer Forensic-Ready, safety-critical, geographically distributed ICS, which can proactively collect relevant data to a stealthy attack. To achieve this aim, we need to identify a trigger to data collection for stealthy attack and a technique to identify relevant data, while reducing any performance overhead.
To trigger data collection, we rely on previous work on online safety monitoring under stealthy attacks [8,9,26]. Instead of forcing the detection of such attacks, online safety monitoring asks whether a potential stealthy attack can cause damage to the system given an initial physical state. While this may not reveal such attacks, it can identify conditions under which a system can be damaged and, thus, relevant data should be collected. To identify relevant data which is also at most risk of being damaged, we propose a technique that relies on the geographical layout of the system, safe Process Plant Layout (PPL) [36]. To reduce the real-time data collection overhead, we propose a decision-theoretic framework to decide whether the identified data is "worth" collecting based on a trade-off between collection cost and the expected damage that can be caused by the attack.
We show through extensive simulations on the benchmark Tennessee -Eastman Process (TEP) that our approach does not miss any relevant data; and the collection of data enabled by our decision-theoretic framework has a limited impact on the controllers' performance (execution time). Additionally, we demonstrate a use case of our approach where the performance of Programmable Logic Controller (PLC) log analysis tools [42] can be significantly improved. The rest of the paper is organised as follows: Section 2 overviews related work; Section 3 clarifies the problem statement and summarizes the overall approach; and Section 4 describes a running example. We detail our approach in Section 5 and evaluate it in Section 6, before concluding the paper in Section 7.

Related Work
Differently from traditional IT systems, attacks on ICS may seek to cause physical damage rather than stealing or tampering with confidential data. Due to their effect on the behaviour of physical processes, a large body of work has considered techniques from control engineering to detect attacks on ICS [18]. It has been shown that resourceful attackers can exploit noise [19] and other control theoretical properties [33] to evade anomaly detectors. The detection of these so-called stealthy attacks still faces theoretical and practical limitations, such as interference with the safety-critical operation of ICS devices [19].
Forensics in ICS faces the following challenges. The safety criticality of industrial processes and the difficulty of shutting them down for investigations lead to strict constraints on forensic tools to limit interference with their operation. The lack of documentation in most ICS devices can complicate data acquisition tasks and may lead to a loss of data, as described by van Vliet et al. [41]. In addition, the limited resources available in low-level devices (e.g., PLC's, sensors, actuators) [7,16,25]) renders potential evidence volatile [16,29]. Therefore, ICS forensics requires different approaches than traditional IT systems. Most of the existing work in ICS forensics attempts to adapt traditional IT forensic investigation frameworks to apply them to ICS [6,17,28]. Other work has proposed techniques to perform live forensics, i.e., investigate potential attacks while the system is running [3,39]. In addition, several approaches have focused on analysing and acquiring evidence from specific ICS devices, namely PLC's [13,35,43] and engineering workstations [21,31].
Most of existing work in ICS forensics has focused on forensic investigations undertaken after an incident or an anomaly has occurred, as in the work of Taveras [39]. A limited body of work has considered proactive approaches to ICS forensics. Examples include the work by Grispos et al. [20] and Ab Rahman et al. [1], which consider Forensic Readiness (FR) requirements in the design of medical and cloud-based CPS, respectively. However, these approaches have focused on generic organisation-level guidelines for pre-planning incident response measures and cannot be applied to safety-critical ICS.
Our work considers engineering FR in safety-critical ICS faced with stealthy attacks, which can cause physical damage to an ICS and lead to the loss of potential evidence. Although live forensics techniques [39] aim to collect data proactively, the collection of such data is only triggered after an attack is detected. Therefore, these techniques may not be effective when an ICS is targeted by a stealthy attack. Our work has a similar aim as the approach proposed by Alrajeh et al. [5], which generates data collection specifications to support forensic readiness in a traditional IT system. However, to the best of our knowledge, our work is the first to address the challenges that stealthy attacks pose to ICS forensics.
Our approach is based on a physics-based Early Warning System (EWS) proposed by Azzam et al. [8,9] to trigger potential evidence preservation in ICS. The approach proposed by Azzam et al. [8,9] allows the real-time identification of conditions under which there is a possibility for a stealthy attack to cause damage. While the proactive collection of data that may represent potential evidence is suggested as possible post-warning measure, the authors do not suggest which data should be collected. Our approach, instead, proposes a technique to identify relevant data that should be collected based on the PPL and a formal decision-theoretic framework.

Problem Statement and Overview
In this section, we clarify the problem tackled in this paper, and provide an overview of our approach.

Problem Statement
In this paper, we seek to collect data that could explain whether and how a stealthy attack occurred. We also aim to collect this data before it may be lost due to the damage caused by the attack. Our problem has three main parts (P1-P3). (P1): Since stealthy attacks can avoid detection, it is not possible to use the alarms generated by anomaly detectors to trigger proactive data collection or live forensics [39]. Thus, it is necessary to identify an alternative trigger for proactive data collection. (P2): not all data generated by an ICS may require to be collected proactively, since only a subset of such data may be at risk of being lost due to damage. As such, there is a need for a technique to identify what data is faced with such a risk. (P3): ICS generally operate under strict real-time performance requirements and are expected to generate a profit; thus, data collection activities can negatively affect the performance of the plant. To reduce the overhead introduced by data collection, this activity should only be performed when the cost of collection does not exceed the potential damage that can be caused by a stealthy attack.

Overview of the Approach
Our approach to support FR in ICS presented with stealthy attacks consists of three main steps, which address the problems (P1-P3).
1. To trigger data collection (P1), we rely on predictive safety checks of whether a potential stealthy attack can indeed take the system into an unsafe state and potentially cause the loss of data.

idle
• Specify data at risk of getting lost (P2) • Decide whether to collect by trading off impact and collection cost (P3)

2.
To determine what data is at the risk of being lost due to the potentially damaging stealthy attack (P2), i.e., relevant data, we employ the safe PPL [36] to identify areas of the plant that may be affected by the suspected breach. Within these areas, we identify devices and data originating from these devices that can represent potential evidence and that are at risk of being damaged. 3. Among the set of data that may potentially be lost, we identify the ones should be collected (P3). To achieve this aim, we propose a decision-theoretic framework which decides to collect data only if the expected impact of the attack, in terms of its potential effect on the system's performance, does not exceed the cost of collection . Accordingly, we collect a subset of relevant data which results in a limited overhead on the system's real-time performance. Figure 1 shows the three steps of our approach in the form of an automaton reflecting the real-time deployment of the proposed approach. Our main assumption in this work is that data acquisition tools, such as the one suggested by Yau et al. [42], are available to be activated whenever data should be collected from a device of an ICS. In addition, we assume that a secure server is available to store the collected data.

Running Example
We use a chemical plant setup based on the benchmark Tennessee-Eastman Process (TEP) as a running example to describe the proposed approach. We begin this section with a brief description of the TEP setup, and then describe our attack scenario example.

Tennessee-Eastman Setup
The TEP is a benchmark chemical process suggested by Downs and Vogel [15] and based on a real industrial plant that produces two liquid products from four gaseous reactants. The TEP has been recently used as a virtual testbed for various works in ICS security as it represents a realistic chemical processing environment. The process consists mainly of five major operating units: a reactor vessel, a product condenser, a vapour-liquid separator, a recycle compressor, and a product stripper. For the purposes of our work, we assume that the system layout is as shown in Figure 2. Each main operating unit in the TEP is housed in its own control room, which contains -aside from the process equipment -lowlevel control devices such as sensors/actuators, PLCs, Remote Terminal Units (RTUs) and engineering workstations. Each control room is connected over a network to a central room that houses supervisory control servers and process historians. In each control room, the relevant controllers' logic is installed on the PLCs which are in turn connected to the engineering workstations via an RTU. The RTUs relay control data to the central control room, as in the case with Supervisory Control and Data Acquisition (SCADA) architectures. Furthermore, an anomaly detector located in the central control room monitors control data, namely sensor measurements and actuator inputs, to detect anomalies due to faults or malicious tampering.

Attack Scenario
Our example is inspired by the Stuxnet [4] and the German steel mill [27] attacks. Namely, we consider the reactor stage of the TEP (area A 1 ) which features temperature and pressure controllers that keep these operating variables at desired levels. We assume that each of these controllers is installed on a Siemens S7 PLC, which is connected to sensors and actuators using a network that employs a standard industrial Profinet ethernet. Due to control safety constraints, communications happening using the TCP/IP protocol are performed in plain text without any encryption [2]. In this scenario, the attacker gains access to a network via a social engineering attack involving a fraudulent email that includes a malicious attachment. The attack then proceeds as following: in Phase (1), the reconnaissance phase, the attacker identifies some properties of the PLC, where temperature and pressure controllers are implemented. These properties can include the make, model, firmware, function codes, and addresses of the PLC [2]. Due to the lack of encryption, the attacker sniffs the data exchanged between the PLC and the physical system and uses it to extract more knowledge about the system, such as its noise properties, and details about the physics-based anomaly detector in use 5 .
In phase (2), and to avoid detection, the attacker chooses to slowly drive the reactor to an unsafe state before any anomaly can be detected in the central station. To this end, they exploit the PLC function codes identified previously to modify the pressure control program installed on the PLC. This modification involves applying a slowly growing bias to pressure measurements received by the PLC, such that the controller is tricked into increasing the pressure in the reactor to unsafe levels. The attacker can hide these deviations from the anomaly detector using their knowledge of the system dynamics and measurement noise properties [30] that they established during the reconnaissance phase.
A forensic investigation into the incident will likely need to recover potential evidence from the equipment located near the reactor, namely the engineering workstation, PLC, and RTU. This however may not be possible, as a pressure buildup in the reactor may lead to an explosion, thus, severely damaging these devices. For example, van Vliet et al. [41] describe the difficulties of recovering data from a PLC that was damaged in a wind turbine fire. Thus, this data need to be collected proactively.
ICS are typically equipped with a process historian -a proprietary server that records process data over long periods of time. Although data recorded in process historians are protected from damage, it is geared towards supervising the physical process rather than detecting attacks [2,6,23]. In the TEP scenario for example, a process historian may contain pressure measurements received from the reactor over a certain period of time. These measurements may reveal potential anomalies caused by an attacker. However, a forensic investigator may not be able to infer that an attacker performed code modifications to the PLC without having access to the PLC configuration at the time. Such data is not typically stored in process historians but it can be collected using our approach.

Approach to Proactive Data Collection
We begin this section with a description of the kind of stealthy attacks that we focus on, before detailing our approach ( Figure 1) to engineering FR in ICS. We then compile the steps of our approach in a real-time data collection algorithm.

Stealthy Sensor Attacks
We focus in this work on attacks that consist of modifying the behaviour of the physical process in ICS. In particular, we consider stealthy attacks that consist of slowly modifying sensor measurements while exploiting sensor noise to evade detection. We assume that the attacker, as illustrated in Section 4, has knowledge of the system's model and anomaly detection procedure. Additionally, the attacker is able to alter sensor measurements either by modifying controller code or by accessing relevant nodes in the network. Regardless of the manner by which the attacker performs their actions, our approach only considers the effect of such actions on the physical process.
Stealthy sensor attacks have been widely considered in previous work [18]. We choose to focus on them due to their practicality from the point of view of an attacker and the ease of maintaining their stealthiness as compared with attacks on actuators [40]. Furthermore, the ICS that we consider may experience long operational transients, which renders replay attacks more difficult to keep stealthy [9]. A model for stealthy sensor attacks can be found in the work of Murguia et al. [30].

Trigger for Real-time Data Collection (P1)
Real-time data collection activities can be triggered following an alarm from an existing Anomaly Detection System (ADS) [39]. However, in the case of the stealthy attacks we are concerned with, such a trigger may not be available as the resourceful attacker may evade the existing ADS and may be able to cause damage before any such alarm is raised.
To mitigate this problem, we propose instead a trigger that relies on online safety monitoring under attacks. Instead of attempting to detect stealthy attacks, this line work seeks to proactively check in real-time for conditions under which a potential stealthy attack can cause damage to the system [8,9,26]. This can be useful to guide certain proactive measures, including the collection of potential evidence. Of the approaches proposed in the literature, we adopt the one proposed by Azzam et al. [8,9] where safety checks are used to compute a so-called suspicion metric reflecting the probability of damage taking place due to a stealthy attack.
The suspicion metric combines two physics-based preliminary indicators to warn in advance of potential damage: (i) feasibility of a stealthy attack, defined as its ability to drive the system into the unsafe operating region without getting detected; and (ii) proximity of the system to unsafe operating region. These two indicators are then expressed in terms of a reachability problem, where reachable sets are approximated using ellipsoids computed through a linear matrix inequalities setup. The size of the reachable set and of its intersection with a predefined set of unsafe states, as well as the approximate time to reach the unsafe set are then combined to compute the suspicion metric. Due to space limitations, we refer the reader to the authors' work [8,9] for more details on their approach.
The reasons for adopting the approach by Azzam et al. [8,9] are as follows: (1) The online safety monitoring algorithm is efficient and can scale well with complex systems. (2) The algorithm is tailored to systems that can be modelled as Linear Time-Invariant (LTI) which is a widely used modelling paradigm in practice, and can be used to model the TEP in our running example. (3) The probability of damage, given by the suspicion metric, can be used to estimate the impact of the suspected attack.
The approach by Azzam et al. [8,9], which is already instantiated for the TEP, can raise warnings having two different levels of criticality, based on the likelihood of damage taking place given by the suspicion metric. The first is a low-criticality warning where the system may reach an unsafe state if it is under a stealthy sensor attack, but is far from these states. The second is a high-criticality warning where the system is dangerously close to the unsafe operating region. The first level of warning suggests that there is sufficient time to collect potential evidence before a potential failure occurs, and, as such, it can be a suitable trigger for proactive data collection. The second warning level suggests the need for immediate safety measures, such as engaging a backup controller [14]. In this paper, we focus on the case where a low-criticality warning is raised and we consider it as a suitable trigger for proactive, real-time data collection activities.

Identifying Relevant Data (P2)
Of the large volume of data generated by an ICS, we would like to only collect data that is likely to be lost due to damage caused by a potential stealthy attack. We assume that this data is more likely to constitute potential evidence of such an attack, and as such is relevant [32] to our proactive data collection activities. In this paper, since we are concerned with attacks that affect the physical process, we only focus on data originating from control devices that are at the lowest level of the ICS network architecture [16,25]. Namely, we focus on PLC's and on sensor measurements. We denote by O = {o 1 , . . . , o m } the set of the system's output variables and by P LC = {P LC 1 , . . . , P LC n plc } the set of PLC's in the system. Table 1 summarizes the kind of data we can collect from these devices. For each P LC i , we denote by Conf ig[P LC i ] the data we can collect from P LC i .
Unlike data found in process historians, the data that we propose to collect in real-time (Table 1) is assumed to be acquired using dedicated forensic tools, which can help explain how an attack may have happened [2,23,42]. Although some sensor measurements that we propose to collect can be collected by a process historian, other sensor measurements are collected using network Table 1. Potentially relevant data (for the stealthy attack that we consider) at the control level of the ICS network architecture and ways to collect it in real-time [16,25]. The proprietary software of the PLC [23] and PLC data acquisition tools (e.g., [42]) forensic tools that include information, for instance, about source and target IP addresses. This information may be helpful in understanding how a potential attack occurred -e.g., tracing IP addresses to an infected host machine.

Source
In a geographically distributed ICS, the layout of process equipment is designed by taking into account safety properties. Safety-critical process equipment, such as reactors, have an associated explosion energy, a measure of the energy that would be released should a safety failure take place within the equipment. This measure determines safe distances that separate different process equipment such that physical damage in one piece of equipment has a low probability of affecting another safety-critical physical process. These energy measures and associated distances are then used to design the layout of the process control system. The result is a partition of the land area available for the plant into different sub-areas where safety-critical equipment should be placed, as shown in Figure 2 [36]. In the process industry this design-time activity is referred to as safe Process Plant Layout (PPL) [36]. To reduce the chance of damage spreading from one area into another, protection devices can be installed [34]. Note that such layout design does not only account for damage that could result from an explosion, but also from other incidents like chemical spills and fire.
In the TEP example (Figure 2), we assume that the equipment running the safety-critical processes is geographically distributed according to a safe PPL design [36]. In this case, if damage occurs in the reactor for instance, then we assume that the damage will be restricted to the control devices located in the vicinity of the reactor (e.g., a PLC implementing the reactor's temperature control), and will not extend to devices elsewhere. In this sense, we can define the set of process sub-areas as a partition of the set of control devices P LC. Namely, let 6 A = {A 1 , A 2 , . . . , A n A } ⊂ 2 P LC be the set of the ICS sub-areas, In such ICS, safety constraints are usually expressed as linear combinations of output variables. Namely, let φ := Σ m i=1 a i o i ♦b be a safety constraint, with a i , b ∈ R, ♦ ∈ {≥, ≤}, and o i ∈ O, and let Φ be the set of safety constraints in a given system. Damage in a sub-area A i can happen due to a violated safety constraint φ i ∈ Φ. We denote by safe(φ i ) ∈ A the plant area or the subset of control devices that will be affected by the violation of the safety constraint φ i , and by var(φ i ) the subset of output variables associated with φ i . This is normally determined by process engineers at the PPL phase [36]. Table 2 lists the main safety constraints of the TEP and associates each constraint with the plant area ( Figure 2) where the corresponding safety-critical process is located. For example, constraint φ 1 sets an upper limit on the pressure inside the reactor.
Since the reactor is located in area A 1 , a violation of this constraint could lead to damaging the equipment in A 1 ; i.e., safe(φ 1 ) = A 1 .
Recall that the proposed data collection activity is initiated by a low-criticality warning at a time instant k from predictive safety checks that return a subset Φ v (k) ⊆ Φ of safety constraints that may be violated in the future [8,9]. Consequently, given this subset of safety constraints, we can use Table 2 to determine a set of relevant device and network data (Table 1) at time k, as follows: Consequently, the set of relevant data that we are in a position to collect at a time k is D rel (k) = Conf ig[P LC] rel (k) ∪ O rel (k).

Deciding Which Data Should Be Collected (P3)
Real-time data collection activities are generally associated with a certain cost. Additionally, our original predictive safety checks are uncertain as they return a probability of damage, as given by the suspicion metric s(k) [8] at a time instant k. From this observation, it is necessary to decide whether to collect each data in D rel (k) based on a trade-off between the expected impact of the attack and the collection cost associated with each data in D rel (k). Namely, we seek to select a subset of data to collect, D col (k) ⊆ D rel (k). To achieve this trade-off, we propose a decision-theoretic framework based on the Value of Information (VoI) analysis. Our main assumption in the development of this framework is that each of the data identified in D rel (k) is equally likely to constitute potential evidence. For example, if D rel (k) consists of a pressure sensor measurement and a configuration of the PLC implementing the pressure controller, then we assume that it is equally likely that the attacker performed their attack by either changing the code of the PLC (Section 4) or intercepting and modifying the pressure sensor measurement. Thus, the only trade-off we are concerned with is the one between the expected impact of the attack and the cost of data collection.
VoI Representation. We assume that the ICS, represented as Π(θ), operates in nominal conditions under a certain performance measured by a revenue W (θ) that depends on some system parameters grouped in θ. For example, the TEP's performance is usually measured by the revenue resulting from the amount of chemical product produced [15,37]. The operational cost of Π(θ) is denoted by C(θ), and can be estimated based on a variety of parameters; such as the amount of reactants consumed in the TEP case. In this paper, we restrict this cost to the cost of data collection, i.e., collecting the data in the set D rel . Under a given parameter setting θ, the value (profit) of Π(θ) is given by V (θ) = W (θ) − C(θ). However, due to uncertainties associated with parameters θ, we must use an expected parameter estimate to compute V (θ). Consequently, we consider the expected value (profit), denoted by E[V (θ)], under expected parameters E[θ].
Suppose that we predict that a potential stealthy attack, may bring the system into a new operating mode with E[V (θ)] < E[V (θ)], by causing damage to one or more of the plant areas A i (i.e., we are presented with the trigger described in Section 5.2). Then we can ask whether it is worthwhile to collect certain relevant data (as identified in D rel (k)) that may allow us to reveal a breach like the one illustrated in Section 4. Namely, we may want to collect a given δ i ∈ D rel (k) only if its collection cost does not exceed the potential reduction in the value or the revenue of the system's operation (i.e., attack impact) if the predicted stealthy attack is successful. The change in this expected value is given by: As a simplifying assumption, we can take the cost of data collection under nominal conditions C[θ] as 0. On the one hand, if we choose not to collect any data and subsequently not attempt to prevent the damage from the breach, then C[θ] = 0 and in this case On the other hand, if we choose to collect data that can help us identify the location of the breach and prevent damage from happening, then we can assume that W (θ) = W (θ) and |∆ V (θ, θ)| = E[C(θ)]. In other words, the collection of data that may reveal a suspected breach is worthwhile only if the expected cost of collection does not exceed the expected reduction in the operating revenue (performance), E[|∆ W |], i.e., the expected impact of the attack.
Example. Consider the TEP system generating a value of 0.8 under nominal operation. Our online safety monitor predicts at a time instant k that a potential breach may be able to cause damage to two out of the five plant areas in A with a probability (given by the suspicion metric) of s(k) = P (damage) = 0.4. If this happens, then the new revenue of operation is 0.48. As such, the new expected revenue of operation is given by: Hence, the expected reduction in revenue of operation (i.e., attack impact) is 188. This is the maximum we will "pay" to identify the location of the suspected breach, i.e., collect each of the data in D rel (k).
Cost of Data Collection. In the set D rel (k) that we identified in Section 5.3, the collection of each of the data δ i ∈ D rel (k) is associated with a certain cost C(δ i ). This measure of collection cost can be estimated, for example, based on the capabilities of the existing data acquisition tools. Namely, if the collected data is to be stored in some secure server, one could set a limit on the total size of data that we can collect at a given time instant k. Then, we can define a collection cost C(δ i ) as the ratio of the space occupied by δ i to the size limit for data collection at a time k. In a similar fashion, this cost measure can account for bandwidth limitations and data transmission delays that may arise in large-scale geographically-distributed ICS. Regardless of the exact definition of C(δ i ), we propose that this measure be a dimensionless number in the same range as the measure of revenue of operation (W (θ)), so we can soundly perform the trade-off as illustrated in the previous example. Going back to the previous example, assume that the identified relevant data consists of the pressure sensor measurement o 1 and the ladder logic of the PLC implementing the pressure controller, Conf ig[P LC 1 ]. Since a single sensor measurement may occupy significantly less space that a PLC ladder logic configuration, we assume that the cost of collecting o 1 is C(o 1 ) = 0.02 and that of Conf ig[P LC 1 ] is C[Conf ig[P LC 1 ]] = 0.2. Since the maximum we will pay to identify the location of the suspected breach is 0.188, then we can conclude that o 1 is worth collecting while Conf ig[P LC 1 ] is not.

Real-Time Proactive Data Collection
We implement the three steps of our approach proposed to collect relevant data in Algorithm 1. The execution of Algorithm 1 at a time instant k is triggered

Algorithm 1 Proactive Real-time Collection of Relevant Data
Inputs: A subset of potentially violated constraints Φv(k); the suspicion metric s(k) = P (damage) Parameters: collection cost of potential data types C(δi), ICS operating value under nominal conditions (W |¬damage) Outputs: D col (k) (a set of data to collect) end if 12 end for by a low-criticality warning from the online safety monitor [8,9] which returns the suspicion metric s(k), i.e., probability of damage taking place, in addition to the potentially violated constraints Φ v (k) (in the future). We first use the safe PPL and Table 2 to identify data that is more likely to be lost due to the predicted damage. We then perform the VoI analysis on the identified relevant data in D rel (k). First, we estimate the expected reduction in revenue (impact) from the predicted attack, as illustrated in the previous example. We use this computed impact as the maximum we will pay to collect a certain δ rel ∈ D rel (k) and subsequently make our collection decision given an associated C(δ rel ).
Note that the estimation of the revenue if damage happens, (W |damage), can be performed using the number of areas A i that are predicted to be damaged by the online safety monitor. In our numerical example, the safety monitor predicted that two out of five areas in A may be damaged. As such, we can estimate a 2/5 = 40% reduction in the original revenue under nominal operation. Note that a better estimate can be obtained by incorporating some process engineering knowledge about the contribution of each plant area to the revenue; however, such considerations are beyond the scope of the present work.

Evaluation
This section discusses our evaluation strategy, describes the testbed that we use in the evaluation, and presents the results.

Overview
Our evaluation does not include a discussion of the trigger for data collection (P1), which is based on previous work [8,9]. In this paper, we evaluate the rele-vance of the collected data (P2), and assess the reduction in control performance overhead obtained using our decision-theoretic framework (P3). We then demonstrate a use case of our approach using an attack scenario. This evaluation is consistent with previous work on FR [5]. While Alrajeh et al. [5] employed existing datasets to perform their evaluation, we resort in our work to simulations of a virtual realistic testbed. The reason for using simulations stems from the lack of datasets that we can use to evaluate our work. To the best of our knowledge, datasets on ICS security (e.g., [11]) are not constructed to proactively collect data that can be relevant to identify how a stealthy attack occurred. Existing datasets are instead aimed to train and test statistical models for attack detection. Thus, they are not suitable to evaluate our approach. Data collected by process historians, as illustrated in Section 4, may not explain how an attack occurred. Thus, we avoid a comparison of our approach with process historians.

Testbed Description
We employed a simulation of the TEP implemented in MATLAB/Simulink by Bathelt et al. [10] based on the control architecture designed by Ricker [37]. We augmented the simulation with an implementation of the algorithm proposed by Azzam et al. [8], and the addition of Algorithm 1 7 . The simulation also includes blocks to simulate networked behaviour and data transmission delays according to the setup given in [8]. To simulate the process of data collection, we augment the controllers in the TEP simulation with logging capabilities. Namely, when Algorithm 1 identifies the set D col (k), the corresponding controllers generate a log in a json format. In particular, we assume that each of the controllers designed by Ricker [37] are implemented on a PLC. As such, we simulate the collection of data from these devices by adopting a log structure that mimics the logging of PLC ladder logic configurations (e.g., [43]). Namely, each log entry consists of a timestamp, sensor measurements received, and details about the controller configuration, in addition to the output of the controller (actuation signal). Since each controller in the architecture proposed by Ricker [37] is a discrete-time Proportional-Integral (PI) controller, the controller configuration is represented by the values of the coefficients of the proportional and integral terms in addition to the sampling time. For sensor measurements, the corresponding sensors simply generate a json log entry that includes the timestamp, the value, and the name of the sensor measurement (Table 2). We performed all the simulations described in this paper on a machine with an Intel i7-9750H CPU clocked at 2.6 GHz with 16 GB of memory.

Relevance and Overhead Reduction
To assess the relevance of collected data and the reduction in performance overhead resulting from data collection, we adopt a large number of randomised In each simulation, we randomise the operating levels of the TEP (i.e., set-points for pressure, temperature, reactor level, etc.), and we choose attacked sensors randomly. For relevance, we check in each simulation whether any data that was lost to damage by the attack was missed by the set identified in step (P2) (D col (k)).
The decision-theoretic framework proposed in step (P3) already ensures, by design, that the collection of data is enabled only when the expected reduction in system performance exceeds the cost of data collection. Thus, we do not compare the revenue before and after data collection. Instead, we showcase how, as a result of the decision-theoretic framework, there is a limited effect on the system's performance. We compare the average execution time of controllers under forensic logging enabled by Algorithm 1, with the case where forensic logging is enabled at all times from all controllers. We also present the average reduction in terms of the number of log entries between the two scenarios. In addition, and as a reference, we present the average execution time of controllers in the case where no logging is enabled. Although our logging mechanism is a simplified version of data acquisition tools, the comparison of average controller execution times between the aforementioned scenarios serves to give an idea of the potential performance improvement brought by our approach with actual data acquisition tools. For each of the testing scenarios previously described, we ran a total of 1000 simulations. Results are shown in Table 3.
Discussion. The low percentage of data missed by the proposed algorithm as shown in Table 3 demonstrates the capability of our approach to identify relevant data (P2), thus, limiting the loss of potential evidence due to damage by a stealthy attack. Additionally, the use of Algorithm 1 with the proposed decisiontheoretic framework (P3) results in reducing the average controllers' execution time by around 35%, compared with the case where logging was performed from all controllers at all times. This is consistent with the 31% reduction in the size of logs due to Algorithm 1. These results show that the proposed decisiontheoretic framework enables the collection of data in a way that minimises the interference with control operations.

Use Case: Supporting PLC Log-based Live Forensics
To demonstrate a potential use-case of our approach, we consider tools that can perform live forensics in ICS, i.e., investigating potential attacks while the system is running. Due to the safety-criticality of ICS and the difficulty associated with shutting them down, methods for live forensics activities [3] have been proposed to avoid such difficulties. One important requirement for such methods is that they need to minimise interference with the safety-critical control operations [16].
In the previous section, we showed how our approach reduces the number of PLC logs that need to be collected, hence reducing the effect of logging mechanisms on the controllers' execution time. In this section, we further show how our approach can be used to improve the performance of live forensics operations, due to the fact that we only consider relevant data and we account for the system's performance while guiding logging activities. A class of ICS live forensics methods [42] consists of investigating PLC logs to check for any malicious modification to their code. To showcase the increase in performance of such methods, we revisit the Stuxnet-inspired attack scenario introduced in Section 4. In the second phase of the attack, the attacker modifies the function codes in the PLC controlling the reactor pressure in a way that drives the reactor slowly beyond safe operating regions. We run a simulation of such an attack by introducing malicious code to the controller handling the reactor pressure. We augment the implementation of the Algorithm 1 and the logging mechanism described previously with an implementation of the technique proposed by Yau and Chow [42] based on the log structure that we use in our simulations. Whenever controller logs are generated, we compare the logged control output with the one expected by the originally designed control law (in [37]). If the two values differ substantially, i.e., more than a pre-defined threshold, the code checking detects the presence of a code modification (Figure 3). Figure 3 shows a plot of the received and actual (real) pressure measurements from the reactor under an integrity attack on pressure sensor, along with the anomaly detector's detection metric. In addition, we show a plot of the results from the code modification checks performed to the pressure controller logs collected according to our approach. Namely, we plot the results from the code checking vs. the timestamps of the collected logs. We also analyse the perfor- mance of PLC code checking under two cases: 1) the checking is performed on logs generated according to Algorithm 1 and 2) the checking has to consider logs from all controllers at each time instant (Table 4). The results were averaged over a ≈ 30-hour simulation, equivalent to around 60000 real-time checks given the system's 1.8 second sampling time.
Discussion. Figure 3 shows that our approach enabled the collection of logs from the reactor's pressure controller, where code checking was able to detect modification of the PLC code well before any alarm was raised by the existing anomaly detector. While our approach only collected logs from the pressure controller at the start of the attack, it enabled the detection of code modifications much earlier than the existing anomaly detector. In the case where logs are needed for the duration of attacks, Algorithm 1 can be modified such that it can be overridden to ensure the collection of logs as long as the code checking returns with the result that the code is indeed modified. The benefit of performing PLC code checking under the proposed approach is highlighted with the results shown in Table 4. Under this approach, the average real-time execution time of code checking was reduced by more than an order of magnitude. In ICS, live forensics is used more often than post-mortem forensics [16] to avoid the high-costs of shutting down the process. The performance enhancement brought about by the proposed approach has the potential of reducing the risk of interfering with the real-time performance constraints of safety-critical ICS when live forensics is performed, especially considering that devices such as PLC's feature low computational resources.
Based on the results obtained in this section, it can be expected that our approach may also improve the performance of other log-based attack detection techniques. As our approach specifies when and which data is worth collecting at a given time, this reduces the amount of logs that need to be processed by such techniques. For example, the technique presented by Hussain et al. [22] relies on converting logs from different ICS components into a petri net representation in order to detect anomalies based on existing process knowledge. In the future, it may be worth investigating how the reduction in log sizes can affect the performance of such conversion, and whether it can help increase the precision of the anomaly detection technique. Our approach could also use machine learning to improve the selection of logs for this purpose.
Potential Limitations. The use case presented in this section has highlighted a potential limitation with our approach: data collection may be triggered for a long time even if no attack was taking place, as shown in Figure 3. However, live forensics tools may serve as a heuristic to determine whether proactively collected data is to be preserved or discarded. For instance, in our use case scenario, the PLC logs collected when no attack was taking place as shown by the code checking could be discarded in order to save storage resources. Future work could look into more ways in which data collected due to our approach can be handled in order to optimize storage capabilities.
Another potential limitation with our approach is that it focuses on a specific type of attacks on ICS. While the model we considered is widely studied, further work would be needed to consider a variety of other attacks. The approach presented in this paper serves as a framework to consider data collection for future investigations of other types of attacks. Namely, one can have a certain metric reflecting a "probability of harm", similar to the suspicion metric, for other attack types. Our framework for relevant data identification and selection can then be used based on the new metric.

Conclusion
In this paper, we propose an approach to engineer Forensic-Ready safety-critical, geographically distributed ICS presented with the threat of stealthy attacks targeting physical processes. We designed an approach to proactively collect relevant data before they are lost due to damage caused by a stealthy attack. In the absence of potential alarms about stealthy attacks from anomaly detectors, our approach considers an alternative trigger for data collection based on predictive safety checks. Then, using the plant layout and the constraints that are predicted to be violated, we identify data that are most likely to be lost due to damage. We also propose a decision-theoretic framework that enables the collection of data only when collection cost does not exceed the expected reduction in the system's performance due to the predicted attack.
Our evaluation of the proposed approach has shown that it ensures the relevance of collected data and incurs a limited control performance overhead. Furthermore, we demonstrated the advantage of such an approach in improving the efficiency of existing live forensic log analysis techniques. In future work we will incorporate process engineering knowledge to better estimate the potential impact of a predicted attack in our decision-theoretic framework. We will also evaluate our approach on a live testbed using data acquisition tools, and investigate how the approach can help improve the performance and efficiency of attack detection techniques.