Novel cyber fault prognosis and resilience control for cyber–physical systems

.


Introduction
Cyber-physical systems (CPSs) refer to systems with integrated computational network and physical components.With the increasing connectivity among the cyberspace and physical systems, capturing the interactions between the cyber and the physical systems becomes increasingly important [1].In particularly, network imperfections and dynamics -such as limited channel capacity, traffic congestions, and malicious attacks -can degrade the performance or even destabilise the control system.This makes controller design more challenging and complex.In the existing literature, this issue is often oversimplified when designing CPS controllers and could result in severe system failure.For example, hackers can remotely take control of a vehicle and cut its transmission on the highway [2].The threat of automotive cyberattacks also threatens people's life.Therefore, detection, estimation, isolation, and mitigation scheme of cyberattacks/faults has to be investigated to improve the resilience of the entire CPSs.
In the past few years, many control and system researchers have pioneered the development of approaches and tools to model and control CPSs.[3][4][5][6][7] addressed the fault detection, isolation, and mitigation in the physical subsystem alone (e.g.actuators, sensors, and controlled components).At the same time, communication and signal process researchers have made major breakthroughs in monitoring, identification, and defence of cyberattacks and other security issues on the cyber side [8][9][10][11][12][13][14][15][16][17].Such existing approaches focus on either cyber or physical control aspects while ignoring or oversimplifying the other aspect.However, such decoupled designs will often fail in practical CPS.Therefore, to address the aforementioned issues, a control and fault prognosis system redesign which takes into account the interaction between cyberspace and physical systems is necessary.
Inspired by this motivation, we proposed a novel prognosis scheme in this work.The main contributions are: (a) Proposed a novel prognosis scheme for cyber network fault detection and prediction.(b) Derived the estimation of network delay distribution based on time-series analysis.The convergence of the estimation error is presented in Lemma 1.
(c) Proposed an isolation scheme to distinguish soft and hard faults based on the prediction of potential failures on system states.Theorem 1 shows the convergence of such prediction.(d) Developed a decision-making scheme for resilience control triggering.The simulation results in Section 6 illustrate that this scheme proactively trigger the resilience control and effectively avoid physical system failure.
The rest paper is organised as following.In Section 2, a motivation example is given to illustrate the relationship of cyber condition and system behaviour.Next, the related works on fault diagnosis and prognosis are presented in Section 3. In Sections 4 and 5, the proposed prognosis scheme is demonstrated.The simulation results are shown in Section 6 and the conclusions are given in Section 7.

Motivation
In this section, the design challenge introduced by the interaction between cyber network and physical system is discussed in a simple scenario.This example emulates a route hijacking by an attacker who is eavesdropping the control information which could be later used for taking over the control.The network delay and delay variation will be increased when the attacker secretly relays and possibly alters the communication between the controller and actuators.In this section, the network is simulated using Network Simulator 2 (NS2) with a random topology of 11 nodes.Ad hoc ondemand distance Vector (AODV) routing scheme is adopted.In Fig. 1a, an attacker relayed the transmission.The change of the network topology introduced sudden packet delays and variation for the control loop.Similar network performance could be a result of topology or traffic pattern changes.Then, an optimal controller to regulate a two input four output (2I4O) system [18].
The network topology and traffic changes will also lead to the disturbance in CPS.In Fig. 1b, a route hijacking attack is launched at t = 10.5 s.The sudden changes of the network delay make the optimal controller controlled system state start to oscillate.The network dynamic leads to unstable CPS.
Therefore, cyber network attacks indeed affect the system performance.Cyberspace is particularly difficult to secure due to its vulnerabilities of connections between cyber and physical IET Cyber-Phys.Syst., Theory Appl.This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)systems.Of growing concern is the cyber threat to critical hardware devices.Cyber network faults could cause harm or disrupt services upon which our economy and the daily lives depend on.In light of the risk and potential consequences of cyber events, strengthening the risk awareness and resilience of CPSs has become an important mission.
To address the cyber network fault issues on the physical system side, a full knowledge of the relation between cyber condition and system performance is essential.Hence, we propose a scheme for detecting and isolating cyber and physical system faults.Also, a resilience control triggering strategy is proposed to proactively trigger the controller and accommodate the potential failures ahead of time.

Related works
In this section, the existing works on cyber security is briefly discussed first.Then, the works on fault awareness of the physical system are discussed.
The overall goals of cyber security include integrity (the trustworthiness of data or resources), availability (accessibility upon demand), and confidentiality (keeping information secret from unauthorised users).Many researchers addressed these issues with different technologies, such as authentication schemes, access control, and other defence scheme [9][10][11][12][13][14][15][16][17].An assumption that the adversary/attack model is fully known is often required; however, it is challenging to obtain.In [10], deception and denial of service attacks against a networked control system (NCSs) are addressed.They proposed a countermeasure based on semi-definite programming.This work and the following literature are only valid for a specific attack model, which cannot be known in priori.A defence scheme without requiring the knowledge about the attack model is needed.
In [11], false data injection attacks against static state estimators are introduced.Undetectable false data injection attacks can be designed even when the attacker has limited resources.Also, stealthy deception attacks against the supervisory control and data acquisition system are studied in [12].In [13], the effect of replay attacks on a control system is studied.In [14], the effect of covert attacks against control systems is investigated.A parameterised decoupling structure alter the behaviour of the physical plant while remaining undetected from the original controller.Then, [15] proposed a general theory of event compensation as an information flow security enforcement mechanism for CPSs.Message scheduling methods were given to improve the security quality of wireless networks for mission-critical CPSs in [16].
With respect to the above works, [17] proposed a mathematical framework for CPSs, attacks, and monitors, and given the fundamental limitations of monitors from system-theoretic and graph-theoretic perspectives.Finally, centralised and distributed attack detection and identification monitors were designed.Overall, many cyberattacks can be addressed on the cyber side.However, the effects of cyberattacks/faults on the physical system behaviour are oversimplified in the above-mentioned existing works.Moreover, the injection time and model of the attacks/faults are difficult to learn ahead of time in practical CPSs.
The control researchers focused on the conventional fault detection techniques that have successfully applied to industrial NCSs.They indeed considered the network delay and packet loss in various ways.In [19], a resilient control problem is studied, in which control packets transmitted over a network are corrupted by a human adversary.They proposed a receding-horizon Stackelberg control law to stabilise the control system despite the attack.However, the proposed approach required a priori knowledge on attack model and type.In [3], network delays were modelled as a constant delay (time buffer), an independent random delay, and a delay with known probability distribution governed by the Markov chain model.In [4], a networked predictive controller in the presence of random delay in both forward and feedback channels was proposed to minimise the effects of network failures.A robust H ∞ control for a non-linear T-S fuzzy model system was proposed to address the network delays and packet drop in [5].However, they assumed the upper bound of delays is known.This is challenging to be satisfied in reality.Then, [6,7] employed a state observer-based fault detection method on the uncertain long time delay.Although, the network delays and packet drop caused by network faults/failures were considered in above works, the assumptions, such as known bounds and time-invariant distribution of delays and packet loss, are always made.In addition, most of the above works aimed to detect the faults of physical components (sensors, actuators, and system plant), not the faults in the cyberspace.
This work is motivated to address cyber network faults detection, isolation, and prediction.Meanwhile, the tolerant control scheme and it triggering strategy are proposed to stabilise the CPS despite cyber network faults and optimise the computational overhead.

Proposed prognosis scheme
In this section, the overview of the proposed prognosis scheme is given in Section 4.1.An online kernel density estimation (KDE)based probability density function (PDF) identifier is introduced in Section 4.2.Then, the details of the proposed prognosis scheme are presented in Section 5.At last, the resilience controller is designed in Section 5.3.

Overview
In this work, the uncertainties in the cyberspace, including traffic congestions, topology changes, and attacks, are causing abnormal delays and packet losses on the physical system side.Monitoring such delays and packet losses online is required for detection of cyber network faults.Moreover, an observer is needed to detect physical system faults and isolate them from cyber network faults [7].
The proposed prognosis scheme is shown in Fig. 2. It includes four main steps that are continuously repeated: (a) Sensing of network delays.In this work, we assume a transmission control protocol (TCP)-based network, such that the delay is provided by the acknowledgement for each data packets.(b) Delay update.n delays ([ ) in the sliding window are used to do the PDF estimation at time k (PDF k ).When the new delay is measured, the data in the sliding window are updated.(c) Cyber Network Fault Detection.The PDF of these n delays is obtained by using the online KDE-based PDF identifier.The probability for each delay interval P i k is calculated.Compare P i k to the set threshold R p i , the PDF variation is captured and a cyber network fault is detected.Meanwhile, if there is no abnormal behaviour presented in the observer, it can be confirmed that only cyber network fault happens.(d) Potential degradation prediction.If a cyber network fault is detected, the PDF of new delay distribution is predicted by using time-series analysis.Then, the delays following the new distribution are resampled.Finally, the prediction of the future physical system outputs is obtained.If the system states deviate out of the acceptable range, the hard fault is detected.Otherwise, it is a soft fault which is not severe enough to trigger the resilience controller.More details about fault isolation are presented in Section 5.Such a scheme can detect stochastic cyber failures/attacks without requiring the knowledge of attack model and its injection time a priori.Only monitoring the PDF of delays in real-time is required to do PDF and system-state prediction.Moreover, the resilience control law tuned by the probabilities of delays is derived accurately for the given cyber performance.Its details are introduced in Section 5.

PDF identifier
To obtain the probability information mentioned in Step (b), a PDF identifier is employed [20].It uses kernel density estimation (KDE) to estimate the distribution of delays iteratively.The data used to make the identification are updated at every sampling interval for a window of n last packet delays.The main steps of online PDF identification are shown in Table 1 (Appendix 9.1).Here, a normal kernel smoother is selected for PDF estimation.
Such a sliding window-based PDF identifier provides the PDF profile of delays in real-time such that the variation of PDF can be captured and observed easily.

Cyber network fault detection and isolation
Cyber network fault is detected by monitoring probability residual (PR).The other residuals -modelled system output residual and system performance residual -are used to isolate cyber network and physical components fault.Then, the prediction of the future new delay distribution and system-state prediction is used to isolate soft and hard cyber network faults.Finally, the decision of resilience control triggering is made.

Cyber network fault detection
For cyber network fault detection, three residuals have to be monitored in an online manner: (a) Probability residual (PR): It is the difference of the probability at k and the last interval k − 1.Such information is provided by the proposed PDF identifier.The PR at time k for each delay interval is denoted as customisable by the users for satisfying the requirement of fault awareness capability.The threshold for each system output variable R SPR is determined by the acceptable error magnitudes of system states.This residual is used to evaluate the system performance and determine when the system should shut down.
If PR exceeds its threshold, a cyber network fault is detected.
Meanwhile, MSOR and SPR should be supervised to do the rootcause analysis of the degradation of system performance.If a cyber network fault is detected, the type of this fault (soft or hard fault) should be learned before triggering the resilience controller.That is because not all types of cyber network fault need to be mitigated by the resilience controller.Unnecessary triggering will result in additional computational resource wasting.When soft faults happen, the adverse effects on system performance can be handled by the existing controller.Therefore, there is no need to take other control actions.Typically, an alarm or warning is sufficient.On the contrary, hard faults potentially threaten the system performance in terms of overshoot, time-to-recover (TTR), and cost of regulation, even stability.Moreover, wear and tear or severe damages of the system components might be induced by such faults.Hence, timely detecting hard faults and triggering the resilience controller are vital for guaranteeing system stability.Moreover, with isolating of soft and hard faults, inefficient triggering of the resilience controller is avoided and the overall computational cost is reduced.

Soft and hard cyber network fault isolation
To recognise hard cyber network fault, we proposed an approach to trend the delay distribution into the future.Next, a system-state prediction scheme evaluates system performance for the estimated future delay distribution.In most cases, there is no need to perform the computation expensive prediction.Hence, it is triggered based on a user-defined threshold.As shown in Fig. 2, R p i is a user- defined threshold for each probability variation P i (t).If P i (t) > R p i is true and the new delay Delay(t) follows the distribution of time t − 1, there is no cyber network fault.The distribution and systemstate predictors will not be activated.Otherwise, a delay distribution change will be observed in PDF identifier and the predictors are activated.The predictions include four main steps that are repeated until the resilience controller is triggered: Step 1: new distribution estimation; Step 2: resampling; Step 3: system output prediction;

Fig. 2 Flow chart of cyber network fault prognosis scheme
Table 1 Online PDF identification algorithm 1. Determining the data in the sliding window for time k: a) Choosing a kernel function K centred on x with a bandwidth h; b) Each observation x i receives a specific weight proportional to the scaled distance from the observation x i to x, which is u = (x − x i )/h; c) At a given x, the estimate is found by vertically summing up over the k shapes.This can be synthesised as: The general formula for KDE will be given by where the dependence of the estimate on the kernel function K( .) is denoted as f k .
2. Updating the new data for time k + 1 in the sliding window and go back to Step 1; IET Cyber-Phys.Syst., Theory Appl.
This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/) Step 4: soft and hard fault isolation and resilience control triggering.
These steps are discussed in details next.

Step 1: New distribution estimation:
The expectation and standard deviation of the new distribution are estimated based on the delays which induce a new distribution.Time series analysis is utilised to estimate the autoregressive (AR) model for the expectation E and standard deviation D of the new distribution.The hypothesis of the model is given by: and D(k), using the estimated coefficients β E0 , β D0 , β E1 , and β D1 . where The one-period ahead prediction error is: The prediction errors can converge overtime by minimising the following cost function: e D (k + 1) T e E (k + 1) Such that the update law of θ(k), L(k), and O(k) can be obtained.
where L(k) and O(k) denote estimator gain and estimation of error variance, respectively.Their initial values are randomly set.
Remark 1: The parameters θ(k), L(k), and O(k) are continuously updated by the training data in the sliding window at time K.We assume that the distribution of time k + 1 does not change so that the one-step prediction for time k + 1 is valid to predict the system behaviour.
Lemma 1: With the update law (5) and more new delays loaded in the sliding window, the objective index (4) is continuously minimised.Then, the following statements are true: (a) the estimation errors of the expected value and standard deviation of new delays converge.
(b) ( ∥ ∑ j = 1 n P (k + 1) j ∥ − ∥ ∑ j = 1 n P k j ∥ ) < 0 holds.Remark 2: Even if the network condition is perfect, unexpected delays, which are out of the healthy range, occasionally occurs in a long period.That can lead to the inefficient triggering of resilience control.Using the above time-series analysis, not only the distribution change can be tracked in real time but also the trend of distribution change is identified and predicted.Such that the occasional event can be filtered without resilience control triggering.

5.2.2
Step 2: resampling: Based on the future delay distribution provided by step 1, a series of random delays is generated, which follows the new distribution.The number of generated delays should be determined by the window size for PDF identification.

Step 3: system output prediction:
The resampled delays are fed to the system model which takes into account dynamic delays and packet losses.Such a time-varying system is given by: where z = x(k) T u(k − 1) T . . .u(k − d) T T is the state variables vector; u k is the control input; A z (k) and B z (k) are the system dynamic matrices and given by Finally, the possible system behaviour induced by the new distribution of delays are estimated and denoted as z k .
Prediction convergence analysis: The prediction error z k convergence is demonstrated in Theorem 1.The dynamic matrices A z j and B z j for each delay interval are deterministic and their calculation can be found in [18].
Theorem 1: (Error of system-state prediction convergence): As the delay data keeps updating PDF identifier and ( ∥ ∑ j = 1 n P (k + 1) j ∥ − ∥ ∑ j = 1 n P k j ∥ ) < 0 is satisfied, then the prediction error for system output ∥ z(k) ∥ asymptotically converges to zero.Proof: The prediction error is given by Therefore, the convergence of z i can be proven by proving the convergence of A z (k) and B z (k) We define the prediction error of A z (k) as A z (k) = A z (k) − A z (k).A z (k) can be expressed as ∑ j = 1 n P j (k)A z j P j (k) is the actual probability at k. Similarly, we denote A z (k) = ∑ j = 1 n P j (k)A z j .P j (k) is the estimate probability provided by the PDF profile.The estimation error of the probability is P j (k) = P j (k) − P j (k).Then, Lyapunov function candidate is Therefore, the prediction error of A z (k) asymptotically converge to zero.Similarly, the prediction error of B z (k) can be proven with the same procedure.Such that z(k) asymptotically converge to zero.□ Remark 3: The maximum error occurs when the first sample of the new distribution comes in the sliding window.Then, the accuracy of PDF estimation improves as the sliding window includes more and more new samples from the new distribution after the PDF change occurs.Therefore, ( ∥ ∑ j = 1 n P (k + 1) j ∥ − ∥ ∑ j = 1 n P k j ∥ ) < 0 holds.

Step 4: soft and hard fault isolation and resilience control triggering strategy:
The acceptable error magnitude of state i is defined as R z i .If z i > R z i , this fault is marked as a hard cyber network fault.A warning is triggered as well as the resilience controller.Otherwise, this is a soft fault that can be handled with the original controller operating normally.In summary, the proposed prognosis scheme can timely detect cyber network faults and isolate soft and hard faults because the dynamics of the network is continuously monitored.Accurately isolating soft and hard fault optimise the decision of resilience controller triggering as well as the computational resources allocation.When hard faults occur, the resilience controller can be timely triggered before adverse effects on system performance happening.

Resilience control strategy
In this section, the employed resilience controller is presented for completeness.PDF-based tuning of stochastic optimal controller (PTSOC) [20] mitigates the adverse effects induced by the uncertainties of cyberspace and adapt to the random occurrence of cyber network faults.
Remark 4: PTSOC has a good adaptability to a time-varying distribution of delays, but lead to more computation overhead than the traditional resilience controller.Therefore, the above strategy Section 5.2.4 aims to determine an appropriate time to trigger the resilience controller without consuming the computational overhead.Meanwhile, the proposed strategy based on fault isolation proactively trigger the controller, rather than triggering it when a failure or damage has occurred.Such that, the system performance and stability are guaranteed.
The PTSOC control law considers the PDF of delays by optimising a weighted summation of cost functions of different delay ranges (7).Each weight is the probability of its corresponding delay intervals from the PDF identifier.
where i presents the delay interval (d int i < d k < d int (i + 1)); n is the total number of delay cases; k represents sampling interval; P i is probability of delay within d int i to d int (i + 1) provided by the PDF identifier; x i is the states vector; u i is the control inputs vector; . .] and R zi = R i /d are symmetric positive semi-definite and symmetric positive definite, respectively.E[ • ] is the expectation operator.By optimising (7), the control input is given by: where K(k) is the optimal gain and u(k) is the control input; S zi (k) ≥ 0 is the solution of the algebraic riccati equation (ARE) equation; n d = d upper /d int , d upper is the maximum delay in the sliding window; P i (k) is the probability of ). Remark 5: The Q i and R i in the cost function for each delay range should be different because each pair of Q i and R i should be the optimal values for the delays bounded in a specific range.They cannot guarantee a high level with the delays out of such boundaries.
Stability Analysis [20]: Two theorems and their corresponding proofs are presented to demonstrate the stability of the proposed PTSOC.Lyapunov-based stability analysis is used.Theorem 2 (Appendix 9.2) shows the control gain estimation asymptotically converges even if PDF estimation has an error provided it asymptotically converges to zero.Theorem 3 (Appendix 9.3) considers the irremovable bias of PDF estimation as a bounded disturbance.However, a UUB stability is guaranteed.The proofs for these theorems can be found in [20].

Simulation and discussion
In this section, the proposed prognosis scheme is evaluated by simulations in MATLAB.Section 6.1 demonstrates the convergence of the system-state prediction.In this case, the resilience controller triggering is disabled to observe the prediction performance alone.Then, both soft and hard cyber network fault scenarios are presented separately to demonstrate the cyber network fault detection and isolation performance in Sections 6.2 and 6.3.The resilience controller in Section 5.3 is applied.A conventional stochastic optimal control [18] is employed as a reference.
A continuous-time batch reactor system is taken as a case study.Its dynamics are given by [18].(e) The sliding window size M is 30.

System state prediction evaluation
This scenario demonstrates that the accuracy of the system-state prediction improves as more new delay measurements update the distribution estimation.
Here, the PTSOC triggering is disabled to allow continuous, uninterrupted predictions of system states.The fault is injected at 47 s.Fig. 3 shows that the prediction at 47.1 s significantly diverges from the actual system behaviour.As more new delays are loaded in the sliding window, the PDF estimation of the new distribution improves.Such that the predicted system states become more accurate.The predictions at 47.5 s is more accurate than that at 47.1 s.Other results are shown in Appendix 9.4.

Soft cyber network fault
In this scenario, a network congestion fault is simulated, which occurs when a network node is relaying more data than it can handle.It usually causes a gradual increase of delays.R pi and M are user-defined parameters.These simulations are repeated 50 times for the statistical validation.With 50 repeated simulations for different soft faults, the proposed scheme only needs 0.42 s in average to detect the fault.With R pi = 0.03 s, the faults are 100% detected.The results in Fig. 4 are only for one case to illustrate the performance of the proposed prognosis scheme.
Before the first 50 s, the delays follow a normal distribution N(0.3, 0.05 2 ).Then, a network congestion attack (e.g.denial-ofservice) is launched at 50 s and the delays after 50 s follow a new normal distribution N(0.5, 0.1 2 ).Fig. 4 presents the result for fluid level.As shown in Fig. 4a, the probability variation exceeds the threshold at 50.2 s.A cyber network fault is detected.Simultaneously, the awareness of the cyber network fault triggers the system-state prediction shown in Fig. 4b and Appendix 9.5.The oscillation are observed, but are small enough for the basic controller to handle.Therefore, this fault is a soft fault.The resilience controller does not have to be triggered.
The traditional diagnosis scheme [18] usually preset a threshold, which is a constant, for the network delay to capture the delay variation.When the delay exceeds the bound, the resilience controller will be activated.Such that some unnecessary triggering might occur resulting in increased computational overhead and false reactions of resilience controller.According to Fig. 4a, the resilience controller should be activated 12 times if the traditional fault diagnosis is applied.However, applying the resilience control is not necessary and induce more resource waste and wear and tear of system hardware.On the contrary, such negative consequence can be avoided with applying our proposed scheme.
Remark 6: There are several cyber network fault detection before 50 s because R pi for this scenario is selected at low level.Hence, false detection occur.However, they would only cause more computational overhead and have no input on system stability.Overall, this trade-off should be considered when selecting R pi .
Remark 7: After the first soft fault detection, the proposed scheme should continuously supervise the cyber condition.That is because a soft fault possibly becomes a hard fault in the near future.Also, a warning should be issued to human supervisor to take additional precautions (e.g.investigate attack or update firewall).

Hard cyber network fault
In this scenario, a man-in-the-middle attack (MitM) is simulated.The attacker secretly relays and possibly alters the communication between two parties who believe they are directly communicating with each other.The transmitted information, such as control commands and feedback measurements, can be eavesdropped and delayed.Here, the delays before 47 s follows a normal distribution (0.3, 0.05 2 ).Then, the attacker injects MitM attacks intermittently.As the results, the distribution of delays is varied overtime (Fig. 5a).The acceptable error magnitudes are set for four system states: 400 cm for the fluid level; 50k for the inside temperature; 40 g/s for the product outlet flow rate; and 50k for the coolant outlet temperature.
As Fig. 5b showing, the sudden change of delay at 47 s is detected at 47.1 s because the probability of delays within [0.2, 0.3] suddenly decreases.In Fig. 6a and Appendix 9.6, all the predicted system outputs exceed their acceptable range.The estimated and   actual points that the system states pass through the acceptable error are shown in Table 2.This prediction can achieve at least 99.6% accuracy.It is concluded that this fault is a hard cyber network fault and its adverse effects on the system performance is predicted.The resilience control is triggered at 47.1 s to mitigate such effects.Comparing with the original SOC, the overshoots are reduced by at least 89.6%, the TTRs are shortened by 31.2%.The summary of improvements can be found in Table 3.When applying the proposed scheme, the fault is quickly detected and the resilience controller is timely triggered ahead of the serious degradation of system performance.Also, the overshoot of each system output is significantly reduced in term of its corresponding TTR.In contrast, without applying the proposed scheme, the fault still can be detected when the system states exceed the acceptable error magnitude at 48.5 s.The fault tolerant controller, which is a tuned PID controller, is triggered.However, it is too late to recover the system performance with such a late activation of the resilience controller.In such case, the basic controller will try to apply excessive actuation (Table 3) to stabilise.This might lead to significant damage of the components or cause an unscheduled downtime.Even worse, the system could be compelled to stop.The above simulation is repeated for 50 times.All the faults are accurately detected.

Discussion
We conducted 100 simulations with 50 soft and 50 hard fault cases and, to evaluate the isolation accuracy of the proposed scheme.All of the faults are detected.However, 58 hard faults are identified, that is eight soft faults are incorrectly recognised as hard faults.The threshold for fault isolation is set ensure 100% correct isolation of hard faults.Those false hard fault identification have no negative impact on system stability and performance, only increase the computational overhead.
It is important to note that the traditional physical system fault detection, which is a model-based observer, cannot detect any abnormalities in cyberspace.The network dynamics concurrently change the mathematical model of the physical system and the model used for observer design.Such that the outputs from the observer and physical system are same.Therefore, the model-based observer can only be used for physical component fault detection, not cyber network fault.In addition, designing a traditional observer for cyber network fault detection is impossible because, in realistic CPS, cyber network fault model cannot be obtained ahead of time.

Conclusions
The proposed novel prognosis scheme is shown to quickly detect and predict cyber network faults using PDF monitoring and estimation.Moreover, soft and hard faults are isolated to optimise the computational cost of resilience control.The convergence of the future delay distribution estimation and the system-state prediction are theoretically proven.With the proposed resilience controller, the adverse effects caused by cyber network faults are efficiently mitigated.
The simulation results show that the proposed scheme accurately detect the cyber network faults before the performance degrades beyond the acceptable range.Moreover, the PTSOC is timely triggered to mitigate the negative effects on the CPSs performance.The overshoot is significantly reduced by 90% and TTR is shortened by 30%.Although the accuracy of the soft and hard fault isolation can only achieve 84%, the hard faults are 100% detected.Those soft faults which are misclassified to hard faults only consume the resources for triggering resilience controller.However, the stability of the entire CPS is always guaranteed.n P (k + 1) j ∥ − ∥ ∑ j = 1 n P k j ∥ ) < 0 is satisfied, then the estimation error for control gain ∥ K k ∥ asymptotically converges to zero.Proof: First, we define the estimation error of control gain K as Since V K k is positive definite and ΔV K k is negative definite provided K k = K k − K k = ∑ j = 1 n P k j K j − ∑ j = 1 n P k j K j .Therefore, the estimation error of control gain asymptotically converge to zero.□

Theorem 3 and Proof (To be included in paper as the approach)
Theorem 3: (UUB Stability of the Regulation Error): Given the initial conditions as the system state z 0 and system matrices A z0 , and B z0 , let u 0 (z k ) be an initially admissible control policy for the CPS (6).Let the control update law be given by ( 8) and ( 9) and if the disturbance induced by the irremovable bias of PDF estimation has a bound ∥ d KDE ∥ and K min < 1/b min such that the regulation error of system states has a uniformly ultimate bounded convergence in the mean.
Proof: Consider the following positive definite Lyapunov function candidate: V z k = z K T z k .z k is the state vector of k.The corresponding estimated Lyapunov is V z k , therefore, ΔV z k = V z k + 1 − V z k .We consider ΔV z km = V z (k + 1)m − V z k for each possible system matrices (A zkm and B zkm ).m represents one of the possible cases.If the maximum value of ΔV z k m is negative definite, the system convergence is proven.The irremovable bias of PDF

Fig. 1
Fig.1The effect of delay variation on system stability (a) Delays, (b) Tracking errors of optimal controller the cyber network fault is detected.(b) Modelled system output residual (MSOR): It is provided by the observer, which is the difference of outputs of modelled and that of actual systems: MSOR = x(k) − x(k).The corresponding threshold R MSOR is selected to detect physical system faults.If MSOR > R MSOR , the physical system fault is detected.(c) System performance residual (SPR): It is the difference between actual and desired system outputs: SPR = x d (k) − x(k).
IET Cyber-Phys.Syst., Theory Appl.This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)this CPS are selected as: (a) The sampling time is 100 ms; (b) The considered delays in the system model is <2 sampling interval, d = 2; (c) d int = 0.1 s; (d) The threshold of the probability variation R pi is 0.03 s, unless otherwise states;

Fig. 3
Fig. 3 Case A: Actual and predicted system behaviour

Fig. 4
Fig. 4 Case B (a) Selected probability variation, (b) Predicted and actual system output

Fig. 5
Fig. 5 Case C (a) The simulated delays, (b) Selected probability variation

Fig. 6
Fig. 6 Case C (a) Predicted and actual system output, (b) Fault mitigation performance

Fig. 7
Fig. 7 Case A: actual and predicted system behaviour

Fig. 8
Fig. 8 Case B: predicted and actual system behaviour

Fig. 9
Fig. 9 Case C: predicted and actual system behaviour

Table 3
IET Cyber-Phys.Syst., Theory Appl.This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)