Quickest Detection of Deception Attacks on Cyber-Physical Systems with a Parsimonious Watermarking Policy

The addition of a physical watermarking signal to the control input increases the detection probability of data deception attacks at the expense of increased control cost. In this paper, we propose a parsimonious policy to reduce the average number of watermarking events when the attack is not present, which in turn reduces the control cost. We model the system as a stochastic optimal control problem and apply the dynamic programming to minimize the average detection delay (ADD) for fixed upper bounds on false alarm rate (FAR) and increased control cost. The optimal solution results in a two threshold policy on the posterior probability of attack, which is derived from the Shiryaev statistics for sequential change detection assuming the change point is a random variable with a geometric distribution. We derive approximate expressions of ADD and FAR applying the non-linear renewal theory. The relationship between the average number of watermarking added before the attack and the increase in control cost is also derived. We design the optimal watermarking that maximizes the Kullback-Leibler divergence for a fixed increase in the control cost. Simulation studies are performed to illustrate and validate the theoretical results.

schemes. Under the passive attack detection scheme, the innovation signal from the state estimator or the observation signal is subjected to various statistical tests [13]- [15]. However, as studied in the literature, passive detection schemes generally have an unsatisfactory probability of detection in the presence of noise and uncertainties [20]. On the other hand, under the active attack detection scheme, physical watermarking signals are added to the control inputs, and various statistical tests are used to check the authenticity of the received observations. The physical watermarking scheme was first introduced in [20] to detect replay attacks by adding an iid watermarking signal to the control input and performing a χ 2 test using the innovation signal from the state estimator. The method in [20] is improved by designing an optimal watermarking signal in [17]. Instead of an iid watermarking scheme, the watermarking signal generated from a hidden Markov Model (HMM) is studied in [3]. A sequential attack detection scheme using the CUSUM statistics evaluated from the joint distribution of the added watermarking and the innovation signal is studied in [21]. Besides the innovation signal, the observation signal is also used to generate residue signals for the attack detections [1]. In [1], [3], [21], watermarking signals are added to the control inputs for all the time instants till the point of attack detection. The addition of physical watermarking to the control input increases the probability of attack detection at the expense of increased control cost [3]. The relation between the increase in the linear quadratic Gaussian control cost, ∆LQG, and the watermarking signal variance is studied in [3]. Since the attack is a less frequent event, adding the watermarking signal during the normal operation for a long time can increase the total control cost significantly [18] and unnecessarily. In the current paper, we have studied an evidence-based watermarking policy to reduce the increase of control cost before an attack, and, at the same time, achieve satisfactory detection performance.
Researchers are exploring diverse approaches to reduce the increase in the control cost due to the added watermarking and maintain satisfactory detection performance. In one approach, the authors have added the watermarking periodically to the control inputs and kept a balance between the improvement in the control cost and the increase in the detection delay [18]. Another approach is to add watermarking directly to the observations [22]- [24]. In this approach, the authenticity of the observations is first examined at the receiving end, and then the watermarking signal is filtered out before using the observations in the controller. Since the watermarking signal is filtered out, the control cost does not increase. Different kinds of watermarking signals are used in this context, such as sinusoidal [23], time-varying sinusoidal [25], random noise [24], multiplicative to the observations [22], etc. However, these methods may fail in the scenario, where the attacker hijacks the sensor node and feeds the fake data before the addition of the watermarking. In general, the physical watermarking-based methods targeting to reduce the increased control cost or more traditional always present watermarking-based methods use batch processing of data, i.e., innovation signal or observation signal. Therefore, those methods do not address the problem of the quickest attack detection. However, we know that early detection of attacks is of paramount importance for CPS to reduce the amount of damage. Therefore, in this paper, we studied the problem of the quickest detection of attacks which uses watermarking parsimoniously to reduce the loss in control performance prior to an attack. The literature on the quickest detection of a change point by sequential analysis of data dates back several decades. A brief review on the quickest change detection techniques is provided in the following paragraph.
The quickest change detection methods can be classified into two broad groups depending upon the assumption of the model of the change point [26]. In one approach, which is also called the minimax approach, the change point is modelled as deterministic but unknown. The cumulative sum (CUSUM) technique is one of such minimax approaches, which was first introduced by Lorden [27]. In the other approach, the Bayesian approach, the change point is modelled as a random variable (RV) with some prior distribution. The Bayesian change point detection technique was first introduced by Shiryaev [28]. The original Shiryaev rule was proposed for the data with different iid distributions, before and after the change point. Finding an optional detection rule for the general non-iid data is difficult [29]. In [30], an optimal detection rule is developed for homogeneous finite-state Markov chains. A slightly different approach is followed in [31], [32], where the authors proved that the Shiryaev rule, with minor modifications, is an asymptotically optimal quickest change detection rule under the conditions given in (3), (4), (21) and (23) of [32]. That means the Shiryaev procedure minimizes the average detection delay (ADD) for a fixed upper threshold on the false alarm rate (FAR) for the non-iid data provided that the threshold → ∞ and a few other conditions are satisfied. The condition (3) of [32] for the optimality is that the prior distribution of the change point must satisfy (1).
where Γ is the change point. That means the exponential rate of convergence of the prior distribution must be c ≥ 0, where c > 0 indicates the prior distribution has an exponential right tail, and c = 0 indicates the prior distribution is heavy-tailed [29]. An attacker will always try to remain stealthy for a long time because the longer time the attacker remains undetected, the more damage can be caused [18]. On the other hand, the defender should design a detection mechanism that will detect the attack as soon as possible with an acceptable FAR to reduce the amount of damage. Therefore, we have used the Bayesian approach in this paper, which minimizes the ADD, whereas the other method, the minimax approach, only minimizes the worstcase ADD (computed over all possible attack start points) [26]. Additionally, our work in this paper is inspired by two other prior works [33], [34]. The quickest intrusion detection problem is studied in [33], where only a minimal set of sensors from a sensor network is kept active at a particular time instant. The problem of quickest change detection is also studied in [34] with upper bounds on the average number of sensor data used before the change point and the FAR. In both the problem formulations, the underlying data was assumed to be iid, which is not the case for the system under study in this paper. However, similar to several other works on change-point detections [33], [34], we have also assumed the distribution of the change point, i.e., the attack start point, to be a geometric distribution with parameter ρ, which satisfies the condition given in (1).

B. Contributions
In our previous work [21], we studied in detail that the worst-case ADD decreases with the increase in ∆LQG, which denotes the increases in the LQG control cost due to the addition of watermarking for the always-present watermarking scheme. Additionally, ∆LQG is proportional to the watermarking signal power. In other words, an attack can be detected early with higher watermarking signal power, i.e., at the expense of increased control cost. Therefore, in this paper, we propose a method to reduce the average number of watermarking (ANW) events used before the attack start point, which reduces the average watermarking signal power and subsequently ∆LQG. We formulate the task at hand as a stochastic optimal control problem to minimize the ADD for fixed upper bounds on FAR and ANW and apply dynamic programming to solve it. Similar to any other detection technique, there is always a trade-off between ADD and FAR [31]. We have studied the structure of the dynamic programming solution, i.e., the solution of the Bellman equation, and found that the optimal policy is a two threshold policy with thresholds T h s and T h d , T h d ≥ T h s , on the posterior probability of attack p k under practical circumstances. In other words, if p k ≥ T h s , then we add watermarking to the (k + 1)-th control input. On the other hand, if p k ≥ T h d , we decide that the attack is present in the system and terminate the process. Our study shows that T h s primarily controls the ANW, which in turn controls the ∆LQG value, and T h d primarily controls the ADD and FAR. Asymptotically approximate analytical expressions of ADD and FAR are derived by applying non-linear renewal theory for noniid data. The derived expression of ADD indicates that the ADD reduces with the increase of the Kullback-Leibler divergence (KLD) between the post-and pre-attack distributions of the test data. Additionally, the derived analytical expression of KLD for our problem formulation provides the relationship between the KLD and the watermarking signal variance. Therefore, we use this relationship to derive the optimal watermarking signal variance, which maximizes the KLD for a given upper bound on ∆LQG. We have also obtained an expression of ∆LQG for a given ANW. We have reported a preliminary simulation study on this problem previously for a single-input single-output (SISO) system in [35]. In the current paper, we have performed a more in-depth theoretical analysis of the problem for general multi-input and multi-output (MIMO) system models. Our main contributions are as follows.
1) To the best of our knowledge, this is the first time the Bayesian approach is applied for the quickest detection of data deception attacks on NCS with a parsimonious watermarking policy to reduce the control cost. 2) We have derived asymptotically approximate analytical expressions of ADD, FAR and ∆LQG that facilitate the optimal design of the watermarking process. 3) We have optimised the watermarking signal variance to maximise KLD, which improves ADD for a fixed upper bound on the ∆LQG.
The paper is organized as follows. Section II discusses the system model and the attack strategy considered in this paper. The defence mechanism is explained in Section III. Section IV provides the analytical expressions of ADD, FAR and the relationship between the ANW and ∆LQG. It also explains the optimization framework for the watermarking signal variance. Section V presents and discusses the simulation results. Section VI concludes the paper.

C. Notations
We have used capital bold letters, e.g., A, B, etc. to specify matrices and small bold letters, e.g., x, y, etc. to specify vectors, unless specified otherwise. Some special notations are given in Table I.
Determinant of a matrix or absolute value of a scalar {·} Mean value of a quantity tr(·) Trace of a matrix Indicator function, 1 if condition is true, 0 otherwise II. SYSTEM MODEL The system model during normal operations and the model with the data deception attack are discussed in this section.

A. System model during normal operation
A schematic diagram of a standard NCS during the normal operation is shown in Fig. 1. We assume a linear time-invariant MIMO plant with the following state update and measurement equations, where x k ∈ IR n and y k ∈ IR m are the state and measurement vectors, respectively. u k ∈ IR p is the control input vector. The process and observation noise vectors are denoted as w k ∈ 4 IR n ∼ N (0, Q) and v k ∈ IR m ∼ N (0, R), respectively, with Q > 0 and R > 0. Here, A ∈ IR n×n , B ∈ IR n×p , Q ∈ IR n×n , C ∈ IR m×n , and R ∈ IR m×m . Process and observation noises are assumed to be iid and uncorrelated to each other and with the initial state vector. We also assume that the system has been operational for a very long time and is currently in steady-state. The states of the system are estimated using the Kalman estimator. The sensor measurements are available to a remote estimator/controller, possibly over a wireless link, which may be vulnerable to malicious data deception attacks. In the absence of an attack, the time update and measurement update equations are as follows,x are the Kalman predicted and filtered states, respectively. E[·] denotes the expectation operator, and I k {u 0 , u 1 , · · · u k , y 0 , y 1 , · · · y k } is the set of all information up to time k. The innovation signal γ k and the steady state Kalman filter gain K are given as, where P = E (x k −x k|k−1 )(x k −x k|k−1 ) T is the steady state error covariance. In steady-state P becomes the solution to the following algebraic Riccati equation, The estimated states are fed to a state feedback controller which is assumed to be an infinite horizon linear quadratic Gaussian (LQG) controller. The optimal control input u * k is derived by minimizing the following cost function, Here W ≥ 0 and U ≥ 0 are weight matrices. The optimal LQG control input turns out to be the following linear function of the estimated states, u * k = Lx k|k , where Here, S is the solution to the following algebraic Riccati equation,

B. Attack Model
We make the following assumptions regarding the capabilities and knowledge of an attacker: 1) the attacker can access the sensor nodes and replace the true observations with fake data; 2) the attacker has complete knowledge about the system and the controller, i.e., the attacker knows A, B, C, Q, R, and L; 3) the attacker can not access or alter the control signal. To launch a data deception attack, the attacker replaces the true observations y k by the fake data z k from k ≥ Γ. A well-studied method to achieve this is to transmit the fake observations with significantly higher power than the true measurements from the sensors. As a result, the wireless control system receiver accepts the fake measurements as legitimate while rejecting the true measurements from the sensor nodes. Such attack models are also known as sensor spoofing attacks [36], [37]. The fake observation data z k is assumed to be generated from the following linear stationary stochastic process, where w a,k ∼ N (0, Q a ) is the iid noise vector at the k-th time instant, and Q a ∈ IR m×m . A similar attack strategy is also studied in [38], where the stealthiness of the attack signal is evaluated in terms of the KLD between the distributions of the fake and true observations. The attacker's system matrix A a and the noise covariance matrix Q a should be designed in such a way so that the fake data z k mimics the statistical properties of the true measurement y k . A a mainly models the correlations between the current and past measurements, and Q a models the uncertainty. Designing A a and Q a in such a way increases the stealthiness of the attack signal. A schematic diagram of the system under the data deception attack is shown in Fig. 2. Remark 1. In general, for linear control systems, the measurement vector y k can be modelled as a stochastic process that is dependent on its past values with an additive i.i.d noise component, i.e., similar to (11). In other words, the attack model in (11) mimics the linear stationary stochastic model of y k , which makes it challenging to detect. In addition to that, the attack model in (11) can make closed-loop control systems unstable, as discussed in [39], which may cause a significant amount of damage to CPS. Since the attacker's objectives are to cause damage to the CPS and to remain stealthy in doing so, the attack model considered in this paper (11) facilitates the attacker in achieving both the objectives, thus illustrating the significance of such an attack model. Moreover, such an attack model can be used to detect replay attacks after a few modifications, as demonstrated in [40]. During the attack, i.e., for k ≥ Γ, the innovation signal will take the following form, whereasx k|k−1 andx k|k will follow the same time update and measurement update equations (4) and (5), respectively. However, the γ k in (5) for k ≥ Γ will follow (12). Note that, after the attack start point, i.e., k ≥ Γ, the defender does not change the Kalman filter. However, since the attacker replaces y k by the fake data z k from k ≥ Γ, the innovation signal γ k automatically takes the form given in (12). As studied in several works of literature, the distribution of the attack start point can be modelled as exponential distribution for continuous-time systems [41], [42]. In [41], the authors collected empirical data from intrusion experiments and found that the attack start points are approximately exponentially 5 distributed. On the other hand, the authors formalized the semantics of attack trees and used them for the probabilistic timed evaluation of attack scenarios in [42]. Additionally, the authors studied various practical systems, including the famous Stuxnet attack [4], and derived the distribution of the attack start time to be exponential. Since the exponential and geometric distributions play analogous roles in the continuous and discrete time domains, respectively [43], we have modelled the attack start point Γ to be an RV that follows a geometric distribution with parameter ρ, where 0 < ρ < 1. Here ρ is a design parameter reflecting the defender's belief of how often attacks occur. For the proposed method, ρ is a parameter that needs to be set by the defender. A vulnerability analysis of the system can decide the value of ρ, see [41], [42]. From the derived approximate analytical expressions of ADD (60), FAR (61) and ∆LQG (72), we can say that ADD and ∆LQG will not be affected much by the difference in the chosen ρ and the attacker's true ρ, as long as both ρ 1, which is a realistic assumption since attacks are rare events. On the other hand, FAR will increase if the chosen ρ is higher than the attacker's true ρ and vice versa. The defender can thus choose a suitable ρ, depending on the specification on the maximum false alarm rate. Finally, we can write the prior probability Π k P {Γ = k} in the following form [34], Here, Π 0 P {Γ ≤ 0}, i.e., Π 0 is the prior probability of the attack happening before the start of the observation. 1 {condition} is the indicator function, which takes the value 1 if the condition is true, or 0 otherwise. In general, 0 ≤ Π 0 < 1. However, for our problem formulation, we have taken Π 0 = 0. We assume that the defender does not know the exact value of Γ, but knows the prior distribution of Γ.

III. PROPOSED DETECTION STRATEGY
We perform the following hypothesis test to detect the presence of an attack, H 0 : No attack present H 1 : Attack present in the system We parsimoniously add an iid watermarking signal, given as to the optimal LQG control input, u * k , to improve the detectability of the attack, see (17). The decision of adding or not adding the watermarking and the selection of hypothesis for the k-th time instant is controlled by the optimal policy u * d . Here, the subscript d of u * d indicates that the optimal policy is derived using dynamic programming. The policy u d,k decides the values of the following two control variables s k and d k at the k-th time instant, d k = 0, Hypothesis H 0 selected, process continues. 1, Hypothesis H 1 selected, process terminated. Figure 3 illustrates the proposed watermarking and attack detection scheme with a schematic diagram of the system. The components enclosed inside the blue dotted rectangle are assumed to be located at a secure location. Figure 3: Schematic diagram of the system with the proposed watermarking scheme.

A. Problem formulation
Our objective is to find the optimal policy u * d that minimizes the ADD for fixed upper bounds on FAR and ANW. First, we introduce the formal definitions of FAR, ADD and ANW as follows. The definitions of FAR and ADD are similar to [31]. False alarm rate (FAR): FAR is defined as Here, P Π indicates the average probability measure.
where Ω is any event and P k is the probability measure when the change point Γ = k. τ is the time instant when the hypothesis H 1 is selected. Average detection delay (ADD): ADD is defined as Here, E Π denotes the expectation with respect to the probability measure P Π . Average number of watermarking events (ANW) before attack: ANW is defined as Here, s i is the same variable as given in (15). Now, we formulate the following optimization problem, where F AR th and AN W th are the user-selected thresholds. Then, the constrained optimization problem of (21) is converted into an unconstrained Lagrangian form as follows. The unconstrained Lagrangian form adopted in this paper is similar to [34], [44], except the term λ e AN W , and it reads as where λ f ≥ 0 and λ e ≥ 0 are the Lagrangian multipliers. A new state variable θ k is defined as System under attack, T e , Attack detected by hypothesis testing and process terminated. 6 Similar to [34], ADD, FAR and ANW can also be expressed in terms of the control variables, s k and d k , and the state variable θ k as follows, Using (24)- (26), the cost function of (22) can be expressed as where g k (·) is the per stage cost, expressed as Here, the first, second and third terms of (28) come from ADD, FAR and ANW, respectively. For the stochastic optimal control problem defined in (27), the state θ k is not observable to the defender. Therefore, we replace the state θ k by it's sufficient statistics p k . The sufficient statistics p k , i.e., the posterior probability of attack at k-th time instant is defined (27) is then redefined and solved using p k as discussed in details in the following Sub-section III-B.
The accessibility hypothesis discussed in [44] tells us that under this hypothesis, for every stationary deterministic policy u d ∈ U, any arbitrary state, say θ k is accessible from each starting state θ k = θ 0 . Here, U is the set of all permissible stationary deterministic policies. Under the accessibility hypothesis, the dynamic programming equation using the cost function of (27) is solvable by at least one stationary deterministic policy for each λ f ≥ 0 and λ e ≥ 0 [44].

B. Finding the optimal policy
This section discusses the solution approach taken to solve the optimization problem of (27) in the following three main steps.

1) Selection of test signals
Combining (2)-(6) and (12), we can represent the innovation signal as So, the innovation signal is dependent on the watermarking signal after the attack, see (30). On the contrary, the innovation signal is independent of the watermarking signal before the attack, see (29). It is assumed that the attacker will be replacing the true stationary observation y k with fake but stationary data z k to remain stealthy. In addition to that, as discussed in Subsection III-C, the optimal policy u * d is also a stationary one.
Therefore, the innovation signal will be stationary but with different distributions before and after the attack, as k → ∞. Also, from the properties of the Kalman filter, we know the innovation signal is iid before the attack. Additionally, the use of the joint statistics of the innovation signal and the watermarking signal increases the KLD compared to the case where the statistics of the innovation signal alone is used, see Theorem 1 and Remark 1 from [7], where the improvement in KLD has been quantified for a single-input single-output (SISO) system. These reasons motivate us to use the joint distribution of the innovation signal and the watermarking signal to generate the test statistics for attack detections.

2) Derivation of test statistics
We use the Shiryaev statistics because of its asymptotic optimality for a fixed upper bound on FAR as stated in Theorem 1 from [32]. The data is assumed to be iid before and after the change point in [32]. In contrast to [32], in our study, the innovation signal γ k is iid before the attack and non-iid after the attack. From [26], the Shiryaev statistics SR k at the k-th instant in time for our problem formulation can be written as where i is the candidate change or attack start point, and L j is the likelihood ratio. The expression for L j is given in Lemma 1. After the change point, SR k increases on average. In the original Shiryaev procedure, a change is detected once SR k crosses a predefined threshold for the first time. Lemma 1. The likelihood ratio L j used to derive the Shiryaev statistics (31) considering the joint distribution of the innovation signals ( (29) and (30)) and the watermarking signal (14), takes the following form, Here, f 1,j (·|·), f 2,j (·|·) and f 0 (·) denote the distributions for j > i, j = i and j < i, respectively, and e s,j = s j−1 e j . f 1,j (·|·), f 2,j (·|·) and f 0 (·) take the following forms, Here Proof. The proof of Lemma 1 is provided in Appendix A.
Equation (31) is same as the original Shiryaev statistics, where Π 0 is assumed to be 0, see (6.9) from [26]. However, the term L j in (31) is derived exclusively for the problem under study, where the test data is iid before the change point and non-iid with stationary distributions after the change point. Furthermore, Lemma 1 shows that the dependency of the test data γ j on the previous values of γ and e s from the time index 1 to j − 1 can be approximated as given in (33)-(39), where γ j is only dependent on the immediate past values, i.e., at time index j − 1, of z,x and e s . Remark 2. The likelihood ratios using the distributions of the innovation signal alone, say, L c,j and L d,j for j < i and j = i, respectively, can be evaluated from Lemma 1 by using e s,j−1 = 0 in (35)- (36). Therefore, we can write L c,j = L a,j | es,j−1=0 and L d, We have applied the value iteration from [45] using sufficient statistics p k to solve (27), which is an infinite horizon dynamic programming problem with a termination state. The relationship between the Shiryaev statistics SR k and the posterior probability of attack p k is given by, see (6.10) from [26], Lemma 2 provides the recursion formula of p k , which is used for the value iteration. Lemma 2. The posterior probability of attack at the k-th time instant, p k , for the test data γ k (iid (29) and non-iid (30)) and e k (14), can be updated using the following recursion formula, when the attack start point is geometrically distributed with parameter ρ, when s k−2 = 0, and Proof. Using (31), the recursion formula of the Shiryaev statistics SR k is derived first, see (45). Then using (43) in (45), the recursion equations of (44) are derived.
We use the following simplified notations to represent the recursion formula in (44). p k = φ 0 (p k−1 ), if s k−2 = 0, and p k = φ 1 (p k−1 ), otherwise. Also, the initial value of p k is taken to be 0.

3) Solution of optimization problem (27)
The expected value of the per stage cost function g k (·) in (27) is derived by taking expectations on both sides of (28), and using p k = E 1 {θ k =1} |Ψ k , see (46).
The Bellman equation for the infinite horizon cost function (27) with the termination state T e can be formulated using the sufficient statistics p k as follows, where U = {1, 2, 3} is the set of all stationary deterministic permissible policies, see Table II.
. Therefore, B 0 (p k ) and B 1 (p k ) denote the expected total costs from (k + 1)-th time instant till the termination of the process when s k = 0 and s k = 1, respectively, and d k = 0. Note that, when d k = 1, the process terminates immediately, so there will be no additional cost after the k-th time instant. In addition to that, if d k = 1 then the process will immediately terminate, and there will be no use of adding watermarking at the (k + 1)-th time instant, therefore, the combination (s k = 1, d k = 1) has been ignored. ) To satisfy the accessibility hypothesis, we have discretized the space of the sufficient statistics p k into a finite set during the numerical simulations. On the other hand, the control space of the stochastic optimization problem under study is inherently discrete and finite. Finally, the value iteration is used to solve the Bellman equation (47) and to find the optimal policy u * d .

C. Structural properties of the optimal policy
In this subsection, we will study the structure of the optimal solution found by solving the Bellman equation (47). Assumption A.1: There exist at least one stationary deterministic policy u d , for which both the constraints as given in (21) will be satisfied.
Assumption A.1 is about the feasibility of the existence of a stationary deterministic policy for the optimization problem in (21). Now the value iteration reveals the following optimal policy for selecting the control variables s k and d k values.
First, we will prove that the optimal policy is going to be a stationary deterministic policy. As stated in Lemma 3.1 from [44], the costs FAR and ANW will be monotone and nonincreasing in λ f and λ e , respectively. We can prove that by following the similar steps used to prove Lemma 3.1 in [44]. From the monotone and non-increasing properties of FAR and ANW, it can be proved that the inequality conditions in the original constrained optimization problem (21) will be satisfied 8 for finite values of λ f ≥ 0 and λ e ≥ 0 for some deterministic policy as stated in Lemma 3.3 from [44].
Finally, as discussed in [44], under assumption A.1 or the weaker condition of Lemma 3.1, the stationary deterministic optimal policy found by solving the Bellman equation (47) from the unconstrained optimization problem with the Lagrangian multipliers, λ e and λ f , will be the solution of the original constrained problem as given in (21).
Even though a formal proof is unavailable at this point, we have performed extensive numerical simulations and found that for the following properties of the optimal policy: The optimal policy is a two threshold policy, T h s and T h d , T h d ≥ T h s . Figures 4 and 5 provide the insights with thresholds of a twothreshold policy by plotting the left-hand sides (LHS) and righthand sides (RHS) of (48) and (49), respectively, for a relatively small λ e and large λ f . We observe that for (48), the LHS crosses the RHS at two points, but the second crossing happens after d k = 1, i.e., the termination of the process, which results in a two-threshold policy. We have found that for a relatively large λ e or ρ near to unity, the optimal policy may even become a one or three-threshold policy, which is similar to the findings of [34]. The following steps can be followed offline to find the two thresholds.
Step 1 : The search space of λ e and λ f is divided into N equally spaced grid points.
Step 2 : For each grid point, we perform the value iterations using (47), and store J * (p k ) where p k is also discretized in [0, 1].
Step 3 : For each grid point, ADD, FAR and ANW are evaluated from Monte-Carlo simulations by deriving the decision variables s k and d k from (48)-(49) using J * (p k ) from Step 2.
Step 4 : Select the best λ * e and λ * f combination, which gives minimum ADD and satisfies the constraints on FAR and ANW, see (21).
Step 5 : Apply numerical solvers such as the Trust-Region algorithm, the bisection method, etc., to solve the following two equations forp, see (50) Finally, the optimal policy u * d is given as Next, we briefly discuss the computational runtime complexity of the proposed policy.

D. Computational complexity
The proposed technique is an online method. At run time, we only need to evaluate p k (44) and compare it with two thresholds at each time step. For our problem formulation, most of the heavy computations, such as matrix inversion and computation

IV. DERIVATIONS OF ADD, FAR AND ∆LQG
This section derives the asymptotically approximate analytical expressions of ADD, FAR and ∆LQG for the given thresholds T h s and T h d , and a few other parameters to be defined later.

A. Approximate Expressions of ADD and FAR
Here we derive the approximate expressions of ADD and FAR, as T h d → ∞, applying non-linear renewal theory [31], [46]. First, the Shiryaev statistics SR k is converted into LSR k = log (SR k ) for the ease of asymptotic analysis. LSR k can be expressed as a summation of two variables, S k and l k , as given in the following Lemma 3. Lemma 3. The logarithm of the Shiryaev statistics, LSR k , generated from the test data, i.e., the innovation signal γ k ((29) and (30)) and the watermarking signal e k (14), under the two threshold policy T h s and T h d , can be expressed as the summation of two variables S k and l k , see (53). Here S k (54) is a ladder variable, and l k (55) is a slowly changing variable in the sense defined in [46].
Proof. The proof of Lemma 3 is provided in Appendix B.
Remark 3. The threshold T h s for p k is equivalent to the threshold T h S for LSR k . Similarly, we can define the threshold T h D as for LSR k , which is equivalent to the threshold T h d for p k . Also, as T h d → 1, T h D → ∞. Therefore, Lemma 3 enables us to apply non-linear renewal theory to derive the approximate expressions of ADD and FAR by splitting the logarithm of the Shiryaev statistics, LSR k , into a ladder variable S k and a slowly changing term l k . The definition of a slowly changing variable from [46] is also provided in Appendix B. We define the variable r to be the overshoot of the ladder variable S n d over a large threshold T h D at k = n d . Therefore, r n d S n d − T h D as T h D → ∞, and n d = inf k ≥ 1 : S k ≥ T h D . According to the non-linear renewal theory [46], the overshoot statistics of LSR k crossing a large threshold T h D can be approximated as the statistics of r n d , provided l k is slowly changing and T h D → ∞. The approximate expressions of ADD and FAR derived in this paper are stated in Theorem 1. Theorem 1. For the Shiryaev statistics given in Lemma 1 and the geometric prior distribution of the change point Γ (13), under the two threshold policy T h s and T h d , the asymptotic approximate expressions of ADD and FAR as T h D → ∞ will take the following forms, provided the conditions C1-C4 are satisfied.
C2: E 1 | Z 1 | 2 is finite. C3: l k (55) is a slowly changing variable in the sense defined in [46]. Then, and P 0 and P 1 denote the probability measures before and after the attack, respectively. E 0 and E 1 denote the expectations with respect to the probability measures P 0 and P 1 , respectively. E 1 D f e 1 , f 0 is the expected KLD between the distributions f e 1,j (·|·) and f 2,j (·|·), and f e 1,j (·|·) = f 1,j (·|·) when s j = 1 for all j. Here, the expectation is taken over the joint distribution of the innovation signal and the watermarking signal after the attack start point. Similarly, E 0 D f 0 , f e 1 is the expected KLD between f 0 and f e 1 , and the expectation is taken over the joint distribution of the innovation signal and the watermarking signal before the attack start point.
Proof. The proof of Theorem 1 is provided in Appendix C.

Remark 4. As given in Lemma 1 and Theorem 2 of [39]
, E 1 D f e 1 , f 0 will take the following form, where the covariance matrix Σ γ of the innovation signal after the attack start point is given as and E zz (0) = E z k z T k . Σ x F z and Σ x F e are the solutions to the following Lyapunov equations, Here A = (I n − KC) (A + BL), which is assumed to be strictly stable. I n is an identity matrix of size n × n. Therefore, we can derive approximate values of ADD and FAR using Theorem 1 for the given thresholds, T h d and T h s , and the system and noise parameters. The denominator of (60) does not depend on the thresholds. Also, according to the renewal theory, the statistics obtained from the overshoot r n d , i.e.,r and ξ, are not dependent on the exact values of the thresholds as long as T h d is large enough. However, from (55), 10 we can say thatl is dependent on the threshold T h s . Further approximation of the expression of ADD (60) can be directly obtained from Theorem 1 as stated in Corollary 1.1. Corollary 1.1. The approximate expression of ADD as provided in Theorem 1 can be further simplified as follows, Therefore, by ignoringr andl from (60), we get (70).
The approximate expression of ADD as provided in Corollary 1.1 does not depend on the threshold T h s . Therefore, Corollary 1.1 can also be used to find a suitable value of the threshold T h d for a given ADD. Remark 5. Finding analytical expressions forr,l, and ξ is difficult for the system under consideration. Therefore, we estimate their values by Monte-Carlo (MC) simulation. The values ofr,l, and ξ are not directly dependent on T h D as long as T h D is very large, but they depend on T h S . However, to derive the ADD using the second approximate expression as given in Corollary 1.1, we do not need the values ofr,l, and ξ, but it is less accurate compared to Theorem 1. Remark 6. In order to use the quickest detection scheme, the pre and post-change pdfs must be known. To achieve that, we need to know A a and Q a . In practice, it is highly likely that the attacker's system parameters A a and Q a may not be known a priori. In such a case, the attacker's system parameters A a and Q a can be estimated online from the received observations (true or fake) by fitting a vector autoregressive model to the observations [47]. This estimator will operate in parallel with the attack detection algorithm. Such a parameter estimation scheme will operate before and after the attack. However, before the attack, the estimates of A a and Q a will represent the healthy plant model. We have conducted some preliminary studies using a MISO system where our attack detection algorithm can perform with estimated parameters, albeit with addtional watermarking compared to the known parameter case. A detailed analysis of such a joint estimation and detection scheme is however, beyond the scope of the current manuscript, and interested readers are referred to [48]. Additionally, we also comment that under a replay attack, A a and Q a can be derived from the normal system model as discussed in [21].

B. Approximate Expression of ANW and ∆LQG
Following similar steps as in [34], the ANW can be approximated as follows, Here, t 1 T h S denotes the time interval between the time instances when LSR k starts from T h S , and then crosses the threshold T h S from above. t 2 LSR T h S , T h S denotes the time interval between the time instances when LSR k starts from LSR T h S and crosses the threshold T h S from below. t T h S is the first time LSR k crosses the threshold T h S from below. An example plot of LSR k is shown in Fig. 6 to illustrate the variables t 1 (·), t 2 (·), and t(·). E 0 [·] denotes the expectation with respect to the probability measure before the attack. Deriving analytical expressions for the expectations and the probability values in (71) is difficult. Therefore, we perform MC simulation to estimate the ANW for the given thresholds T h S and T h D . The relationship between the ANW and the increase in the control cost is given in the following theorem. For the parsimonious watermarking scheme adopted in this paper, the increase in the LQG control cost, ∆LQG, is related to ANW as where and Σ L is the solution to the following Lyapunov equation.
Proof. The proof of Theorem 2 is provided in Appendix D.
Theorem 2 shows that ∆LQG is proportional to ANW and a linear function of the watermarking signal variance Σ e . If watermarking is added at all the time instants, ρAN W will become unity, and Theorem 2 will coincide with the special case of always present watermarking as stated in Theorem 3 in [39].

C. Comparative Analysis
The proposed method is compared with the following two methods, PW-Σ e : persistent watermarking with fixed watermarking power and PW-∆LQG: persistent watermarking with fixed ∆LQG. The only difference between the proposed method and PW-Σ e is that the watermarking is always present for the latter, and the watermarking power for both the methods is Σ e . On the other hand, the only difference between the proposed method and PW-∆LQG is that the watermarking is always present for the latter, and the ∆LQG value is the same for both. The subscripts P , A and B denote the proposed method, PW-Σ e , and PW-∆LQG, respectively.

1) Comparison with PW-Σ e
Claim 1. The proposed optimal watermarking policy incurs a lesser increase in LQG cost compared to PW-Σ e .
The increase in the LQG control cost for PW-Σ e is as follows, see Theorem 3 from [39], (75) 11 By comparing the increase in the LQG control cost between the two methods, we can write where ∆LQG P denotes the increase in the LQG control cost for the proposed method. Since, (1 − ρAN W ) < 1, we can make Claim 1. Claim 2. The increase in ADD for the proposed optimal policy with respect to PW-Σ e will be small .
There will be an increase in the ADD for the proposed method. Other than the mean of the slowly changing term,l, in the ADD expression (60), the rest of the components will be the same for the proposed method and PW-Σ e . Therefore, the increase in the ADD for the proposed method will be as follows, where the subscripts A and P denote the PW-Σ e and the proposed method, respectively. o(1) notation is dropped for simplicity. Herel P is the same as given by (55) and (63), and l A will take the following form Since,l A andl P both are small quantities compared to T h D , which is assumed to be → ∞, we can make Claim 2. Claim 3. The FAR for the proposed optimal policy and PW-Σ e will almost be the same..
Since T h D ≥ T h S , the watermarking will be present for both cases when Z n crosses the threshold T h D . In other words, the statistics of the overshoot r n d will be the same for both methods. Therefore, we can make Claim 3.

2) Comparison with PW-∆LQG
Claim 4. The watermarking signal power for the proposed optimal policy, Σ eP , will be greater than or equal to the watermarking signal power of PW-∆LQG, Σ eB .
Since the increase in the LQG control cost is taken to be the same for both the methods, the watermarking signal powers Σ eB and Σ eP for the method PW-∆LQG and the proposed method, respectively, will be different. The relationships between the watermarking signal power and ∆LQG for both the methods are given as, ∆LQG = tr (HΣ eB ) = ρAN W tr (HΣ eP ) . (79) Since ρAN W ≤ 1, from (79) we can make Claim 4. Claim 5. The ADD for the proposed optimal policy will be less than or equal to the ADD for PW-∆LQG.
We use the ADD expression from Corollary 1.1 to compare the two methods. The difference in the ADD will be due to the difference in E 1 D f e 1 , f 0 B and E 1 D f e 1 , f 0 P as follows.
Here the subscripts B and P denote the method PW-∆LQG and the proposed method, respectively. By examining (66), we can say Σ γ P − Σ γ B ≥ 0. Therefore, from (80) we can write E 1 D f e 1 , f 0 P − E 1 D f e 1 , f 0 B ≥ 0, and we can further make Claim 5.

D. Optimum Σ e
Theorem 1 and Corollary 1.1 imply that the increase in KLD will reduce ADD. Therefore, we derive the optimum Σ e that will maximize KLD for a given fixed upper limit on ∆LQG, denoted as ∆LQG P for the proposed method. The optimization problem is defined as follows.
where J is a user-defined threshold. As given in Remark 4, the KLD expression for the proposed parsimonious watermarking policy is identical with the case where watermarking is always present, i.e., the method PW-Σ e . Moreover, ∆LQG P (72) is just a scaled version of ∆LQG A (75). Therefore, the condition ∆LQG P ≤ J in (81) can be replaced by ∆LQG A ≤ J A , where J A = J/(ρAN W ), without any change in the optimum Σ e value. Now, the optimization problem for the proposed method becomes identical to the optimization problem for the method PW-Σ e . According to Theorem 4 from [39], the optimum Σ e for PW-Σ e will be a rank one positive semi-definite matrix. Therefore, the optimization problem in (81) can be written as where v λ = √ σ e v e , σ e is the non-zero eigenvalue of Σ e and v e is the corresponding eigenvector. As discussed in [39], the maximization of E 1 D f e 1 , f 0 with respect to v λ is same as maximizing the following function, where Here, κ e is the solution to the Lyapunov equation Since the matrix A is assumed to be strictly stable, the Lyapunov equation of (85) will have a unique solution. As discussed in [39], the optimization problem of (83) can be solved by various methods available in the literature, such as sequential quadratic programming (SQP) [49], interior point method [50], simple gradient-based method [39], etc. Interested readers are referred to [39] for a detailed analysis, but the same has been removed from the current paper due to space constraints. 12 V. NUMERICAL RESULTS This section will illustrate and validate different aspects of the proposed methodology using, System-A: a second-order multi-input single-output (MISO) open-loop unstable system and System-B: a fourth-order MIMO open-loop stable system. Appendix E provides the required parameters for simulations associated with System A and B. Figure 7 shows the optimal decision variable u * d,k vs. p k plots for three different values of λ e and a fixed λ f for System-A. The watermarking signal variance is taken to be a diagonal matrix with equal signal power, σ 2 e . We observe that the optimal policy is a two threshold policy, which validates the theory presented in Sub-section III-C. A higher λ e means a stricter constraint on how much watermarking could be added, which gets reflected into higher T h s . On the other hand, a higher T h s means watermarking will be added for fewer samples. As discussed in Sub-section IV-C1, since the added watermarking has little effect on the FAR, the change in λ e does not affect the threshold T h d much.  Figure 8 shows the optimal decision variable u * d,k vs. p k plots for three different values of λ f and a fixed λ e for System-A. The watermarking signal variance is taken to be a diagonal matrix with equal signal power, σ 2 e . A higher λ f means a stricter constraint on how much FAR could be allowed. Therefore, the increase in λ f increases the threshold T h d . However, since T h d ≥ T h s , the change in λ f does not affect the threshold T h s . 2) Trial Run Figure-9 illustrates how the control variables s k and d k , and the sufficient statistics p k change with the time index k for a sample trial run for System-A. The watermarking signal variance is taken to be a diagonal matrix with equal signal power, σ 2 e . We have also indicated the attack start point as "change pt" in the plot. This figure provides relevant insights into how the proposed method works. We can observe that only for a very few time instances p k ≥ T h s , and watermarking have been added before the attack. Such parsimonious use of watermarking reduces the control cost before the attack. On the other hand, p k increases gradually after the attack start point and eventually crosses the threshold T h d . In other words, p k ≥ T h s and watermarking have been added almost all the time after the change point, resulting in faster detection. 3) ADD and FAR vs. σ 2 e Figure 10 shows the comparison between the plots of ADD and FAR vs. σ 2 e for two different values of λ e for System-A. Σ e is taken to be a diagonal matrix with equal signal power, i.e., σ 2 e . For each σ 2 e point, the thresholds T h s and T h d are derived using value iterations from dynamic programming. Then, the ADD and FAR are estimated by MC simulations using the derived thresholds. As discussed before, higher λ e reduces the usage of watermarking before the attack by increasing the threshold T h s . The derived approximate expression of ADD (60) reveals that the ADD does not depend on λ e or T h s directly. However, from (55) and (63), we can say that thel reduces with the reduction in watermarking, which in turn increases ADD. Sincel is a small quantity compared to T h D , the effect of the change of l is small on ADD. To summarize, lower λ e results in slightly lower ADD. Similarly, the derived approximate expression of FAR (61) reveals that the FAR does not depend on λ e or T h s also. That is why we observe very similar FAR curves for two different values of λ e in Fig. 10.  Figure 11 compares the same set of plots as in Figure 10, but for two different values of λ f and a fixed λ e for System-A. As discussed before, the increase in λ f increases T h d . From (60) and (61), we know that ADD and FAR are mainly dependent on the value of T h D . ADD increases with the increase in T h D , 13 whereas FAR reduces. To summarize, ADD increases and FAR decreases with λ f .

1) Optimal Policy
In both the figures, Fig. 10 and Fig. 11, ADD reduces with the increase in the watermarking signal power, which is primarily the result of increased KLD (65). On the other hand, FAR (61) does not reduce much with the watermarking signal power since the correlation is weak. Higher watermarking signal power increases the overshoot r n d to some extent, which in turn reduces ξ (64) slightly.  Figure 12 shows the ADD and FAR vs. σ 2 e plots for System-A, where ADD and FAR are estimated by MC simulations and also derived using Theorem 1 and Corollary 1.1. The watermarking signal variance is taken to be a diagonal matrix with equal signal power, i.e., σ 2 e . For each σ 2 e point, the thresholds T h s and T h d are derived using dynamic programming value iterations. Figure 13 shows the same set of plots as in Fig. 12 for System-B. The ADD derived using MC simulations does not reduce at the same rate as that of the approximate theoretical ADD with the increase in σ 2 e . The reason is that the derived analytical expression of ADD is asymptotically approximate. On the other hand, we have selected the parameter values for the MC simulations so that ADD remains small for the ease of simulation studies. Within the small delay window after the change point for the MC simulations, the increase in σ 2 e is not making much difference to the estimated ADD. On the other hand, the simulation study shows that ξ (64) does not change much for a small increase in σ 2 e . Therefore, from (61), we can say FAR will only be affected to a small extent due to the increase in σ 2 e . Therefore, we observe that the simulated FAR and the theoretical FAR are in close agreement in Fig. 12. We also see that the derived ADD from Theorem 1 is a better match compared to the ADD derived from Corollary 1.1.   Figure 14 shows the ∆LQG vs. σ 2 e plot, where ∆LQG is estimated by MC simulation and also derived using the theory presented in this paper for System-A using the same parameters as Fig. 12. The watermarking signal variance is taken to be a diagonal matrix with equal signal power, i.e., σ 2 e . From the derived expression of ∆LQG (72), it is evident that the control cost will increase with the increase in watermarking signal power.  Figure 15 compares the ∆LQG vs σ 2 e plot from the proposed method and PW-Σ e assuming a diagonal Σ e with equal power, σ 2 e , for System-A. For each σ 2 e point, the thresholds T h s and T h d are derived using dynamic programming value iterations for the proposed method, and the same thresholds are used for PW-Σ e for a fair comparison. From the derived expression of ∆LQG (76), we predicted that we would get a large improvement in the control cost since ρ and ANW both are small quantities. Also, the difference will increase with σ 2 e as ∆LQG A increases with Σ e . As predicted from the theory discussed in Sub-section IV-C1, we observe a large improvement in the control cost (approx. 99% reduction in ∆LQG) for the proposed method in Fig. 15, which validates our Claim 1.

5) Comparison with PW-Σ e
We have shown ADD and FAR vs. σ 2 e plots for the proposed method and PW-Σ e using the same parameters as Fig. 15 in Fig. 16. From the derived expression of ∆ADD (77), we can comment that the proposed method will take a longer time on average to detect the attack compared to PW-Σ e . The difference is due to the slowly changing terms,l A andl P . Since the magnitude of the slowly changing term usually remains small, the increase in ADD for the proposed method is also small. In Fig. 16, an average increase of 35% (approx.) in ADD is observed at the same FAR for the proposed method. On the other hand, as discussed in Sub-section IV-C1, FAR will be the same for both the methods and the same is observed in Fig. 16.
To summarize, Fig. 16 supports our Claim 2 and Claim 3.

6) Comparison with PW-∆LQG
We have shown ADD and FAR vs. ∆LQG plots derived from MC simulations for the proposed method and PW-∆LQG assuming a diagonal Σ e with equal power σ 2 e in Fig. 17 for System-A. For each ∆LQG point, the thresholds T h s and T h d are derived using dynamic programming value iterations for the proposed method, and the same thresholds are used for PW-∆LQG for a fair comparison. In general, for the proposed method, T h s decreases and T h d increases with the increase in ∆LQG or σ 2 e for fixed λ e and λ f . Since the same thresholds are used for PW-∆LQG, the ADD increases with ∆LQG in the plot. As discussed in Sub-section IV-C2, since the proposed method uses a higher watermarking signal variance at the same control cost, the KLD for the proposed method is higher compared to PW-∆LQG. Higher KLD for the proposed method results in lower ADD, and the same characteristic is observed in Fig. 17. Also, the usage of higher watermarking signal power increases the overshoot statistic to a small extent, resulting in a small decrease in FAR. To summarize, Fig. 16 supports our Claim 4 and Claim 5.

7) Optimum Σ e
As discussed in Sub-section IV-D, the optimum Σ * e reduces the KLD for a fixed upper bound on the ∆LQG, which in turn reduces the ADD. We compare the ADD for the optimum Σ * e and the diagonal Σ e in Fig. 18 for System-A. For each σ 2 e point, the thresholds T h s and T h d are derived using dynamic programming value iterations for the diagonal Σ e case, and the same thresholds are used for the optimum Σ * e case for a fair comparison. We observe an average increase of 14% (approx.) in the estimated ADD for the optimal Σ * e . For the optimal Σ * e , the watermarking signal power is mostly concentrated in one eigenvector direction, which results in higher overshoot and a lower FAR. Figure 18: Comparison between diagonal Σe and optimal Σ * e . ADD and FAR vs. ∆LQG plot for System-A. λ f = 100 and λe = 0.3.

8) Comparison with a periodic watermarking scheme
We have compared the proposed evidence-based parsimonious watermarking scheme with a periodic watermarking scheme. The periodic watermarking scheme is adopted from [18] for our problem formulation. To fairly compare both methods, we have evaluated ADD and FAR by MC simulations for the same ∆LQG values. Under both schemes, the p k has been evaluated and compared with the same T h d value for attack detections. Note that T h d values are different for different ∆LQG values. However, watermarking has been added under the proposed scheme if p k ≥ T h s . On the other hand, watermarking is added only once in a period for the other method, and the periods are determined separately for each ∆LQG value. Since the periodic watermarking scheme does use any existing evidence extracted from the set Ψ k of all available information upto the k-th time instant, the watermarking frequency remains the same before and after the attack. However, for the proposed scheme, the watermarking frequency increases significantly after the attack (approx. 50 times), which reduces ADD and FAR, see Fig. 19.

VI. CONCLUSION
In this paper, we have studied the quickest data deception attack detection problem with constraints on FAR and ANW. Such parsimonious use of watermarking helps to reduce the control cost during normal system operations and maintain a moderate detection performance. First, we have formulated the problem as a stochastic optimal control problem under a Bayesian framework. Then, we have applied dynamic programming to find the optional policy. We have studied the optimal policy structure and found the optimal policy to be 15 Figure 19: Comparison between proposed method and a periodic watermarking scheme. ADD and FAR vs. ∆LQG plot for System-A. λ f = 100 and λe = 0.3. a two threshold policy on the posterior probability of attack under a few practical assumptions. We have also derived the asymptotic approximate expressions of ADD and FAR applying non-linear renewal theory. The analytical expression of ∆LQG and its relationship with ANW is also derived. Theoretical and simulation studies reveal significant improvement in reducing ∆LQG with a relatively small increase in ADD compared to the method PW-Σ e , where watermarking is always present. The proposed method is also compared with PW-∆LQG, where both the methods have the same ∆LQG limit and found that the proposed method performs better in terms of ADD and FAR. Furthermore, we have described a technique to find the optimal watermarking signal power that maximises the KLD, which will improve the ADD.
The likelihood ratio, L a,j , of the joint dependent distributions of the innovation signal and the watermarking signal, after and before the attack, takes the following form, (86) γ j is iid before the attack. Therefore, applying the chain rule to f 1,j (·|·), (86) can be written as given in (33). Using a similar argument, (34) can be derived. To derive (35), (38) and (40), we rewrite (30) for the case, k > Γ, applying (11) as From the assumptions that (A + BL) is strictly stable and the system started at j = −∞, we can say (A + BL) j−2 → 0 as j → ∞. Therefore, (88) will take the following form as j → ∞, Using (87) and (89), we can derive the following, Using the same approach, (36), (39) and (41) can be derived from (30) for the case, k = Γ. Taking expectations on both sides of (29), we get E 0 [γ j ] = 0. (42) is derived from (6) as, APPENDIX B PROOF OF LEMMA 3 The following form of the LSR k is derived by taking logarithms on both sides of (45), and combining both the conditions in (45) using an indicator function. where (94) The threshold T h S on LSR k is the same as the threshold T h s on p k . T h S is derived directly from (43) as given in (58). We rewrite λ k by adding and subtracting k i=1 log (L a,i ) 1 {LSRi<T h S } to the right hand side of (94) as follows, where Z k is given in (56). Replacing the first λ k in (93) by (95) and dividing the terms in S k and l k , we get (53). The proof that l k is slowly changing variable is provided as follows.
The variable l k will be called slowly changing provided the following two conditions are satisfied, according to [46]: and for every > 0, there exists k * and δ > 0, such that for all k ≥ k * C2: P max 1≤i≤kδ | l k+i − l k |> < .
After the attack start point, LSR k will gradually increase on average (from condition C6 in Theorem 1), and it will first cross T h S and then T h D as k → ∞. LSR k will remain below T h S for a relatively short period of time compared to the time it takes to cross T h D , since T h D → ∞. Therefore, l 2,k and l 3,k will converge to some finite values, say L 2 and L 3 , respectively, as k → ∞. Also, exp (−λ k ) → 0 since λ k → ∞ as k → ∞ from condition C6. Therefore, l 1,k will also converge to a finite value, say L 1 , as k → ∞. Now, from (102), we can say l k will also converge to a finite value l, i.e., l ≤ L1 + L2 + L3 as k → ∞, which means l k will satisfy condition C1. We assume that at k = k 1 , LSR k crosses T h S . Therefore, for k ≥ k 1 , we can write Therefore, l k+i − l k = l 1,k+i − l 1,k for k ≥ k 1 . (107) As mentioned before, exp (−λ k ) → 0 as k → ∞, therefore, we can say P {l 1,k+i − l 1,k } → 0 for a sufficiently large k, say k * , and k * ≥ k 1 . From (107), for k ≥ k * , P {|l k+i − l k | > } = 0, which in turn will satisfy condition C2.

APPENDIX C PROOF OF THEOREM 1
First, we will show that the conditions C1-C4 are satisfied for the problem under study. Z k is a function of continuous random variables, which take uncountably infinite values, so Z k is non-arithmetic, thus satisfies the condition C1.
For condition C2, Z 1 denotes the log-likelihood ratio (56) just after the attack start point. For simplicity, we consider that the attacker is present in the system from the beginning. Now, from (56) and (33), we can write Z 1 as From (38), (40), and (42), we can say that all the elements of (108) are either finite or having Gaussian distributions with finite means and variances, which in turn ensures that E 1 | Z 1 | 2 is finite. Condition C3 is proven in Appendix B. From the expressions of Σ 0 (42) and Σ γ (67), we can say that under the practical assumptions of the plant model and attacker's system parameters 0 < E 1 D f e 1 , f 0 < ∞ from (65). In a similar way we can also show that 0 < E 1 D f 0 , f e 1 < ∞. Therefore, the condition C4 is valid for the problem under study. The following is the proof of Theorem 1.
Say, after the attack start point, at k = Γ the test statistics LSR k will cross the threshold T h D at k = τ for the first time which is equivalent to the test statistics p k crossing the threshold T h d . To derive the expression of ADD, we assume, T D = τ −Γ. After adding and subtracting T h D to (53) and rearranging the terms, it will take the following form at k = T D , According to the nonlinear renewal theory [46], the overshoot statistics of LSR T D − T h D can be approximated by the overshoot statistics of S T D , i.e., r n d = S T D − T h D , provided T h D → ∞. Moreover, the slowly changing term l k → l as k → ∞, where l is a RV [31]. Taking expectations on both sides of (109), we get Furthermore, E 1 [Z 1 ] can be approximated as E 1 D f e 1 , f 0 (65) as explained in [39]. Combining (110) and (111), and rearranging the term we get (60), where ADD = E 1 [T D ]. 17 A brief derivation of FAR is provided as follows. A detailed one can be found in [31].

APPENDIX D PROOF OF THEOREM 2
The proposed parsimonious watermarking mechanism can be assumed to be a always present watermarking scheme, where the watermarking signal is e s,k = s k−1 e k . s k−1 and e k are assumed to be uncorrelated since they are generated from two independent processes. The variance of e s,k , Σ es , is derived as where Σ e = E 0 e k e T k . Since s k = 1 or 0, E 0 s 2 k−1 = E 0 [s k ]. Therefore, Σ es takes the following form, Now, the increase in the control cost, ∆LQG, is derived using Theorem 3 from [39] for the always present watermarking signal e s,k as follows, ∆LQG = tr (HΣ es ) = ρAN W tr (HΣ e ) , [applying (115)], where H is given in (73).

APPENDIX E SYSTEM PARAMETERS
The following system parameters are used for simulation study. ρ = 0.001. System-A parameters: