Survivability and Vulnerability Analysis of Cloud RAID Systems under Disk Faults and Attacks

In this paper we model and analyze survivability and vulnerability of a cloud RAID (Redundant Array of Independent Disks) storage system subject to disk faults and cyber-attacks. The cloud RAID survivability is concerned with the system’s ability to function correctly even under the circumstance of hazardous behaviors including disk failures and malicious attacks. The cloud RAID invulnerability is concerned with the system’s ability to function correctly while occupying some state immune to malicious attacks. A continuous-time Markov chains-based method is suggested to perform the disk level survivability and invulnerability analysis. Combinatorial methods are then presented for the cloud RAID system level analysis, which can accommodate both homogeneous (based on binomial coefficients) and heterogeneous (based on multi-valued decision diagrams) disks. A detailed case study on a cloud RAID 5 system is conducted to illustrate the application of the proposed methods. Impacts of different parameters on the disk and system survivability and invulnerability are also investigated through numerical analysis. KeywordsCloud storage system, Cyberattack, Disk fault, Survivability, Vulnerability.


Introduction
Survivability is concerned with the ability of a system to continue its intended operation in the presence of accidental failures or malicious attacks (Fung et al., 2005). Going beyond the survivability, the invulnerability of a system is concerned with the system's ability to function correctly while occupying certain state immune to malicious attacks. In addition to the individual component failures caused by factors like aging and defects, various cyber-attacks have posed significant threats to modern technological systems such as Internet of Things, cloud computing systems (Chou, 2013;Escudero et al., 2018;George and Thampi, 2018;Xing, 2020). For instance, sound waves have been utilized to launch DoS (denial-of-service) attacks without internet connections, causing the service outage even hardware damages (e.g., destroying electronic devices, or posing a life-threatening danger on medical devices) (Shahrad et al., 2018). The objective of this paper is to quantitively assess the survivability and invulnerability of a cloud RAID (Redundant Array of Independent Disks) storage system at both disk and system levels, facilitating the robust design and operation of the cloud RAID system in practice.
Reliability analysis of cloud storage systems has received significant attentions from both academia and industries, which focused only on the system behavior in the event of random component failures, see for example, Iliadis et al. (2014), Xing (2015a, 2015b), Xing (2019, 2020), Nachiappan et al. (2017), and Zhang et al. (2013). In contrast to the rich literature for quantitative reliability analysis, there exist only limited works on quantitative security analysis, see for example, Levitin et al. (2018), Liu et al. (2019), and Xu et al. (2019). This paper advances the state of the art by performing survivability and vulnerability modeling and analysis of cloud storage systems, simultaneously considering both reliability and security attributes. We analyze possible threat scenarios through a survivability architecture. We then examine the failure and attack behaviors at the disk level and propose a continuous-time Markov chains (CTMC)-based method to assess the survivability and invulnerability of each disk in the cloud RAID system. Based on the disk level analysis, we then present combinatorial methods to quantify the survivability and invulnerability of the entire cloud storage system. The rest of the paper is structured as follows. Section 2 presents the framework or architecture of survivability modeling. Section 3 presents an illustrative example of a cloud RAID storage system. Section 4 presents the CTMC-based method and the combinatorial methods. Section 5 presents detailed analyses of the example cloud RAID system at the disk level and investigates impacts of several attack or recovery parameters on the disk performance. Section 6 performs the system level survivability and invulnerability analysis. Lastly, Section 7 gives conclusions and identifies directions for future work. Figure 1 depicts a three-level survivability architecture. At the bottom level, representative hazardous events are listed. In particular, according to the cloud vulnerability incidents investigation report (Check Point, 2020;Ko et al., 2013), the top three threats to the CIA (confidentiality, integrity, availability) were Insecure Interfaces & APIs (29% of all threats), Data Loss & Leakage (25%), and Hardware Failure (10%). These three threats accounted for 64% of all the cloud outages investigated in the report. Specifically, APIs are the inter-connector which provide the interface between the Internet and the Things. Since the APIs are accessible from anywhere on the Internet, malicious attackers can use them to compromise the confidentiality and integrity of the system or service (attackers acquiring a token employed by a customer to get access to the service via the service API can make use of the same token to manipulate this customer's data) (Bamiah and Brohi, 2011). Regarding the Data Loss & Leakage threat, an attacker might steal, modify or corrupt the data, for example through the co-resident or co-location attacks (Hasan and Rahman, 2020;Xing et al., 2019). In addition, this work also addresses another common form of cyberattacks, the distributed DoS (DDoS) attacks, which are designed to take over all network, storage and server resources with transient bursts, causing cloud services to crash (Wang et al., 2015). The frequency of DDoS attacks has increased more than 2.5 times over the last 3 years. The average size of DDoS attacks has correspondingly grown approaching 1 Gbps, which is enough to take most organizations completely offline (Avital et al., 2020;Hummel, 2019). There exist other threats that are not addressed in this work, such as Account/Service Hijacking, Abuse and Nefarious Use of Cloud, Malicious Insiders, and etc. (Bamiah and Brohi, 2011).

Survivability Architecture
The bottom level also covers representative causes for generating hardware, software or middleware faults, including implementation mistakes, external disturbances, and component flaws or defects. In the middle level, the effects from the hazardous causes are identified, including the effects to the CIA principle of vulnerabilities and the three types of faults. Techniques or mechanisms can be developed to mitigate or tolerant these faults, enhancing the system survivability. For example, security controls (e.g. correct firewall setting and certain windows updates) could reduce the security risk effectively. Redundant techniques (e.g., standby sparing, N-modular redundancy) can be implemented to achieve fault tolerance. However, due to the imperfectness of these mitigation strategies and complicated interdependencies among the system components, the system may still fail leading to the loss of the system assets or the mission despite the use of the survivability enhancement techniques.

Example Cloud RAID System Description
The cloud RAID 5 with four disks in the array is used as an illustrative example system (Liu and Xing, 2015b). As shown in Figure 2, data are divided into stripes or blocks and are stored across four different disks that may be from different providers). The parity information (Ap, Bp, Cp, Dp) is also distributed among the four disks, providing fault tolerance in the event of one disk failure or being attacked. More specifically, when one disk malfunctions, the system can restore the stripes of the failed disk using the parity stripe and remaining data stripes easily, for example through the exclusive OR operation.

Proposed Method
The proposed method includes a CTMC-based model for describing the complicated failure, attack and recovery behaviors of an individual disk and further evaluating the state probabilities at the disk level. The method also covers a multi-valued decision diagram (MDD)-based combinatorial model for the system level survivability and invulnerability analysis. Figure 3 illustrates the CTMC model, i.e., the state transition diagram depicting the attack, failure and recovery behaviors of a RAID disk. In the initial good state 0, the disk possesses both the security and reliability attributes (the data can be retrieved from the disk correctly). From the good state, the disk can transit to the degradation state 1 with rate λgd. It may then transit back to the good state with rate μdg due to the disk's self-recovery mechanism that is able to restore certain media errors. In the event of the restoration attempt failing, the data get permanently lost and the disk transits to the failure state 3 with rate λdf. From the initial good state, due to DDoS attacks or even events that bring transiently explosive server visits (e.g., Black Friday online shopping), the disk drive may enter the vulnerable state 2 with rate ρgv; under this state some latency problems occur but the system still works. The disk can go back to the good state 0 from the vulnerable state 2 with performing contingency strategies immediately with recovery rate rvg. However, under the state 2, if no timely remedial measures are taken or the size of the attacks increases, the disk server can go down anytime entering the failure state 3 with rate ρvf. The inaccessibility of a disk in the failure state 3 caused by the occurrence of DDoS attacks can be restored fully with rate rfg (back to the good state 0) through some defensive mechanism (e.g., buying enough spare bandwidth for volumetric attacks, developing an incident response plan, haring a DDoS mitigation service) (Mirkovic and Reiher, 2004). The disk can also transit directly from the initial good state 0 to the failure state 3 with rate λgf due to hardware failure or some unrecoverable failure or with rate ρgf due to security threats like data tempering/deletion via insecure APIs. (1) denotes the probability that the disk is in state j (j=0,1,2,3), and ̇( ) denotes the derivative of Pj(t) with respect to t. From Eq. (1), Eqs.

Combinatorial System-Level Solution
In order to calculate the system survivability and invulnerability, each disk is modeled as a multistate component, and the MDD model is applied to represent system-level behavior of the cloud RAID system (Xing and Amari, 2015;Xing and Dai, 2009). Specifically, as illustrated in Figure 4 each multi-state disk k (k=1,2,3,4) is modeled as a non-sink node with four outgoing edges, representing the disk being in the good (0), degradation (1), vulnerable (2), and failure (3) states, respectively. Each edge is associated with its corresponding state probability, denoted by Pk0, Pk1, Pk2, Pk3, respectively.  The entire cloud-RAID 5 system also has four states: good, degraded, vulnerable and failed. Specifically, the entire cloud-RAID system is considered being in a failed state when at least 2 disks are in the failure state 3, modeled using the 2-out-of-4 MDD lattice structure in Figure 5. Sink node 1 in Figure 5 means the system is in the failed state; sink node 0 means the system is not in the failed state. The probability of the system being at the failed state Psys=3(t) is obtained as the sum of probabilities of all the paths from the root node 1 to sink node '1', which is given by Eq. .
where, Pkj is the probability of disk k being in state j (k =1, 2, 3, 4, and j=0, 1, 2, 3). In the case of all the four disks being identical (i.e., having the same state probabilities Pkj= Pj), the system failed state probability can be simply obtained using binomial coefficients as Eq. (11).
The entire cloud-RAID system is considered being in a good state 0 when at least three out of the four disks are in the good state. The entire system is in the vulnerable state when at least two of the four disks are in the vulnerable state and no two disks are in the failure state at the same time (i.e.,  2 disks are in the vulnerable state and the remaining 2 disks are in either good or degradation state; or 2 disks are in the vulnerable state, 1 disk is in the failure state, and 1 disk is in either a good or a degradation state; or 3 disks are in the vulnerable state and the remaining 1 disk is in a good, or degradation, or failure state; or 4 disks are all in the vulnerable state). Any state other than the system good, vulnerable and failed states is considered as a degraded state for the example cloud RAID 5 system.
In the case of homogeneous disks, the probability of the system being in the good, vulnerable, and degraded states can be evaluated using Eqs. (12), (13), and (14) respectively.
Based on the state probabilities evaluated using Eqs. (11)-(14), the survivability of the cloud RAID 5 system is given as, and the invulnerability of the system is given as,

Disk-Level Analysis Results and Discussions
Based on statistics and survey reports from Avital et al. (2020), Check Point (2020), Hummel (2019), 11 sets of parameter values are designed for the transition rates in Figure 3 (Table 1), including the attack rates ρgv, ρvf, ρgf (number of attacks per hour), failure rates λgd, λdf, λgf (number of failures per hour) and recovery rates μdg, rvg, rfg (number of repairs per hour).  Table 1. The survivability and invulnerability also depend on the network administrator's capability of handling network attacks or the quality of the existing cyber defense mechanism. For example, a cloud provider with an incident response plan would respond quickly and effectively after the crash with the occurrence of DDoS attacks; it can timely restore the server and keep the system functioning after attacks happen. The parameter rfg model this recovery capability; its effects are investigated through parameter sets i, b, j and k in Table 1. Figure 6 plots the different state probabilities (P0, P1, P2 and P3) for the disk subject to DDoS attacks and disk faults for different values of attack rate ρgv (sets a, c, d, and e in Table 1) at different time points (from 0 to 54 hours). Among the four sets, ρgv in set a corresponds to the disk with the highest security level which has seldom been targeted. In contrast, ρgv in set e corresponds to another extreme case of being in the top attacked environment. ρgv in sets c and d correspond to intermediate cases between set a and set e.  It is intuitive that the good state probability is decreasing with time. The good state probability under sets a and c falls very slowly as time proceeds due to small values of ρgv; while this probability under set e drops very quickly in the first 6 hours and then keeps the lowest over the considered mission time. Due to the complicated interactions among the transition rates, the trends for degradation state probability P1 and vulnerable state probability P2 appear non-monotonic under each considered parameter set, reaching a peak with a different pace, and then dropping gradually.

Effects of DDoS Attack Rate ρgv
In particular, P2-e stays the highest all the time and reaches the zenith 0.8 at t=6 hours while P2-a keeps the lowest over the considered mission time reaching its own peak until t=42 hours. The turning point (i.e., the time when the peak value is reached) for P2-c is t=24 hours and for P2-d is t=12 hours. Thus, as the attack rate increases, the turning point appears earlier. Conversely, P1-e remains the lowest reaching a peak value around t=6 hours due to the high-frequency DDoS attack while P1-a is the largest one and reaches its peak at t=42 hours. It can be observed that P1 and P2 share the same turning point under each parameter set. In addition, the values of P1 do not vary much under the different values of ρgv, implying that the rate ρgv affects the probabilities of states 0, 2 and 3 more than the probability of state 1. The failure state probability P3 shows an upward trend as time proceeds. It is intuitive that both the survivability (1-P3) and the invulnerability (1-P2-P3=P0+P1) appear the highest under set a (the smallest attack rate).

Effects of Recovery Rate rvg
The survivability and vulnerability of the system are also related to the administrator's capability to cope with attacks-targeted system or defense capabilities of the system itself. This capability is modeled by the recovery rate rvg. Its effects are investigated through the analysis under four parameter sets f, b, g and h listed in Table 1. Particularly, rvg in set h models the strong recovery capability (an expert protects the disk with effective anti-virus/attack tool); rvg in set f models a weak recovery capability (an amateur user); sets b and g model intermediate cases. Figure 7 plots the different disk state probabilities under parameter sets f, b, g and h. It can be observed that the recovery rate rvg impacts the good (0) and vulnerable (2) state probabilities more significantly than states 1 and 3. The good state probability P0 under set f (weak recovery capability) declines more quickly within 36 hours while P0 under sets g and h decreases much more slightly and then reaches the stable level around 0.995.
P1 and P3 both demonstrate the growing trend as time proceeds with slight differences for the four cases compared. It is intuitive that P2 under set h (the highest recovery rate) is the lowest (close to 0) due to the effective recovery action, while P2 under set f (the smallest recovery rate) grows significantly and is the largest one among the four cases compared. Due to the complicated interactions among the different transition rates, while the survivability (1-P3) is the largest, the invulnerability (1-P2-P3=P0+P1) appears the lowest under set f.

Effects of Rescue Rate rfg
In the real life, there are several DDoS mitigation solutions according to Ahmed and Kim (2017) and Osanaiye et al. (2016), including for example using spare bandwidth, creating a DDoS action plan, improving the security of Internet of Things devices, monitoring traffic levels, or choosing a hosting provider who can give you DDoS protection as a service. In this section, we investigate effects of different mitigation mechanisms after the crash caused by DDoS attacks, which is modeled by parameter rfg. Figure 8 illustrates each disk state probability in the period of 0 to 24 hours under four sets i, b, j, k with varying values of rfg. Among these four sets, set k with the highest rescue rate corresponds to cases where contingency strategies are performed regularly, leading to the lowest failure state probability P3 or the highest disk survivability and invulnerability. P3 under set i with the lowest rescue rate increases more significantly as the time proceeds than the other three sets and appears the highest. In addition, it can be observed that rfg affects P0 and P3 more than P1 and P2 (where very slight differences are generated under the four sets i, b, j, k).

System-Level Analysis Results and Discussions
Based on the equations derived in Section 4.2, Figures 9, 10 and 11 plot the system survivability and invulnerability to show the effects of parameters ρgv, rvg, rfg, respectively. All the empirical results supported the intuition that the system survivability decreases as time proceeds in all the cases. Moreover, the system invulnerability decreases as the attack rate increases, and it increases as the recovery or rescue rate increases.   Figure 9 illustrates the intuitive result that the system survivability under set e Ssys-e with the highest attack rate decreases more quickly than the system survivability under the other three sets a, c, and d, and remains the lowest all the time. Ssys-a with the smallest attack rate remains the largest during the considered mission time. The system invulnerability under set a Isys-a is almost flat staying the highest (near 1) among the four cases compared due to the lowest attack rate. The system invulnerability under set e Isys-e appears non-monotonic, beginning with a sharp drop in the first six hours, reaching the bottom with a value of 0.0165, and then increasing gradually. Isys-e is the lowest among the four cases compared due to the highest attack rate. It can be observed from Figure 10 that the system invulnerability under set h Isys-h with the highest recovery rate remains the largest level with subtle changes at different mission time while Isys-f under set f with the lowest recovery rate declines gradually from 1 to 0.958 during the considered mission time.
It can be observed from Figure 11 that the system survivability Ssys-k and invulnerability Isys-k under set k with the highest rescue rate appear the largest while Ssys-i and Isys-i under set i with the lowest rescue rate are the lowest among the four sets compared. Because the rescue rate rfg mainly affects the good state (0) and the failure state (3) of each disk, its impact on the system invulnerability is less than the impacts caused by changing ρgv, rvg as shown in Figure 9 and 10.

Conclusions and Future Work
In this paper we suggest a survivability framework that enables the survivability and vulnerability modeling and analysis of cloud RAID storage systems considering both reliability and security threats. The quantitative assessment methods are then presented. Specifically, the CTMC-based method is used to analyze the disk level survivability and invulnerability. The combinatorial binomial coefficients-based and MDD-based methods are used to analyze the system level survivability and invulnerability in the case of homogeneous and heterogenous disks, respectively. Impacts of different attack and recovery parameters (particularly ρgv, rvg, rfg) on the disk and system survivability and invulnerability are investigated through the numerical analysis of an example cloud RAID 5 system.
The disk-level analysis method based on CTMCs is applicable to only the exponentially distributed state transition time. In the future, we are interested in investigating semi-Markov models or multiple integrals (Zeng et al., 2019) to accommodate non-exponential transition time distributions for disk state probability analysis. We are also interested in incorporating the sequential attack events for the survivability and invulnerability analysis of cloud storage systems.