Balancing Reliability and Cost in Cloud-RAID Systems with Fault-Level Coverage

Based on redundancy techniques, cloud-RAIDs (Redundant Array of Independent Disks) offer an effective storage solution to achieve high data reliability. Their performance however can be greatly hindered by the fault-level coverage (FLC) behavior, where an uncovered disk fault may crash the entire system in spite of adequate redundancy remaining. Moreover, different choices of cloud disk providers lead to designs with different overall reliability and cost. Thus, in this paper we formulate and solve optimization problems, which determine the combination of cloud disks (from different providers) maximizing the cloud-RAID system reliability or minimizing the total cost. The cloud-RAID reliability is analyzed using a combinatorial and analytical modeling method while considering effects of the FLC behavior. Multiple case studies are performed to demonstrate the considered optimization problems and proposed solution methodology.


Introduction
Recent advances in big data, Internet of Things, cyber-physical systems have led to great demands on data storage (Atat et al., 2018;Wang et al., 2018;Wang and Alexander, 2019). These demands engender high needs to use cloud storage services as the backbone of those technologies. As a continuously growing paradigm, a cloud storage system is a model with data being saved in logical space while the actual disks may span several physical servers managed by different cloud service providers (Deng et al., 2010;Erl et al., 2013). As users often expect to access their data from the cloud storage system anytime and anywhere, any service interruption or failure can pose negative impacts on the reputation and business of the cloud service provider. Thus, it is significant to enhance the overall data reliability through fault-tolerance techniques. Based on redundancy techniques (Jin et al., 2009;Bausch, 2014), cloud-RAIDs (Redundant Array of Independent/Inexpensive Disks) provide one such solution to achieve high data reliability (Fitch and Xu, 2013;Zhang et al., 2013;Liu and Xing, 2015). Their performance however may be greatly affected by the imperfect coverage behavior (Xing, 2005;Myers, 2010;Li and Mao, 2016), where due to an imperfect fault recovery mechanism, a disk fault may not be adequately or timely detected, located and isolated after its occurrence. The consequence from this undetected or uncovered fault is it may propagate to other system components, causing extensive damages and even crashing the entire system. Studies performed in (Amari et al., 1999) showed that an increase in the redundancy level may not necessarily enhance the reliability of systems subject to the imperfect fault coverage monotonically (after certain level, a further increase in redundancy can actually reduce the system reliability). Hence, it is crucial to address the imperfect fault coverage behavior in the system reliability modeling, analysis and optimization activities.
Some efforts have been expended in optimizing the cloud storage systems. Examples include an algorithmic approach (Goyal and Kant, 2014) suggested to optimize cloud storage for improving data accessing and storage performance. In Yahyaoui and Moalla (2016), the optimization of file storage service was considered through classifying customer files based on their usage rates to save storage space in the cloud. In Liu et al. (2018), a heterogeneous cloud storage model was optimized achieving the tradeoff between the system storage and repair costs. In , the structure and performance of an object-based cloud storage system were optimized for processing data files with different sizes. In Fu et al. (2016), a block replica placement method was suggested for optimizing performance of small files accessing in the cloud storage system. In Cha and Kim (2018), a software-defined storage using a combination of on-premises and public cloud storage was introduced to optimize the cloud storage for improving the I/O performance. In Mansouri et al. (2017), an optimal offline algorithm was designed to optimize the price difference between cloud data storage and cloud network services. In Al-Abbasi and Aggarwal (2018), a framework for erasure-coded storage systems was proposed to quantify and optimize the mean latency over the choice of cloud storage servers and auxiliary bound parameters. The above-mentioned methods neither addressed the imperfect fault coverage behavior, nor considered the cloud provider/disk selection problem that determines the optimal combination of cloud disks balancing the cloud-RAID system reliability and cost. The selection problem belongs to the reliability allocation problem, which has been proven to be NP-hard (Chern, 1992;Todinov, 2006). Different models have been suggested for addressing the imperfect fault coverage in reliability analysis of systems in diverse applications. Typical examples include the element level coverage (ELC) model (also known as single-fault model), the fault-level coverage (FLC) model (also known as multi-fault model), and the performance dependent coverage model (Levitin and Amari, 2008;Mandava et al., 2016;Mandava and Xing, 2017). In recent work , the cloud disk selection problem was first considered for the cloud-RAID system subject to the ELC, where the performance of the system recovery mechanism depends on the occurrence of each individual disk fault and thus the fault coverage probability of a disk does not rely on statuses of other disks in the system. However, in a load-sharing or work-sharing environment, it is more practical that the fault coverage probability relies on the number of disk faults that have happened within a certain recovery window, which can be modeled by the FLC . In this paper, we make advancements by formulating and solving the cloud disk selection problem for cloud-RAID storage systems undergoing the FLC. Both unconstrained and constrained optimization problems are considered. Solutions to the constrained problems provide a balance between the overall reliability and cost when configuring a cloud-RAID system. The rest of the paper is arranged as follows. Section 2 formulates the considered optimization problems. Section 3 describes examples of cloud-RAID storage systems. Section 4 presents the reliability analysis method for the cloud-RAID systems undergoing the FLC. Section 5 demonstrates solutions to the considered optimization problems through multiple case studies. Section 6 concludes the work and gives directions of future research.

Formulation of Cloud Disk Selection Problems
In (1) an unconstrained optimization problem is formulated with the objective to minimize the overall cloud-RAID system unreliability, denoted as UR(t) (i.e., maximize the system reliability 1-UR(t)).
minimize UR(t) (1) In (2) and (3), two different constrained optimization problems are formulated to balance the system reliability and cost. Specifically, the aim of problem (2) is to minimize the cloud-RAID system unreliability, UR(t) while satisfying constraint on the total system cost denoted by C*. The aim of problem (3) is to minimize the total system cost C while satisfying constraint on the cloud-RAID system unreliability denoted by UR*. (1)-(3) is evaluated using a combinatorial and analytical modeling method that is presented in Section 4. C is evaluated as summation of costs of all the chosen disks to configure the cloud-RAID system.

Example Cloud-RAID 5 Storage Systems
There are seven levels in the conventional RAID architecture employing different redundancy techniques (except level one without any redundancy) (Jin et al., 2011). The RAID level 5 using the distributed single parity code is selected in this work to illustrate the proposed methodology and optimization. Specifically, data to be saved in RAID 5 are divided into non-overlapped blocks. These blocks are striped across different disks that form the array (Patterson et al., 1989). The parity stripes are also distributed across those disks in the array. The RAID 5 is able to tolerate any single disk failure that can be detected and located successfully. Specifically, if any disk becomes malfunctioning or unavailable, stripes on this disk can be restored using the parity stripe and data stripes from the remaining disks (in particular through performing the exclusive OR operation if an even parity code is utilized). In the case of disk drives within the array having different capacities, the usable capacity by the whole storage array is decided by the disk with the smallest storage capacity. Figure 1 shows the architecture of an example cloud-RAID 5 system with three disk drives coming from different cloud service providers. Users' data (A, B, C) are divided into stripes (for example, A1 and A2 for A). Both data and parity stripes are distributed among the three disk drives. The system is essentially a 2-out-of-3: G model, meaning that the system is good or reliable if at least two out of the three disks are functioning correctly. Figure 2 shows architecture of a larger cloud-RAID 5 with five different disks. It is a 4-out-of-5: G model, meaning that the system is good if at least four out of the five disks are functioning correctly.

Combinatorial Method for Analyzing Cloud-RAID 5 with FLC
Under the FLC, the fault coverage probability relies on the number of disk faults happening to a particular group (disk array) within a certain recovery window τ (Levitin and Amari, 2007;Myers and Rauzy, 2008). Specifically, a set of fault coverage factors ci is evaluated for a specific disk group. For example, the first disk fault is covered with probability c1, the second disk fault is covered with probability c2, and so on. By definition, the coverage factor involving no disk faults c0 is always 1. Formula (4) presents an example method of evaluating ci for a system with n identical disk drives, where i denotes the fault number and λ denotes the constant failure rate of each disk.
In the case of non-identical disks with different failure rates (in general, failure time distributions), formula (4) needs to be modified to consider a different reliability evaluation for each disk based on its time-to-failure distribution function. Specifically, let cid denote the coverage probability associated with the i-th fault happening to disk d in the cloud-RAID system. As mentioned in Section 3, the cloud-RAID 5 system can tolerate any single disk failure if the fault is detected and located successfully. This first disk failure (i=1) is covered with the coverage probability, c1d. For example, for the example 2-out-of-3 cloud-RAID system, if disk 1 fails first, the coverage probability c11 can be evaluated as (5); if disk 2 fails first, c12 in (5) To consider effects of FLC, the binary decision diagram (BDD)-based method (Xing and Amari, 2015;Xing et al., 2019) can be applied with an extension of inserting corresponding coverage factor cid onto the relevant paths in the system BDD model. The following major steps are involved in the extended BDD-based reliability analysis: 1) Variable Ordering: assign indexes to input disk variables using the numerical order.
2) BDD Generation: generate the BDD model without considering FLC by recursively applying manipulation rules in (6), which are used to combine two sub-BDDs into one BDD.
g and h in (6) are Boolean functions in the if-then-else (ite) format representing the two sub-BDDs to be combined (x and y are their root nodes). ◊ denotes a logical OR/AND operation. For applying the rules, indexes of x and y are compared. In the first case, their indexes are identical meaning they are the same variable. In this case, the operation is performed between their child nodes. In the second and third cases, x and y have different indexes meaning they are variables of different disks. In this case, the variable with a smaller index becomes the root node of the combined BDD, and the logic operation is performed between each child node of the smaller index node and the entire sub-BDD rooted at the larger index node.
3) Coverage Factor Nodes Insertion: node associated with each relevant coverage factor cid is inserted onto the relevant operational path (i.e., path from root node to sink node '0') in the BDD model generated in step 2 for the i-th failure happening to disk d.
For the example cloud-RAID 5 model considered in this paper, all operational paths involve either no failures or 1 disk failure. Thus, for paths involving a single disk failure happening to disk d, node c1d is inserted; for paths involving no disk failures, node c0 is simply inserted. For systems that can tolerate 2 disk failures (e.g., the cloud-RAID 6 model in Mandava and Xing, 2017), there exist operational paths involving no failures, 1 disk failure, and 2 disk failures. For the path involving 2 disk failures, e.g., happening to disk d1 first and then disk d2, nodes c1d1 and c2d2 are inserted onto the path. 4) BDD Evaluation: the cloud-RAID system unreliability UR(t) is evaluated as summation of probabilities of all disjoint paths from the root node of the BDD constructed at step 3 to sink node '1'. Similarly, the reliability of the cloud-RAID system is the sum of probabilities of all disjoint paths from the root node to sink node '0'.
Each single path probability is evaluated as the product of disk reliability pd (if 0-edge appearing on the path) or unreliability 1-pd (if 1-edge appearing on the path) for all disks d involved on the path.
Reliability pd of each single disk d is either given as an input parameter or can be derived from the time-to-failure distribution parameter(s). For example, if disk d has the exponential timeto-failure distribution with parameter λd then pd = e (-λ d *t) ; if disk d has the Weibull time-tofailure distribution with scale parameter λd and shape parameter βd, then = −( * ) . Figure 3 illustrates the BDD model constructed for the 2-out-of-3 cloud-RAID 5 with FLC. Each non-sink node has two outgoing edges: the solid edge (or 0-edge) represents that the disk represented by the node is operating and the dashed edge (or 1-edge) represents the disk is failed.
For values of n and w used in the practical implementation of the cloud-RAID system, the brute force approach can be sufficiently applied, which searches all possible combinations to find the optimal solution for problems formulated in (1)-(3). Theoretically, for large values of n and w, heuristic approaches e.g., the genetic algorithm (Tannous et al., 2011;Boddu and Xing, 2013;Bhunia et al., 2017) and meta-heuristic approaches (Dahiya et al., 2019) may be applied to solve those problems.  Table 1 presents parameters of cloud disk drives used in four case studies performed in the following subsections, including constant failure rate λd (the exponential distribution is assumed) and cost Cd for disks from five providers (v1, v2, v3, v4, v5). The cost Cd is assigned based on recent prices of several top cloud service providers in market, like AWS, Dropbox, Google, iCloud (Amazon, 2019;Dropbox, 2019;Google, 2019;iCloud, 2019).

Case Study 1: 2-out-of-3 Cloud-RAID 5 with 3 Disk Providers
Providers v1, v2, v3 are used in this case study. Table 2 lists all the ten possible combinations and corresponding system unreliability and cost for three different sets of t and τ values. As the mission time proceeds (comparing columns 3 and 4) or the recovery window time increases (comparing columns 4 and 5), the system unreliability increases. For problem (1), the optimal combination is (v3, v3, v3) for all the three sets of t and τ values listed, respectively with the minimum system unreliability of 0.002584, 0.117226, and 0.117724. This is intuitive since among the three available providers, v3 has the lowest failure rate (the most reliable disks). For problem (2) with C* = $10, the optimal solution is (v1, v2, v2) for all the three sets of t and τ with cost of $9.8. For problem (3) under τ = 3hrs, t=1000hrs, and UR * = 0.01, the optimal solution is (v2, v2, v2) with the system cost of $12.6 and unreliability of 0.006943.  Table 3 presents the optimization results for the 2-out-of-3 cloud-RAID 5 model considering five disk providers v1, v2, v3, v4, v5 for problems (1)-(3) assuming t=1000hrs, τ=3hrs.

Case Study 3: 4-out-of-5 Cloud-RAID 5 with 3 Disk Providers
Assume disk providers v1, v2, v3 are used in this case study. Table 4 lists all the 21 possible combinations and corresponding system unreliability and cost for three different sets of t and τ values. For problem (1), the optimal combination is (v3, v3, v3, v3, v3) for all the three sets of t and τ values, respectively with the minimum system unreliability of 0.008277, 0.290459, and 0.291487. This is intuitive since among the three available providers, v3 has the most reliable disks; without any cost constraints, v3 should be selected for all the five disks. For problem (2) with C* = $20, the optimal solution is (v1, v2, v2, v2, v3) for all the three sets of t and τ with the system cost of $19.6. For problem (3) under τ = 3hrs, t=1000hrs, and UR* = 0.07, the optimal solution is (v1, v2, v2, v2, v2) with the system cost of $18.2 and unreliability of 0.057174. Table 5 presents the optimization results for the 4-out-of-5 cloud-RAID 5 model using five disk providers v1, v2, v3, v4, v5 for problems (1)-(3) assuming t=1000hrs, τ=3hrs.

Conclusion and Future Work
Disk drives from different cloud service providers are often characterized with different reliability performance and cost. Different combinations of available disk choices thus lead to different overall cloud storage system reliability and cost. This paper models and solves three types of cloud disk provider selection problems for cloud-RAID storage systems subject to the FLC, including an unconstrained problem minimizing the overall system unreliability, and two constrained problems balancing the overall system reliability and cost. The reliability of the considered cloud-RAID system is analyzed using an extended BDD-based method that has no restriction on time-to-failure distribution types for disks. The solution methodology and proposed optimizations are demonstrated through examples and four case studies.
One direction of our future work is to extend the proposed methodology to optimize multi-state cloud-RAID storage systems subject to performance degradation and the FLC . We are also interested in studying the performance dependent coverage model for reliability analysis and optimization of cloud-RAID systems.