A Cloud-Based Adaptive Disaster Recovery Optimization Model

,


Introduction to Disaster Recovery Optimization
Business Continuity Plans and Disaster Recovery Plans are becoming standard requirements as part of any organization's IT department, sometimes as a government regulation and sometimes as a standard, for example, ISO 22301 (ISO, 2011) for business continuity and ISO 24762 for disaster recovery (ISO & IEC, 2008).Hence, this is the case in big organizations because they realize the benefits; moreover, they have the capacity to dedicate some of their resources to this purpose.However, many medium and small businesses (SMBs) believe that cost is too high.A study by Semantic shows that 57% of small businesses and 47% of medium businesses do not have a disaster recovery plan (Semantic, 2011).Recently, a study by Cisco and Fortune showed that about 49% of small businesses owners do not have disaster recovery plan, fearing extra costs of remote sites and operational overhead (Cisco, 2015).Fortunately, advances in cloud computing and pay-as-you-go plans made startup costs significantly lower; thus, more medium and small organizations are able to afford disaster recovery plans than at any time before.However, SMBs need to plan their budgets to make sure that they spend funds in the best optimized way.

Importance of Disaster Recovery Optimization
We need to provide SMBs some adaptive model to cover their disaster recovery and business continuity and at the same time direct the resources in the right direction and stick with a pre-assigned limited budget.
In order to design a suitable disaster recovery plan (DRP) for a SMB, a business analyst must make a cost-benefit analysis of the technology used; for example, how much does one hour of downtime cost?How much does the loss of 1MB of data cost?How much can the business tolerate?If the business has more than one application or database, each might need separate analysis.Moreover, the indirect costs of downtime and loss of data should also be considered, such as loss of reputation, loyalty and customer confidence.Hence, the outcome will be setting the right recovery time objective (RTO) and recovery point objective (RPO).These can be set for the whole system or, even better, for each independent part of their system.

Related Work
When cloud computing became an important option for organizations, the cloud providers offered different forms, such as platform as a service (PaaS), infrastructure as a service (IaaS), or software as a service (SaaS).However, not until later was disaster recovery as a service introduced and discussed by researchers such as (Wood et al. 2010), where they have shown how the cloud can be an excellent alternative for disaster recovery sites.Others (Alhazmi & Malaiya, 2013) have compared co-location disaster recovery and cloud-based disaster recovery.Now the term disaster recovery as a service (DRaaS) is becoming a common type of cloud service.
The issue of optimizing resources and trying to find the best allocation with some constraints is not a new topic.For example, (Buyya et al. 2001) has proposed an algorithm to allocate heterogeneous resources on the global grid.They have considered a time-budget constrained model.Also, recently, (Wientraub & Cohen, 2015) have proposed a combination of IaaS, PaaS and SaaS to optimize resource allocation.Moreover, (Manikandaprabhu1 & Senthil, 2013) have discussed some dynamic provisioning plans to reduce cost based on reserving resources by using on-demand vs. pre-reserved resources.They have shown some promising results.However, their models are based on multi-service providers and are for generic cloud resource allocation, while our work is specifically for disaster recovery purposes and our model does not take service providers into account.(Banu & Saravanan, 2013) have proposed a two-phase algorithm to allocate resources at minimum cost and improved quality of service in the cloud.With similar objectives, (Poobalan & Selvie, 2013) have proposed better management of resources by utilizing an optimal cloud resource provisioning algorithm (OCRP).All have considered the pre-reserved and the on-demand options provided by infrastructure as a service (IaaS) cloud providers.The service providers offer the pre-reserved option, which costs less but needs to be reserved earlier.However, the on-demand option is more expensive but more convenient by not requiring prior reservation.

The Proposed Model
In this work, we present an adaptive cost optimizing algorithm based on a constraint of cost or budget and some other factors given by the user.However, the business is responsible for setting their RPO and RTO and also for identifying potential risks and changing levels of risk over time, which is sometimes predictable and can be set according to alerts and external threats, such as weather, or internal threats, such as scheduled maintenance.
Here, the optimization model will allocate more resources and increase protection during these periods; this makes the optimization model flexible.
The goal of this model is to have a flexible adaptive optimization disaster recovery model that manages the resources according to some quality of service goals, financial limitations and internal and external threats, yet is applicable and practical and has very little management overhead.Moreover, the proposed model is vendor independent and can be applied to various cloud service providers, not limited to some billing and service plans.Therefore, in its current form it is generic and can be tailored to the specific needs of a SMB with some changes to accommodate the requirements of cloud service providers (CSPs).
In the following section which will give the main parameters, inputs, outputs of the model and the main algorithm.Next, in section 3, the optimization model is simulated using a scenario to illustrate how the model works, how it can be applied, and what will happen in a real operational environment; finally, in section 4 we will present a conclusion and remarks about the advantages and disadvantages of the model, identify strengths and limitations of the model, suggest possible improvements and briefly discuss some future research directions.

The Adaptive Cloud Based Disaster Recovery Model
Here, we preview the proposed model; in the next section we will preview the parameters used in the model; in section 2.2 we will preview the outline; and in section 2.3, the algorithm is previewed.

The Main Factors
Here we are going to overview the factors that impact the proposed system, which are: Criticality levels, Risk levels, Disaster Recovery tiers and Cost.Here we shall explain each factor.
The first factor is criticality level, which represents the level of importance of the data or an application.It can be assigned to a certain segment of data or an application.This is usually based on the business need and business analysis.The analysis should set the required Recovery Time Objective (RTO) and Recovery Point Objective (RPO).For simplification, we will assume three levels of criticality (high, medium and low); and segments with higher level of criticality have more priority when they are allocated in the cloud.
The second factor is risk level.A study by (Qurium, 2013) shows that although disaster recovery is often associated with natural disasters, nature is only responsible for about 5% (See Table 1), while the other 95% are mainly due to hardware and software failures and human errors.This can be helpful for system administrators to take care of their disaster recovery system during specific periods of time, particularly during software and hardware installations, upgrades and reconfiguration.With increased risks during these times, an optimized disaster recovery system can be elevated to allocate more resources during these times than during other times and thus utilize resources more efficiently.Natural 5% Here, we will also assume that risk changes over time and we can assume three levels of risk (high, medium and low).The risk level of a disaster happening can be determined, although some natural disasters come without warning, such as earthquakes or fires.However, some of them come with a short notice, like floods and storms, which can be predicted by weather forecasts.Moreover, some disasters actually happen during prescheduled system upgrades and system maintenance.Those are ideal for this system because by raising the risk level during these activities, more resources are allocated to protect the data in a better way.Here, Figure 1 shows a hypothetical change of risk level over time: Figure 1.Risk level changes over a period of time The third factor is the disaster recovery tier.There are several disaster recovery tier schemes previewed in (Alhazmi & Malaiya, 2012).However, in the cloud, we can prefer a scheme designed for the cloud; hence, we will choose Firdhous' (Firdous, 2014) classification of Disaster Recovery as a Service (DRaaS).Firdhous has proposed three levels of DRaas: CDDRaaS, WMDRaaS and HTDRaaS.However, Firdhous does not give full descriptions for these levels; thus, we are defining these levels using (Carroll, 2013).Disaster recovery system is fully mirrored with the original system and fully synchronized and ready to take over operation Less than 5 minutes

Less than 5 minutes
The fourth factor is cost, a main constraint here, because many small businesses don't want to spend more on extra storage and recovery systems.A study conducted by Forrester has shown that cost prevents many SMBs from owning a disaster recovery; for example, 49% worry about cost of a remote site, while 46% are concerned about cost of extra hardware in the recovery site.Moreover, 42% answered that the cost of implementation of a DR system is the main reason they don't have one (Cisco, 2015).Hence, the cost constraint is very important.Cloud technology and DRaaS could reduce the upfront cost and allow more companies and small businesses to own disaster recovery systems.

The Cloud-Based Adaptive Disaster Recovery Optimization Model (CBADROM)
The Cloud-Based Adaptive Disaster Recovery Optimization Model (CBADROM) assumes that data can be divided into independent segments so for each S i (S 1 , S 2 , S 3 , ..., S n ), there is a criticality level associated with it.
Based on business analysis and business need, these levels can be in m levels (C 1 , C 2 , …, C m ) (see Table 4) and there are also k risk levels for the whole system.Moreover, the cost is strongly linked to the disaster recovery tiers and the billing policy of the disaster recovery service provider.In other words, CBADROM takes four inputs: two direct inputs which are cost and current level of risk, and two inputs that need configuration, which are criticality/risk/disaster recovery tier matrix (see Table 3) and segment criticality map (see Table 4).Figure 4, illustrates the model's main outlines and how the components interact.The CRT matrix (see Table 3) suggests multiple modes of operation and resource scheduling based on risk.We will assume the role of a business owner and build a criticality requirement table.This should be built by the business analyst to reflect the actual business needs of an organization.Table 3, below, shows an example of how a system analyst must build a criticality/risk tier for each criticality level and risk level, the desired recovery time objective (RTO) (how much time does the system need to recover) and a recovery point objective (RPO) (how much data loss can be tolerated).This can be achieved by analyzing loss impact on the business and must take into account direct, indirect, short term and long term impacts.After the CRT is built, segments must be given the appropriate criticality level (Table 4).

The Cloud-Based Adaptive Disaster Recovery Optimization Model (CBADROM) Algorithm
The algorithm is given in Figure 3, below.The four inputs are shown in the top box.The main part is also shown on the bottom part.The system will call allocate_all() every period of time; here, the default is one day.It can be extended or shortened as desired.4 shows the algorithm for the functions.Cost() will estimate the cost of allocating a segment on a specific tier, based on size.Allocate_all() will go on all segments, starting with the higher criticality level, and allocate them based on this priority and then to the lower criticality level and so on.Of course, all of this is based on criticality-risk-tier matrix and segment-criticality map.

Applying the Cloud-based Disaster Recovery Optimzation Model
Here we examine a hypothetical scenario, as an assumption we will consider the criticality-risk-tier matrix given by Table 3.We shall see that if the company decides to spend only $100 for disaster recovery, the company has a set of systems and applications; their business analyst determined that their system can be divided into six segments as shown in Table 4.They assigned different a criticality level for each segment, and the size of each segment is also shown in the last column.This company has obtained a list of prices from their cloud service provider (CSP), along with its RTO/RPO specifications, all illustrated in Table 5:  3, 4, and 5, we have produced Table 6, which shows the daily cost of the whole system for different risk levels.The company's analyst decides that for some reason risk will be medium on the third day and high on the fifth day and for all other days it is low.
Figures 5 and 6, show how the model allocates the appropriate disaster recovery tier to the segment based on the criticality/risk matrix given in Table 3.Here is the scenario: -Days 1 and 2 are low risk days; therefore, S1 and S5 are allocated to HTDRaaS while S2, S3, S4 and S6 are allocated to CDDRaaS.

-
In day 3, risk is elevated to medium.Based on the criticality/risk matrix in Table 4, only S4 and S6 need to be upgraded to WMDRaaS, while the rest of them stay unchanged.

-
In day 4, risk is demoted to low, causing the downgrade of S4 and S6 back to CDDRaaS.
-In day 5, risk is elevated again but this time to high, causing an upgrade to S4 and S6 to HTDDRaaS and S2 and S3 to WMDRaaS.

-
In day 6, risk is demoted to low, causing a downgrade of S2, S3, S4 and S6 to CDDRaaS.
-In day 7, the funds are about to finish (the balance is only about $5.40), causing a downgrade to most of the segments.Only S1 is allocated at the cost of $4.50; here, the remaining balance is too low, so the model had to skip S5, S4, S6 and then S2, to allocate only S3, as it is the first segment that is allocatable with the available funds.
Given this scenario in Figures 5 and 6, it is clear that the company needs to allocate more funds to disaster recovery to avoid the drop in the disaster recovery service that occurred on the seventh day.This occurred due to the lack of planning of the system but can be remedied by increasing the planning from one day to a longer period of time, thus avoiding the drop of DR for all segments.
Figure 5.The bars represent the risk, the gray shows the daily cost, and the black shows the cumulative cost Figure 6.Segments allocation for the 8-days using 1-day planning For the previous example, we have used day-by-day planning, and we have seen that at the seventh day most segments could not be allocated and on the eight day all of them -even critical segments -are left without coverage.Therefore, we shall try a four-day allocation system.We expect that this way, the results will change and the most critical segments will be favored over others.Let us repeat the scenario but with four-day planning (see Figure 7): -The first four days will cost a total of $58.40 to allocate all segments.Thus, the remaining balance is $41.60.
-The estimated cost of the next four days is 22.10+14.10+14.10+14.01=$64.40, which is too high; therefore, the most critical parts will be allocated first.
-The next highly critical job S5 will be allocated, costing $14.40 for the four days; here the remaining balance will be 23.60-14.40=$9.40.
-Then, the system will look at the next job S4 which costs 1.20*3+2=$5.60 and will be allocated for four days.The remaining balance will drop to 9.40-5.60=$3.80.
-Then, the system will go to S6, only to discover that the needed amount is $8.40 while the balance is enough for two days at a low tier: 3.80-3.60=$0.20.
-Here, we notice that S2 and S3 are not allocated anymore for the second through fourth days because they have low criticality status, as illustrated by Figure 7.

Conclusions
Cloud computing has given the area of disaster recovery boost, as it has significantly reduced the upfront costs and thus increased interest in disaster recovery systems.Moreover, it has given disaster recovery systems more flexibility and made them dynamic and scalable.However, for small and medium businesses the issue of cost is still a concern, especially in times of slow economies.Here, we have presented a model to optimize resources based on available budget constraints.The model needs a business strategy as an input, which is the criticality-risk-tier matrix and segment-criticality map, cost and current risk, in order to manage resources in real time and give more priority to critical resources and change allocation based on changing risk levels.

Figure 2 .
Figure 2. Outline of the Cloud-based Disaster Recovery Model Return Segment.size*T.cost_per_gigabyte;//return the actual cost end Int Allocate_All () Begin for j← 1 to 3 // for criticality level 3 to 2 to 1 for i ← 0 to num_segments //for all segments 0 to i Begin if (Segment[i].Criticality ==j) result = Allocate (Segment[i], RC [Risk, Segment[i].Criticality]); if (result == 0) return 0; //if can not allocate , fail end for Return (number_of_segments); // this will allocate segment i to the appropriate disaster recovery tier End Allocate (segment S, Tier) Begin If (Budget-Cost (S,Tier) < 0 ) return 0 // if budget is too low return fail Else Budget ← Budget -Cost (S,T); //deduct the cost from budget return 1; End //allocated and cost deducted from Budget Figure 4. Functions to be called by the main program

Table 5 .
Sample pricing for cloud-based disaster recovery system

Table 6 .
Applying the Criticality-Risk on the given example