TOWARDS BIG DATA PROCESSING IN CLOUDS: AN ONLINE COST-MINIMIZATION APPROACH

Due to its elastic and on-demand nature of resource provisioning, cloud computing provides a cost effective and powerful technology for the processing of big data. Under this paradigm, Data Service Provider (DSP) may rent geographically distributed datacenters to process their large amount of data. As the data are dynamically generated and the resource pricing varies over time, moving the data from differently geographic locations to different datacenters while provisioning adequate computation resource to process them is an essential task to achieve cost effectiveness for DSP. In this paper, a joint online approach is proposed to address this task. We formulate the problem into a joint stochastic optimization problem, which is then decoupled into two independent subproblems via the Lyapunov framework. Our method is able to minimize the long-term time average cost including computing cost, storage cost, bandwidth cost and latency cost. Theoretical analysis shows that our online algorithm can produce a solution within an upper bound to the optimal solution achieved through offline computing and guarantee that the data processing can be completed with preset delays.

1. Introduction. The cloud computing paradigm offers a convenient way for users to dynamically adjust its computing resources rented from cloud service providers (CSPs) according to the demand in a Pay-As-You-Go (PAYG) manner. Specifically, in cloud computing, benefited from the development of virtualization technology [3], VMs (Virtual Machines) resources can be scaled up and down to match the applications demands. Compared with traditional approaches, the cloud computing paradigm eliminates users' costs of purchasing and maintaining their own infrastructures.
The elastic and on-demand nature of resource provisioning attracts a lot of users to deploy their applications, especially computation-intensive big data analysis in the clouds. At the age of big data, data analysis is more and more important for applications such as financial analysis, social interaction web sites, astronomical telescope service. For example, Facebook-like social media sites can uncover usage patterns and hidden correlations by analyzing the web site history records (e.g., click records, activity records et al.) to facilitate its marketing decision. We call this kind of organization as Data Service Provider (DSP) in this paper. Under this paradigm, the DSPs should solve two problems in the first place: 1) How to transfer the large-scale data sets from various locations into clouds and 2) How many resources such as computing resource and storage resource should be rented in the clouds for processing?
Although much efforts has been made to design computing models for fast big data analysis, such as Mapreduce [6] and Spark [27], the problems of moving largescale data to the clouds and provisioning adequate resources at the same time in the clouds is rarely considered in the community. Currently, for the data moving problem, practices such as copying the data into large-scale hard drives for physically transportation [2,15] and even moving the entire machine [1] to datacenters are adopted. These methods not only incur undesirable delays but also insecure case, given that hard drives may be damaged from transportation accident. For the resource provision problem, some works have been done to copy with dynamic workload in clouds [16,21]. But these methods often considered the data moving problem and resource provisioning problem in isolation.
In this paper, targeting the analysis of big data from different locations with the MapReduce-like framework in the clouds, we propose an online approach which systematically address the data moving problem and resource provisioning problem, with the goal of over all cost minimization of running big data analytic in the couds. To achieve this goal, we first formulate the problem into a jointly stochastic optimization problem, and then, apply the Lyapunov Optimization framework. Such a stochastic system does not require predicting the future system states and makes decisions only based on current system state [13]. Based on the drift-plus-penalty function transformation, we propose an online algorithm that is able to move data from multiple regions to distributed datacenters in an online manner and dynamically rent the near optimal number of computing resource and storage resource needed to satisfy user requirements for serving data analysis.
The major contributions of this work are summarized as follows: • We propose a novel framework that systematically handles data moving from multiple locations to multiple datacenters and resource renting in each datacenter in a nearly optimal manner. In particular, we consider the bandwidth cost, computing cost, storage cost and delay cost as the overall cost and guarantee the data can be processed within a desirable delay. In our framework, VMs in the cloud have different types and are priced dynamically.
• We propose an algorithm to solve the jointed stochastic problem using the Lyapunov optimization framework, which is able to make decisions of resource renting and data moving online. Moreover, the algorithm can have a distributed implementation.
• We conduct performance analyses for the algorithm theoretically, which demonstrate that the algorithm approximates the optimal solution within provable bounds and is capable of processing the tasks within a preset delay.
The remainder of this paper is organized as follows: Section 2 summarizes related works; Section 3 describes the system modeling and the problem formulation; Section 4 gives the online algorithm for solving the problem; Section 5 analyzes the proposed algorithm; Section 6 concludes the paper.
2. Related work. Recent years have witnessed the proliferation of cloud-based service in both academic and industry. Much efforts has been made to migrate the applications such as cloud-based live streaming [9,21], cloud-based online game [18], cloud-based conference [7] and social media applications [24] etc. into clouds. Majority of these studies have focused on how to scale up and down the resource in the clouds to meet user demand or migrate the workflow into clouds.
Few studies have been conducted to move large scale data into clouds. Paper [4] studied how to transfer data to the cloud provider via the Internet and courier services. Study [5] proposed a solution to minimize the transfer latency under a budget constraint. In [11], the authors studied the data streaming storage for real time big data processing. Different from our study, these work deal with the data transfer problem on static scenario in which the data amount is fixed, while our work consider dynamically generated data. In addition, aforementioned studies considered a single datacenter while our work takes into account multiple datacenters. The most relevant work is Zhang et al [28] which proposed an online algorithm to migrate dynamically generated data from various locations to the clouds for processing. However, our work significantly differs since we consider the resource provisioning and data moving as simultaneously and applied the Lyapunov framework to address the problem.
There is also a line of research on resource provisioning in clouds. In the clouds, the server pool and the capacity of each server become elastic. Studies [16,12] considered elastic server capacity supported by virtualization technologies. Work [16] proposed adaptive request allocation and service capacities scaling mechanism mainly to cope with the flash crowd. Study [23] took into account of the VM renting cost and storage cost when making scheduling decisions. Different from these works which often need certain mechanisms to predict the future workloads, our work does not rely on any future information on big data tasks since the Lyapunov optimization framework is adopted. Also, studies on how to scheduling the tasks with different objectives in clouds have been conducted. Works [29,31] proposed efficient scheduling strategies for real-time tasks with energy minimization while studies [31,20] developed task scheduling algorithms under the consideration of fault-tolerant. These works are often within one single datacenter.
In addition, the Lyapunov optimization technique was first proposed in [17] to address the network stability problem and then was introduced into cloud computing to deal with job admission and resource allocation problem [19,10]. Yao et al. [26] extends it from the single time scale to two-time-scale for achieving electricity cost reduction in geographically distributed datacenters. Recently, this approach is used for resource management in cloud-based video service [22,25]. In our work, we utilize this approach to simultaneously address the data moving from multiple locations to multiple datacenters and resource provisioning in each datacenter. To summarize, our work differs from existing works as follows. 1) Firstly, we address the problems of data moving and resource provisioning systematically and design an online algorithm that can be implemented distributedly. 2) Secondly, with the Lyapunov framework, our method does not rely on the prediction of future big data processing workload, which significantly differs from the assumptions made in [21,23].

Modeling and formulation.
3.1. System modeling. We consider such a system scenario as presented in Fig. 1: A DSP (e.g., a global astronomical telescope department) manage multiple geographical data locations that continuously produces large volumes of data. The DSP deploys their data analytics application in cloud and connects the data source to different datacenters located in multiple places. All the data are moved to the the datacenters and processed in the corresponding datacenter with distributed computing model such as MapReduce framework. In the system, the DSP observes the state of the datacenter (e.g., VM price, datacenter load state, network state) and decides the amount of data to be moved to each datacenter and the amount of resource rented from each datacenter, with cost minimization consideration. Finally, the datacenters return the analysis results to DSP after the data have been processed and analyzed.
Formally, considering the geo-distributed datacenters set D with size of D = |D|, indexed by d(1 ≤ d ≤ D). A set K of distinct types of VMs (with size K = |K| ), each with specific capacity v k under different configurations of CPU and memory, are provided in each datacenter. Data are dynamically generated from R = |R| different data location (indexed by r, 1 ≤ r ≤ R), denoted as set R. Data from any location can be moved to any datacenter for analytics via virtual private networks (VPNs). And the data transmission bandwidth between a data generation location Table 1. IMPORTANT NOTATIONS D set of datacenters distributed over multiple regions R set of data locations K set of VM types ar(t) amount of the data generated from region r at t A r max max amount of data generated from region r λ d r (t) amount of the data allocated to d from region r at t n k,max price of bandwidth between location r and datacenter d v k data processing rate of type-k VM ε d preset constant for controlling queueing delay in H d (t) l max delay of data process H d (t) unprocessed data in datacenter d at t Z d (t) Virtual queue associate with H d (t) to guarantee its delay r ∈ R and a datacenter d ∈ D is large as well. To be realistic, we also assume that the bandwidth B rd on a VPN link (r, d) from data location r to data center d is limited, and constitutes the bottleneck in the system. In addition, The data generation in each location is independent and the prices of the resource (e.g., VM, storage) in each datacenter are varying in both spatial and temporal domain.
The system operates according to time slots, denoted by t = 0, 1, ..., T . In every time slot, the DSP need to make the decision of moving how much data from data location r to datacenter d and renting how many resources to support its data processing, storage from each datacenter. Our goal is therefore to minimize the over all cost of big data analytics in clouds as well as guarantee the delay in the long run. For ease of reference, important notations are summarized in Table 1. 3.2. Problem formulation. In this subsection, we first formulate the cost incurred in the system and then define the objective of the problem mathematically.
As aforementioned, the system runs in a time-slotted fashion and the data are dynamically generated over different regions in each time slot. Let a r (t) be the amount of data generated from the r-th region at time slot t. Since the data generated from each location can be moved to any datacenter for analytics, we denote λ d r (t) as amount of the data allocated to d from region r at t and A r max as the max number of data generated in location r. Hence, we have: The goal of the DSP is to minimize the over all cost incurred in the system by optimizing the amount of data allocated to each datacenter and the number of resources needed. Specifically, the following cost components are considered in this paper: bandwidth cost, latency cost, storage cost and computing cost. Each of the cost is defined as follows.
(1) Usually, the bandwidth price is varying over different VPN links because they often belong to different Internet service providers. Let b d r be the price of transferring 1 GB data between data location r ∈ R and datacenter d ∈ D, then the bandwidth cost of the at t is: (2) Storage cost is an important factor to be considered in choosing the datacenter for data analytics since it often has large amount of data for big data application. Let s d represent the price of storing 1 GB data for one time slot period in datacenter d ∈ D, then the storage cost at t is: (3) Due to the variance of VM price over time slots, the number of VMs rented from datacenter has important impact on the over all cost of the system as well as QoS of the big data application. Let n k d (t) be the number of VMs rented from datacenter d at time slot t. p k d (t) be the type-k VM price in datacenter d at time slot t, which is diverse in both spatial and time space. Then the computing cost is defined as follows. C (4) The latency incurred by upload data to the datacenters is an important performance measure, which is to be minimized in the data moving process. L d r is the latency between the data location r ∈ R and the datacenter d ∈ D. These delays are determined by the respective geographic distance and can be obtained by a simple command such as Ping. As suggested in [28], we convert the latency into monetary cost. Therefore, we can define the latency cost as: where α is a weight converting latency into a monetary cost. Based on above cost formulation, the overall cost incurred in the system can be derived as: Hence, the problem of minimizing the time-average cost of data moving and processing within a long-term period [0, T ] can be formulated as: where C = lim E{C(t)}. The constraint (10) is to ensure that the sum of data allocated to each datacenter at one time slot is equal to the total amount data generated at that time slot. The constraint (11) ensures that the number of VMs required is within the capacity that a datacenter can provide. From the problem formulation presented above, as the data generation is unknown, we know that the problem is a constrained stochastic optimization problem and our objective is to minimize the long-term average cost by optimizing the amount of data allocated to each datacenter as well as the number of the VMs rented inthe datacenter. To deal with this problem, a recent developed optimization technique is adopted in this paper. The details of solution by using Lyapnov optimization framework is presented in the next section. 4. Online algorithm design. In this section, we exploit the Lyapunov optimization theory to design our online control algorithm. An outstanding feature of this method is that it does not require future information about workload. By greedily minimizing the drift-plus-penalty in each time slot, it also can be proved to approach a time averaged cost that is arbitrarily close to optimum, while still maintaining system stability.
According to the standard optimization framework theory in [13], we first transform the problem P1 to an optimization problem of minimizing the Lyapunov drift-plus-penalty term and then design the corresponding online algorithm.

Problem transformation.
Let H d (t) be the amount of unprocessed data in datacenter d at time slot t. Initially, we define H d (0) = 0, and then the evolution of the queue H d (t) can be described as below: The above queue update implies that the amount of departed data and newlyarrived data are k∈K n k d (t) · v k and r∈R λ d r (t), respectively. To guarantee that the worst-case queuing delay in queue H d (t) , ∀d ∈ D, is bounded by the max workload delay l, we design a related virtual queue Z d (t) according to the ε-persistent service technique for delay bounding in [14]. Similarly, the backlog of virtual queue Z d (t) is initialized as Z d (0) = 0, then updated as follows: Let Z(t) = (Z d (t)), and H(t) = (H d (t)), ∀d ∈ D denote the matrix of virtual queue and actual queue respectively. Then, we use Θ(t) = [Z(t), H(t)] to denote the combined matrix of actual queues and virtual queues. According to Lyapunov framework [13], we define the Lyapunov functions as follows: where L(Θ(t)) measures the queue backlogs in the system. The one-slot Lyapunov drift is: ∆(Θ(t)) = E{L(Θ(t + 1)) − L(Θ(t))|Θ(t)}. In the sense of Lyapunov optimization framework, the drift-plus-penalty can be obtained by adding the the cost incurred by the system to the above Lyapunov drift, namely, where V is a non-negative parameter, that can control the tradeoff between the system stability and cost. The larger the V is, the smaller the cost is, and vice versa. Hence, the original problem P1 can be transformed into the following problem P2: s.t. : (9)(10)(11).
To solve problem P2, rather than directly minimize the drift-plus-penalty expression (16), we seek to minimize the upper bound for it, without undermining the optimality and performance of the algorithm according to [13]. Therefore, the key is to find an upper bound on problem P2. It can be proved that the expression (16) is bounded as: where Detailed proofs please see the Appendix A.

4.2.
Online control algorithm design. Fortunately, a careful investigation of the R.H.S of inequality (19) reveals that the optimization problem can be equivalently decoupled into two subproblems: 1) data allocation and 2) resource provisioning. The details of solving the two subproblems are presented as follows.

1) Data Allocation:
To minimize the R.H.S of (19), by observing the relationship among variables, the part related to Data Allocation can be extracted from the R.H.S of (19) as: Furthermore, since the data generated from each location are independent, the centralized minimization can be implemented independently and distributedly. Considering the data allocation in location r at time t, we should solve the following problem. min In fact, the above problem is a generalized min-weight problem where the amount of data from location r moved to datacenter d λ d r (t) is weighted by the queue backlog H d (t) , bandwidth price b d , storage price s d and the latency cost L(r, d). By using linear program theory (e.g., Simplex Method), we can get the following solution: Obviously, the solution exhibits that the data generated from location r will incline to be moved to the datacenter with the shortest weighted workload queue and the minimal operation price (e.g., VM price, Storage price etc.) at current time slot.
2) Resource Provisioning: The left part of R.H.S (19) related to variable n k d (t) can be considered as resource provisioning problem if we remove the constant term. Therefore, we can get the optimal VM provisioning strategy by solving the following problem: (11) .
Since the resource provisioning problems in each datacenter are independent, similar to data allocation problem, (23) can be solved distributedly within each datacenter. For a single datacenter d, the resource provisioning problem can be further rewritten as (24). (11) .
The optimal solution to the above linear problem is: The above solution indicates that a type-k is preferred to be rented in t when its price , p k d (t), is small, and VMs whose capacity, v k , is large are more likely to be rented too.
Obviously, the two complex problems of data allocation and resource provisioning have been solved efficiently by using Lyapunov framework so far. The simple strategy facilitates the online deployment of the algorithm in the real world systems. The detail of its online algorithm is presented in Algorithm 1.
5. Performance analysis. Next, to show its priority, we analyze theoretically the performance of the algorithm 1 in terms of cost optimality , queueing delay bound, and the worst delay of data processing. (Cost Optimality) Suppose the data generation rate a r (t), ∀r ∈ R is identical independent distribution over time slots, for any control parameter V > 0, the algorithm can achieve a time average cost related with the optimal one as follows.
lim sup where C * is the infimum of the time average cost when choosing the optimal control action, representing the theoretically optimal solution to the optimization, B is the same as defined in (19).
Proof. Please see the Appendix B. This theorem exhibits that the gap between the time average cost obtained by the algorithm proposed in this paper and the optimal cost obtained offline is with O(1/V ). In particular, by choosing the control variable V , the time-average cost O is arbitrarily close to the optimal cost C * . and where p max d is the max price for each type of VM over time slots, and v min is the minimal capacity among all kinds of VMs.
Proof. Please see the Appendix C.
This theorem shows that the queue backlog is with O(V ). It means that, to keep the queue backlog stable, we should choose a small V . Notice that decreasing V will cause a larger cost as shown in (26), the cost and system stability has an [O(1/V ), O(V )] tradeoff. In reality, given the acceptable cost we can choose the V to maximize the system stability, and vice versa. Theorem 5.3. (Worst Case Delay)Assume that the system running in First-in-First-Out manner, then the worst delay of the data processing in queue d is bounded by the l defined below: where [x] denotes the minimal integer among those greater or equal to x and H max d , Z max d are defined in (28) and (27).
Proof. Please see the Appendix D.
This implies that the data arriving at any time slot t in Q d can be completed within l time slots, demonstrating that our algorithm is able to guarantee the QoS (Quality of Service) for DSP. In addition, as can be seen, given the system parameters, by choosing a suitable ε d , the QoS for the DSP can be changed. Also, with different setting of ε d for d ∈ D, we can achieve heterogeneous QoS for different datacenters. 6. Conclusion. Targeting the processing of big data from different locations in geo-distributed datacenters, we propose a systematical way for data moving and resource provisioning with the goal of cost minimization. The model takes into consideration the case that data analysis application is running in dynamic environment (e.g., unpredictable data generation, dynamic VM pricing). By using the Lyapunov technique, we transformed the original problem into two independent subproblems that can be solved efficiently online. Theoretical analysis demonstrates that the algorithm is able to maintain the stability of the dynamic system and complete the data processing within some time slots. It remains to further validate the effectiveness of the proposed algorithm via extensive experiments. Other considerations that may be further incorporated into our proposed framework include data processing relevance between two consecutive time slots, data processing migration among datacenters etc.