SDTP: Accelerating Wide-Area Data Analytics With Simultaneous Data Transfer and Processing

For the efficient analysis of geo-distributed datasets, cloud providers implement data-parallel jobs across geo-distributed sites (e.g., datacenters and edge clusters), which are generally interconnected by wide-area network links. However, current state-of-the-art geo-distributed data analytic methods fail to make full use of the available network and computing resources. The main reason is that such geo-distributed methods must wait for bottleneck sites to complete the corresponding transmission and computation in each phase. Furthermore, such geo-distributed methods may be impractical to the network bandwidth dynamicity and diverse job parallelism. To this end, we propose a Simultaneous Data Transfer and Processing (SDTP) mechanism to accelerate wide-area data analytics, with the joint consideration of network bandwidth dynamics and job parallelism. In the SDTP, a site can execute the computation, provided that it obtains the required input data. As a result, the input data loading, map, shuffle, and reduce phases at each site need not wait for the completion of the previous phases of other sites. We further improve the SDTP method by offering more accurate time estimation and generalizing the mechanism to dynamic situations. The trace-driven results demonstrate that SDTP can improve the wide-area analytic job response time by 19% to 72% compared to other methods.


INTRODUCTION
C LOUD providers such as Google, Amazon, and Alibaba have deployed data centers globally to provide instant services. These services generate a large volume of data across the world [1], including transaction data, user logs and performance logs, etc. Mining geo-distributed data (also known as wide-area data analytics) is crucial for commercial recommendations, anonymous detection, performance upgrades, and system maintenance, among others. A distributed computing framework such as MapReduce is generally implemented to mine such massive datasets.
A dominant challenge in this computing paradigm is the heterogeneity of hardware resources among geo-distributed sites, including the computing, uplink bandwidth, and downlink bandwidth. For example, the gap between the bandwidth among sites of Amazon EC2 is up to 12 Â [2], and the computation capacity of the largest online service provider may be up to two orders of magnitude larger than that of ordinary ones [3]. With the development of edge computing, many applications are placed at the edge. However, the edge resources are naturally heterogeneous and insufficient [4], [5]. Moreover, the data amounts among geodistributed sites are also highly heterogeneous [3], [6], [7]. As reported in reference [8], the amounts of Skype logs in over 100 different Azure sites indicate that the largest sites had 22 Â the values as the smallest site. These heterogeneities significantly affect the execution of wide-area data analytics.
The job response time is a key metric in the analysis of geo-distributed data, which usually contains multiple tasks on one stage and is dominated by the completion time of the last task [3], [7], [9]. However, the heterogeneity of the hardware resources and diversity of the data volumes among geo-distributed sites have a serious impact on the job completion time. Thus, it is challenging to optimize this metric because multiple factors must be considered, including the WAN link bandwidth among sites [6], [7], [10], [11], the cost of the WAN links [12], [13], [14], [15], the computing resources in each site [3], [16], and the data distribution [7], [8], [17]. To this end, Iridium [7] considers the heterogeneity of the WAN link bandwidth among sites and optimizes the placement of the reduce tasks to minimize the overall response time. Flutter [12] jointly considers the heterogeneities of both the bandwidth and cost of the WAN links among the sites. In contrast, Tetrium [3] offers a novel placement strategy for the map and reduce tasks with respect to the heterogeneities of both the WAN links and computing capacity among the sites. These methods can improve the response time to a certain extent.
Observation. The state-of-the-art task scheduling strategies execute the MapReduce tasks in a strict sequential manner, which may idle the distributed sites and lead to unnecessary waiting time. A toy example with three sites at the map stage is presented in Fig. 1, and the basic settings are shown in Fig. 1a. We assume that each task processes 100 MB of data, and the time required to process each task is 2 sec. Initially, sites 1, 2, and 3 contain 10, 40, and 50 GB of local data, respectively. They have an unequal number of computing slots, and the link bandwidth between a pair of sites is also heterogenous. We use a triple at each site to represent the upload bandwidth, the number of computing slots and the download bandwidth, respectively. Using such a setting, the traditional method usually executes the map tasks on local sites and thus avoids transferring local data to others. As shown in Fig. 1b, the bottleneck is site 2 which needs 80 sec to complete its tasks. To reduce the response time, Tetrium [3] tries to balance the transferring and computation workloads among the sites, and migrates part of the data from those sites which cannot handle its workload effectively. It generates the map task placement strategy illustrated in Fig. 1c. In this placement solution, sites 2 and 3 need to transmit 26 and 22 GB of data to site 1, respectively. During the transfer period, sites 2 and 3 remain idle. The sites can only begin the data computation when all transmissions have been completed. In this example, Tetrium requires 56 sec in total.
As indicated in Fig. 1d, it is possible to improve the existing work on wide-area data analytics by balancing and parallelizing the data transfer and data processing. In this solution, sites 2 and 3 transmit less data to site 1 because the transfer time may overwhelm the processing time, even if site 1 has more computing slots. Moreover, sites 2 and 3 begin to process the data at time 0 as they require no data from the other sites. Therefore, the data transmission and data processing parallelized from 0 to 20 sec. As a result, the total running time of the map stage is 40 sec.
According to the above observation and motivation, this study presents a novel task scheduling mechanism known as Simultaneous Data Transfer and Processing (SDTP) to accelerate wide-area data analytics by decreasing the response time of wide-area data analytic jobs. With the joint consideration of the WAN link heterogeneity and parallel execution in each site, SDTP first models the scheduling problem as a non-linear programming problem, which is generally complicated and is hard to resolve. To resolve this non-linear programming problem, SDTP relaxes the nonlinear programming model to a linear programming model. Thereafter, SDTP reduces the overall response time by migrating part of the data from the straggler site which leads to the longest response time to the idlest site which has the least response time.
In practice, the WAN bandwidth is dynamic over time. Reference [10] reports that the available bandwidth is below 25% of the maximum bandwidth between Amazon EC2 sites in some cases. However, previous work which devoted to the geo-distributed batch analytics assumed that the WAN bandwidth was static over the job execution period [3], [7], [16], [18]. Moreover, a complex relationship exists between the computation time and degree of parallelism. In general, the computation time of parallel tasks will decrease with an increase in the degree of parallelism. The assignment of additional resources to the job has a marginal impact on the performance. However, many references have assumed that the computation time will decrease constantly with the increase of the degree of parallelism [3], [16]. Therefore, we propose two improved approaches SDTP+ and SDTP++, which provide a more accurate computation time estimation of each stage within a site and can be generalized to dynamic situations.
The main contributions of this paper can be summarized as follows.
We discover that state-of-the-art task scheduling strategies cannot make full use of the available network and computing resources, which may lead to unnecessary waiting time. To minimize the job response time, we present a new non-linear programming model to characterize the geo-distributed data analytic job with joint consideration of the resource heterogeneity and the degree of parallelism, as well as the dynamic WAN links among the sites. We make reasonable relaxations and assumptions on this model and propose a novel scheduling mechanism known as SDTP to accelerate wide-area data analytics. SDTP allows the sites to begin the data processing once they obtain the required input data. Considering the dynamicity of WAN and the influence of the degree of parallelism, we further present two improved scheduling mechanisms known as SDTP+ and SDTP++. The two approaches can provide  Fig. 1a, sites 1, 2 and 3 contain 10, 40, 50 GB of local data, respectively. The triple on each site represents the uplink bandwidth, the number of computing slots, and the downlink bandwidth, respectively. more accurate time estimation and can be generalized to dynamic situations. We conduct trace-driven experiments to evaluate the performance of SDTP, SDTP+ and SDTP++. The results demonstrate that our methods outperforms existing methods and achieves a 19% to 72% reduction in the overall job response time. The remainder of this paper is organized as follows: Section 2 presents the background and related work. Section 3 outlines the system model and formulates the task placement problem of a geo-distributed data analytic job. Our SDTP, SDTP+, and SDTP++ methods are described in Sections 4 and 5. We conduct extensive evaluations using realistic traces in Section 6, and we conclude the paper in Section 7.

BACKGROUND AND RELATED WORK
In geo-distributed data analytics, multiple sites are connected by WAN, which restricts massive data transmission. Data processing frameworks, such as Hadoop and Spark, rely on the MapReduce model to implement their tasks on multiple geo-distributed sites in parallel. These geo-distributed sites may be highly heterogeneous in terms of the hardware capacity and data distribution. Furthermore, as the WAN bandwidth is also dynamic, the running time of the parallel frameworks exhibits certain properties. Therefore, in this section, we introduce the heterogeneities in geo-distributed sites, the dynamicity of the WAN bandwidth, and the parallel computing properties, followed by a discussion of related work on the geo-distributed data analytics.

Characteristics of Computation and Network
Resources Among Distributed Sites Heterogeneity of Resources and Data Distribution. An important characteristic and challenge in geo-distributed data analytics is that the resources among different sites are highly heterogeneous in terms of the hardware capacity [3]. Different sites are built at varying times and in varying regions, with diverse goals and budgets, thereby resulting in high heterogeneity [3], [19]. To demonstrate the heterogeneity of hardware resources, we compare the two most important hardware capacities, namely the computation and bandwidth. In particular, the computation capacity of one of the largest online service providers may be up to two orders of magnitude larger than that of ordinary ones [3]. The impending trend of edge computing increases the heterogeneity. In private clusters, the compute capacities vary significantly from just a handful of cores to hundreds of cores [20]. The link bandwidth among different sites is also extremely diverse [7], [17], [21]. According to a measurement of Amazon EC2 in 11 different regions, the bandwidth among the sites is 15Â smaller than the bandwidth within a site and 60Â smaller in the worst case [2]. Moreover, the amounts of data generated on different sites are heterogenous, which has a serious impact on the job response time [22]. An analysis of Skype logs obtained from over 100 different Azure sites demonstrated that the median, 90th percentile, and maximum values were 8, 15, and 22 Â larger than those of a site with the minimum log data [8]. Therefore, the data distribution across the sites may not be constant or may even be skewed in certain cases.
Dynamicity of the WAN. The dynamicity of the WAN bandwidth across different sites also poses significant challenges to geo-distributed data analytic jobs. Researchers discovered that large variances exist across different sites, and in certain cases, the available bandwidth is below 25% of the maximum bandwidth [10]. Consequently, it is difficult to develop task placement strategies for minimizing the job response time prior to the bandwidth change.
Degree of Parallelism. In parallel computing, the amounts of data processed by the job is varied. Even if the jobs contain the same amounts of data, different degrees of parallelism will also lead to different job response time [23]. To determine the relationship among the computation time, the size of input data, and degree of parallelism, we have constructed a Spark cluster based on 11 virtual machines (1 manager node and 10 worker nodes). Each virtual machine contains 8 cores and 8 GB of main memory.
We have measured the running time of two jobs that ran on Spark under BigDataBench [24], and the results are presented in Fig. 2. The running time of Job 2 exhibits little variation with the increase in the parallelism. Job 1 exhibits a significant acceleration of up to 56 parallel slots when it processes 80 GB of input. When Job 1 processes a small input of 20 GB, it requires no more than 16 parallel slots. For all jobs, assigning additional parallel tasks beyond a "sweet spot" in the curve adds only diminishing gains. Thus, we need to design a method which can calculate more accurate computation time according to the type of job, the size of the input data and the degree of parallelism.

Related Work
Geo-distributed data analytics has received substantial attention over the past several years. Numerous efforts have been made to optimize the response times of such jobs.
Geo-Distributed Data Analytics on MapReduce-Based Systems. Iridium [7] considers the heterogeneity of the WAN, and aims to minimize the response time of jobs across geodistributed sites by optimizing the placement of the reduce task and involved input data. Considering the heterogeneities of the WAN links and the WAN bandwidth cost, Flutter [12] is a new task scheduling algorithm for reducing both the response time and the network cost of big data processing jobs. Tetrium [3] jointly considers the heterogeneities of the computing and networking resources when designing the placement strategy of the map and reduce tasks. Yugong [1] proposes a novel data and job placement strategy to minimize the cross-DC bandwidth use and to reduce the query latency. Liu et al. proactively aggregate the output data of map tasks and avoid repetitive data transfers in the shuffle stages to reduce the job response time [18]. The above methods mainly decrease the job response time, while massive WAN links and computing resources remain idle during the job execution.
To minimize the average job makespan, Zheng et al. study a joint scheduling optimization mechanism by overlapping the map and shuffle phases of two jobs to form a strong pair [25]. However, it focuses on the single datacenter applications. Furthermore, these methods all ignore the dynamic nature of the available WAN bandwidth and the influence of the degree of parallelism. Decima [23] uses reinforcement learning and neural networks to learn workload-specific scheduling algorithms and sets an efficient parallelism degree for each job to minimize the average job response time. However, it only attempts to optimize the average job response time in a single cluster.
Geo-Distributed Data Analytics on Other Distributed Systems. With a focus on the geo-distributed SQL query, CLAR-IENT [26] includes a novel WAN-aware query optimizer, which can achieve multi-query network-aware plan selection and task placement to ensure low query latency. WANalytics [27], Pixida [28] and Geode [29] attempt to reduce the bandwidth use across geo-distributed data centers and decrease the latency for SQL query requests. Lube [30] monitors geo-distributed data analytic queries in real-time, and detects and mitigates potential bottlenecks (e.g., bandwidth scarcity) at runtime to reduce the query response time.
Furthermore, Gaia [2] provides a machine learning synchronization model for cross-site learning tasks. It dynamically eliminates insignificant communication between sites to accelerate the execution of machine learning jobs. Monarch [31] optimizes the iterative processing style of graphparallel systems to execute geo-distributed graph analytics effectively. Liu et al. present a hierarchical synchronous parallel mode, which results in lower WAN bandwidth use, faster convergence, and a lower WAN cost for wide-area graph analytics [11]. G-Cut [32] optimizes the performance of graph processing jobs by minimizing the inter-DC data transfer time. To save the amount of data transferred and to reduce the makespan, HPS+ [33] offers a new resource allocation algorithm. However, these methods have mainly focused on wide-area machine learning, SQL analytics, astronomical applications, and certain special fields. Therefore, they are not applicable for general big data processing frameworks such as MapReduce.
Optimizing Systems for Dynamic Settings. Considering the challenge of scarce and variable WAN bandwidth, Turbo [6] adjusts the query execution plans for geo-distributed SQL queries in response to runtime resource variations across data centers. AWStream [10] automatically learns an accurate profile to model the relationship between the accuracy and bandwidth consumption of an application. Thereafter, it carefully adjusts the application data rate to match the available bandwidth, while maximizing the achievable accuracy. Besides, Magrino et al. introduce predictive treaties to predict the evolution of the system state in distributed transaction processing [34]. This method can reduce the coordination of geo-distributed applications and improve their performance. Unfortunately, these methods only focus on some specific areas and are not applicable to geo-distributed data analytics on MapReduce-based systems.
In this study, we propose a novel scheduling mechanism named SDTP to accelerate wide-area data analytics. The method attempts to make full use of the available network and computing resources to avoid unnecessary waiting time, and it can realize an effective balance between the data transfer and data processing. Moreover, with a focus on the dynamic network and diverse job parallelism, we further improve the SDTP method by offering more accurate time estimation and generalizing it to dynamic situations.

MODELING AND PROBLEM FORMULATION
In this section, we describe the execution of the wide-area data analytic jobs and formulate the optimal response time problem for such jobs, including the details of calculating the overall time consumption. Table 1 summarizes the major notations used in this paper.

Execution of Wide-Area Data Analytic Job
In this section, we describe how we place tasks in the map and reduce stages to minimize the entire response time of a wide-area data analytic job in the system. The task placement in each stage involves deciding which tasks should be placed on the site and determining the source of the task input data.
In this paper, we focus on the jobs that have exactly one map stage and one reduce stage. We formulate the task placement of the wide-area data analytic job for each stage independently. The map stage includes the input data loading and map computation phases, and the reduce stage is divided into the shuffle transfer and reduce computation phases. Fig. 3 presents an example of the execution process of a wide-area data analytic job. The 3 sites have different amounts of unprocessed local data. At the data loading phase, the sites with heavy workload transfer some raw data to those sites which have sufficient computing and bandwidth resources. After that, each site executes the map computation on its raw data and generates intermediate results. At the reduce stage, each reduce task needs to read the corresponding intermediate data generated by all map tasks. In the shuffle phase, according to the fraction of reduce tasks performed at each site, each site transfers the intermediate data to corresponding sites. Finally, each site performs the reduce computation to get the final results.
Owing to the heterogeneities of the WAN bandwidth and computing capacity among the geo-distributed sites, the transmission times for obtaining the required input data on different sites are uneven. However, in previous approaches, the sites can execute the map computation only when all of the transmissions are completed.
This will idle the distributed sites and lead to unnecessary waiting times. By contrast, in our method, a site can execute its task computation, once it obtains the required input data, to avoid unnecessary waiting time.

Response Time at Map Stage
At the map stage, the task placement problem involves determining the amount of data x j i that should be transferred from site j to site i, and i; j 2 D, where D is the set of sites and x represents the set of the data volume that is transferred across all sites. We assume that the map stage contains the input data loading phase and map computation phase, as shown in Fig. 3. Each site should obtain its input data from other sites during the input data loading phase. We suppose that a site can begin to execute its tasks if the data assigned to it has been collected. Let T i load represent the input data loading time of site i. The map computation time of site i is denoted by T i map . At this stage, the goal is to minimize the maximum response time among the sites. That is As the bandwidth of the WAN links among the sites may be dynamic, let B i down ðtÞ represent the download bandwidth of site i at time t, and let B i up ðtÞ represent the corresponding upload bandwidth of site i at time t. According to x, we can obtain the fraction of map tasks at each site. Therefore, the total volume of data that site i needs to download is P j2D;j6 ¼i x j i . We use t down i;s and t down i;e to denote the start and end times when site i downloads all input data from other sites. Let t up j;s and t up j;e denote the start and end times when site j uploads all data to other sites that need to fetch data from site j. Then, we have From the above two equations, we can determine that the download time of site i at the map stage is t down i;e À t down i;s , and the upload time of other sites that contain the input data of site i is t up j;e À t up j;s . Thus, the input data loading time of site i is the maximum value between the download time of site i and the upload time of other sites that need to transmit data to site i. Therefore, we have The computation time of tasks at a site is determined by the total volume of data to process, the degree of parallelism of the site, and the job operation processes. Thus, in the map computation phase, we use a function f to estimate the computation time of site i according to the job type, input data size P j2D x j i , and number of computation slots S i on site i. That is Moreover, there is a constraint on the data volume. That is, the sum volume of all data mitigated from site i to others and the data remaining at site i must be equal to the original data size Using the above descriptions, we formulate the task placement problem P1 at the map stage as follows:

Response Time at Reduce Stage
At the reduce stage, we should decide the fraction a i of reduce tasks to place on each site i, where a denotes the set of the fraction of reduce tasks on all sites. We suppose that the reduce stage includes the shuffle phase and reduce computation phase, as shown in Fig. 3. In this case, T i shuf represents the communication time of the shuffle phase at site i. This is the transfer time during which site i obtains its input data from other sites. The reduce computation time on site i is denoted by T i red . At this stage, the goal is to minimize the maximum response time among the sites; that is At the shuffle phase, the total amount of data that site i needs to download is P j2D;j6 ¼i ðI j shuf Â a i Þ. Furthermore, I j shuf is the amount of intermediate data on site j. Besides, t down i;s , and t down i;e denotes the start and end times when site i downloads its input data from other sites. Let t j;up s and t j;up e represent the start and end times when site j uploads all data to the corresponding sites. Hence, we have Similar to the calculation of the time on map stage, the download time of site i at the reduce stage is t down i;e À t down i;s , and the upload time of other sites that contain the input data of site i is t up j;e À t up j;s . The data shuffle time of site i is equal to the maximum transmission time between the download time of site i (t down i;e À t down i;s ; a i 6 ¼ 0) and the upload time of other sites (t up j;e À t up j;s ; j 6 ¼ i) that contain the input data of site i. Then, we have ðt down i;e Àt down i;s ; t up j;e Àt up j;s Þ; j 6 ¼ i; a i 6 ¼ 0: Following the map stage, the intermediate data from the map tasks on site i are equal to q Â P j2D x j i , where q denotes the ratio of the intermediate data to the input data of the map stage. Moreover, P j2D x j i indicates the input data of the map stage on site i, and it is the sum of the amount of data P j2D;i6 ¼j x j i transferred from site j to site i (i 6 ¼ j) and the remaining data x i i on site i The ratio a i of the reduce task on each site needs to satisfy the following constraint: At the reduce computation phase, the input data of the reduce computation on site i is I shuf Â a i , where I shuf is the total amount of intermediate data calculated by the map tasks across all sites. We use a function h to estimate the reduce computation time of site i, according to the job type, input data of the reduce phase, and number of computation slots S i on site i. Thus, we have Using the above descriptions, we formulate the task placement problem P2 at the reduce stage as follows: By means of the above formulations, we have specified the calculation of the map and reduce stage response times for a wide-area analytic job. However, according to our model, the function of f is usually complicated and non-linear. Therefore, the problems P1 and P2 are both non-linear programming problems in general. Besides, the bandwidth of the WAN links among the sites may be dynamic. Thus, it is hard to obtain the optimal solutions in polynomial time.

SDTP: TASK SCHEDULING FOR WIDE-AREA DATA ANALYTICS
In this section, we investigate a more special case and convert the wide-area data analytic problem into a simpler problem. First, we assume that the WAN is static and each site uses a fixed WAN bandwidth. Second, we assume that the computation time of each task is fixed and the computation time can be calculated by a simple formula.

SDTP at the Map Stage
According to [35], only 7% of jobs in a production MapReduce cluster are reduce-heavy. That is, a reduction in the running time at the map stage is particularly important to minimize the response time of the entire job. As the WAN is static, each site has a fixed upload and download WAN bandwidth. In this section, B i down and B i up represent the download and upload bandwidths of site i, respectively.
Let T i load represent the input data loading time of site i, which is dominated by the maximum transfer time between the download time (T i load;down ) of site i and upload time (T j load;up ) of sites that need to transfer data to site i. Thus, the input data loading time of site i can be formulated as Eq. (18). T i load;down is equal to P j2D;i6 ¼j x j i divided by B i down , and P j2D;i6 ¼j x j i is the sum of the amounts of data that need to be transferred to site i from other sites. T j load;up is the maximum value among the upload times of site j that need to transfer data to site i. Thus, we have To simplify the computation time calculation, we assume that the computation time of each task is fixed and t map is the computation time of a map task. When the number of tasks on one site exceeds its available compute slots, the tasks will be usually executed with subsequent waves locally and cannot use the idle slots in other sites [3]. For example, site 1 has 50 slots and site 2 has 100 slots. If sites 1 and 2 both compute 100 map tasks, site 1 completes those tasks within two waves and site 2 completes those tasks with one wave. Thus, the computing time of site 1 is twice that of site 2. To this end, the computation time on each site is equal to the number of tasks divided by the degree of parallelism and subsequently multiplied by the execution time of a single task. The map computation time of site i can be formulated as Eq. (19) The goal of the map stage problem is similar to that of Eq. (1). Finally, when the WAN of each site is static and the computation time of each task is also fixed, the task placement problem of the map stage is formulated as problem P3 Due to the complexity of the problem, we turn to an approximation algorithm (Algorithm 1), reducing the total response time significantly. The Algorithm 1 iterates to the optimal solution by adjusting map task placement on each site. To accelerate the Algorithm 1 approaching the optimal solution, we formulate the problem P4 to obtain an initial input for Algorithm 1.
We assume that the computation of the map tasks must wait for the completion of the data transmission. The input data loading time and map computation time of the map stage are both dominated by the bottleneck site. The objective of this problem can be transformed into Eq. (21). Specifically, the input data loading time in the map stage of this job is equal to the largest input data loading time across all sites (Eq. (22)). The map computation time in the map stage of this job is dominated by the maximum computation time across all sites (Eq. (23)). Therefore, if the data transfer and the computation cannot be performed simultaneously, the map task placement problem P4 can be formulated as follows: The problem P4 is a linear programming problem and it can be solved in polynomial time by existing methods, such as the interior point algorithm or other linear programming algorithms that have been realized by many solvers. With the initial input from P4 and the following three theorems, we design the Algorithm 1 to achieve map task placement with a reduced job response time. Intuitively, when a site has higher response time than others, we can migrate some tasks from this site to others to reduce the response time of the whole stage. Moreover, when all sites have the same response time at this stage, it means the current task placement is optimal and cannot be further optimized. Based on these intuitions, we formulate 3 theorems which can be proved as follows. Proof. Suppose that a task placement scheme exists in which the response times of all sites are not the same, and this task placement scheme has a minimized time in the map stage. An example is presented in Fig. 4a, where site 3 is the bottleneck site that has the longest response time, and site 2 has the minimum response time. After t1, site 3 is still working, while site 2 is idle.
As all sites can transfer data to one another, if site 3 transfers little data to site 2, and the response time of site 2 and 1 does not exceed the maximum time across other sites, the tasks of site 3 will decrease. The computation time of site 3 will decrease with the decrease in the number of tasks. The transmission time of site 3 is dominated by the maximum transfer time between the download time of site 3 and the upload time of the sites that need to transfer data to site 3. As the download data of site 3 unchanged, the download time of site 3 is unchanged. Similarly, the upload time of the sites which need to transfer data to site 3 is also unchanged. Thus, the transmission time of site 3 will decrease or remain unchanged. Therefore, the response time of site 3 decreased, and the response times of sites 1 and 2 are both less than the response time of site 3. Finally, the response time of this stage will decrease. This example can be easily extended to any number of sites. Similarly, we assume that the response times of the n sites are tunable, and this task placement scheme has a minimized time at this stage. The site with the longest time can either transfer part of its data to other sites or reduce the amount of data received from other sites, to reduce the response time. In this process, it is required that the completion time of other sites does not exceed the completion time of bottleneck site. After that, the response time of this stage can be significantly reduced. Consequently, this result contradicts the assumption, and Theorem 1 is proven. Proof. In parallel data analytics, a stage is finished when all sites complete their allocated tasks. Thus, the entire response time is determined by the bottleneck site, and is equal to t old max . When some tasks are transferred from the bottleneck site to other sites, the t old max will be reduced. Consider the simple case in which the tasks of the bottleneck site are only transferred to site i whose response time is minimum across all sites. After the task transfer, if the response time of site i is less than t old max , the response time of site max is also decreased, then the gap of these two sites is narrowed, as well as the gap among all sites. Let t new max denote the maximum response time of the new task placement strategy which follows the above conditions. Based on Theorem 1, t new max is less than t old max , which means the decrease of entire response time. down Þ, and the transfer time of site 2 is maxððk À lÞ=B 1 up ; ðk À lÞ=B 2 down Þ. In this case, the transfer times of the three sites are also all less than or equal to the transfer time in Fig. 5a, and the computation times of the three sites do not change. Therefore, Theorem 3 is proven.
t u Based on the above theorems, we propose Algorithm 1 to accelerate the wide-area data analytics by balancing the response time of each site. First, we can solve problem P4 by some classical linear programming algorithms (e.g., Ellipsoid method, Interior point method, Simplex method, etc.) to determine the preliminary data transmission scheme (step 1). Thereafter, the algorithm calculates the response time T i e;m for each site. T i e;m is the sum of the transfer time of site i and the computation time of site i (step 2). Subsequently, it calls function decreaseInputDataðÞ, which attempts to adjust the data transmission scheme for reducing the map response time (step 3). The process is repeated until r < b or T m e decreases no more, where b is the expected maximum difference ratio of the response time across all sites, and r is the actual difference ratio among the response time of all sites. For instance, Let t 2 and t 1 denote the maximum response time and the minimum response time, respectively. Then, the actual difference ratio r is ðt 2 À t 1 Þ=t 1 . In this process, the algorithm attempts to reduce the response time of site max at the map stage by decreasing its input data, as in Theorem 2. The amount of decreased data of each step is g% (step 12), and g is the adjusting step size. Specifically, if site i transfers x max i data to site max in the original scheduling scheme, site i will decrease the volume of data to x max i Â ð1 À g%Þ in this process. Thereafter, to reduce the response time of the job at the map stage further, we design another function equalResponseTimeðÞ. This function sorts the response times of all sites and divides each site into two groups, G l and G s , where each group has the same number of sites (step 19). Next, it matches the items of the two groups one by one and enables the matched sites to obtain the same response time (step 20). For example, we assume that sites i and j are matched, and site i has a larger response time. To obtain the same response time, site i needs to transfer some data to site j. Assuming that the amount of transfer data is y, the following equation is solved: Thus, site i needs to transfer y GB of data to site j. Subsequently, we verify whether the data transmission scheme depicted in Fig. 5a exists in the calculated data transmission scheme (step 21). Similarly, the process is repeated until r < b or T m e no longer decreases. Finally, the map task placement solution x is returned, which can achieve a reduced job response time.

SDTP at the Reduce Stage
At the reduce stage, the shuffle time T i shuf of site i can be formulated as Eq. (26). It is the maximum value among the download time of site i to obtain all input data (T i shuf;down ) and the upload time of several sites that need to upload a certain ratio of data to site i (T j shuf;up ). The reduce computation time of site i is presented in Eq. (27), and t red is the execution time of a reduce task Hence, in this case, the task placement problem P5 of the reduce stage is formulated as follow: Owing to the complexity of the above problem, we continue with the simplification. We also require that all reduce tasks are executed after the shuffle phase. Thus, the goal of the reduce stage is transformed into Eq. (29), the shuffle time of the job is denoted by Eq. (30), and Eq. (31) represents the reduce computation time of the job. The task placement problem P6 of the reduce stage can be formulated as follows: This is a linear programming problem, and it can be solved in polynomial time by existing solvers. However, the job response time of the above formulation is not sufficiently small. Owing to the insufficient computing capacities, the map tasks of one site cannot be executed at the same time. Only a part of the tasks can be executed simultaneously, and others have to wait. That is, the tasks will be executed on different waves. Once the tasks are completed, the intermediate data generated by those map tasks can be transferred to the other sites, which will execute corresponding reduce tasks using the generated intermediate data. That is, the map phase is CPU intensive, and the shuffle phase is I/O intensive, and the map computation phase may overlap with the shuffle phase. However, the shuffle phase of a job must start later than its map phase, and it cannot finish earlier than its map stage. This is because the shuffle phase must wait to transfer the intermediate data calculated by the map phase.
Thus, we can formulate the job response time by overlapping the map computation and shuffle phases. T shuf;load represents the sum time of the first three phases of the job. Once the first wave of map tasks in a site has been completed, the intermediate data of those map tasks generated on this site and intermediate data can be transferred to other sites, which will execute its reduce tasks after obtaining all corresponding intermediate data.
Therefore, the start time of the shuffle phase on site i (T i shuf;start ) is equal to T i load + t map , and T i load can be calculated by Algorithm 1. The sum time of the input data loading phase, map computation phase, and shuffle phase on site i is T i shuf þ T i shuf;start . The time of the first three phases of the entire job is the maximum value among ðT i shuf þ T i shuf;start Þ; i 2 D and it can be formulated as Eq. (33). Our objective is to minimize the job response time (Eq. (32)) min T shuf;load þ T red (32) Therefore, the task placement problem P7 can be formulated as follow: min T shuf;load þ T red s:t: Constraints ð18Þ; ð26Þ; ð31Þ; ð33Þ; ð34Þ: Considering the above characteristics, we design an algorithm known as SDTP for reducing the response time of the entire job. The basic concept is depicted in Fig. 7. In contrast, the usual geo-distributed data analytics execution process is presented in Fig. 6. The job can only start a new phase when the previous phases on all sites have been completed. Thus, massive resources are idle during job processing. SDTP starts the data processing as soon as possible and attempts to make full use of the resources of the sites.
Specifically, as shown in Algorithm 2, the input data loading time and map computation time at the map stage are first obtained for each site by x, which calculated by Algorithm 1 (step 1). Moreover, the start time of the shuffle phase is determined according to T i load and t map (step 2). Thereafter, the model of problem P7 is formulated. In problem P7, the map and shuffle phases will overlap. The linear programming problem P7 is solved using a linear programming method to obtain the original reduce task placement solution (step 3). Thereafter, the algorithm obtains the difference ratio of the reduce tasks across all sites and attempts to decrease the job response time by decreasing the ratio of the reduce tasks on the bottleneck site. It calls the function getRedTaskPlaceðÞ, which attempts to mitigate part of the reduce tasks from the straggler site, which leads to the highest response time to the most vacant site with the lowest response time. T ¼ T m e ; 10: Update the reduce ratio a m À a m Â g%, a s þ a m Â g%; 11: Calculate the new response time T i e of each site, maximum site m, minimum site s, and difference ratio r; 12: return a; The rate of each adjustment is g%. That is to say, the ratio of reduce tasks at the site m and site s change to a m À a m Â g% and a s þ a m Â g%, respectively.
In Algorithm 1, the iteration times of the while loop in lines 17 to 22 is a constant k, and the complexity of the sort in line 19 is Oðnlog nÞ. Thus, the complexity of the function equalResponseTime() is Oðknlog nÞ. Problem P4 is a linear programming problem, which can be solved by normal solvers in polynomial time. Specifically, we adopt the interior point method to solve problem P4 with the time complexity Oðn 3:5 L 2 Þ [36], where n represents the number of variables, and L denotes the scale of the problem. Thus, the complexity of step 1 is Oðn 3:5 L 2 Þ and is higher than that of function equalResponseTime(). Therefore, the overall complexity of Algorithm 1 is Oðn 3:5 L 2 Þ. In Algorithm 2, the most time-consuming process is calling Algorithm 1 for execution. Thus, the complexity of Algorithm 2 is also Oðn 3:5 L 2 Þ.

FURTHER IMPROVEMENT OF SDTP
In this section, the dynamic nature of the WAN bandwidth and the influence of the degree of parallelism are considered. The ever-changing WAN bandwidth and degree of parallelism in the parallel computing will seriously affect the job response time in wide-area data analytics. In this section, the computation time of tasks at each site is predicted by the nonlinear regression algorithm. Based on the more accurate time estimation, we formulate a suitable task scheduling scheme to optimize the job response time. This scheme can be generalized to dynamic situations.

Challenge of Accurate Time Estimation
A job with large input or large intermediate data can efficiently harness additional parallelism; in contrast, a job running on small input data or with less efficiently parallelizable operations will obtain few gains from extra parallelism. Therefore, it is necessary to formulate the relationship of the job running time to the size of the input data and degree of parallelism.
For recurring jobs, the running time and intermediate data sizes can be reasonable predicted [23], [37], [38]. In this section, we use the multiple nonlinear regression algorithm to predict the job running time. Based on these predictions, the job running time can be further reduced. As we calculate the task placement solution separately in each stage, we construct the related model for the running time in each individual stage.
At the map stage, the task placement problem P8 can be formulated as We first build the prediction models m1() and m2() to predict the map time and reduce the response time on each site (step 1). Next, we calculate the initial solution x by solving P4 (step 2). Thereafter, we use the function decreaseInputDataðT i e;m ; xÞ in Algorithm 1 to obtain a better transmission scheme at the map stage (step 4). In this function, we use the prediction model m1() to calculate the map computation time on each site. Subsequently, we follow Algorithm 2 from steps 2 to 5 to obtain the final task placement solution. In these steps, we also use the prediction model m2() to calculate the map computation time on each site.

Challenge of Dynamic Network Bandwidth
In addition to the scarcity, large variances exist in the WAN bandwidth. The dynamic WAN bandwidth will significantly affect the response time of the wide-area data analytic job. For example, one site may contain sufficient bandwidth and rich computing resources, and thus, it can execute a large number of computation tasks of job A. During the execution process of job A, the WAN bandwidth of the site decreases sharply and the site have to spend more time to transfer data to obtain the input data of job A. As a result, if the original task placement strategy is not changed, the job response time of job A will increase dramatically. Thus, it is necessary to consider the dynamic nature of the WAN bandwidth, especially for batch jobs with a long response time.
Focusing on the dynamic WAN bandwidth, we design a task placement update module, which provides a bandwidth detection component to detect the bandwidth of each site with a given interval. When the variation in the WAN bandwidth exceeds r, the module will change the task placement solution according to the job execution state and the variation in the WAN bandwidth. When the job is in the input data loading phase, we first calculate the data amount of each site, and subsequently determine a new task placement at the map and reduce stages using Algorithm 3 (steps 5 to 7). If the job is in the map computation phase, we continue to complete the map computation phase, and thereafter calculate a new task placement at the reduce stage using function getRedTaskPlaceðÞ in Algorithm 3 (steps 8 to 9). In this algorithm, we use the prediction model to predict the reduction in the computation time of the job on different sites. If the current time t c is larger than t min map , we do nothing, because the shuffle phase will transfer special data to the corresponding sites. A change in the task placement will seriously affect the reduce response time. The specific steps of the algorithm are presented in Algorithm 4.
In Algorithm 4, there is no loop, and the most time-consuming process is calling Algorithm 3 for execution. Thus, the complexity of Algorithm 4 is dominated by Algorithm 3. In Algorithm 3, the function decreaseInputData() is called for execution. In decreaseInputData(), the iteration times of the while loop in lines 8 to 13 is a constant k and the nested while loop in lines 11 to 12 is iterated p times, where p is the number of sites. Hence, the complexity of function decreaseInput-Data() is OðkpÞ. It is also lower than the complexity of solving the problem P4 in step 2. To this end, the complexity of Algorithms 4 and 3 is Oðn 3:5 L 2 Þ.

PERFORMANCE EVALUATION
In this section, we discuss the comprehensive evaluations that were conducted to measure the performance of our methods using the Google dataset [39], [40] and the Alibaba dataset [41].

Experiment Settings
We construct two wide-area analysis environments with 10 and 30 geo-distributed heterogeneous sites, respectively. The resource capabilities are set according to Amazon EC2. More precisely, the bandwidth of each inter-site link ranges from 100 Mbps to 2 Gbps, and the number of slots on each site ranges from 10 to 100. Moreover, by default, we set the ratio between the intermediate data and input data q as 0.5, the excepted difference ratio among response times of all sites b as 0.1, and the adjusting step size g as 5. The execution time of a map task t map ranges from 10 to 120s and the execution time of a reduce task t red ranges from 5 to 60s.
We use realistic trace data sets from Google and Alibaba to emulate the geo-distributed data analytic jobs. The Google trace [39], [40] collects the information of machines, jobs, and tasks in a data center with 12.5k machines. The events of the machines, jobs, and tasks are all described by one or more records. Each record generally contains meta-information such as the timestamp, ID, event type, and resource request. The Alibaba trace [41] was published by the Alibaba Group in 2018. It contains records regarding 4k machines over a period of eight days. This trace includes many types of batch workloads, most of which are DAG jobs. The machines in Google trace and Alibaba trace are randomly divided into 30 and 10 sites, respectively. Thus, the workload of each site is combined by the tasks distributed on corresponding machines.
We compare our approach with the following methods in our evaluations.
In-Place: The default Spark approach which runs tasks locally according to the input data placement and assigns tasks evenly to all sites in the shuffle phase. Iridium: A recent method that improves the job response time by shuffle-optimized reduce task placement for geo-distributed jobs. Tetrium: A state-of-the-art approach in recent years, which aims to optimize the placement of the input data and reduce tasks, as well as improve the job response time.

Performance of Our Approach
We first evaluate the performance of our approach, namely SDTP, by comparing it with several classical task placement approaches in terms of the average response time and average slowdown. We present the results of the reduction in the average response time and reduction in the average slowdown compared to various approaches. The slowdown is defined as the reduction ratio of the response time of a single job compared to that of other approaches. For instance, the response time of job A using In-place is t 1 and the response time of job A using SDTP is t 2 ; thus, the slowdown of job A compared to the response time using In-place is ðt 1 À t 2 Þ=t 1 . The average slowdown is the sum of all slowdowns of each job divided by the number of jobs. Fig. 8a presents the improvement of SDTP on the average job response time under different numbers of sites. From this figure, we can find that SDTP outperforms the other baseline methods significantly. In particular, when the number of sites is 10, our method reduces the average job response time of all job types by 72%, 70%, and 29% compared to In-Place, Iridium, and Tetrium, respectively.
Thus, our approach can effectively reduce the job response time. When the number of sites is 30, our method reduces the average job response time of all job types by 61%, 60%, and 19% compared to In-Place, Iridium, and Tetrium, respectively. The reductions in the average response time under the 10-site setting are more significant than those under the 30-site setting. This is because more sites result in more computing resources being required to process the job, and the overlapping time of the map computation and shuffle phases is reduced. Thus, the reduction in the average response time decreases with the increase in the number of sites. Obtain the new input data A i by x, t c ; 7: Calculate x; a by Algorithm 3; 8: if t c > t min load and t c < t min map then 9: Get a by getRedTaskPlace(T i e;m ; x; B i d;n ; B i u;n ; S i ); 10: return x; a; Fig. 8b also presents the reduction in the slowdown compared to In-Place, Iridium, and Tetrium when the number of sites is varied. When the number of sites is 10, our approach can reduce the average response time for each job by 56% compared to the In-Place method. Thus, SDTP can reduce the job response time for most of the wide-area data analytic jobs. Similarly, with the increase in the number of sites, the reduction in the average slowdown decreases. This is also because with the increase in the number of sites, the sum of the computing resources required to process the job increases, and the overlapping time of the map computation and shuffle phases decreases. Thus, the reduction in the average slowdown decreases when the number of sites increases. Furthermore, the reductions in the average response time are less than the reductions in the average slowdown according to the different baselines. This demonstrates that SDTP is more effective for jobs with a long response time. Fig. 8c presents the CDF in the slowdown compared to other approaches when the number of sites is 10. It can be observed that the slowdown mainly ranges from 0.1 to 0.8. Besides, compared with Iridium and In-Place, SDTP can reduce the response time of almost all jobs by at least 10%, and can reduce the response time of 50% jobs by 10% to 70% compared to Tetrium. That is to say, SDTP can effectively reduce the response time of most of the wide-area data analytic jobs compared with other approaches.
Thereafter, we evaluate the improvement in the average response time on jobs of different scales compared to In-Place, Iridium, and Tetrium. We classify all jobs as smallscale, medium-scale or large-scale jobs according to the volume of input data that they required. If the amount of input data of a job is no greater than 60 GB, it is regarded as a small-scale job. If the amount of input data of a job is greater than 60 GB and no greater than 600 GB, it is classified as a medium-scale job. Otherwise, it is a large-scale job. In this experiment, the number of sites is set to 10. Fig. 9a presents the ratio of average response time between baselines and our approach on diverse scales. This figure demonstrates that the response time increases with the increase of job scale. That is, our approach is more effective for time-consuming jobs. This is because with the increase in the scale of the jobs, SDTP can use more idle resources to transfer data and process tasks.
At the map stage, the time-consuming jobs usually result in a greater difference in the sum time of the data transmission and map task execution across all sites. Thus, SDTP can use idle resources to balance the sum time of the data transmission and map task execution across all sites, thereby reducing the time more. At the reduce stage, larger jobs result in longer map computation and shuffle times, and the job response time can be reduced further by overlapping the map computation and shuffle phases. Fig. 9b shows the slowdown of the job response time with the increase in the scale of the jobs. The reduction in the average slowdown ranges between 37% and 75% for large jobs. Thus, our approach can effectively reduce the job response time when the job scale is large. The reduction in the average slowdown decreases with the increase in the job scale. This is because with the growth in the amount of input data, the response time of the jobs increases and the time that SDTP can optimize becomes large.
Thereafter, we evaluate the influence of the average response time on different components compared to In-Place, Iridium, and Tetrium. Fig. 10a presents the ratio of average response time between baselines and our approach with different components. When using only Algorithm 1 at the map stage, the ratios of average response time under In-Place, Iridium, and Tetrium compared to SDTP are 318 %, 290%, and 123%, respectively. When using only Algorithm 2  at the reduce stage, the reduction in the average job response time is greater than the reduction in the average job response time when using Algorithm 1. That is, Algorithm 2 can reduce job response time more than Algorithm 1. When using Algorithms 1 and 2 simultaneously, the reduction in the average job response time is greater than the reduction in the average job response time when using only one algorithm. That is, using Algorithms 1 and 2 simultaneously results in less job response time. Furthermore, the reduction in the average job response time using Algorithms 1 and 2 simultaneously is less than the sum of the reduction in the average job response time when using Algorithms 1 and 2 separately. This is because optimizing the map stage with Algorithm 1 definitely changes the input of the reduce stage, which decreases the performance of Algorithm 2, since Algorithm 1 makes the data distribution more balanced among different sites. Fig. 10b depicts the reduction in the average slowdown compared to the other approaches. It can be observed that the reduction in the average slowdown when using Algorithm 2 is slightly greater than the slowdown when using Algorithm 1. This demonstrates that Algorithm 2 is more effective in reducing the job time.
Furthermore, when using Algorithms 1 and 2 simultaneously, the reduction in the average slowdown is greater than the reduction in the average slowdown when using only one algorithm.

Impact of Varied Parameters
In this section, we quantify the impact of diverse parameters on SDTP, including the ratio of the intermediate data to the input data at the map stage q, the number of slots, the expected maximum difference ratio b and the adjusting step size g. Fig. 11a depicts the influence of q. The figure indicates the ratio of the response time to T for different q values, where T is the response time when q ¼ 1. It can be observed that the job response time increases with the increase in q. This is because a larger q will produce more intermediate data. Transmitting intermediate data at the shuffle phase and handling intermediate data at the reduce stage can both increase the overall response time. Fig. 11b illustrates the reduction in the average response time with different q values compared to In-Place, Iridium, and Tetrium. It can be observed that, with the increase in q, the reduction in the average response time increases compared to Tetrium, whereas the reduction in the average response time is relatively stable compared to In-Place and Iridium. The reason for this is that, with the increase in q, the amount of intermediate data increases. SDTP can effectively reduce the response time at the reduce stage by overlapping the map computation and shuffle phases compared to Tetrium. As the job response time of In-Place and Iridium is very long, SDTP can make full use of the idle resources to reduce the job response time, and thus, the reduction in the average response time is always larger than those of In-Place and Iridium. Different numbers of slots will also affect the job response time. Fig. 12a indicates that the reduction in the average job response time decreases with the increase in the number of slots. In this experiment, the number of slots on each site ranges from 100 to 1000 when the ratio of the slot number is 1. When the ratio of the slot number is 0.1, the number of slots on each site is equal to the number of slots when the ratio of the slot numbers is 1 multiplied by 0.1.
Thus, the number of slots increases with the increase in the ratio of the slot numbers. The figure indicates the ratio of the response time to T with different ratios, where T is the response time when the ratio of slot numbers is 1. It can be observed that the job response time is reduced with the increase in the number of slots. When a site has more slots, it can process the map and reduce tasks more rapidly, and thus, the time of the map and reduce computation phases is decreased.
The reduction in the average response time with different numbers of slots compared to In-Place, Iridium, and Tetrium is illustrated in Fig. 12b. It can be observed that the reduction in the average response time decreases with more slots. This is because when sites have more slots, the map and reduce computation times decrease, and thus, the time that SDTP can improve is limited, particularly at the reduce stage.
Finally, we measure the influence of the expected maximum difference ratio b and the adjusting step size g. Fig. 13a shows the influence of b. We count the number of iterations and demonstrate the ratio of the response time to T for different b values, where T is the response time when b ¼ 0:01 at the map stage. With the increase of b, the average response time grows rapidly. Specifically, the average   response time when b ¼ 1 is 10% higher than that when b is set as 0.01. That is to say, the trend of the reduction of the response time increases with the decline of b. On the other hand, the smaller value of b, the more iterations the algorithms required meaning the longer execution time. To better trade-off the efficiency and performance of Algorithm 1 at the same time, we assign b as 0.1. When b ¼ 0:1, the average response time is only 0.75% less than that when b ¼ 0:01. Moreover, the average iteration amounts for scheduling one job when b ¼ 0:1 is about 20 times less than that when b ¼ 0:01. Fig. 13b shows the influence of adjusting step size. It presents the ratio of the average response time T for different g, where T is the average response time when g ¼ 1.
With the increase of g, the average response time also increases. This indicates that using smaller g results in less job response time.

Impact of Parallelism
Considering the impact of parallelism in parallel computing, we first evaluate the accuracy of our prediction method on the response time at different stages. Thereafter, we modify our algorithms and other benchmarks by computing more accurate computation time with our prediction method, and then analyze the deviation of the unmodified approaches from the actual values. Finally, we evaluate the performance of SDTP+.
We measure the time of multiple queries with different data amounts and degrees of parallelism running on Spark using BigDataBench [24]. Based on the results, we use the multiple linear regression algorithm to construct the prediction model for the computation time in each stage. The results demonstrate that the R 2 statistics are all larger than 0.9, where R is the correlation coefficient. The value of the F-statistic is larger than the value according to the F distribution table. The probabilities p corresponding to the F -statistics are all less than 0.0001. That is, a strong correlation exists between the amount of input data and degree of parallelism, and thus, the prediction model is effective.
To quantify the impact of the degree of parallelism, we analyze the absolute percentage error (APE) of different approaches. The APE is calculated as APE ¼ jT À T p j=T Ã 100%. For example, T is the average response time of the In-Place algorithm, and T p is the average response time when the computation time is calculated by the prediction model in the In-Place algorithm. Fig. 14a presents the APEs of the different algorithms for varying job scales. The APEs of the algorithms are all greater than 20%.
Thus, if we only use the times of the map and reduce tasks and the number of computation slots to calculate the computation time at different stages, the final results of the job response time will differ significantly from the actual job response time. Moreover, the APE of Tetrium is larger than those of the other approaches because Tetrium considers the influence of the heterogeneity of computing resources on different sites. The computation time for each site in Tetrium is equal to V =S i Â t, where V is the amount of data and t is the execution time of a single task.
The reduction in the average response time on different job scales compared to the In-Place, Iridium, and Tetrium methods is illustrated in Fig. 14b. It can be observed that the improvement in the average response time of all jobs is between 7% and 14%. Furthermore, when the scale of jobs is large, the improvement in the average response time of all jobs is between 20% and 26%. That is, the reduction in the average response time increases with the increase in the amount of data. Note that, due to the limitations of our experiment environment, the size of input data used in this experiment is no more than 100GB per site, which means that many large-scale jobs are not considered. However, as shown in Fig. 9a, the larger size of input data, the greater advantages our algorithms have over other benchmarks. If these large-scale jobs are considered in this experiment, the advantages of our algorithms will become greater.

Dynamic Bandwidth
Fig. 15a presents the influence of the dynamic WAN links on the job response time. The changed bandwidths of all sites are randomly generated between 0.1 and 2 GB/s. We measure the reduction in the average response time of SDTP++ compared to SDTP+ when the WAN bandwidth changed on different job scales. It can be observed from Fig. 15a that there is a large reduction in the average response time. Thus, if the WAN bandwidth is varied while the task placement remains unchanged, the job response time will become very long. When the scale of the job is medium or large, the reduction in the average response time is very large compared to that of small-scale jobs. Thus, SDTP++ is more effective for time-consuming jobs.
The reduction in the average response time with different WAN bandwidths is illustrated in Fig. 15b. In this figure, the value of the abscissa d is the difference in the WAN bandwidth. For example, when d is 0.1, the WAN bandwidth B i of each site range from B i Â 0:9 to B i Â 1:1. It can be observed that the reduction in the average response time increases with an increase in d. The reason for this is that when d increases, a link with sufficient bandwidth may  become the bottleneck link, and the original data transmission scheme will cause a longer transmission time.

CONCLUSIONS AND FUTURE WORK
Cloud service providers and research institutes deploy data centers or edge clusters globally, which generate large volumes of data across geo-distributed locations. We have proposed a novel scheduling mechanism known as SDTP for wide-area data analytics. This method attempts to balance the data transfer and task computation, and begins the tasks as early as possible. Moreover, SDTP provides more accurate time estimation and can be generalized to dynamic situations. The evaluation results demonstrate that SDTP can outperform existing state-of-the-art methods and significantly improves the job response time.
In future work, we plan to realize our method on popular big data frameworks. There are mainly two challenges to do so. First, in the current big data frameworks (e.g., Hadoop, Spark, etc.), the tasks of each stage are executed when all tasks obtain their required input data. However, in our approach, we assume that a site can execute task computation once it gets its required input data. Thus, how to realize a new task scheduling component to satisfy our requirement is challenging. Second, with the sites distributed across different regions, task failures are more likely to occur due to the unstable wide-area network. The task failures may lower the performance of our methods. Thus, coping with task failures or persistent data transmission is an open problem.
Ori Rottenstreich received the BSc degree in computer engineering and the PhD degree in electrical engineering from Technion, Haifa, Israel. He is currently an assistant professor with the Department of Computer Science and the Department of Electrical Engineering, Technion, Haifa, Israel. Previously, he was a postdoctoral research fellow with Princeton University.
Jie Wu (Fellow, IEEE) is the associate vice provost for international affairs with Temple University. He also serves as the chair and Laura H. Carnell professor with the Department of Computer and Information Sciences. Prior to joining Tempe University, he was a program director with the U.S. National Science Foundation and was a distinguished professor with Florida Atlantic University. His research interests include mobile computing and wireless networks, routing protocols, cloud and green computing, network trust and security, and social network applications. He regularly publishes in scholarly journals, conference proceedings, and books. He serves on several editorial boards, including the IEEE Transactions on Service Computing and the Journal of Parallel and Distributed Computing.