An optimized resource scheduling strategy for Hadoop speculative execution based on non-cooperative game schemes

Hadoop is a well-known parallel computing system for distributed computing and large-scale data processes. “Straggling” tasks, however, have a serious impact on task allocation and scheduling in a Hadoop system. Speculative Execution (SE) is an efficient method of processing “Straggling” Tasks by monitoring real-time running status of tasks and then selectively backing up “Stragglers” in another node to increase the chance to complete the entire mission early. Present speculative execution strategies meet challenges on misjudgement of “Straggling” tasks and improper selection of backup nodes, which leads to inefficient implementation of speculative executive processes. This paper has proposed an Optimized Resource Scheduling strategy for Speculative Execution (ORSE) by introducing non-cooperative game schemes. The ORSE transforms the resource scheduling of backup tasks into a multi-party non-cooperative game problem, where the tasks are regarded as game participants, whilst total task execution time of the entire cluster as the utility function. In that case, the most benefit strategy can be implemented in each computing node when the game reaches a Nash equilibrium point, i.e., the final resource scheduling scheme to be obtained. The strategy has been implemented in Hadoop-2.x. Experimental results depict that the ORSE can maintain the efficiency of speculative executive processes and improve fault-tolerant and computation performance under the circumstances of Normal Load, Busy Load and Busy Load with Skewed Data.


Introduction
In recent years, from the pace of Internet information technology to the booming trend of but its performance is poor in heterogeneous environments due to it uses task progress to determine whether the task is a "Straggler", so many researchers began to optimize the SE from any other aspects and several optimized strategies are proposed [Liu, Jin, Liu et al. (2016)], Such as LATE, MCP, ERUL and so on. These determinations of the "straggler" tasks in the proposed strategy are based on the estimation of the remaining time of the real-time task, but the inaccurate estimation will lead to improper node allocation. At the same time, if there exist multiple "Stragglers" in the cluster, speculative execution performance will greatly affect the overall performance of the cluster, which implies the scheduling strategy of the backup task is very important. Based on the whole study of speculative execution, we proposed an Optimized Resource Scheduling model for Speculative Execution based on a non-cooperative Game theory (ORSE) that introduced the idea of game theory. In the ORSE algorithm, the resource scheduling model of the backup task in execution is transformed into a classic multi-party noncooperative game problem, the game participants are the backup task group and the game strategies are the node in the cluster, the game's utility function is the cluster's overall task execution time, and finally when the game reaches the Nash equilibrium, the task scheduling scheme will be obtained.

Related works
Liu et al. [Liu, Jin, Liu et al. (2016)] lists the three core components in the speculative execution strategy: • Finding out "straggler" tasks during they are running; • The selection of a suitable backup node; • Make sure that the benefit of starting the backup task to the cluster is greater than not enabling it. Hadoop considered the three components at the beginning of the design, it implied the original speculative execution strategy in Hadoop, which is called "Hadoop-Naive". Since Hadoop-Nave shows many deficiencies in the heterogeneous cloud environments, Zaharia et al. first proposed the heuristic speculative execution strategy called LATE, this strategy uses the remaining execution time of the task as the priority for the determination of the "straggler" tasks, and also considers the proper backup node [Cheng, Rao, Guo et al. (2014)]. The LATE strategy has been optimized to a certain extent relative to the Nave strategy, but many problems have been found in the application process, such as the estimation error of the task's remaining time and not considering the impact of real-time workload on task execution. Therefore, Zhang et al. [Zhang, Zhang, Li et al. (2016)] further proposes a heuristic strategy "ERUL" by finding the linear relationship between system load and task remaining time. The MCP strategy is proposed [Chen, Liu and Xiao (2014)], which maximizes the benefits of the cluster by establishing the maximum cluster performance model and guarantee that the backup task gains more benefit to the cluster than the original task, but the model does not consider the value of the node itself when calculating the benefits. The Ex-MCP strategy is the optimization of the MCP strategy, which takes the node value into the benefit model [Wu, Li, Tang et al. (2014)]. In addition to these strategies such as LATE, MCP and ex-MCP, domestic and foreign researchers have also conducted research and exploration of speculative execution strategies from different aspects and proposed their own optimization plans. SSE (Smart Speculative Execution) is an optimization strategy based on the node classification which depends on the hardware performance of the node and the amount of computational data in the node [Liu, Cai, Fu et al. (2016)]. Wang et al. [Wang, Lu, Lou et al. (2015)] also proposed a Partial Speculative Execution (PSE), which uses the detection point of the original task to start the speculative execution without restarting the entire process, which enhanced the Hadoop performance. Since speculative execution is a classical space-fortime thinking, most optimization strategies ignore the storage space occupied by backup tasks. Therefore, Liu et al. propose a speculative execution strategy based on space-time optimization for multi-objection, the strategy optimized the load balancing problem during speculative execution based on extreme learning machine and multi-objective space-time optimization algorithm [Liu, Jin, Liu et al. (2016)]. SECDT is a new speculative execution algorithm based on the C4.5 decision tree, which estimates the completion time of the scheduled task based on the C4.5 decision tree [Li, Yang, Lai et al. (2015)]. The ATAS strategy improves the success rate of backup tasks by reducing the reaction time and quickly starting backup tasks [Yang and Chen (2015)]. The data skew of the data itself has always been one of the factors which result in "straggler" tasks, the Flexslot's strategy can adaptively change the number of slots on each compute node to further mitigate the problem of data skew [Guo, Rao, Jiang et al. (2017)].

Model and algorithm
Hadoop resource scheduling refers to different scheduling of tasks to different nodes through a ResourceManager (RM). Traditional resource scheduling algorithms such as FIFO, Capacity Scheduler, etc. have the problem of low cluster utilization resulting in low performance of the cluster, so many scholars are committed to Various dimensions of resource scheduling optimization, various optimization strategies based on game theory ideas have also been proposed [Li (2016); Zhang and Zhou (2017)]. This chapter is different from the traditional resource scheduling optimization algorithm in that this paper speculates that the execution of the backup task and the tasks being executed and waiting on the node select different computing nodes through the game, and seeks the Nash equilibrium allocation scheme in the game, thereby improving the cluster computing node, the utilization rate. This article mainly adopts the non-cooperative game model, mainly because its mathematical theory model has been proved and extended many times, especially the existence problems of Nash equilibrium point [Czumaj and Cking (2002)]. The purpose of the game theory introduced in this chapter is to generate backup tasks generated by speculative execution without affecting the execution of normal tasks on cluster nodes. Backup tasks and original tasks can still be evenly distributed to the nodes of the cluster, thereby making greater use of cluster computing resources. The algorithmic model for the speculative execution of resource scheduling in this chapter is based on the following preconditions: • There are several nodes in the cluster, and different tasks can select different nodes for execution; • The ultimate goal of optimization is that the backup task can be executed at the fastest speed and the model needs to sense the current task status of the current cluster in real time; • When the scheduling scheme is optimal, the benefit of the corresponding cluster is the largest; • There are multiple backup tasks at the same time in the execution process; In Yarn, users can customize the number of dropped tasks, so the number of backup tasks in Hadoop is limited. Second, a backup task can only be assigned to run on one node. The number of nodes in the cluster is also set in advance, resulting in backup tasks. The number of allocated nodes is limited; finally, when the task is assigned to a node, since the node's operating efficiency and the task's data volume are fixed, the task's running time is also determined, so the model is regular Limited game. At the same time, there are two concepts in the non-cooperative game: hybrid strategy and Nash equilibrium. Hybrid strategy indicates that a certain task in the cluster has a certain probability when selecting the corresponding node. In the speculative backup task node scheduling, there are multiple tasks and multiple nodes, so each node has a certain probability to be assigned to a fixed node, in line with the hybrid strategy. Nash equilibrium refers to the game process, the game participants cannot obtain higher interests by changing the game strategy, and the core of the game theory introduced in this chapter is to find a balanced backup task node allocation scheme, making the task not to change its own operation the node achieves better benefits and there is also a Nash equilibrium solution. This model satisfies the non-cooperative game model. In the Hadoop platform, tasks are submitted by users and the number of tasks is limited. It has been explained above that the game model in this chapter is a limited noncooperative game. Therefore, according to the principle of existence of Nash equilibrium points in non-cooperative game theory, it can be known that there are limited noncooperative. In the game model, there must be one or more Nash equilibrium points. In this case, the situation that the resource scheduling model has no solution in this paper is eliminated, thereby ensuring that there is a corresponding Nash equilibrium solution for each resource scheduling problem.

Design of a resource scheduling algorithm model based on a non-cooperative game
From the above it can be seen that this article uses a resource scheduling model based on a non-cooperative game. The input and output of this model under Hadoop speculation execution involve: Input 1: A set of backup tasks generated by speculative execution. The game focuses on multi-participation. If there is only one task, it is equivalent to a unilateral optimization problem. Therefore, the model needs to have a group of backup tasks ( ) This group of backup tasks also corresponds to the participants in the game model P . It is assumed that there are backup tasks in the model and the amount of data for each backup task is n σ .
Input 2: Compute nodes in the cluster. Since there is a backup task, there must be a computing node in the cluster. The computing node corresponds to a game strategy S. Assume that there are computing nodes in the cluster, and the average execution rate of the node processing task is used as the execution rate of the node m v .
Output: Total task execution time. The RM assigns the backup task to different compute nodes. At this point, the node needs to complete the backup task and the total execution time of the assigned original task on the node.
In the Hadoop platform, each task is assigned by the RM to each computing node in the cluster. Each task has a certain probability to be assigned by the RM to any computing node in the cluster. The probability that the task is assigned to the node k is k i p , and the workload of the task i is i σ . Assuming that the task group assigned by the RM to the node k is denoted as k M , then the execution time of the node is the ratio of the total task amount of the task to be processed on the node to the node execution efficiency, as shown in the following formula (1).
The model of resource scheduling for the backup task in the speculative implementation is a hybrid strategy and can be represented by this binary group ( , ) T N , where T is the backup task set in the cluster, and N is the compute node set in the cluster. For each backup task t , the data volume is σ t ; whilst for each compute node n , the processing rate is n v . Therefore, when the RM allocates the backup task t to the node k , the running time of the node k is also the sum of the execution time of the backup task plus the execution time of the original task group waiting on the node, which is shown in formula (2). _ os where k v is the processing rate of the computing node k . k i p is the probability that the original task i is assigned to the node k . σ i is the data quantity of each task in the task set to be completed on the node, and σ t is the work amount of the backup task t allocated by the RM . After the scheduler determines all the backup task allocation strategies, a single backup task cannot achieve higher benefits by changing its own computing node. At this point, the Nash equilibrium status of the non-cooperative game-based Hadoop speculative execution resource scheduling model is reached. For this resource scheduling model, to make it reach the Nash equilibrium point of the game, RM cannot improve the benefit of the whole cluster by changing the scheduling strategy of any current task, then it must satisfy the following conditions, such as formula (3) where k t p is the probability that task t is assigned to node k , and _ cos k t time t is the cost function that task t is scheduled by RM to the node k , which is shown in formula (2). N is the set of compute nodes in the cluster. Therefore, the resource scheduling model for the speculative backup task can be transformed into a classic non-cooperative game problem. The participants of the game are the set of backup tasks, whilst the game strategy is the different computing nodes in the cluster, and the utility function of the game is the final completion time of the task. In a distributed cloud environment, the task scheduling optimization goal is that the task can be completed as soon as possible, so the game's utility function is the task's completion time; that is, if a task t is scheduled to a computing node k , then the cluster has a gain, whereas the income becomes the profit of task t on the node k . The individual task profit function is shown in formula (4), while the overall cluster's profit function is shown in formula (5) where tk a is game strategy, which is the plan that task t dispatches to node k . i d is the scheduling scheme i . After the RM schedules backup tasks every time, the metric on the tasks is task completion time, whilst the metric on the scheduling strategy is the overall running time of the tasks on the cluster. These two metrics are calculated following formulae (6) and (7).
where in formula (6), σ i is the workload of task i . k t N represents the original task set that are uncompleted in node k when backup task t is scheduled to node k . k v is the execution rate of node k . σ t is the workload of the backup task t that is scheduled to be on the node k during execution. In formula (7), σ j is the workload of task j . p v is the execution rate of node p . M is the set of compute nodes in the cluster, and p M represents the task sets assigned to node p in the entire cluster.
In the actual cloud environment, the cluster load is high (the number of nodes is smaller than the current task to be processed) or the cluster load is low. In this strategy, the RM generates a possibility execution node sets for the backup task generated during speculative execution, which are the highest-benefit computing nodes q for the backup task t , where q is the minimum value between the number of tasks and nodes; that is called the set of possible running nodes of the backup task, as shown in the formula (8). = M i P is M nodes in the cluster, whereas N is the number of backup tasks that need to be scheduled. At this time, due to the limited number of tasks and the number of nodes, the scheduling strategy for this backup task is a classical limited non-cooperative game problem. Therefore, according to the definition of non-cooperative games, there must be a Nash equilibrium solution, i.e. the Nash scheme. When there are intersections in the set of possible processing nodes for two or more backup tasks, these two become a conflicting task set, named _ Conflict Tasks , as is shown in formula (9).

Implementation and critical steps of the resource scheduling algorithm model
The overall flow chart of the Resource Scheduling Algorithm model is shown in Fig. 1, where critical steps of resource scheduling scheme on the purposes of speculative implementation are as follows. • Step 1: The RM determines the number of backup tasks that need to be started based on the cluster resource operating status and the current number of "Straggler" queues in the cluster. At the same time, the RM can estimate the remaining time • Step 3: If conflicts of two or more tasks happen in potential execution node sets, a _ Conflict Tasks set is generated.

•
Step 4: A non-cooperative game strategy is applied to _ Conflict Tasks to find a Nash equilibrium solution, involving multiple scheduling schemes.
• Step 5: The most benefit scheme is determined from the Nash equilibrium solution according to _ cos time t and _ Pr Cluster ofit .
• Step 6: According to Step 5, the RM schedules backup tasks in the cluster.
Corresponding nodes execute the tasks following the instruction of the

Experiments and evaluation
In order to fully analyse the performance of the ORSE, a series of comparative experiments has been designed and conducted in a heterogeneous distributed environment, including job execution time, cluster throughput, and speculative execution accuracy. The experimental heterogeneous environment is mainly built on the servers in the lab, where eight nodes were created in the cluster, as shown in Tab. 1. A Hadoop-2.6 system was installed on them. LATE and MCP strategies have been deployed in the cluster to meet the contrasting experimental requirements. Each group of experiments was run five times in order to ensure the accuracy of the results, and then the performance comparison results were obtained. The data sets used in this experiment are all provided by Purdue University's performance testing benchmark suite including WordCount and Sort. Among them, WordCount input data volume is 50 G, of which the Map task volume is 200, and the Reduce task volume is 16; the input data volume of Sort is 30G, of which the Map task volume is 200 and the Reduce task volume is 15.

Performance evaluation metrics
In this paper, major metrics are chosen for performance evaluation, including job execution time and cluster throughput.
• Task Execution Time: Task Execution Time is the completion time of a task, as an important indicator to indicate the performance of an optimized Algorithm in the Hadoop system. • Cluster Throughput: Cluster Throughput is defined as the number of jobs that the cluster runs per unit of time. Three scenarios are designed to examine potential performance of the ORSE strategy in a heterogeneous distributed environment, including Normal Load Scenario, Busy Load Scenario and Busy Load with Data Skew Scenario.

Performance of the ORSE strategy in the heterogeneous environment Normal load scenario
In the normal load scenario, a low-load cluster has been configured with efficient resources. Using the original task initialization strategy in the Hadoop system, a file was split into several file blocks with the size of each block setting to 64 MB and acting as a Map task. Data skew has been avoided by setting the input file size to be an integer multiple of 64 MB. The execution time of each job and the cluster throughput running WordCount and Sort datasets are calculated, with experimental results shown in Fig. 2 and Fig. 3, which respectively illustrate the implementation of different strategies during the implementation of WordCount and Sort. As shown in Fig. 2, ORSE is different from LATE and MCP for the execution time of WordCount task. The degree of improvement is about 23.2% higher than that of LATE, which is about 6.1% higher than that of MCP. Compared with LATE, ORSE is improved by about 25.8% compared to LATE in terms of cluster throughput, which is about 9.7% higher than MCP. Similar trends can also be seen in the execution results of Sort. In the design process, ORSE is based on the LWR-SE's behind-task decision rule, which has optimized the prediction accuracy compared with LATE, MCP and other strategies using the average execution rate of the task to calculate the remaining time of the task. After the corresponding backup task is generated, the non-cooperative game model is used to find the possible execution nodes of each backup task group, and the node resources in the cluster are scheduled according to the Nash equilibrium scheduling scheme to ensure the execution efficiency of the backup task.

Busy load scenario
As mentioned earlier, the introduction of game theory is mainly to solve the problem of how to dynamically schedule the cluster resources to maximize the cluster resource utilization when the cluster is under high load. Therefore, in the design of the experiment in this chapter, we select three nodes among the eight nodes in the cluster to perform I/O operations such as file reading tasks to compete for resources in order to simulate the cluster high load scenario. At this time, the accuracy of the monitoring performed and the accuracy of backup node selection become particularly important. The experimental results are shown in Fig. 4 and Fig. 5. As can be seen from Fig. 4, For WordCount, on average, ORSE consumes the task execution time 27.3% less than LATE and 13.4% less than MCP and 6.1% less than LWR-SE. Moreover, ORSE improves cluster throughput by 43.2% over LATE and 15.6% over MCP and 9.5% over LWR-SE. Similarly, we can see from Fig. 5, ORSE gains a corresponding degree of optimization in Sort comparing with MCP, LATE and LWR-SE. On average, LWR-SE executes jobs 31.9% faster than LATE and 18.3% faster than MCP and 9.4% faster than LWRSE, whereas ORSE improves cluster throughput by 49.1% over LATE and 21.9% over MCP and 11.1% over LWR-SE. When the cluster resources are in shortage, "Stragglers" can be misjudged and performed in the slow node by LATE. MCP optimizes the performance through the cluster benefit guarantee strategy, but in Hadoop 2.x and 3.x, the concept of Slot is replaced by Container. All the resources in a cluster are Container resources, where Maps and Reduce types are not divided. Therefore, the accuracy of the MCP cannot be satisfied. The LWR-SE avoids partial misjudgement on the basis of the MCP; however, the efficiency of the selection will be partially insufficient. ORSE dynamically schedules the node computing resources of backup tasks based on LWR-SE; that is, the LWR-SE calculates the execution time and benefit of the backup task, and then the ORSE finds the set of possible nodes for each task to reach the Nash balance, so as to ensure that free computing resources in a cluster are maximized when the cluster appears a set of backup tasks.

Busy load with data skew scenario
In a practical cloud environment, data skew situation is common, especially in the Map stage, which will result in Stragglers misjudgements due to the different size of input data. In order to create a data skew scenario, the WordCount and Sort jobs have been proposed with 30 GB of their datasets in total and 100 MB of input data in each. The input data were divided into two data blocks due to the Hadoop self-split strategy, which are 64 MB and 36 MB. Similar to the previous Busy Load Scenario, WordCount and Sort jobs were set up to be submitted every 150 seconds with the task execution time and cluster throughput being counted to be compared with the LATE, MCP and LWR-SE. According to Fig. 6, the performance of the ORSE has been significantly improved, which consumes the job execution time 46.9% less than the LATE, 18.4% less than the MCP and 8.9% less than the LWR-SE. Moreover, the ORSE improves the cluster throughput by 47.2% over the LATE, 23.1% over the MCP and 11.8% over the LWRSE.
Similarly, as can be seen from Fig. 7, LWR-SE also gets improvements when executing Sort jobs in a busy cluster with data skew. In terms of the task execution time, the ORSE finishes jobs 37.8% faster than the LATE, 21.3% faster than the MCP, and 12.1% faster than the LWR-SE. As for the cluster throughput, the ORSE increases by 45.9% over the LATE, 22.9% over the MCP, and 7.3% over the LWR-SE. When the data skew exists in the cluster, some tasks with slow execution speed appear in the cluster. The reason for their slow speed is not that the execution rate in the real sense is not high, but is due to the influence of the data volume. Therefore, these tasks are not really straggler tasks. The performance of LATE and MCP is significantly reduced when the data is skewed, because it uses the remaining progress and the average rate to calculate the execution time error. The ORSE strategy proposed in this chapter improves the accuracy of LWR-SE. Even if a task with a large amount of data is misjudged as a slow task, the RM will find the Nash equilibrium solution according to the possible processing set and conflict set of each task. The task is evenly distributed to the cluster's compute nodes to ensure that the task's execution time and cluster throughput are not affected.

Conclusion
In order to solve the problem that the efficiency of the backup node allocation strategy of classical speculative execution algorithm is not high, this paper proposes a speculatively executed hybrid resource scheduling strategy ORSE based on non-cooperative game, which transforms the resource scheduling problem of backup task into a classic non-cooperative game problem. The participants of the game are the set of backup tasks. The game strategy is the computing node in the cluster. The utility function of the game is the final completion time of the task. The experimental results show that ORSE has better performance than LATE, MCP, and LWR-SE strategies, and can make greater use of the cluster's computing resources, improve cluster throughput, and estimate the efficiency of execution.