Research on Hodoop Job Scheduling Algorithms Based on Dynamic Fusion

Adaptive genetic algorithm and improved ant colony algorithm are combined to solve Hadoop job scheduling problem. Firstly, the global search ability of the adaptive genetic algorithm is used to generate the list of resources allocated by the task. When the search speed of the genetic algorithm gradually decreases, the optimal integration time of the adaptive genetic algorithm and the ant colony algorithm is determined dynamically. The initial pheromone distribution of the ant colony algorithm is generated from the optimal solution solved by the adaptive genetic algorithm. Improve the target node selection strategy of ant colony algorithm, consider the success rate of completing tasks of nodes, and accelerate the speed of ant colony algorithm to solve the optimal solution. Simulation results show that compared with genetic algorithm and ant colony algorithm, hybrid genetic algorithm takes less time, and the more tasks, the more obvious advantages.


Introduction
Hadoop [1] is an open source big data processing framework of Apache foundation, The core part consists of HDFS(distributed file system) and MapReduce (parallel computing model) modules [2]. HDFS can efficiently complete the efficient storage of large-scale data sets. MapReduce divides the work to be processed by the application into several small pieces, making it easy for developers to realize distributed application programmers. The combination of the two allows users to write distributed programs at the top without having to know the details at the bottom. Users can make full use of the advantages of clustering for distributed and high-speed storage and computing. The combination of HDFS and MapReduce makes Hadoop more powerful [3].
Job scheduling is an important factor affecting the performance of Hadoop platform. Many experts and scholars have adopted particle swarm optimization (pso) [4], ant colony optimization (aco) [5] and genetic algorithm (ga) [6][7]to solve the Hadoop cluster job scheduling problem. Literature [8] aimed at the slow convergence speed and redundant operation problem of genetic algorithm, adopted dual objective function and optimal solution retention strategy to improve job scheduling efficiency. Literature [9]applies the adaptive job scheduling scheme to the ant colony algorithm and makes use of the advantage of positive feedback of ant colony algorithm to improve the efficiency of Hadoop cluster job scheduling. Literature [10]firstly classifies tasks by using resource signatures, and then iterates the configuration scheme by using genetic algorithm, which significantly improves the utilization rate of the cluster. The Hadoop job scheduling algorithm based on mixed genetic algorithm adopted in this paper does not need the administrator to preset values, so as to enhance the dynamic adaptability and improve the overall performance of the platform.

Hadoop job scheduling problem model
Hadoop job scheduling problem is that N tasks are executed on K task-tracker nodes, each Task can only be assigned to execute on one node, and the optimal scheduling sequence of these N tasks to execute on K nodes is to be solved. To describe the Hadoop job scheduling problem model, the following conventions are made: is a task node, set ij ETG as subtask i P on the j T expected execution time, Then the assignment relation of corresponding tasks on nodes can form an n-by-k matrix, the calculation formula is as follows: If m tasks are assigned to the JTH task-tracker node for execution, then the time(j) when node j completes all tasks is: (2) After all tasks are allocated to each resource node for execution, the execution time used is the maximum value of time(j). Therefore, the total execution time is:

Basic idea of algorithm
The basic idea of this algorithm is to adopt dynamic cohesion strategy to mix Adaptive Genetic Algorithm(AGA) with Adaptive Colony Optimization Algorithm(ACO), form a hybrid optimization algorithm(AGA-ACO). The idea of combining AGA with ACO dynamically is：Before the best time ta, the distribution of pheromone conducted by AGA is taken, then the best solution is generated with ACO. The dynamic strategy is as follows ： (1) Set the minimum generation Genemin and the maximum generation Genemax；(2) Calculate the evolution rate of the best solution Genemin-rati of offspring the in the generation of evolution and set the minimum evolution rate of offspring based on Genemin-rati；(3)if the evolution rate of offspring is less than Genemin-rati in the continuous generations, it means that optimization speed in GA is low so as to terminate the GA and run AA.

Adaptive genetic algorithm
3.2.1. Genetic coding and fitness function. Real number coding scheme was used in this passage, the allocation of resources is represented by a list of real numbers, the length of the chromosome is the actual number of tasks. Suppose you have 10 tasks to perform on four task-tracker nodes, task sets TotalTime is the aim function, TotalTime more smaller, the fitness of chromosomes is greater.

3.2.2.
Adaptive genetic strategy . The adaptive genetic algorithm's adjustment strategies for crossover probability and mutation probability are as follows: Original crossover probability is c p ,adjusted crossover probability c P is: f is the fitness value of the optimal chromosome involved in crossover before crossover operation, 2 c f is the fitness value of the optimal chromosome obtained after crossing.
The original probability of variation was m p , the adjusted probability of variation is m P : By adjusting the weighting factor to focus on the selection of certain types of nodes, the weight of pheromone corresponding to CPU processing capacity can be appropriately increased for cpuintensive tasks, so that nodes with strong CPU processing capacity can be selected more easily.
The estimated execution time of tasks on each node obtained by genetic algorithm is converted into the pheromone

Pheromone update.
If the expected execution time of a task on node I is within the specified range, the node is considered to be a valid node. Ant will be affected by pheromone concentration during traveling, and pheromone update is determined by whether the node is a valid node and relevant information brought by backward ant. When task is assigned to the node at 1 t time, the value of the node pheromone is updated to: ( 1 t i  is the size of the node pheromone.  is the regulatory factor. At the time 2 t ,the task assigned by the node is successfully executed and released, or the task assigned by the node fails, the pheromone is updated to:  10. According to equations (7) and (8), the pheromone on the node is initialized; 11. Set parameters and end conditions of ant colony algorithm; 12. Before searching with the ant colony algorithm, first check the task list. If the task list is empty, then randomly place the ants on the node, Otherwise, place m forward-facing ants at the recommended node, and select the next node j according to formula (11), put j into k tabu ; 13. If node j is a valid node, the backward ant is generated at node , and pheromone update is carried out according to equations (9) and (10). If the optimal solution is found successfully, the node is filled into the task list.
14. According to the backward ant information, the success rate of each node and the global pheromone are updated.
15. If it meets the ending condition of ant colony algorithm, then the end of the algorithm is achieved; otherwise, go to step 12.

Combining self-adaptive genetic algorithm with ant algorithm
1. Set the time for dynamic fusion：(1)reach the maximum generation Genemax;(2) Genedie is less Geneminratio by a predefined value.
2.Set the initial pheromone of the ant algorithm: We use the initial distribution of GA. Pheromone on the path ( i,j) is defined as follows:

Analysis of experimental results
The simulated environment was 20 resource nodes, and 30-100 tasks were taken as examples for testing. Simulation experiments were carried out on self-genetic algorithm (GA), ACO and AGA-ACO. In order to test the effectiveness of the algorithm, the control parameters of GA, ACO and AGA-ACO are set the same, and the running time and iteration times of the algorithm are compared respectively. The minimum set of tasks is shown in Table 1.
In order to compare the execution time and cost of AGA, ACO and AGA-ACO in executing larger task sets, a larger task set is set up as shown in Table 2. The control parameters of the three algorithms are the same, and the execution time and execution cost are compared respectively. Expectation cost [10,50] As can be seen from figure 1, when the number of tasks is relatively small, GA has a strong search ability and takes a short time, but with the increase of the number of tasks, the time increases rapidly. ACO algorithm rapidly converges to the optimal solution with the accumulation of pheromones, and the time increase is smaller than AGA algorithm. For AGA-ACO algorithm, after combining the advantages of the two algorithms, the advantages of the algorithm are not obvious when the number of tasks is small. As the number of tasks increases, the advantages of the algorithm become prominent.    Figure 2 shows the AGA, the ACO and AGA -the ACO three algorithms in different task the number of iterations required to obtain a optimal solution of the comparison, the comparison result shows that the number of iterations required AGA algorithm, so the algorithm is the most timeconsuming, AGA -the ACO rapidly in early use of the global search ability of GA algorithm to obtain the initial pheromone distribution, the optimal fusion point of reusing the ACO algorithm of positive feedback mechanism, finding the optimal solution quickly the number of iterations required less than GA and ACO algorithm.

Conclusion
Aiming at the Hadoop job scheduling problem, this paper integrates the adaptive genetic algorithm and ant colony algorithm to solve the optimal solution of task allocation. In the early stage, AGA's global fast search ability is used; in the later stage, ACO's positive feedback mechanism is used to form complementary advantages. In the process of algorithm implementation, the resource capacity of the node and the estimated time of task completion are considered comprehensively, and then the task is allocated to the most suitable node for execution. Compared with the three common schedulers of Hadoop, this algorithm improves the efficiency of job scheduling and saves the cost of administrator management node.