Storm Scheduling Based on Non-cooperative Game

With the continuous development of Internet technology, a large amount of data with rich value has been generated, and lagging data analysis will affect the timeliness of data, so real-time streaming data processing is becoming increasingly important. Storm is a pure streaming data processing framework, but it uses the polling scheduling algorithm by default. This algorithm ignores the network communication overhead between workers and cluster load balancing. Aiming at Storm’s default scheduling problem, a non-cooperative game-based Storm scheduling algorithm (G-Storm) was proposed. Storm extracts the source data in real time through the component “Spout”, passes it to the logical processing component “Bolt”, and finally loads it into the target warehouse. The experimental results show that the game scheduling algorithm proposed in this paper reduces the system processing delay by 28.6% compared with the default scheduling algorithm.


Introduction
With the continuous development of Internet technology, massive amounts of data with rich value have been generated, and lagging data analysis will affect the real-time role of data. For example, Facebook generates more than 10 petabytes of data per month, and the number of Weibo on Twitter is more than 100,000 per minute. The data continuously generated by practical activities gradually accumulated to the source database, and a large amount of data became precipitation data, which affected the real-time processing and analysis and mining of the source data, and thus could not provide a powerful information support for information decision makers. Therefore, massive amounts of data need to be processed in real time. Storm is a pure streaming data processing framework, but polling scheduling is used by default. There is room for improvement in network communication between nodes and cluster load balancing.
Literature [1] proposed two adaptive scheduling algorithms, offline and online, in the Storm environment. The online scheduling algorithm formulates scheduling strategies based on real-time monitoring of node loads and cluster communication loads, which makes up for the shortcomings of offline scheduling. For complex topologies, it is easy to fall into a local optimum. Xiong Anping and others introduced the concept of topological hot edges [2] . This algorithm migrates high-frequency hotedge association task pairs to the same working node, but the algorithm only considers the task of optimizing high-frequency hot-edge communication within the topology. Lu Liang proposed a weightbased task scheduling algorithm [3] , designed an edge weight gain model, and moved tasks to nodes with larger edge weight gain values. However, there is a problem that the load of one working node is excessively lower than that of the remaining nodes. Liu Su proposed a task scheduling strategy based on topology [4] , which first moves the threads corresponding to the components with the largest topology to the nodes with sufficient CPU resources, but moving the tasks to a node as far as possible must affect

Related definitions
In Storm scheduling, try to maximize the internal traffic of the nodes (then the traffic between nodes is the smallest) while taking into account the load balance of the cluster, and introduce a revenue function (game function) to solve the optimal solution of the system scheduling. For the above model, the following definitions are made: Defined f as the weighted value of the total data transmission within the node, as shown in equation Where:  as the weight of internal traffic of the node, F as the total amount of internal data transmission of the node, S as the cluster slot set, ij r as the data flow between tasks, ij kr S as the internal data flow ij r of the first k slot.
Define  and  as the standard deviation of CPU and memory load of the work node, as shown in formula (2) and (3): Where: k Wn as average load of cluster CPU, k Mn as average load of cluster memory.
The problem of scheduling optimization can be transformed into the problem of the maximum value of the income function u under the conditions of formula (4), (5) and (6).
Where: g represents the weighted sum of cluster CPU and memory load standard deviation, and  and  represent the weights of  and  respectively. Generally    , mainly considers the impact of CPU load on the system.

Game modeling
This paper constructs a non cooperative game model of storm topology control, in which the players in the game model or game participants are tasks in the scheduling system; Participant's strategy is that when they change their own strategies while others' strategies remain unchanged, the overall performance of the system improves, as shown in equation (7): It is called a scheduling strategy vector. According to the definition of Nash equilibrium point, when the constructed storm task scheduling control model reaches Nash equilibrium through game, it can be considered that the system revenue has reached steady state. No task can improve itself and the overall utility of the system by only changing its scheduling strategy. Strategy set is a necessary and sufficient condition for Nash equilibrium point of the proposed multi-task resource allocation game model, which satisfies formula (8): The game model proposed in this paper is based on the existence theorem [6] of the Nash equilibrium II to find the existence of the Nash equilibrium. First, f is a non-empty bounded closed convex set on Euclidean space. With the continuous optimization of the game model, the load standard deviation gradually decreases and stabilizes. Therefore, g is a non-empty bounded closed concave on European space set. In summary, the return function u is a bounded closed convex set. There is a maximum return function u for topology task scheduling, so there is a Nash equilibrium point in the game model of the Storm cluster topology scheduling problem.

Algorithm Design
The premise of the implementation of this algorithm is to monitor the operating status of the cluster and consider load balancing while obtaining the maximum internal traffic of the nodes. The prime minister of this algorithm runs the default algorithm. After the operation is stable, it collects and stores the data flow between tasks in the node and the CPU load and memory load of the node. During the execution of the default algorithm, if the monitored CPU load duration interval exceeds the threshold, the Storm scheduling algorithm based on non-cooperative games is triggered. In Storm topology operation, if the CPU load of each node in the cluster continues to be uneven, that is, within a certain interval, the difference between the maximum and minimum CPU load of each working node in the cluster is greater than the threshold, then game scheduling is triggered again algorithm.
The game algorithm is based on collecting default scheduling information and then determining whether to trigger game scheduling. If the number of initially scheduled cluster node tasks is unbalanced, the initialization is repeated. Then iterate through all the tasks in each node, move to different nodes respectively, and record the optimal revenue value. The total number of traversal times M(50-100), if the optimal value has not changed, it is the optimal value. This scheduling algorithm starts from an adaptive perspective and uses an iterative approach to maximize the cluster's revenue, thereby deriving the optimal scheduling strategy. This process requires each task to traverse the slots, calculating the optimal return will consume a certain time.

Lab environment
The test environment is a Storm-1.2.2 cluster built by 13 PCs. Each computer is equipped with an Intel-Core i7-9700K CPU @ 3.6 GHZ x8 processor and 8 GB of RAM. The machines are all installed with CentOS-7.0 64 system, interconnected by switched Ethernet 1Gbit in LAN. Three of them run Zookeeper cluster and Kafka cluster together, Nimbus, process UI and database Mysql run on one of them. The remaining 10 nodes run the Supervisor daemon.

Experimental program
Implement the Pluggable Scheduler using the pluggable custom task scheduler provided by the Storm framework, which is designed for developers. Use the system's built-in Lead Monitor to monitor cluster operation, and Load Monitor runs as a background process. The experimental parameters are ideal values determined after several experiments and fine-tuning. The experiment set up 10 worker nodes and 10 worker processes, that is, only one worker process is deployed on each worker node, which can effectively reduce the communication overhead between worker nodes. The game weights  ,  , and  are 6, 4, and 2, respectively. When the node CPU load lasts for more than 60% for 80 seconds, the scheduling is triggered. Table 1 lists some experimental parameters. In order to evaluate the performance of the proposed non-cooperative game algorithm G-Storm, this paper develops a data generator program to generate the source data, generating 1*109 tuples, each tuple contains 20 alphabetic fields. The processing flow proceeds as follows. The dataset is stored in a Kafka (topic=3, partition=12) message sequence database, and Storm reads the data from Kafka and loads it into a Mysql database after processing.

Experimental results
Using our designed data set, the non-cooperative game-based Storm scheduling algorithm G-Storm proposed in this paper is superior to the default polling scheduling algorithm adopted by Storm systems in terms of data processing latency. Experimental results show that the proposed improved algorithm reduces the system latency by about 28.6% compared to the default algorithm. As shown in Figure 1.

conclusion
With the development of society, there is an urgent need to improve the performance of data real-time stream processing. Storm uses the polling scheduling algorithm by default. There is room for optimization in node network communication overhead and cluster load balancing. This paper proposes Storm scheduling for non-cooperative games. Experiments show that the improved Storm scheduling algorithm reduces system processing delay. Storm scheduling has not considered network bandwidth issues and the impact on data transmission hardware resources.