Research on Routing Algorithm Based on Reinforcement Learning in SDN

Reinforcement learning is an unsupervised learning method which has been used in many fields. Actually, the essence of Reinforcement learning is a decision-making problem. It constantly tries to interact with the environment. Each interaction process will get a different feedback value, and then it adjusts each trial strategy through feedback. In this paper, we apply the Reinforcement learning technology to software defined network routing algorithm, and propose the routing algorithm based on Q-learning. Through the combination of Reinforcement learning and neural network, which means the Q-table in Q-learning is replaced by neural network, we present routing algorithm based on Deep Q-learning.

Mentioned above, SDN network does not consider more about the requirements of link's quality of service. Therefore, ensuring service deployment and optimal utilization of network links in the SDN network architecture, using certain QoS requirements as the criterion for judging the path pros and cons, reinforcement learning is applied to SDN network routing in this paper. Reinforcement learning can continuously explore the surrounding environment through agents without prior knowledge, familiar with the entire environment after a number of training cycles, and make the right choice finally. Different from the Dijkstra algorithm [12] using in traditional network routing, we propose and implement a routing algorithm based on Q-learning. Also, because of the feature that routing algorithm Q-learning needs the defects of artificial design, using the neural network to replace the Qtable in the traditional network, the routing algorithm based on Deep Q-learning is proposed.

Research on routing algorithm based on Q-learning
Q-learning as a no-model advanced version of reinforcement learning, relying on Q-value iteration to achieve the balance with exploration and utilization. During the iterative process of the Q-learning algorithm, people generally select the direction in which Q-value is the largest. However, not every iteration will follow this principle. Under the results of multiple experiments, Q-learning algorithm will use a certain probability of which the distribution will be carried out in the direction with the largest possible that Q-value is the largest. There are several common terms in Q-learning, states, reward, and actions. The goal of Q-learning is to achieve the maximum reward value by Q-value iteration.
When designing the routing algorithm, a numerical iterative solution method is used to approximate the optimal value. The complex network environment in which new packets are located is a finite state Markov process. When a new packet selects an action from the action set, the network environment accepts the action and transitions state, giving the decision a reward value at the same time. The greater the reward value that a packet gets from an action, then the probability that the packet will select this action again will increase. At a certain time t, the state of the packet is s t , the data packet selects an action a t , the network environment state is transferred from s t to s t+1 , and the reward value given by the environment system is r t , then the routing decision process in the environment is shown in equation (1).
In the learning process, the optimal value function is approximated by iteratively calculating Q(s, a). The basic transfer rules of Q-learning algorithm are shown in equation (2) and equation (3). s, a respectively represent the current state and action of the packet, s' and a' respectively represent the next state and action of the current state, and the learning parameter γ is a constant satisfying 0 > γ >1.
If the learning parameter tends to 0, indicating that packet routing decision cares more about the reward value of the current action. If the learning parameter tends to 1, packet routing decision will take into account the reward value of the future action.
Without supervision, the packet only study relying on the current experience, and it constantly moves from one state to another to explore until it reaches the target state. Each exploration of a packet is called an episode. When the packet arrives at the target state from the initial state, it represents the end of an episode.
The basic steps based on the Q-learning routing algorithm can be summarized as follows: (1) Give the learning parameter γ and the bonus value matrix R. (2) When the Q-matrix is initialized, set the value of all elements in the matrix to 0.
(3) For each episode: (3.1) Select a state s randomly from state set as an initial state.
(3.2) Before reaching the target state, the following steps will be performed: Once the Q-matrix training is completed, the newly transmitted data packet can easily find a forwarding path from the arbitrary state s 0 to the target state from the trained Q-matrix. The specific process is as follows: (1) Let the current state s = s 0 .
(2) Determine the action a in the current state, which satisfies Q(s, a) = max {Q(s, a')}.
(3) Let the current state s = s' (s' denotes the next state corresponding to a).
(4) Repeat (2) and (3) until s becomes the target state. The iterative process for every route attempt using of packet which is based on the Q-learning routing algorithm is shown in figure 1. S 0 represents the position of the start state of the packet in one iteration process, and Sn represents the n state of one iteration. The direction indicated by the arrow is the direction in which the packet state transitions. It can be seen that based on the Q-learning routing algorithm, each state transition of the packet occurs between two adjacent states, so such a Q-learning algorithm is also called a single-step Q-learning algorithm. The single-step Q-learning algorithm takes up a relatively large amount of memory and at the same time converges slowly.
Step one training Step two training Step three training Step n training

Research on routing algorithm based on deep Q-learning
The Q-learning algorithm can learn from the surrounding environment, but it needs to manually design the corresponding features to train a convergent Q-matrix, while neural networks have good processing power for massive data. So replaced the Q-matrix in Q-learning with a neural network, it is the Deep Q-learning algorithm. A Deep Q-learning algorithm model is a combination of a multi-layer neural network model and a reinforcement learning model. Give a state s, output a vector of action values Q (s, ; ),  is the parameter of the network. And neural network as a function can map the vector from an n-dimensional space to an m-dimensional action space. The Deep Q-learning algorithm has two main points, one is the target network, and the other is experience replay. Equation (4) For experience replay in reinforcement learning, the observed state transitions are first stored. After the samples have accumulated to a certain extent, then sampling randomly to update the network. Experience replay is a very important part of deep reinforcement learning, which can greatly improve the system performance of deep reinforcement learning.
There are two key elements based on the Deep Q-learning routing algorithm: one is the calculation of the loss function, and the other is the extraction of the training samples in the neural network training process.
Calculation of the loss function: The relationship between the target Q-value and the old Q-value in Q-learning coincides with the relationship between the output value and the result value in the neural network learning, so the loss function is calculated as shown in equation (5).
Training sample extraction: Based on the concept of experience replay, the reason for randomly sampling is that a sample obtained after a packet exploring by the surrounding environment is a timeassociated sequence with temporal correlation. Because of the time correlation, if the data is directly used to train and update the Q-table, the system convergence will be greatly affected. Therefore, we adopt random sampling to solve the problem of time correlation.
The complete process based on the Deep Q-learning routing algorithm is as follows: (1) Initialize packet randomly as one state, initialize the memory pool, and set the observation value (2) Loop traversal: (2.1) Packet policy selects an action a.

Routing algorithm simulation environment
The tests of the two algorithm models are based on the network topology shown in figure 2. In this experiment, the link is graded by a certain QoS standard. The higher the level, the better the link QoS, and the stronger the trend of selecting the link. The ultimate goal of the experiment is to select a forwarding path with better QoS performance for the packet.

Routing Algorithm Status Description
The main points of reinforcement learning are to find out the state of the environmental system and the corresponding executable actions. According to the simulation experiment environment model diagram above, all the state sets under the adopted topology can be described as follows: S = [S1, S2, S3… S15, L*_1, L*_2, L*_3]. S1 ~ S15 respectively represents 15 switches in the experimental environment, * placeholder represents the links of any two switches. In order to fully represent all the possible link conditions in the entire network environment, each link performance is divided into three levels according to the QoS standard, _1, _2, _3. The link performance corresponding to each level is shown in Table 1. The higher the percentage of link QoS performance, the better the link status. Since there are two or more links on a switch, each link is divided into three different levels, so one link of a switch usually corresponds to an actual state. All states can be materialized according to three different levels of link state. Take switch S3 as an example. There are three links connected to it: L2, L5, and L8. So the three links correspond to three states, and each link can be divided into three states according to the link performance level. The three links of a switch actually correspond to nine specific states. As shown in table 2.  [3, L2_1] represents the state s1, the switch in which it is located is S3, the link is L2, and the link level is _1. After all the states are determined, the optional set of actions in the experimental environment is determined. An optional action refers to moving from one state to another. In this paper, the optional state is defined as A = {s[i], s[t]}, which means to transfer from one state i to another state t.

Routing Algorithm Reward Function Definition
The purpose of the routing algorithm is to select a link with better comprehensive performance from among the many optional links. In this paper, the network QoS standard is taken as the main consideration. So the purpose of the routing algorithm is to select a path with better QoS synthesis. Since the ultimate goal of reinforcement learning algorithm we adopted is to select the link with the larger reward value as the better link after training, the setting of the bonus value R is particularly critical.
Since the performance of the link has been divided into three levels, the reward value can be designed based on the performance level of the link. Generally speaking, we will choose the link with higher link performance value as the better link, so according to the three levels of link performance, the set bonus value is as follows:

Deep reinforcement learning algorithm parameter setting
Deep Q-learning uses neural network to replace the Q-table in Q-learning. In this paper, we design a simple three-layer neural network. After setting the parameter debugging and operating, if the loss function value continues to decrease, as shown in figure 3, confirm that the neural network can be normal operation. The loss function, which can also be called the error function, represents the closeness of the sample's predicted value to the actual value. Learning rate, the number of layers, and the number of neurons in neural network can be adjusted according to the experimental results. In the

Routing Loop Problem
In the above experimental scheme design, there may be a routing loop problem. And traditional RIP protocol has four main methods to solve the routing loop problem [13]: trigger update, horizontal split, toxicity reversal and suppression timing method. We adopt a kind of horizontal splitting method: if a packet passes through a switch node, set a variable Node = [S1, S2, S3...Sn] to record the switch that the packet has already experienced. The optional path of the next hop of the packet can only be selected from the remaining switch nodes, but cannot be returned to the node that has already arrived. The strategy flow chart for solving the loop problem is shown in figure 4.  The final result of the Q-learning algorithm is to train a convergent Q-matrix, through which the link from a given starting point to the end point can be directly selected. The results of the Q-matrix solved by the Q-learning algorithm in this experiment are shown in Table 3. Through the trained Qmatrix, it is easy to get the full path results from any starting point to the ending point 15 as shown in Table 4. Based on the same topology design, the Deep Q-learning routing algorithm can also easily find the complete path from any starting point to the ending point 15. The partial search results after the model training is completed are shown in Table 5.

Analysis of experimental results of routing algorithm
Based on the given network topology environment, the number of steps required to reach the end point in the training process based on the Q-learning routing algorithm is analysed. The results of partial cycle convergence steps are shown in table 6. It can be easily seen from figure 5 that based on the Qlearning-based routing algorithm, the convergence speed near the end state does not change the convergence speed of the current state.  In terms of algorithm accuracy, the Q-learning belongs to unsupervised learning. Each training period in the early stage is in the process of trying and training the Q-matrix, so the accuracy rate is very low, but as the number of training increases, the Q-matrix gradually converges. During the process, it can be found that the accuracy of training is rising rapidly as shown in Table 7 figure 6 that the accuracy of the Q-learning can reach a stable value faster. The environment used in this training process is not complicated, and the final accuracy of the algorithm can reach almost 100%.  Figure 5. the average number of steps to the end point changes with the training period Figure 6. Accuracy rate as a function of training cycle

Conclusion
In this work, we mainly use reinforcement learning to SDN network routing, and design a routing algorithm based on Q-learning. In the design process of this algorithm, the network link QoS standard is mainly considered. Obviously, this algorithm can make full use of the network link. From the experimental test results, the algorithm can find the route with better QoS performance with nearly 100% accuracy after a certain period of training. There is a disadvantage in the Q-learning-based routing algorithm: the corresponding features need to be manually designed. However, in the actual application environment system, the features are often much complicated. It is very difficult to extract the design features manually, and the effect is not satisfactory. Therefore, we combine neural network with reinforcement learning, replace the Q-table with an approximate function trained by neural networks, and proposes a routing algorithm based on deep reinforcement learning. The experimental results show that the routing algorithm based on Deep Q-learning also shows good performance.