Reinforcement Learning-Based Service-Oriented Dynamic Multipath Routing in SDN

(e increasing quality and various requirements of network services are guaranteed because of the advancement of the emerging network paradigm, software-defined networking (SDN), and benefits from the centralized and software-defined architecture. (e SDN not only facilitates the configuration of the network policies for traffic engineering but also brings convenience for network state obtainment. (e traffic of numerous services is transmitted within a network, whereas each service may demand different network metrics, such as low latency or low packet loss rate. Corresponding quality of service policies must be enforced to meet the requirements of different services, and the balance of link utilization is also indispensable. In this research, Reinforcement Discrete Learning-Based Service-OrientedMultipath Routing (RED-STAR) has been proposed to understand the policy of distributing an optimal path for each service. (e RED-STAR takes the network state and service type as input values to dynamically select the path a service must be forwarded. Custom protocols are designed for network state obtainment, and a deep learning-based traffic classification model is also integrated to identify network services. With the differentiated reward scheme for every service type, the reinforcement learning model in RED-STAR gradually achieves high reward values in various scenarios. (e experimental results show that RED-STAR can adopt the dynamic network environment, obtaining the highest average reward value of 1.8579 and the lowest average maximum bandwidth utilization of 0.3601 among all path distribution schemes in a real-case scenario.


Introduction
As the diversity of network services increases, users accordingly demand high quality of service (QoS) [1]. Each service may be pursued with different network metrics, such as less response time for voice over Internet protocol (VoIP) and low packet loss rate for file transmission. A traffic engineering scheme must forward the traffic of specific services to routes, whereas the traffic of various services is being transmitted within a network. Some routes have different attributes within a network; thus, the service traffic must be appropriately routed to the corresponding paths. Meanwhile, the utilization of every link must also be balanced. With the abovementioned issues, the main problem can be formulated as follows: given a set of services and network link states, an optimal path must be assigned for the traffic of each network service to meet its QoS requirements and the link utilization must be balanced as much as possible.
e emerging software-defined networking (SDN) paradigm will be a potential solution to dynamically assign paths and obtain network link states. e SDN with the global view of the network enables the rapid and dynamic deployment of network policies [2,3], which are widely used in enterprise and wide area networks. e control plane within a traditional network device is separated from the data plane in SDN, and a logically centralized controller comes into being. e controller communicates with the network devices via an open-standard protocol, namely, OpenFlow, and the switches that support OpenFlow are known as OpenFlow Switches (OFSs) [4]. By adding flow rules via OpenFlow to an OFS, the OFS can execute the instructions designated by the controller. e flow rules are in charge of either modifying the packet header fields or forwarding the traffic, which can be used to carry out policies to meet the QoS requirements [5].
In the environment of SDN to enforce policies, a corresponding traffic engineering algorithm or mechanism can be used in the controller [6,7]. However, before we distribute the paths for each service, the service traffic must be first identified, which is known as a traffic classification (TC) task. e TC can be approximately categorized into three approaches [8,9]: (a) port-based, (b) payload-based, and (c) machine learning-based. Traditional port-based methods identify packets by the well-known port numbers [10] assigned by the Internet Assigned Numbers Authority [11], which is an instant TC scheme but suffers from dynamic port number utilization [12,13]. Payload-based approaches inspect the payload within a packet by predefined patterns, which can handle dynamic port numbers but are weak at processing encrypted traffic [14,15]. Machine learningbased approaches exploit various algorithms to classify service traffic by taking either statistical information or packet bytes as input values [16][17][18]. In this research, a deep learning TC model constituted of the autoencoder and 1D convolutional neural network (CAPC) is used in the SDN controller.
e data collection and processing methods, model construction and training, and performance evaluation of the TC model are presented in our previous work [19]. is study is the first to integrate the deep learning TC model within a network environment.
After service classification, an identified service can be assigned to an ideal route by the learning-based algorithm.
is work aims to support the services with their required QoS and simultaneously balance the traffic load. e Reinforcement Discrete Learning Service-Oriented Multipath Routing (RED-STAR) mechanism is proposed to dynamically distribute routes in a network to every service and to tackle the problem. As a deep reinforcement learning (DRL) [20] method, RED-STAR considers the network metrics, that is, bandwidth utilization, link latency, and packet loss rate, as the environment state. e metrics are periodically measured and updated for the route distribution task. However, errors of the obtained measurements may occasionally occur because of software simulation or hardware defects. A metric regularization scheme is included in this work. e deep neural network (DNN) model in RED-STAR takes the regularized environment attributes as input values and generates the output via its inner neural network (NN) computation. Each output value of the DNN model represents the reward value of a route, also known as an action, and the action with the highest reward value is the best route in DNN's perception. e reward scheme is inconsistent for different genres of services because of the varying degrees of QoS. For example, VoIP services attach great importance to latency; thus, high latency results in a low reward. Text messages concern more on packet loss rate; thus, a high packet loss rate also leads to a low reward. e differentiated reward scheme prompts the RED-STAR to allocate the appropriate path to the corresponding service traffic. In addition, high unbalanced link utilization incurs a low reward to balance the utilization of links.
e "discrete" learning in RED-STAR is a slight modification from a typical DRL scheme, which will be further discussed in the following sections. e major contributions of this research are summarized as follows: (1) A deep learning TC model is integrated within a network environment to classify the incoming packet encapsulated in packet-in messages, which is an innovative implementation. (2) Custom protocols for network metrics obtainment are designed, and RED-STAR regularizes the measured metrics to provide the DRL model with stable input data. (3) e reward scheme considers different QoS requirements of services and load balancing issues, and RED-STAR distributes routes to services relying on the custom reward scheme. (4) e DRL mechanism is applied in the SDN, takes network metrics and service type as the environment, and considers routes in the network as the action set, which is a novel traffic engineering paradigm. (5) e experiments are implemented with real service traffic (i.e., PCAP file traffic replayed by Bit-Twist [21]) instead of simulated traffic (e.g., randomized packet payload generated by iPerf [22]). e results show that the proposed method performs better than other route distribution schemes when considering load balancing and QoS requirements. e remainder of this work is organized as follows. Related work and background are discussed in Section 2. e system architecture is illustrated in Section 3. e system workflow is elaborated in Section 4. e proposed RED-STAR route distribution is detailed in Section 5. Experimental results are demonstrated in Section 6. Finally, Section 7 concludes this work.

Related Work and Background
In this research, two main issues are targeted to be addressed for route distribution: (a) QoS guarantee of network services and (b) bandwidth utilization, offloading and balancing. Existing works regarding both topics are discussed in the following paragraphs, and the background and applications of DRL will also be investigated.

QoS Guarantee of Network Services.
e traditional network architecture cannot thoroughly offer the QoS guarantee of each service, whereas the emergence of SDN enables the flexible flow rule addition and accelerates the deployment of QoS routing policies [23,24]. e common strategy of QoS is to reserve bandwidth for specific services, which guarantees the least available bandwidth for each service. Oliveira et al. [25] used Resource Reservation Protocol and OpenFlow to set up a dedicated channel between a service requester and a service provider, with a static threshold of bandwidth to guarantee file transfer time. Tomovic et al. [26] utilized the SDN mechanism to offer priority flow bandwidth guarantees, designed an algorithm for route calculation and bandwidth reservation, and compared the performance with the best-effort and shortest path routing and IntServ. However, the requirements of services are not limited to the minimum bandwidth guarantee but involve maximum latency and packet loss rate tolerance. Links may have different network metrics, wherefore a superior path distribution algorithm for service traffic is required. Tseng et al. [27] proposed a multiobjective genetic algorithm (GA) to dynamically forecast the resource utilization and energy consumption in the cloud data center. e GA forecasts the resource requirement of the next time slot according to the historical data in previous time slots. Li et al. [28] presented a novel service functions (SF) deployment management platform that allows users to dynamically deploy edge computing service applications with the lowest network latency and service deployment costs in edge computing network environments. Tseng et al. [29] proposed a gateway-based edge computing service model to reduce the latency of data transmission and the network bandwidth from and to the cloud. An on-demand computing resource allocation can be achieved by adjusting the task schedule of the edge gateway via lightweight virtualization technology.

Bandwidth Utilization Offloading and Balancing.
Apart from QoS guarantees, traffic offloading and balancing are inevitable issues, which often occur in a multipath network environment [30]. e SDN controller with the global view of a network can observe the network state and dynamically formulate a strategy to optimize traffic forwarding. Traffic offloading is essential when congestion occurs, and the increase of throughput is the primary goal. Chiang et al. [31] proposed a traffic distribution method to offload the incoming traffic. ey utilized Link Layer Discovery Protocol (LLDP) in finding disjoint paths and Dijkstra in finding the shortest path with minimum hop counts to increase the overall throughput of the multipath network. Yahya et al. [32] pointed out the defect of the current prevalent open shortest path (OSPF) algorithms, which are prone to selecting merely one single best path for traffic forwarding and likely incur traffic congestion. e authors have developed a depth-first search algorithm to select several best paths according to link utilization, and the group action feature of OFS is used to distribute traffic across multiple paths. Despite an uncongested network, traffic balancing is still desirable to prevent future congestion. Challa et al. [33] proposed a CentFlow routing algorithm to enhance the node and link utilization depending on the centrality measures and temporal node degree. Tseng et al. [34] integrated the hypervisor technique with container virtualization and constructed an integrated virtualization (IV) fog platform for deploying industrial applications based on the virtual network function. Tseng et al. [35] addressed the design pattern of the 5G micro operator and proposed a Decision Tree-Based Flow Redirection (DTBFR) mechanism to redirect the traffic flows to neighbor service nodes. e DTBFR mechanism allows different μOs to share network resources and speed up the development of edge computing in the future.

Reinforcement Learning.
A typical reinforcement learning (RL) [36] scenario involves three essential elements: an environment, agent, and action set. e agent is the learning entity that receives the state from the environment in a sequence of discrete times, t � 0, 1, 2, . . ., where s t is the state obtained at time t. After receiving s t , the agent will select an action, a t , to be performed by its policy according to the gained information.
e environment of s t will be influenced by a t , thereby transforming into s t+1 . A reward value, r t , standing for the score of performing a t in s t , will accordingly be generated by the environment and given back to the agent. On the basis of r t , the agent will determine the performance of the previous action and tune its inner algorithm, attempting to obtain high values under the following states.
Q-learning [37] is a representative RL paradigm that has a Q-function to estimate the expected reward value of performing an action under a state (i.e., Q value): where Q π (s, a) represents the Q value of an action. e policy π determines the action to be performed, and r is the reward value of performing a under s. After the state-action pair (s, a), a new state s' comes out. A discount factor λ is multiplied by Q π (s ′ , a ′ ) to reduce the impact of events over time. e Q-function in Q-learning is implemented with a Q table, storing the expected reward values of each action under each state. Once the reward is obtained, the Q table updates its stored value as follows: (2) where max a′ Q(s ′ , a ′ ) is the maximum expected reward under the next state, which is multiplied by an adjustable discount factor c. e addition of r and cmax a′ Q(s ′ , a ′ ) subtracted from the original Q(s, a) indicates the error of the Q value predicted by the Q-function. e difference is multiplied by a learning rate α and added to Q(s, a) to update the Q-function.
Although traditional RL works well in simple tasks, it cannot handle high input dimensionality and it suffers from slow convergence. Combined with the emerging deep learning, which mitigates the abovementioned problems, DRL has appeared. e DRL has recently been applied in several fields, such as video gaming [20], self-driving systems [38], and even computer networking [39,40]. Hossain et al. [41] raised the issue of situation-aware management to ensure application-driven QoS and utilized link delay and packet loss rate as QoS metrics. DRL-based intelligent routing decision-making is proposed to optimize routing paths, with delay and loss rate as the observation space and weighted delay and loss rate as the reward scheme. Lin et al. [42] used an RL adaptive routing in a hierarchical SDN network. e customized reward function is calculated with delay, packet loss rate, and bandwidth multiplied by the corresponding weighted parameters. e parameters are tuneable and configured according to the requirements of services.

Wireless Communications and Mobile Computing
ere are many RL methods that learn some weights and then employ conventional routing algorithms. Yu et al. [43] proposed a deep deterministic policy gradient routing optimization mechanism (DROM) for SDN to achieve a universal and customizable routing optimization. e DROM simplifies the network operation and maintenance by improving the network performance, such as delay and throughput, with a black-box optimization in continuous time. Sun et al. [44] built an intelligent network control architecture TIDE (timerelevant deep reinforcement learning for routing optimization) to realize the automatic routing strategy in SDN. An intact "collections-decision-adjustment" loop is proposed to perform an intelligent routing control of a transmitting network. Stampa et al. [45] designed a DRL agent that optimizes routing.
e DRL agent adapts automatically to current traffic conditions and proposes tailored configurations that attempt to minimize the network delay. Pham et al. [46] exploited a DRL agent with convolutional neural networks in the context of knowledge-defined networking (KDN) to enhance the performance of QoS-aware routing. Guo et al. [47] proposed a DRL-based QoS-aware secure routing protocol (DQSP). While guaranteeing the QoS, the DQSP can extract knowledge from history traffic demands by interacting with the underlying network environment and dynamically optimize the routing policy. Rischke et al. [48] designed a classical tabular RL approach (QR-SDN) that directly represents the routing paths of individual flows in its state-action space. QR-SDN is the first RL SDN routing approach to enable multiple routing paths between a given source (ingress) switch and destination (egress) switch pair while preserving the flow integrity. Ibrar et al. [49] proposed an intelligent solution for improved performance of reliable and time-sensitive flows in hybrid SDN-based fog computing Internet of ings (IoT) systems (IHSF). IHSF solves several problems related to task offloading from IoT devices in a multihop hybrid SDN-F network context.
Based on the abovementioned related research, the feature of this study is to propose a reinforcement discrete learning-based service-oriented multipath routing to understand the policy of distributing an optimal path for each service. e RED-STAR takes the network state and service type as input values to dynamically select the path a service must be forwarded. e seven papers related to the motivations and problems to be solved in this study are compared. e comparison table is shown in Table 1. Compared with IHSF or other methods, our proposed method considers the type of traffic and selects the best routing path.

System Architecture
e overall system is an SDN paradigm, which can be regarded into two parts: data plane and control plane. e data plane is in charge of forwarding the traffic; the control plane is the primary site to deploy the custom modules, which are the core components of the architecture. e details of both parts are illustrated in Figure 1 as a UML diagram. e data plane is composed of OFSs, forwarding the traffic between the server and the client according to the rules deployed in the flow table. e flow table stores the commands delivered by the controller, and OFSs forward the packets or modify the header fields according to the rules.
e OFSs communicate with the control plane via OpenFlow channels, whether flow deployment or statistical report. As for the control plane, several custom modules constitute the controller, which is described as follows.

Controller.
e controller object is responsible for maintaining the information of routers, links, and service objects within the network. A router stands for an OFS; a link is a path between two OFSs, and service is the traffic type being transmitted. In addition, the controller periodically requests the network metrics and regularizes and updates the obtained metrics. e controller also periodically reallocates the paths for services and trains the DRL agent to improve the path allocation. Regardless of the out-of-band communication or in-band communication between the switch and the controller, the method proposed in this study is applicable.

Router.
A router object is an entity in the control plane representing an OFS. When an OFS is activated and notifies the controller, a router object will be instantiated. A router object collects the information of each port by sending the port request messages and explores the topology by sending LLDP messages. Moreover, whenever a packet unmatched to flow rules is sent to the controller, the router will normalize and classify the packet to a service type by the CAPC deep learning model. A router is also in charge of the communication between OFSs and the control plane, such as flow addition and port statistics requests.

Classifier.
e classifier is a component of a router object that normalizes and classifies packets, and the classification result will be returned to the router. e normalization and training process are detailed in our previous work [19].

Link.
A link object consists of three network metrics: bandwidth utilization, latency, and packet loss rate. e controller is responsible for maintaining and updating the links, and the metrics are regularized before being updated. e metrics are the main factors and state for the future DRL path distribution.

Service.
A service object records the service type and its specific reward calculation policy. After the controller distributes a path for a service, the allocated path (last action) will be recorded. Once the network metrics are obtained, the metrics will be the input values of the reward calculation method to generate the reward for the last action. e agent object belongs to the controller, having an experience replay memory to train the NN model. e NN model is used to select a path for service and trained by the transitions in the replay memory. After path allocation, the previous state, the selected path, reward, and the current state are saved as a transition into the replay memory.

System Workflow
A few procedures must be accomplished to model the QoS path distribution as an RL problem. is section details each procedure, including network state observation, path distribution, and the reward mechanism. e typical framing of an RL scenario: an agent takes actions in an environment, which is interpreted as a reward and a representation of the state, which are fed back into the agent.
e state of the environment includes the type of service, the current bandwidth utilization, the packet loss rate, and the delay of each link. e description of each procedure is shown in Figure 2, where the observation is to learn the changing environment of the network for the RL agent.
e observation includes the service type, current bandwidth utilization, packet loss rate, and the latency of each link. After receiving a state as the input, the agent will select a path for service and add a flow to OFSs. Subsequently, the reward value is generated on the basis of the distribution and service type. e three kinds of data, state, action, and reward, will be stored in the replay memory for agent training.

Observation.
A few steps must be completed to form a state, including topology discovery, metric measurement, and metric regularization.  e task that must be accomplished first is topology construction to offer the paths (action set) for distribution. e LLDP is used to explore the link status of each port. An LLDP packet is crafted with a designated chassis ID, that is, the ID of an OFS, and a port number, thereafter sent out from the port of the OFS. e connected port on the other side will receive the packet, thereby encapsulating the packet in an OpenFlow packet-in message and forwarding back to the controller. e controller can construct the topology of the network. When discovering the link-layer topology, the topology is only a connectivity topology that gives the links between individual network nodes, but does not yet construct end-toend paths. e entire process of topology discovery is shown in Figure 3, and the notations used in the figure are explained in Table 2.

Metric Measurement.
e network metrics are measured periodically as the state of the DRL model. e measurement approaches of bandwidth utilization, latency, and packet loss rate are listed in the following order.
(1) Bandwidth Utilization Measurement. Every OFS keeps its accumulated transmission byte number up to date. Whenever receiving a port request message, the OFS will answer the port reply to the controller, containing the accumulated transmitted bytes. e controller can thereafter calculate the difference between the previous and the current values, thereby obtaining the bandwidth utilization at this time. e detailed process of the measurement is depicted in Figure 4, and the notations used in the figure can be referred to in Table 3.
(2) Link Latency Measurement. A custom protocol is designed to measure the latency. e packet format of the protocol complies with the Ethernet frame; the ether type of which is set as an arbitrary value (0 × 8787). e destination MAC address remains blank, and its source MAC address is filled in with the timestamp at that time. e length of the custom probing packet is designed as 14 bytes, less than the one used in other research [41,50]. Whenever an OFS receives a packet with a 0 × 8787 ether type, it sends the packet within a packet-in message to the controller. Afterward, the controller removes the timestamp from the packet and calculates the difference between the current and the timestamp in that packet. Finally, the outcome is subtracted from the latency between the OFSs and the controller, and the latency of a link is measured. e overall process of latency measurement is shown in Figure 5, and the notations used in the figure can be referred to in Table 4.
(3) Packet Loss Rate Measurement. Similar to the latency measurement, a custom packet format is used for the packet loss rate. e ether type of the packet is set as 0 × 7878, and the destination and source MAC addresses remain blank. Initially, the controller sends out a fixed number of probing messages to OFSs. Whenever an OFS receives the packet with a 0 × 7878 ether type, it drops the packet immediately. After a fixed period, the OFS delivers the number of 0 × 7878 packets it receives from the controller. e controller then calculates the difference between the received number and the original quantity of probing packets sent before. erefore, the packet loss rate can be calculated. e entire process of packet loss rate measurement is illustrated in Figure 6, and the notations used in the figure can be referred to in Table 5.

Metric Regularization.
Two aspects must be considered before the utilization of the obtained network metrics. e first aspect indicates that the same value of different metrics has different meanings, for example, latency is presented in milliseconds; bandwidth utilization and packet loss rate are presented in ratio, but 50% bandwidth utilization is definitely better than 50% packet loss rate. e second aspect indicates that some metrics occasionally go wrong.
For the first problem, a mechanism is required to dimension each metric into a similar scale range (0-1), where a larger value is better than a smaller one. In dimensioning a bandwidth utilization value, if the value is originally 0%, then the normalized value will be 1; if the value is originally 100%, then the normalized value will be 0: bw norm lnk (m, k, l) � −bw lnk (m, k, l) + 1,

Online Learning Multipath Routing
Get the current state of the environment, i.e. the service type of the incoming packets, the current bandwidth, loss and latency of each link.
Decide which path the incoming packet should be forwarded, based on the result of classifier and the previous observation.
Calculate the reward value based on the service category. The reward, taken action, and observation would be the input record of DRL model.

Action
Observation Reward Figure 2: Progress of modeling an RL path distribution problem.
Controller creates the topology according to every L m (k, l) Controller instantiates rtr sw (k) for SW k rtr sw (k) cra s lldp p (k,j) for each pj sw rtrsw (k) packet-out each lldpp to the corresponding pj sw (k) Controller receives the connection request from SW k

Controller instantiates lnk L (m, k, l)
for each L m (k, l) by lldp p (i,j) SW l receives lldpp (k, j) and packet-ins to ctrl Figure 3: Flowchart of link-layer topology discovery. Chassis ID of sw i rtr sw (i) Instantiated router object of sw i in controller P sw j (i) e j th port of sw i no P (j, i) Port number of P sw j (i) L m (k, l) e m th link between sw k and sw l lnk L (m, k, l) Instantiated object of L m (k, l) in ctrl lldp P (i, j) LLDP packet generated for P sw j , composed of id sw (i) and the port number of P sw j (i)      Flow with match field as ether type � 0x8787 and action field as packet in to calculate the latency ts rnd (n) Timestamp of rnd lat where bw lnk (m, k, l) is the original bandwidth utilization and bw norm lnk (m, k, l) is the normalized value. For latency normalization, if the original latency is 0 ms, then the value will be 1; if the original latency is 100 ms, then the value will be 0: where lat lnk (m, k, l) is the original latency and lat norm lnk (m, k, l) is the normalized value. Finally, if the original packet loss rate is 0%, then the normalized value will be 1; if the original packet loss rate is 10%, then the normalized value will be 0: loss norm lnk (m, k, l) � (−10) × loss ln k (m, k, l) + 1, where loss lnk (m, k, l) is the original loss rate and loss norm lnk (m, k, l) denotes the normalized one. e second problem is solved by the custom mechanism, which is determined by evaluating the difference between the new and the mean value. If the difference is greater than the standard deviation, then it will be determined as an anomaly value and be regularized as the mean value. e dynamic standard deviation can be obtained as follows: where Var(X) is the variance and used to calculate the standard deviation.

Action Selection.
With the gained input value in the observation, the DRL model can determine a path to forward the traffic of a service. e proposed model is following a complete path approach with a multipath capability. e proposed RED-STAR adopts an ε-greedy scheme, in which an ε probability and 1-ε probability are found to randomly select an action to be performed and to make the decision on the basis of the calculation result of the DRL model.
Once an OFS receives a packet unmatched to the installed flow, the packet will be sent to the controller within a packet-in message. ereafter, the CAPC model is used to classify the service type that the packet belongs to. en, a service object is instantiated and added to the service list of the controller. e agent within the controller subsequently allocates a path for each service by the ε-greedy approach, thereby training its NN model for good allocation. A forwarding flow of the corresponding service will be added to the OFS. A flow is composed of a matching field and an action field. e matching field is set as the IP address of the IP address and the port number of that service, and the action is set to the forwarding port according to the selected path.
ε in the ε-greedy approach is a variable, which is initially set to a value close to 1. With training, the ε value gradually decreases (7). Considering that the NN model is not robust at the beginning, the probability of ε is set to a relatively high value, also known as exploration. With time, the decision made by the NN model becomes better and ε becomes smaller. At present, we can rely more on the NN model to allocate a path for a service, which is known as exploitation.
4.3. Reward Scheme. Two factors are involved in the reward scheme: QoS requirements of services and link utilization balancing. Each factor accounts for the reward value of 1; thus, the maximum reward value is 2.

QoS Reward.
Each service attaches different importance to the metrics, where a differentiated reward policy is needed. After path allocation for services, the controller will request and receive the metrics for the next turn. Once the controller receives the new metrics, the rewards of services of the last path allocation will be calculated on the basis of their individual policy. A total of 16 applications will be classified into four main categories (Table 6). File transfer services  emphasize the packet quality with less corruption and loss, thereby tolerating high latency. Video streaming and VoIP services are sensitive to link latency, thereby allowing slight packet loss; thus, the latency weight of which must be set higher. Remote control services demand moderate response time and packet loss rate; thus, the weights of the two are set equally. e reward calculation for the four services is formulated as follows: where r svc (i) is the reward value of the i th service. e reward is the sum of the weighted latency and loss rate of the selected path for the service, latency, and loss base value. e latency weight w lat svc (i) is set higher for latency-sensitive services (streaming and VoIP), and the latency base b lat svc (i) is set lower. e loss weight w loss svc (i) is set lower, and the loss base is set higher for the latency-sensitive services. e weight and base values are also set correspondingly on the basis of their QoS requirements. e actual weight and base values for services are depicted in Figure 7. erefore, the latency and packet loss rate take half of the QoS reward (maximum of 0.5 for each).

Link Utilization
Reward. An unbalanced path allocation results in a low reward to utilize the bandwidth of the network. e utilization of each link is gathered, and the utilizations of the most and least used link are removed for reward calculation. e large utilization value is subtracted by the small value, and a high difference leads to a low reward value, and vice versa: where r bw lnk (k, l) is the reward value of utilization balancing. e difference between the maximum and minimum utilization is turned to negative and added by 1. ereafter, the QoS reward and balancing reward are added as the final reward value of a path distribution:

DRL Route Distribution
In a general DRL case, s t performed with a t results in r t and s t+1 and a transition (s t , a t , r t , s t+1 ) will be saved in the replay memory. e memory contains several transitions from which the agent arbitrarily selects n transitions to train its NN model. e proposed RED-STAR mechanism is a slight modification (Reinforcement Discrete learning (RED)) of a classic DRL model, that is, deep Q-network (DQN). e RED-STAR considers the actual reward r t affected by s t and a t , without s t+1 (Figure 8).
e main idea of RED is that the route distribution of services does not influence each other directly. For example, the first distribution is targeted at the Skype VoIP service and the next distribution is for LINE VoIP service, whereas the two distributions have no correlation. e state s t of Skype will not lead to s t+1 of LINE. erefore, a transition stored in the replay memory is constituted of s t , a t , and r t . A typical DRL model involves two NN models: an NN model for action selection and an NN model for the calculation of targeted values. s t of the transitions randomly selected from the replay memory is fed into the prediction NN model, and the output of which is the action to be performed in this round. s t+1 of the transitions is fed into the target NN model, and the output of which is the targeted value to be updated by the prediction model. e output of the prediction model is regarded as the estimated expected reward of performing a t under s t , whereas the output of the target model multiplied by a discount factor c and added by r t is considered as the practical expected reward (Figure 9). e prediction NN model trains and updates itself with the practical expected reward as targeted values.
Different from the traditional DRL operation, the output of the NN model in the RED-STAR mechanism stands for the actual reward value r t for performing a t under s t , rather than the expected reward. During training, s t is the input data for the NN model and r t is set as the targeted value ( Figure 10). e model updates its parameters to approximate the targeted value. e structure of the RED-STAR NN model is depicted in Figure 11, which is a three-layer deep learning structure. e input layer at the top receives 28 features, including the onehot encoded service type, the network metrics of all routes, and the route allocation state. e output layer contains neurons with the same number as the routes. e value of each neuron represents the reward value for the corresponding path. e mean square error (11) is set as the loss function of the NN model, which is the criterion to determine how "bad" the model is. e Adam [51] is a gradient descent method used to update the parameters of the NN model.
To date, the overall contour of the route distribution mechanism has been illustrated. e process of the mechanism can be briefly presented by three procedures: (a) the router objects request their port statistics from the OFSs, thereby obtaining the link states of the topology. e controller then updates the link states received by the router, which is the observation step described in Section 4.1. (b) e agent allocates routes to the services on the basis of its policy, and the controller calculates the reward values of services according to their reward policy. e tasks in this

Experiments and Evaluation
Several scenarios are simulated to evaluate the effectiveness and performance of the proposed RED-STAR mechanism. e environment settings, that is, hardware and software specifications, SDN network construction, and link attribute settings, are first introduced. ereafter, the performance of the RED-STAR mechanism compared with the other two route distribution schemes (i.e., shortest path distribution (SPD) and DQN) in different scenarios is demonstrated.

Environment Settings.
e specifications are listed in Table 7. An Ubuntu virtual machine is installed atop a VMware ESXi hypervisor, running Mininet [52] as the network simulator. e Bit-Twist traffic replay toolkit [21] is used for service traffic generation; thus, the traffic in the simulation is the actual PCAP files of certain services. TensorFlow and Keras are used for NN model construction and training. e network environment topology and configuration are simply set (Figure 12), in which two hosts are in charge of traffic transmission, and the links are featured differently. e delay value and packet loss rate are set proportionally: a higher delay value is configured along with a lower packet loss rate (Link 3), and vice versa (Link 1). e maximum bandwidth values are all set to 100 Mbps. e abovementioned configuration allows the scheme to determine the route distribution for all services in transmission on the basis of their QoS requirements.

Reward Scheme.
In this scenario, a LINE VoIP service is transmitted at 10 Mbps in the network. e VoIP is a latency-sensitive and loss-rate-tolerant service, from which we can directly identify that the first link must be the best route for LINE VoIP. A random path distribution scheme is first tested, and its obtained reward values are shown in Figure 13, in which the reward oscillates from 1.6 to 1.8.
e RED-STAR model is prone to arbitrary selection of a route at the beginning based on the ε-greedy policy, as ε gradually declines to rely on the NN model. e reward value gained by RED-STAR is shown in Figure 14(a). e NN model approximately converges after the 200th second and obtains high reward values. Initially, the model selects the third link, causing the expected reward of the third link to grow quicker than the others (Figure 14(b)). After the convergence, the model has been aware of that the first link is more suitable for the LINE VoIP service, thereby fixing its route allocation to select the first link more often and obtain higher rewards.

Composition of Different Services.
e traffic of three services is replayed to the network simultaneously to evaluate the performance of each scheme, and the bandwidth consumption of every service is equally set to 10 Mbps in this scenario. e schemes deployed on the controller are in charge of distributing a route for each service. e rewards gained by the three schemes are shown in Figures 15(a)-15(c), where the value of SPD remains the same, and the two learning models obtain higher values with time.
e RED-STAR obtains the highest average value for every service. e DQN model performs worse than RED-STAR, and SPD does not improve with time. e average reward value of the three services of each scheme is presented in Figure 15(d), where RED-STAR converges faster than DQN with the greatest value of 1.8579. Apart from QoS, load balancing is one of the key factors of the reward, which can be separately discussed. In Figure 15(e), the maximum bandwidth utilization of RED-STAR decreases with time, reaching the bottom at around the 500th second, and keeps at a relatively low value of around 0.1. erefore, RED-STAR has the lowest average utilization (0.1181), indicating that it uses the bandwidth resources in the most effective way.   . Six services are involved in this case, which is a more complex route distribution task. e reward gained by SPD (Figure 16(a)) remains at a stable reward value; the DQN model approximately converges after the 500th second (Figure 16(b)), obtaining higher values on all services than SPD; the RED-STAR also converges at the 600th second (Figure 16(c)), having average values on three services higher than DQN, and becomes steady after convergence. e average reward of six services in each scheme is presented in Figure 16(d).
e SPD scheme keeps at approximately 1.6, whereas the two learning-based models grow steadily from 1.4 to 1.9. e RED-STAR and DQN achieve similar average rewards with 1.8579 and 1.8568, respectively, at the end. Both of them are able to adopt the realistic traffic scenario, of which the models neither have the knowledge of service bandwidth consumption nor take the consumption as the input data. Regarding load balancing, RED-STAR has a lower maximum utilization rate than DQN at the end (Figure 16     indicating that DQN gains more reward on the QoS requirements, thereby having a similar average reward as RED-STAR.

Conclusions
RED-STAR considers the QoS requirements of different services and balances the link utilization with the average reward value of 1.8533, which is greater than the other two schemes, and the lowest average maximum bandwidth utilization of 0.1181 in the three-service traffic scenario. Moreover, in the realistic six-service traffic scenario, RED-STAR still achieves the best average reward of 1.8579 and the lowest average maximum bandwidth utilization of 0.3601 among all schemes.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare no conflicts of interest regarding the publication of this paper.