Reinforcement learning based routing for time-aware shaper scheduling in time-sensitive networks ✩

To guarantee real-time performance and quality-of-service (QoS) of time-critical industrial systems, time-aware shaper (TAS) in time-sensitive networking (TSN) controls frame transmission times in a bridged network using a scheduled gate control mechanism. However, most TAS scheduling methods generate schedules based on pre-configured routes without exploring alternatives for better schedulability, and methods that jointly consider routing and scheduling require enormous runtime and computing resources. To address this problem, we propose a TSN Scheduler with Reinforcement Learning-based Routing (TSLR) that identifies improved load balanced routes for higher schedulability with acceptable complexity using distributional reinforcement learning. We evaluate TSLR through TSN simulations and compare it against state-of-the-art algorithms to demonstrate that TSLR effectively improves TAS schedulability and link utilization in TSN with lower complexity. Specifically, TSLR shows a more than 66% increase in schedulability compared to the other algorithms, and TSLR ’s scheduling time is reduced by more than 1 h. It also shows flows’ transmission latency is less than 25% of their latency deadline requirement and reduces maximum link utilization by approximately 50%.


Introduction
Time-Sensitive Networking (TSN) [1,2] represents a general-purpose real-time Ethernet standard 1 that aims to solve not only the real-time requirements but also the compatibility issues of many proprietary Ethernet extensions such as EtherCAT, PROFINET, and SERCOS III.Its goal is to provide a standards-based deterministic ultra-low latency, ultra-low jitter, and zero-congestion loss data communication in an integrated network that supports both time-sensitive and best-effort traffic simultaneously.TSN yields a next-generation local area network (LAN) technology for the coexistence of information technology (IT) and operation technology (OT), especially for industrial automation, in-vehicle, and avionic networks.
TSN consists of several standards to ensure the stringent timing requirements of real-time systems.Particularly, the IEEE 802.1Qbv [3] serves as one of the TSN's core standard amendments defining the timeaware shaping (TAS) mechanism, which aims to schedule precise frame transmission timing in bridged networks.TAS guarantees deterministic latency for time-sensitive traffic flows by controlling the transmission gates of egress queues within switches according to computed schedules (Fig. 1).However, calculating correct and coordinated TAS schedules for a network poses a challenging problem and requires substantial computation because it must account for various factors such as traffic configuration, flow requirements, paths, link capacity, and utilization [4,5].Optimal TAS scheduling is known to be an NP-complete problem [6,7].
Problem.Prior works have proposed constraints and methods for TAS scheduling (related works in Section 2).Many of these works focus on scheduling based on the assumption that the routes of the input flow are known in advance [8][9][10][11][12][13][14]; they do not explore the dependency between routes and schedules.Nevertheless, if a flow passes through a path, that affects the schedule of other flows on that path.A set of https://doi.org/10.1016/j.comnet.2023 • We evaluate TSLR through TSN simulations to exhibit improved schedulability and link utilization of TAS compared to several state-of-the-art algorithms in the literature.
The remainder of this paper is structured as follows.In Section 2, we first summarize related literature.Then, we introduce TAS and reinforcement learning, and justify the selection of C-DQN for our work in Section 3. We present the TSLR design in Section 4, and evaluate TSLR in Section 5. We summarize the work and conclude in Section 6.

Related work
Several prior studies have attempted to solve the problem of optimal or efficient TAS scheduling in TSN.Craciunas et al. [8] propose formal scheduling constraints for calculating valid GCL considering the influencing factors of real-time communication and computing schedules by applying satisfiability modulo theory (SMT).Dobrin et al. [9] propose a fault tolerance scheduling scheme that ensures the deadlines of timesensitive traffic considering transmission failures and retransmissions.Jin et al. [10] employ SMT and optimization modulo theory (OMT) to schedule more real-time flows and reduce execution time.Ansah et al. [11] propose a schedulability analysis algorithm that verifies whether schedules of periodic TSN applications and bridges can be computed.Durr et al. [12] propose a heuristic scheduling algorithm based on the Tabu search and a compression algorithm to reduce bandwidth wastage by guard bands.Kai et al. [22] propose TSN Chained Flow Scheduling (TCFS) as an efficient scheduling mechanism in a multi-level topology.Their ILP-based approach is designed to resolve offline scheduling problems (typical TAS scheduling cases), and Tabu search based approach is designed to solve online scheduling problems.However, these works limit the search space by assuming fixed and given routes with simplified constraints without considering alternative routes for improved schedulability.
Some studies have considered routing jointly with scheduling in order to improve schedulability.Nayak et al. [23] propose a timesensitive software-defined network (TSSDN) that exploits the logically centralized paradigm of SDN to provide ILP formulations for solving the combined problem of routing and scheduling time-triggered traffic.Subsequently, the authors improve performance by proposing an ILP-based incremental scheduling algorithm that dynamically adds schedules whenever new flows occur [15].Similarly, two recent works [24,25] propose methods for dynamic (online) re-configuration of TAS scheduling and routing.However, our work aims to solve the offline scheduling and routing problem in TAS.
Schweissguth et al. [16] propose an ILP-based joint routing and scheduling method that formularizes network structure, routing, scheduling, and application requirements.Smirnov et al. [17] propose an approach that generates a valid route and schedule using a set of pseudo-boolean constraints for automated optimization of mixedcriticality networks with time-triggered traffic.Xu et al. [18] utilize SMT and OMT to solve the co-design constraint set of scheduling and routing in TSN.Alnajim et al. [19] propose a QoS-aware path selection and scheduling algorithm that calculates the route and schedule incrementally to minimize queueing delay and preserve QoS.Hellmanns et al. [20] propose an ILP-based routing and scheduling method, improving its performance through various optimization techniques.However, these works only handle simple constraints (simplification of requirements such as deadline, network load, topology, etc.), suffering scaling issues for more complex problems with a larger number of flows.
Most recently, there have been attempts to adopt reinforcement learning for TSN.Yang et al. [26] propose a Graph Convolutional Network-based routing and TAS scheduling scheme, and Yu et al. [27] propose a branching dueling Q-network-based scheme.However, they have not evaluated their proposals against ILP-or metaheuristic-based approaches, which are widely adopted in TSN.Furthermore, their
evaluations are conducted with very simple network requirements.For example, all flows have the same and loose transmission interval [26], such as 5 ms, or only small network topologies under ten nodes are considered [27].More importantly, none of these works have investigated applying distributional reinforcement learning to increase TAS schedulability.Table 1 summarizes the list of most relevant prior works and their key differences from our work.

Background
We begin with an initial overview of TSN's time-aware shaper mechanism and reinforcement learning.

IEEE 802.1Qbv Time-Aware Shaping (TAS)
One of TSN's goals includes supporting a variety of traffic types in a converged network.These are generally classified into scheduled time-critical (ST), semi time-sensitive audio-video bridging (AVB), and best-effort (BE) traffic based on their requirements.An ST flow is a periodic flow that requires deterministic ultra-low latency (with hard deadlines) and low jitter without congestion loss.To guarantee these requirements, TAS in TSN isolates those flows' transmission times from other traffic types using a gating mechanism, which schedules the transmission gates of egress queues within each switch based on advanced knowledge of flow information.
Fig. 1 illustrates an example of TAS operation in a TSN switch.A switch can have up to eight queues in each egress port, and each queue corresponds to a traffic class determined from the priority code point (PCP) in the VLAN identifier according to the IEEE 802.1Q mapping [28].Each queue may have an individual transmission selection algorithm (e.g., credit-based shaper (CBS) [29], asynchronous traffic shaper (ATS) [30], or a simple FIFO) that can throttle the transmissions based on some stream reservation criteria [31] (or lack thereof).Additionally, each queue presents a transmission gate that has two states, open or closed.Frames in a queue can be transmitted only through an open gate.When multiple gates are open simultaneously (e.g., at time  1 in Fig. 1), the transmission selection part selects and transmits frames in descending order of traffic class among those open queues according to the strict-priority rule [14].
The gates are controlled by the gate control list (GCL).To compute this GCL, TAS first allocates ST time windows that can accommodate the transmission times of time-critical flows that need to be transmitted at the given interval.Furthermore, to prevent delaying ST frames due to non-ST frames, TAS allocates a guard band (GB) that closes all gates before every ST window, as expressed at the bottom of Fig. 1.Finally, all remaining times except for the ST and GB windows are assigned as the non-scheduled traffic (NST) windows for transmission of all other traffic types such as AVB and BE.

Reinforcement Learning (RL)
RL aims to learn an agent's optimized behavior by taking actions to maximize reward in a specific environment.[32] represents an RL algorithm that uses Markov Decision Process (MDP) as a probability model (Fig. 2).In our work, input traffic flows become the agent in the MDP, and the current network state is the environment.In Q-learning, the agent learns the optimal policy by predicting an action's future reward value (Q-value) for a specific state in MDP.A Q-function that generates Q-value is expressed as:

Q-learning
When   has a policy , it returns the sum of expected rewards , which can be obtained by action   in state   . + is the reward value that can be obtained in  time units, and  is the discount rate that expresses how important the reward of the currently selected action is compared to a future reward. possesses a value between 0 and 1 and is designed such that the distant future reward yields a less effect than the present reward.Initially, Q-function is initialized to an arbitrary value, and then learns using the following formula; The squared difference between the predicted Q-value and the sum of present and future rewards equates to cost, and then learning proceeds such that the cost converges to zero.
Categorical Deep Q-Network (C-DQN) [21] is the RL algorithm that we adopt for TSLR.Admittedly, there are many other RL algorithms from which we could choose.However, our intuition is that C-DQN is better suited for TSLR in solving the load-balanced routing and TAS scheduling problem of TSN due to the following reasons; • C-DQN, while similar to a DQN, utilizes the Bellman equation to learn approximate value distributions.DQN's prediction as a single scalar value for a particular state of a complex environment with partial observations does not adequately reflect the variations in real systems.Therefore, expressing the reward as a distribution as in C-DQN could help to predict the future reward more accurately.• TSLR's action space can be expressed easily as discrete since in involves selecting a switch.Therefore, we require a DQN algorithm that performs efficiently in a discrete action space and is less sensitive to model fine-tuning than other methods with an Actor-Critic manner.• C-DQN's distributional characteristic allows for more flexibility in making assumptions and stronger inferences into learning problems.
For this reason, C-DQN allows TSLR to adapt easily to and perform better in a variety of environments (topologies and flow sets) without fine-tuning model hyperparameters.• C-DQN offers superior performance in environments with a distribution of bimodal or multimodal values [34].Since there may be multiple routing results for proper scheduling and load balancing, this feature aligns well with our problem.• C-DQN reveals better performance than other DQN-based algorithms such as Double DQN, or Dueling DQN [33].
Therefore, we adopt C-DQN in the design of TSLR, and C51 Algorithm [21] provides an implementation of C-DQN in TensorFlow's TF-Agents [34] for our work.

Design
This section presents the TSLR design, which consists of DRR and PSSalgorithms.

DRR -Distributional RL-based Routing
TSLR's goal is to discover a routing path set that satisfies the deadlines for all flows.While doing so, DRR balances the load on links to the extent possible to improve the network's effective capacity.Therefore, the objective function is as follows: () is the schedule score function to confirm how well a route set (when scheduled upon) satisfies the flow deadlines, () is the load balancing score function, and  () is the punishment function for invalid actions. is the episode's time-step.The end-to-end routing of one flow is considered one episode.For each flow, three functions are combined as in Eq. ( 4) to consider all rewards related to scheduling, load balancing, and punishment.DRR obtains the reward value according to the sum of three functions when one episode has ended.However, to validate whether flows meet their deadlines and calculate the generated routes' reward, scheduling must be preceded for each route set.For this purpose, TSLR executes its scheduling algorithm PSS (Section 4.2).DRR addresses the complexity problem using RL.The main challenge is to define the state, action, and reward/penalty of RL that achieves our goal.

STATE:
The state matrix () of DRR is composed of three column vectors and a matrix as, () = { ℎ (),   (),   (), ()} (5) where each row corresponds to a link.Fig. 4 provides an example state when flow  2 tries to route from  4 in Fig. 3.In this example scenario, the assumed frame size of each flow is 125 bytes and the link bandwidth is 100 Mbps. ℎ () is a vector that represents the location, destination, and current flow's path until . ℎ () distinguishes links into three cases.First are the links that can be reached from the current switch, including those that directly connect toward the flow's destination.If the  2 's current location in Fig. 3 is  4 , it can move to  2 .In this case, as exemplified in Fig. 4, a relatively large value (e.g., 100002 ) represents the reachable links and clearly emphasizes the sender and receiver at . Second are the links toward the switches that are already visited by the flow, assigned a value of 0 to prevent further consideration.Finally, the rest are initialized to   () of the corresponding link to illuminate the minimum potential free capacity on the link.We observe that the learning instability due to deviations from these values can be suppressed through a C-DQN's distributional learning strategy with mini-batch learning.
() depicts the packet size (per transmission) of the current flow to be scheduled, represented as the transmission time required for one packet to be scheduled within  3 on each link.  () is set to ⌈'   ' ÷ ' ℎ'⌉.This conversion represents   () with time units equal to that of schedule information ().  () must be expressed explicitly on every row because the bandwidth of each link may differ (e.g., 100 Mbps, 1 Gbps, etc.).
() embeds the current flow's tx interval (period) given by the ratio of   to flow interval in all rows.
() matrix represents the current scheduling information of all network links.To reduce the state matrix's size for faster runtime, scheduling information () is expressed using '  ÷ ' columns with an adaptive compression value  for reducing () as follows: and  represent the row (link) and the column (time), respectively.Therefore,   and   mean the beginning and end of the time range assigned to column .All columns of () are initialized to , the size of their allocated time range.ℎ  (,) indicates in binary that link  is scheduled at time .For example, if link  is reserved at 0 μs, ℎ   (,0) equals, 1 but otherwise 0. Therefore, as the reservation progresses, ()  decreases from  to 0.
A larger  makes smaller () matrix size, seeking faster learning.However, the more the () size is compressed by , the more abstract scheduling information becomes.Therefore, by adjusting the  according to the environment in which DRRis to be run, it is possible to control the trade-off between accuracy and latency; i.e., accurate expression of scheduling information versus faster DRR operation.For example, if DRR runs in a high-performance GPU environment,  could be set to 1 to reflect accurate scheduling information; otherwise,  could exceed 1 to reduce the execution time.
In our scheme, we assume that each end system is integrated with a switch ('talker' and 'listener' in TSN terminology).If end systems and switches are separated, the state matrix ()'s row should contain additional links connected to the end systems.However, if an end system connects only to one switch, then that link can be excluded safely from () to reduce size and complexity.Therefore, in this case, the sources of flows from end systems are replaced by a directly connected switch.
ACTION: An action () is to choose a switch to visit next for routing.Therefore, if switches are indexed as an integer starting from 0, the range of () that the agent can take is between 0 and 'number of switches -1'.However, selecting a switch not directly connected to the current location would be an invalid action.Therefore, DRR avoids this by removing those actions from the action space; i.e., an action space is defined by links toward each neighbor of the current location.As exemplified in Figs. 3 and 4, if flow  2 tries to route from  3 , a valid action to take is either a link to  1 or to  4 .
There are also other invalid actions that cannot be known in advance before routing or scheduling and thus must be handled by giving a penalty in the reward.Selecting a switch that generates a routing loop is an invalid action.Also, an action that forces link utilization to exceed 100% of its link bandwidth is an obviously invalid action.Finally, to achieve global load balancing, DRR compares the maximum link utilization of the current path set to that of previously scheduled sets. 4Subsequently, an action that increases the max link utilization is defined as an invalid action.These invalid actions will be punished with a penalty in the reward.

REWARD:
The reward function () is composed of, DRR first determines whether an action is valid or invalid and assigns a reward () in Eq. ( 7) accordingly.If an action is invalid, the reward function () has a penalty value  () according to  ℎ = − − 1, which is smaller than the most negative value that a valid action can take as a reward. is the number of switches in the topology.The reason that DRR assumes a value of − −1 as  ℎ is that  ℎ must be less than the reward for any action.The worst action reward that can theoretically be received without falling into the penalty state is the minimum value of ()+().() is a reward according to the schedule score function in Eq. ( 8), and () is a reward according to the load balancing score function in Eq. ( 9).The minimum of () is 0 (when the binary value is 0) and the minimum of () will be − according to Eq. ( 9).Therefore,  ℎ must be less than −.However, if the  ℎ becomes too small, the reward range becomes too large and the value distribution precision of C-DQN decreases.Therefore,  ℎ possesses the value of − −1 to narrow the reward range.DRR's primary goal involves creating a routing path set that satisfies the deadlines of as many flows as possible when scheduled.For this purpose, () reflects whether the scheduling result complies with the deadlines of flows; i.e., the more flows comply with the deadline, the greater the reward; ∶  ℎ     ℎ ?    ∶  ℎ       ?(8) 4 Only for those that have successfully scheduled all flows.  and    are binary values that are 1 if each condition is satisfied and 0 otherwise. is the number of switches in the topology.and  is the number of flows.ComplianceRate(t) represents the ratio of how many flows succeeded in meeting their deadlines among the total flow set.DRR accepts a routing result when it does not decrease ComplianceRate(t).() carries value  by default to provide a reward for successful routing of one flow.Also, () awards an additional reward as a function of the overall scheduling result with ComplianceRate(t).However, in order for the reward to have a meaningful distribution, its value must be clearly distinguished according to the scheduling result.The simplest method involves setting an extremely large change in the reward's absolute value according to the scheduling result, but this makes other rewards meaningless.Therefore, DRR takes  ⋅  as the base of the exponential function in the reward equation for the scheduling result.A higher ComplianceRate(t) produces greater a change in the reward.Finally,  is multiplied to increase the maximum value according to .
DRR has two policies for load balancing, global and local.The global load balancing policy employs min-max fairness criteria to achieve ''minimizing the maximum link utilization'' using the penalty function for invalid actions.On the other hand, the local load balancing policy is added as () under the intuition that avoiding links with higher utilization is helpful for future scheduling, even in the local scope.Note that if () is made positive, the reward may increase as the path length increases, which may incorrectly encourage DRR to select unnecessarily longer paths.To prevent this, () is always a negative value as follows: () refers to the link utilization of the current link.When link utilization increases to 10% (0.1), the agent receives a worse reward by a factor of two (1024 0.1 = 2).The reason for multiplying by  is related to the reward's range.According to Eq. ( 7) and Eq. ( 8), the range of rewards increases as  increases.However, if () has a fixed range, it will hardly affect the overall reward value.Thus its influence on the value distribution can disappear.Therefore,  is multiplied to increase ( )'s influence on a reward function.Fig. 5 illustrates an overview of TSLR.When a flow list on a network topology is given as the input, and their paths are initialized to the shortest paths, DRR iteratively finds alternate routes for each flow and replaces them according to its policy.DRR learns for a certain period by evaluating alternate routes' rewards in terms of TAS scheduling  end for 25: end for 26: end for success rate and load balancing according to Eq. (7).DRR training and evaluation repeat continuously, and the actual routes are reflected gradually during this process.Because the problem's form could vary significantly according to the network topology or flow set, DRR trains without a pre-trained neural network in other networking scenarios.

PSS -Path Step Scheduling
The PSS algorithm receives the routing result generated by DRRas an input and generates the network's schedule.Algorithm 1 is the pseudo-code of PSS.We use the example scenario in Fig. 3 to describe PSS operation and assume that flow  0 has a 500 μs interval and  1 and  2 have 250 μs intervals.The flow deadlines of flows match their interval.
Initialization (Line 1-5): scheduleSet initializes as a set of schedule lists (i.e., GCLs repeated with   ) for each switch in the network.  is calculated as the least common multiple of the intervals of flows in flowList, and maxLength is set as the maximum path length (hop-count) of all flows.Finally, the flows with the same interval are grouped together, and these groups are sorted in ascending order of the intervals.For the example in Fig. 3, initialization occurs as follows: Line 6∼8: Scheduling is performed in group order by smaller intervals, proceeding as many times as the number of transmissions (itrNUM ) within one   (th itr).For example, if   is 500 μs and a flow's interval is 250 μs, that flow will transmit twice within   (reve.g., itrNUM = 2 in line 8).Furthermore, not all flow transmissions are scheduled at once, but th transmissions of the same flow group are scheduled in a batch.Therefore, in the example of Fig. 3, the scheduling proceeds in the order of (1) the first transmissions of  1 and  2 , (2) their second transmission, and (3) the transmission of  0 .The reason for scheduling flows with smaller intervals first is that their constraints (deadlines) are tighter, and when multiple transmissions are performed within   , th transmissions are affected by the time occupied by previous transmissions.Furthermore, the reason not to schedule all transmissions of one flow simultaneously is to ensure fairness to other flows.For example, consider Fig. 6(a), where   is 500 μs, and the intervals and deadlines of flows  −1 ,   , and groups of flows   and   are all 250 μs.When flow   utilizes links in the order of 1-2-3, flow   is unable to comply with the deadline (i.e.,  1   is not delivered within 250 μs) due to the second transmission of   ( 2  ).This shows that scheduling all interval transmissions of one flow at once imposes a bigger constraint on the next flow scheduling.Therefore, PSS schedules all th interval transmissions within the same flow group first and then schedules (+1)th interval transmissions.This method, for example, enables flow   to comply with its deadline by scheduling  1  earlier than  2   on link 3, as shown in Fig. 6(b).Line 9: As the name suggests, PSS schedules step-by-step, where a step indicates a path hop.After scheduling the first hop of all flows in one iteration, scheduling the second hop for each flow occurs.In Fig. 3, { 1 ∶ 0 → 1} and { 2 ∶ 3 → 4} are scheduled first, and then the scheduling of { 1 ∶ 1 → 2} and { 2 ∶ 4 → 2} follows.
Line 10∼17: This part confirms which flow should be scheduled on which link in the current step.The algorithm identifies the th-step (th hop) link of the flow path and stores the flow that should be reserved in the corresponding link.When this process is completed, the flow-link lists that need to be scheduled at the current step are stored in curFlows.
Line 18∼22: PSS reads the list of flows and links that should be scheduled and performs actual scheduling.The list of flows is sorted in ascending order of the remaining deadline per remaining hop so that a flow with an imminent deadline is scheduled first.Then, get-TimeSlots verifies the free space (called timeslot ) within   that can be scheduled on the current link.In setSchedule, a flow is scheduled in a suitable timeslot with a consideration of its constraint.If there are multiple timeslots, the algorithm confirms whether sufficient space exists to be scheduled and to satisfy the flow's constraints sequentially from the front.In case of that there is a favorable determination, the scheduler selects the corresponding timeslot.Otherwise, the scheduler checks the next timeslot, and if there are no available timeslots, it stops scheduling.

getTimeSlots() in line 21:
In most TAS scheduling, a timeslot can have a maximum length of   , as shown in Fig. 7.In general, a timeslot having a length of   is divided into several sub-timeslots in advance, and then a schedule is assigned to these sub-timeslots, or timeslots are split into two by dividing the original timeslot centered at the scheduled part.In both approaches, it is assumed that a GCL schedule cannot extend beyond the endpoint of the original timeslot of length   .PSS is similar to the latter approach.However, getTimeSlots() of  PSS exploits the fact that the TAS schedule is repeated.For example, a new timeslot at step-2 in Fig. 7 ends at  1 beyond  0 , the endpoint of   .This allows PSS to expand the scheduling space, increasing schedulability.
Computational complexity of PSS depends mainly on the number of operations scheduling links.There are additional operations, such as flow sorting, but they are relatively insignificant.The number of link scheduling operations   can be expressed as: The number of links to be scheduled per-flow depends on the number of links used in the flow path (ℎ( ℎ )) and how many times to schedule each link according to the flow's transmission interval (  ÷   ).Since scheduling must be performed '  ÷   ' times for each link, the number of link scheduling required by one flow is ℎ( ℎ ) ⋅ (  ÷   ).If this is aggregated for all flows  in the entire flow set  , it is the total number of link scheduling (  ) performed by PSS.Therefore, PSS has (  ) complexity, which depends the transmission interval and flow path length.

Evaluation
We evaluate TSLR against two state-of-the-art algorithms, Joint Routing and Scheduling (JRaS) [16] and ILP-Red+Conf+Adv (IRCA) [20], and also against Simulated Annealing -based dynamic routing with PSS (SA/PSS) for comparison with a heuristic approach.SA/PSS is a metaheuristic algorithm that finds a better solution by changing the current route to other routes from the entire set while applying the same PSS for scheduling.The main point of comparison with SA/PSS is whether DRR's RL-based routing proves effective.
We conduct an extensive set of simulations using multiple different network topologies to evaluate TSLR's scalability.We compare the scheduling time, schedulability, scheduled flow latency and jitter, maximum link utilization, and path length distribution.
We use ten different network topologies for evaluation.The first is NASA's Orion Crew Exploration Vehicle (CEV) network topology in Fig. 8 [35][36][37] which consists of 15 switches.The five topologies in Fig. 9 are randomly generated according to the specifications in Table 2 to reveal that our scheme can generalize to non-specific topologies.For these random topologies, we use 2 links for  nodes.This is a popular approach to support two-link redundancy using industrial fault-tolerance protocols such as high-availability seamless redundancy (HSR) and parallel redundancy protocol (PRP) specified in IEC 62439-3, or TSN's frame replication and elimination for reliability (FRER) in IEEE 802.1CB [38].In addition, to observe the impact of nodes-links ratio on TSLR, four additional random topologies having the same number of nodes or that of links with the T30 graph (Fig. 9(c)) as used as in Fig. 11.All links are full-duplex with 100 Mbps bandwidth, and all schemes use up to 75% of link bandwidth for scheduling to comply with the TSN standard.
Five ST flow types are used as described in Table 3, and we vary the total number of flows from 40 to 200 in the CEV scenario, and 100 to 500 in random topology scenarios; one fifth of the flows are of each type.For the topologies in Fig. 11, same number of flows as the T30 scenario (300 flows) is used.We set the flow deadlines to 100 μs according to industry standards [39,40] and use five different flow intervals to simulate a complex IIoT scenario.DRR parameter settings are listed in Table 4.In the case of , we adopt 5 as a default value for the CEV scenarios.However, in random topology scenarios,  is adjusted to reduce the size of TSLR's state matrix.In T50,  is 20, and in other random topology scenarios,  is 10.

Performance of routing and scheduling
Scheduling time is defined as the earliest time that the schedules for all flows are created while satisfying the deadline and bandwidth requirements.Scheduling time is very important for TSN since it directly impacts the network's schedulability within a given time limit.Since   TSLR is not a pre-trained model, the scheduling time of TSLR includes both training time and evaluation (testing) time.Note that scheduling time result exists if and only if the scheduling succeeds.In other words, there is no scheduling time result if scheduling fails.Thus, scheduling time result also shows the schedulability.
The scheduling time is influenced not only by the size and bandwidth of the network topology but also by the number, size, and requirements of the flows, i.e., traffic load.Fig. 10(a) plots the scheduling time as we increase the traffic load on the CEV topology.TSLR is the fastest for all cases, and the execution time of other algorithms increases dramatically as the number of flows increases.For example, in a simulation with 120 flows, TSLR shows a scheduling time of about 3605 s faster than JRaS and 2478 s faster than SA/PSS.In addition, JRaS fails to schedule 160 flows and beyond due to an out-of-memory (OOM) 8 error, and IRCA is unable to find a schedule beyond 80 flows.On the other hand, TSLR schedules about 66% more flows than JRaS and 150% more flows than IRCA while successfully scheduling 200 flows, showing significantly improved schedulability.
These results of JRaS and IRCA are worse than what has been presented in [20] because IRCA is designed for flow sets with the same intervals.According to these results, the computing resources required by the ILP-based approaches increase rapidly according to the number of flows to be scheduled (with multiple transmissions due to interval and complicated hyper-period), and the chances of finding a solution drop rapidly.SA/PSS had similar results with the ILP-based approaches as well.Although OOM does not occur, SA/PSS cannot find solutions within ten hours in 160 flow cases and beyond.These results illustrate that the ILP-based methods and the heuristic method of changing the path from the entire path set are inefficient.TSLR succeeds in generating schedules for a complex interval flow set in a relatively short time in all cases, showing significantly improved schedulability and scheduling time.This confirms the efficacy of an RL-based routing method like DRR.Fig. 10(b) plots the scheduling time results for random topologies as we increase the network size and the number of flows.In order to maintain a similar proportional traffic load, both the network size and the number of flows are increased at the same rate as in Table 2.It can be seen that the increase in network size (which increases the average path length of flows) and the number of flows have a significant impact on scheduling complexity and, thus, the scheduling time.Nevertheless, TSLR still significantly outperforms the ILP-based schemes.JRaS and IRCA consume substantial time on T10 and returns out-of-memory on larger scenarios.SA/PSS succeeded in scheduling very quickly in T10 and T20.However, this is because PSS successfully schedules based on the initial routing result without executing the SA-based routing algorithm.In the other scenarios, SA/PSS returns OOM.This shows that even generating a routing set is difficult in a scenario with a relatively large topology, and that TSLR is more effective in terms of both schedulability and scheduling time than other approaches for different topologies and flow sets.TSLR succeeds in routing and scheduling input flows in various topologies without a pre-trained model because RLbased routing of DRR can find reasonable routes even when unable to view the entire routing set.
Finally, to understand the impact of node-link ratio on TSLR, we used T30 topology as a reference and varied the number of nodes and links.Specifically, we used a fixed number of nodes from T30 and different number of links, and vice versa, as shown in Fig. 11.Fig. 12 plots the scheduling result on these topologies.The bar graph represents the number of flows whose schedules meet their deadline requirements, and the line graph represents the scheduling time for the best result within a time limit (10 h, red dotted link).In general, when the number of links increases or the number of nodes decreases, there Fig. 11.Fixed random network topologies with the fixed number of nodes or links.In each subfigure's name, the number after  means the number of nodes, and the number after L means the number of links.
12. Scheduling result of TSLR in fixed random graphs.The grey bars mean that TSLR fails to configure successful TAS schedules in 300 flows.
are abundant links such that shortest path is sufficient for scheduling without the need for alternative routes search by RL.On the other hand, when the number of nodes increases or the number of links decreases, there is an insufficient number of links to fully support (schedule) the given set of flows.Specifically, in the N40-L60 scenario (Fig. 11(b)) with 40 nodes and 60 links, only 295 flows were scheduled successfully within the time limit, and only 270 flows succeeded in the N30-L40 topology (Fig. 11(d)).This shows that the node-link ratio of the network greatly affects the scheduling performance given a fixed number of flows; if there are too many links, alternate routes are not necessary, and if there are too few links, the number of flows that can be scheduled must be reduced accordingly.
Overall, the evaluation results show that it is impractical to have all paths as a search space, but alternate paths should be explored to increase the schedulability of TAS, and TSLR addresses this problem appropriately.TSLR succeeds in routing and TAS scheduling with significantly less running time and memory usage than existing methods in all evaluated scenarios.Latency/Jitter: Fig. 13 plots the latency of ST flows, showing the latency results when the scheduling time was measured.In other words, when the first successful schedule was created.The red dotted horizontal line is the deadline that the flows must satisfy.Since we define successful scheduling as generating a schedule that satisfies the latency requirements for all time-critical flows, there is no flow that violates the deadline.Whether such a schedule can be found was already reflected in the aforementioned scheduling time or schedulability results.Nevertheless, TSLR's overall average latency is similar to JRaS and IRCA and quite lower than SA/PSS.For example, the average latency of TSLR in all CEV scenarios is below 25 μs, which is 25% of the flow's latency requirement.(The reason TSLR and SA/PSS have the same latency in most smaller scenarios is that PSS successfully scheduled on the initial routing without needing to explore alternatives.)This implies that TSLR effectively generates a schedule considering deadlines.The optimization functions of JRaS and IRCA are to reduce latency.In contrast, TSLR considers a detour path (maybe longer) for load balancing.However, in the T10 scenario, TSLR's latency is lower than that of JRaS and IRCA and markedly lower than that of SA/PSS in the CEV 120 flows scenario.Finally, Tables 5 and 6 show that TSLR maintains acceptable average jitter, being within 5μs in most scenarios.Although this is not as good as JRaS and IRCA that achieve zero jitter, their schedulabilities are poor.The main results is that TSLR is sacrificing a little latency and acceptable jitter for significantly improved schedulability.Therefore, we conclude that TSLR is competitive when considering schedulability, latency, and jitter jointly.
Load balancing and path length (Hops): Fig. 14(a) displays the load balancing performance of TSLR.Maximum utilization is the utilization of the bandwidth allocated to TSN.According to Fig. 10(a), TSLR generates successful schedules in less than an hour on average for all CEV scenarios.However, in this simulation, we fixed the execution time of TSLR to one hour to obtain improved load balancing results.In all scenarios of Fig. 14(a), TSLR achieves the smallest maximum utilization compared to all other methods.Particularly, in the CEV-120, the maximum utilization of TSLR is approximately 50% lower than that of SA/PSS.This indicates that the flows have been distributed to alternate links for balanced traffic load, and there is more available bandwidth for other later flows, including best-effort traffic.
Fig. 14(b) plots the empirical cumulative distribution function (eCDF) of flow path lengths for CEV 120 flows scenarios in Fig. 14(a).IRCA is excluded because scheduling failed in the CEV 120 flows scenario.Since TSLR considers detour routes (from shortest paths) for load balancing, path length increases are inevitable.However, compared to JRaS, the increases are small and insignificant, considering the benefits gained.Additionally, compared to SA/PSS, the overall

of PSS scheduler with static routing
To understand the performance of the scheduler, we isolate the scheduling parts of the three schemes and compare them on a fixed route set generated by DRR, as in Table 7. Fig. 15 plots the scheduling time of each algorithm on the CEV topology, that PSS completes TAS scheduling in less than 1 s, even for 200 flows.This is because PSS derives only one result for each input according the criteria by the greedy algorithm.On the other hand, scheduling algorithms of and IRCA take a considerable amount of time, and out-of-memory errors occur the of flows increases.Due to the characteristic of TSLR, which recognizes a set of TAS schedules as an environment for RL, a single scheduling execution on a route set produced by DRR must complete quickly.Thus, it is not feasible for TSLR to adopt the scheduling algorithms of JRaS or IRCA for its purpose.Furthermore, despite being a greedy algorithm, PSS has comparable or even superior flow latencies than the other two on the same route set, as portrayed in Fig. 16.Jitter results are similar to those with dynamic routing (See Table 8).

Conclusion
In this paper, we proposed TSLR for TSN.It explores alternative routes for improved TAS scheduling and addresses the problem's complexity using RL.As such, it accounts for both the network load balancing and the flows' deadline to select the good scheduling option depending on the input flow set.We evaluated TSLR on various topologies with a diverse set of flows and compared it against two state-of-the-art algorithms, JRaS and IRCA, and also with one heuristic algorithm SA/PSS, to show that the scheduling performance improves while achieving 'min-max fair' load balancing and negligible increases in path length.
However, our approach has a couple of limitations.First, when the network size grows, the learning performance deteriorates.Second, TSLR must be trained each time for each network scenario.We plan to investigate solutions for these challenges in our future work.Nonetheless, we believe that this work provides a reference for studies that attempt to graft machine learning to TSN.As our another future work, we plan to explore multi-path sub-stream routing in IEEE 802.1CB frame replication and elimination for reliability for robust TSN.

Fig. 10 .
Fig. 10.Comparison of scheduling time for the four algorithms.(Missing data point means scheduling failed to complete.Only TSLR succeeds beyond 160 flows or T30 scenario respectively.)

Fig. 13 .
Fig. 13.Latency results of flows for simulations of routing and scheduling Missing results mean scheduling failed.

Fig. 16 .
Fig. 16.Latency results of flows with static route in CEV.

Table 1
Summary of related works.

Table 2
Random topology specifications for the simulation.

Table 3
Traffic specifications for the simulation.

Table 5
Average jitter in CEV scenarios.'-' means no result can be obtained because it failed to schedule all flows.

Table 6
Average jitter in random topology scenarios.'-' means scheduling failed.
lengths of the TSLR-generated paths are notably shorter.This means that TSLR's negative reward policy for routing effectively suppresses the increase in path lengths while pursuing the schedulability and load balancing goals.

Table 7
The routing result in CEV 40 scenario.

Table 8
Average jitter with static routing in CEV.