Intelligent Decision-Making of Scheduling for Dynamic Permutation Flowshop via Deep Reinforcement Learning

Dynamic scheduling problems have been receiving increasing attention in recent years due to their practical implications. To realize real-time and the intelligent decision-making of dynamic scheduling, we studied dynamic permutation flowshop scheduling problem (PFSP) with new job arrival using deep reinforcement learning (DRL). A system architecture for solving dynamic PFSP using DRL is proposed, and the mathematical model to minimize total tardiness cost is established. Additionally, the intelligent scheduling system based on DRL is modeled, with state features, actions, and reward designed. Moreover, the advantage actor-critic (A2C) algorithm is adapted to train the scheduling agent. The learning curve indicates that the scheduling agent learned to generate better solutions efficiently during training. Extensive experiments are carried out to compare the A2C-based scheduling agent with every single action, other DRL algorithms, and meta-heuristics. The results show the well performance of the A2C-based scheduling agent considering solution quality, CPU times, and generalization. Notably, the trained agent generates a scheduling action only in 2.16 ms on average, which is almost instantaneous and can be used for real-time scheduling. Our work can help to build a self-learning, real-time optimizing, and intelligent decision-making scheduling system.


Introduction
Production scheduling is one of the important issues to consider for a production workshop. Since flowshop is one of the most widely used workshops, the permutation flowshop scheduling problem (PFSP) has received numerous studies during the past several decades. According to a review by Fernandez-Viagas et al. [1], hundreds of heuristic and metaheuristic methods have been proposed to solve the PFSP over the past 60 years. Studies of PFSP have a significant impact on the scheduling field. The permutation criteria in PFSP restricts that the job sequence in the first machine is maintained for all successive machines, i.e., the job sequence is the same for all machines in the flowshop. The PFSP has been proved non-deterministic polynomial-time hard (NP-hard) when the number of machines is more than two [1]. To solve the NP-hard problem, many hybrid meta-heuristics have been proposed [2,3]. For the PFSP, its objective is to find a good job sequence to minimize makespan [4][5][6], flow time [7][8][9][10], tardiness [11][12][13][14], multiple objective [15][16][17][18][19][20], etc.
Among these dynamic characteristics, new job arrival has been recently receiving arising attention. In this problem, jobs arrive at the production system randomly rather than at time zero, as generally assumed for statistic PFSP. A rescheduling is needed when a new job arrives. This problem is described as a real-time scheduling problem in some literature [25,[38][39][40].
Several researchers have studied the dynamic PFSP with new job arrival using heuristics and meta-heuristics. Rahman, Sarker and Essam [38] investigated the PFSP with real-time order arrival and used GA to repeatedly re-optimize the solution as each new order arrives. Liu et al. [41] proposed a scheduling strategy by integrating match-up and real-time strategy and provided eleven new heuristics with ten existing and one new priority rule. Li et al. [42] solved the flow shop scheduling problem with new job arrival using the constructive heuristic method. The heuristic method successively puts each job in the current best position and reinserts several jobs based on weight calculations. Liu et al. [43] studied the dynamic PFSP with new job arrival using an improved IG, which is equipped with a novel repair method and new enhanced destruction and reconstruction.
With the advance of artificial intelligence (AI), DRL and reinforcement learning (RL) have also been used to optimize scheduling problems. The DRL method trains a scheduling agent to choose the best job to process next based on current production status. Since a scheduling problem can be regarded as a sequential decision problem, it is suitable to be solved by DRL [44]. Many scheduling problems, such as flowshop scheduling problem (FSP), job shop scheduling problem (JSP), assembly shop scheduling problem (ASP), parallel machine scheduling problem, etc., have been solved by DRL or RL. For FSP, Zhang et al. [45] studied the FSP using an online TD algorithm to minimize makespan. For JSP, several studies have been carried out both for statistic and dynamic JSP. Lin et al. [46] studied the JSP under an edge computing framework and used a multiclass deep Q network (DQN) to generate scheduling decisions for multiple edge devices. Liu et al. [47] solved the JPS using an actor-critic algorithm. Zhang et al. [48] used the DRL to automatically learn priority dispatching rule for JSP. They exploited the disjunctive graph representation of JSP and proposed a Graph-Neural-Network-based scheme to embed sates. Han and Yang [49] studied the adaptive JSP using dueling double DQN. Luo [44] studied the dynamic JSP with new job insertions to minimize total tardiness using DRL. A DQN is used to select appropriate dispatching rules at each rescheduling point, and seven generic state features are extracted to represent production status. For other types of workshops, Wang et al. [50] investigated the adaptive scheduling for assembly job shop with uncertain assembly times using dual Q-learning algorithm, which contains top and bottom level Q-learning methods. Shiue, Lee and Su [39] studied a RL-based real-time scheduling problem using multiple dispatching rules strategy to respond to changes in a manufacturing system. Shiue et al. [51] studied the dynamic scheduling of a flexible manufacturing system and semiconductor wafer fabrication using RL. Zhang et al. [52] studied the scheduling of unreliable parallel machines to minimize mean weighted tardiness using RL.
In addition to the DRL, other AI approaches, such as machine learning, deep learning, and a combination of machine learning and meta-heuristics, are also used to solve scheduling problems in recent years. Jun and Lee [53] addressed the dynamic single-machine scheduling problem to minimize total weighted tardiness by learning dispatching rules from schedules. They proposed a decision-tree-based machine learning method to extract dispatching rules from existing schedules. Wu et al. [54] used deep learning to solve unreliable machines' dynamic dispatching in re-entrant production systems. They combine a deep neural network (DNN) and Markov decision processes (MDP) to assign different priorities to job groups to minimize cycle time or maximize throughput. Li et al. [55] studied the flexible job-shop scheduling problem (FJSP) with sequence-dependent setup times and limited dual resources using machine learning and meta-heuristics. A hybrid metaheuristic is proposed to solve the FJSP, and the machine learning classification model is trained to identify rescheduling patterns. Chen et al. [56] used RL to control key parameters of GA during evolution in FJSP.
From the above literature review, we can know that the dynamic PFSP with new job arrival has become an active topic in recent years and has been studied using several meta-heuristics. Recently, DRL and other AI techniques serve as a new approach to solve scheduling problems. However, the dynamic PFSP with new job arrival and total tardiness cost criteria has not been solved by DRL. Since the modeling of dynamic flowshop is different from those of other workshops, it is necessary to propose a DRL-based modeling method for dynamic PFSP. In addition, the DRL-based approaches are not compared with traditional meta-heuristics in most literature. It is unclear whether the DRL algorithms outperform traditional meta-heuristics considering solution quality and CPU times.
This paper studies the dynamic PFSP with new job arrival to minimize total tardiness cost using DRL. Our study aims to realize real-time optimization and intelligent decisionmaking of scheduling for dynamic flowshop using DRL. The procedure of solving dynamic PFSP using DRL is illustrated, and the mathematical model is formulated. The scheduling agent is then modeled by designing state features, actions, and reward for the DRL agent. The A2C algorithm is adapted to train the scheduling agent to choose appropriate actions in different production status. The results show the well performance of the A2C-based scheduling agent compared with other DRL algorithms and meta-heuristics.
In particular, the contributions of this paper are as follows.
(1) To the best of our knowledge, this is the first attempt to solve the dynamic PFSP with new job arrival to minimize total tardiness cost using DRL. Our work can fill the research gap in solving dynamic PFSP by DRL. (2) An intelligent decision-making scheduling system based on DRL is designed and modeled. The system architecture of intelligent scheduling is proposed. States features, actions and reward for the scheduling agent and system are designed.
(3) Our work shows the DRL-based scheduling method outperforms traditional metaheuristics (IG and GA) in solution quantity and CPU times by a large margin for dynamic FPSP.

Problem Description
In this section, the system architecture of solving dynamic PFSP using DRL is proposed. As shown in Figure 1, a new job arrives at the system randomly. If the job cannot be processed immediately, the job will be stored in a buffer BF. The flowshop needs to determine which job to process next using scheduling agent SCD when a job is finished in M 1 . The current state features, including job information and production status, are input to SCD to generate a scheduling action. Since for PFSP, the job sequence is maintained for all machines. Only M 1 needs to determine which job will be processed next. Thus, for step 7 in Figure 1, the system time is only pushed to time t s' (when M 1 finishes job j) to determine which job will be processed next in M 1 using SCD. At each decision point t s' , the jobs finished in all machines are moved to the finished jobs area, denoted as FNS.

Mathematical Model
The mathematical model is established to minimize the total tardiness cost of all jobs arriving at the system. Some notations are listed as follows.
Notations ATj: arrival time of job j Based on the notations above and some related formulation works [57][58][59], the objective function is formulated as Equation (1).
Subject to:

Mathematical Model
The mathematical model is established to minimize the total tardiness cost of all jobs arriving at the system. Some notations are listed as follows. Notations: j: index of jobs, j = 1, 2, . . . , n i: index of machines, i = 1, 2, . . . , m t ij : processing time of job j on machine i, i = 1, 2, . . . , m, j = 1, 2, . . . , n C ij : completion time of job j on machine i, i = 1, 2, . . . , m, j = 1, 2, . . . , n C j : completion time of job j, j = 1, 2, . . . , n d j : due date of job j, j = 1, 2, . . . , n CP: an indicator of completion time of all jobs TF: tardiness factor RDD: the relative range of due dates α j : unit (per second) tardiness cost of job j AT j : arrival time of job j Based on the notations above and some related formulation works [57][58][59], the objective function is formulated as Equation (1).
Subject to: C i0 = 0, ∀i (8) According to [58,60,61], the d j is set by constraint (2). The d j follows a uniform distribution and is controlled by CP, TF, and RDD. TF and RDD are constant and are set to 0.5 and 0.5 [58,60]. CP, is calculated by Equation (3). Equation (4) defines the completion time of a job in the system. Equation (5) gives the completion time of a job on a machine. Equation (6) ensures that a job can be processed by the first machine only after the job arrives at the system. Equations (7) and (8) provide some initial values of completion time C ij .

Modelling of the Intelligent Scheduling System
This section models the intelligent scheduling system by designing state features, actions, and reward for the scheduling agent and workshop environment. Whenever M 1 finishes a job, the scheduling agent SCD generates a scheduling action based on current state features, and a job is selected based on the scheduling action and processed on M 1 . The workshop environment returns a reward for this scheduling action to update SCD parameters. The reward, state features, and actions are designed as follows.

Reward
Since most literature on using DRL to solve the scheduling problems are aimed to minimize makespan. These reward functions cannot be used to minimize the total tardiness cost. We designed a new reward function for the total tardiness cost criteria in this section.
Recall that the aim of solving PFSP is to find out a job sequence with minimized total tardiness cost. The scheduling agent SCD should learn to choose an appropriate job at every decision point so that the total tardiness cost is minimized after all jobs are finished. Each action the SCD taken should make the total tardiness cost increases as little as possible. Given that DRL is designed to maximize the cumulated reward obtained at each step, the reward in this problem is defined as the inverse of unit newly added total tardiness cost of unfinished jobs in the system during this time step. The newly added tardiness cost comes from work-in-progress (WIP) and jobs in the buffer (BF). Therefore, the reward of time step [t s , t s ] is calculated by Equation (9).
where tp WIP , tp BF denote newly added tardiness penalty cost from WIP and jobs in BF, respectively, during the current time step [t s , t s ]. tp BF , tp WIP are calculated as follows.
where n BF denotes the number of jobs in BF, n WIP denotes the number of WIP. z js , defined by Equation (12), indicates whether job j is overdue at t s , the end time of this step. Only the overdue jobs generate tardiness cost. t js , calculated by Equation (13), denotes the actual end time for job j compared with t s . If job j is completed, i.e., finished by all machines, before t s , the time step for calculating tardiness cost is [t s , C j ] rather than [t s , t s ].
Sensors 2021, 21, 1019 6 of 20 t js = C j , if job j has been completed at t s t s , else (13)

State Features
The scheduling agent SCD determines a scheduling action based on current state features. State features should fully and efficiently reflect current job information and production status. Specifically, state features should provide sufficient information for selecting an action at each rescheduling point. The state features in most existing literature are designed to solve the JSP to minimize the makespan. These features are not applicable for our studied problem. Thus, we designed five state features for the PFSP under the total tardiness cost criteria, considering both jobs in BF and current production status. Unlike the existing literature, we calculated four statistic characteristics, maximum, minimum, average, and standard deviation, for several state features rather than only providing the average value to reflect each feature's characteristics better.
(1) f t 1 = ψ j , j ∈ BF. Current unit tardiness cost of each job in BF. where ψ j is the unit tardiness cost generated by job j at present and is calculated as follows. ( where ST j is determined by Equation (15). ST j reflects how much time will be left before the due date d j when job j is finished, if job j begins to be processed at present time t c . ( Total processing times in all machines for each job in BF. (4) f t 4 = u j , j ∈ BF. The estimated utilization rate of each job in BF. u j is the estimated utilization rate of job j and is calculated by Equation (16). Each job in BF is assumed to be processed under the present production status. The u j is calculated based on the waiting times WT ij of all machines when job j is processed.
where WT ij is the waiting time of job j on machine i. WT ij generates when job j is finished on machine i − 1, but cannot be processed immediately on machine i because machine i has not finished its current job j − 1. Note that the first machine M 1 does not have a waiting time because M 1 is always idle at a decision point. (5) f t 5 = n BF . The number of jobs in BF at present.
For the first four state features, which are array, four statistic characteristics, maximum, minimum, average, and standard deviation, are calculated. Thus, the total dimensions for state features are 4 × 4 + 1 = 17. In addition, all features are normalized to facilitate the learning process.

Actions
Each action corresponds to a single dispatching rule (SDR) for selecting a candidate job to be processed in M 1 from BF. The action space should provide sufficient and effective dispatching strategies under different production status. To lower the learning difficulty, the action space should not be too large. We designed five actions (a 1 -a 5 ) for SCD, considering different production status and well-known dispatching rules.
(1) Select the job j, which has the minimum current unit tardiness cost ψ j Where ψ j is calculated by Equation (14).
a 1 is apparent because the overdue jobs with the maximum unit tardiness cost should be processed first to reduce the increase of tardiness cost in the system. Otherwise, the job will generate tardiness cost every second, and the tardiness cost is the largest compared with the tardiness cost generated by other jobs in BF.
(2) Select the job j, which has the minimum safe time ST j . Where ST j is calculated by Equation (15).
a 2 may be appropriate when no jobs in BF overdue or other actions do not have too many advantages.
(3) Select a job that has the shortest processing time (SPT).
(4) Select a job that has the longest processing time (LPT). SPT and LPT are well-known dispatching rules and have been used as scheduling actions in [44,46,48].
5) Select a job that obtains the maximum utilization rate u j . a 5 = argmax j (u j ), j ∈ BF a 5 selects the most suitable job for current machine status, considering the utilization of machines. a 5 may has larger priority when no jobs overdue or the ψ j has small standard deviation, etc.
The five scheduling actions above can provide efficient dispatching strategies under different production status. If more than one job in BF fulfills an action, a job is selected randomly among those candidate jobs.

A2C
In this section, we adapt A2C to solve the dynamic PFSP with new job arrival. A2C is a DRL algorithm that equips with an actor-network π θ and critic-network V φ [62]. A2C selects an action with the probability π θ (s t ) generated by actor-network π θ , at state s t . The critic-network is used to estimate the state value of a state s t in the learning process. Recently, Liu, Chang and Tseng [47] used A2C to solve JSP to minimize makespan and achieved a good balance between makespan and execution time. However, A2C has not been used to solve the PFSP in currently published literature. We adapted the A2C to solve the dynamic PFSP, and the A2C-based training method are shown in Algorithm 1. In Algorithm 1, the new job arrival operation adds the newly arrived jobs during this step to BF. The operation is executed in the beginning and when a job is finished in M 1 . When a job is finished in M 1 , the WIP is updated by removing jobs that have already been finished in all machines.
The critic and actor networks are updated every T steps using gradients shown in lines 15 and 16 [62] in Algorithm 1. The dr t denotes the discounted reward of step t and is used as the target state value of step t. The estimated state value of step t is V φ (s t ). Thus, the gradient for critic-network is calculated by ∇ φ (dr t − V φ (s t )) 2 . For the actor-network, the difference between target and estimation state values is used to update the probability for selecting actions at state s t . Also, an entropy H(π θ ), calculated by Equation (18), is considered to expand explorations.
where A is the action space of the agent. Algorithm 1 shows that an instance with n jobs is used to train the DRL agent for EP epochs, and the agent's parameters are updated every T steps. Thus, the computation complexity of the training process for a single instance is O(EP × n × T). However, after training, the computation complexity for solving an instance is reduced to O(n). Algorithm 1. The A2C-based training method.

1:
Initialize actor and critic network π θ , V φ 2: for epoch = 1: EP do 3: Perform new job arrival at time zero (current system time) 4: Get current state s t 5: while step = 1: n do \\ n is the number of jobs for the selected instance 6: Determine an action a t based on probability π θ (s t ) at state s t 7: Select a job j from BF using action a t , process job j in all machines, obtain the finished time of job j in each machine 8: Push forward the system time only to the time when job j is finished in M 1 9: Perform new job arrival and update WIP at current system time 10: Get current state s t+1 and reward r t 11: Store transition {s t , a t , r t , s t+1 } of this step 12: s t ← s t+1 13: if step % T == 0 then 14: Calculate discounted reward dr t of the T steps in reverse order using data in transitions, dr t = r t + γV φ (s t+1 ), for the Tth step r t + γdr t+1 , for the first T−1 steps 15: Update critic-network V φ using gradient ∇ φ (dr t − V φ (s t )) 2 16: Update actor-network π θ using gradient ∇ θ log π θ (a t s t )[dr t − V φ (s t )] + β∇ θ H(π θ (s t ; θ)) 17: end if 18: end while 19: end for

Numerical Experiments
In this section, a large range of instances is generated to train the A2C-based scheduling agent SCD. After training, the trained model of SCD is saved and used to provide intelligent decision-making of scheduling for dynamic PFSP. To evaluate the performance of A2C, we compared it with SDR, two other DRL algorithms, DQN and double DQN (DDQN), and two traditional meta-heuristics, IG and GA. We also tested the trained agent on a new set of extended instances with larger production configurations to verify the generalization of A2C.
A total number of 84 instances are generated for dynamic PFSP with new job arrival using different parameter settings of n and m, following [43,44]. The specific parameters for instances are listed in Table 1. As mentioned above, the due dates d j are determined by Equation (2). We noticed that when |r| < 10, the agent can be trained more efficiently. Based on this consideration, the α j is set.
The configuration of n and m in Table 1 is repeated three times. Hence, the total number of instances is 7 × 4 × 3 = 84. For each instance, three jobs are set as initial jobs, which arrive at time zero. Other jobs arrive following a poison distribution, which means the interval time of two successive arriving jobs follows an exponential distribution. The mean value of the exponential distribution is set as 30 All algorithms and workshop environments are coded with Python 3.8. The training and test experiments are performed on a PC with Intel(R) Core(TM) i7-6700 CPU @ 3.40 GHz CPU and 12 GB RAM.

Training Process of A2C
The A2C-based scheduling agent SCD is trained using the aforementioned training instances. Parameter settings for A2C are shown in Table 2. Note that the coefficient β is linearly decreased from 0.005 to 0.0005 during the first 70% epochs and remains at 0.0005 after the 2100th epoch. The average total tardiness cost on all test instances at each training epoch is shown in Figure 2. Figure 2 shows that the average total tardiness cost decreases dramatically with the increase of training epochs, meaning the SCD effectively learned to choose a good scheduling strategy and had a good generality for the unseen test instances. In particular, for the first 1400 epochs, the average total tardiness cost decreased from 163.55 to 91.88, which is a 43.82% improvement. The learning curve remains relatively stable after the 2000th epoch, even with some fluctuations. This indicates the SCD can provide reasonable scheduling solutions at a relatively stable level after training. Note that the solution seems not to keep at the global optimal found in the approximate 1300th epoch. This might be due to the exploration of A2C. The fluctuations and the problem of escaping from global optimal can be reduced by increasing training epochs.
scheduling solutions at a relatively stable level after training. Note that the solution seems not to keep at the global optimal found in the approximate 1300th epoch. This might be due to the exploration of A2C. The fluctuations and the problem of escaping from global optimal can be reduced by increasing training epochs. The average episode reward obtained by SCD during the training epochs is shown in Figure 3, which can reflect the learning effect more directly. Figure 3 shows that the episode reward increases significantly with the increase of training epochs. Recall that a DRL agent learns to maximize the accumulated reward obtained during an episode. Thus, Figure 3 confirms the SCD indeed learned to perform better during the training process. The curve tendency of average episode reward is similar to that of average total tardiness cost, indicating the reward of SCD has a high correlation with the objective function of the studied scheduling problem, verifying the correctness of our reward design for SCD.  The average episode reward obtained by SCD during the training epochs is shown in Figure 3, which can reflect the learning effect more directly. Figure 3 shows that the episode reward increases significantly with the increase of training epochs. Recall that a DRL agent learns to maximize the accumulated reward obtained during an episode. Thus, Figure 3 confirms the SCD indeed learned to perform better during the training process. The curve tendency of average episode reward is similar to that of average total tardiness cost, indicating the reward of SCD has a high correlation with the objective function of the studied scheduling problem, verifying the correctness of our reward design for SCD. scheduling solutions at a relatively stable level after training. Note that the solution seems not to keep at the global optimal found in the approximate 1300th epoch. This might be due to the exploration of A2C. The fluctuations and the problem of escaping from global optimal can be reduced by increasing training epochs. The average episode reward obtained by SCD during the training epochs is shown in Figure 3, which can reflect the learning effect more directly. Figure 3 shows that the episode reward increases significantly with the increase of training epochs. Recall that a DRL agent learns to maximize the accumulated reward obtained during an episode. Thus, Figure 3 confirms the SCD indeed learned to perform better during the training process. The curve tendency of average episode reward is similar to that of average total tardiness cost, indicating the reward of SCD has a high correlation with the objective function of the studied scheduling problem, verifying the correctness of our reward design for SCD.

Comparison with SDR
Recall that the scheduling agent learns to choose the best scheduling action, i.e., SDR, at every rescheduling point. To evaluate whether the scheduling agent has learned to choose the best scheduling actions, we compared the A2C with SDR and random action strategys, as shown in Figure 4. Figure 4 shows that the a 1 obtains the best results. This may be because for the tardiness penalty criteria, always selecting the jobs with maximum current unit tardiness cost may be the best strategy, typically when the due date is too tight, and many jobs have waited in the buffer. It indicates that simple dispatching rules exist to generate high-quality solutions for the dynamic PFSP with total tardiness cost criteria. This kind of simple SDR is useful and efficient in practical production scheduling.

Comparison with DRL and Meta-Heuristics
To evaluate the performance of A2C, we compared it with two other widely used DRL algorithms, DQN and DDQN, and two traditional mate-heuristics, IG and GA.

Training Process of DQN and DDQN
As mentioned above, the DQN has widely used to solve scheduling problems [44,46,49]. We adapt the DQN [63] and its variant DDQN [64] to solve the dynamic PFSP. The training procedure and instance settings are the same as those of A2C. After tuning, the hyper-parameters of DQN and DDQN are set. The learning rate for DQN and DDQN are The training curves of DQN and DDQN are shown in Figures 5 and 6. Figure 5 shows that the average total tardiness cost decreases during the training epochs, indicating the two DRL agents have learned to choose better actions at rescheduling points. Figure 6 shows that the average episode reward shows an increasing trend, verifying the agents indeed learned to generate better solutions. However, the learning curves of DQN and DDQN show a large fluctuation compared with those of A2C, indicating the A2C is easier to train. For the two DQN algorithms, the training curve of DDQN is smoother than those of DQN. The results also indicate that the A2C fails to find the best actions at all rescheduling points. Between all actions, the a 1 is the best one. However, in theory, there are situations when other actions are more appropriate than a 1 . Always selecting the a 1 may not be the best strategy. Since the A2C is worse than a 1 , the DRL agent did not learn to choose the best actions at every rescheduling points. The trained DRL agent may not outperform its best pure action significantly when the performance of actions differs greatly. For example, the results in Lin, Deng, Chih and Chiu [46] show that the trained DRL agent MDQN only outperforms its best pure action MOPRN by 14.33%.
The random actions are also compared with the SDR strategy. Figure 4 shows that random action results are not the worst, but approximately the average of those of other actions. This may be because the random action is expected to generate a solution that equals the mean value of other actions' solutions. To verify this conjecture, we tested the rand_-a 1 , the random action exclude a 1 . The results of rand_-a 1 get worse when the best efficient action a 1 is removed. Also, the rand_-a 1 obtains the results close to the average results of other all actions exclude a 1 .
Compared between all actions, the a 1 is the best one, followed by a 5 , i.e., select jobs with the maximum utilization rate u j , and the worst one is a 2 , i.e., select the jobs with the minimum safe time ST j . Surprisingly, the a 2 is worse than a 3 (SPT) and a 4 (LPT). This may be because the SPT and LPT are more related to the utilization, which is more important than the due date information influencing the safe time. The best action a 1 outperforms the worst one a 2 by 73.55%, which is a large scale. To improve the performance of DRL, more efficient actions should be designed, and less efficient actions are replaced. In addition, more efficient DRL algorithms should be proposed since the A2C fails to choose the best actions in every situation.

Comparison with DRL and Meta-Heuristics
To evaluate the performance of A2C, we compared it with two other widely used DRL algorithms, DQN and DDQN, and two traditional mate-heuristics, IG and GA.

Training Process of DQN and DDQN
As mentioned above, the DQN has widely used to solve scheduling problems [44,46,49]. We adapt the DQN [63] and its variant DDQN [64] to solve the dynamic PFSP. The training procedure and instance settings are the same as those of A2C. After tuning, the hyperparameters of DQN and DDQN are set. The learning rate for DQN and DDQN are 1 × 10 −5 and 1 × 10 −6 , respectively. The memory size and batch size are 1000 and 32, respectively.
The training curves of DQN and DDQN are shown in Figures 5 and 6. Figure 5 shows that the average total tardiness cost decreases during the training epochs, indicating the two DRL agents have learned to choose better actions at rescheduling points. Figure 6 shows that the average episode reward shows an increasing trend, verifying the agents indeed learned to generate better solutions. However, the learning curves of DQN and DDQN show a large fluctuation compared with those of A2C, indicating the A2C is easier to train. For the two DQN algorithms, the training curve of DDQN is smoother than those of DQN.

IG and GA
To test the performance of A2C in solving dynamic PFSP, we also compared it with traditional meta-heuristics, which are extensively used in solving scheduling problems. From the review of [1], the IG-based algorithms show a very efficient performance compared with other meta-heuristics on the PFSP. In addition, GA is one of the most commonly used meta-heuristics in solving PFSP [1]. Thus, IG and GA are selected as tradi-

IG and GA
To test the performance of A2C in solving dynamic PFSP, we also compared it with traditional meta-heuristics, which are extensively used in solving scheduling problems. From the review of [1], the IG-based algorithms show a very efficient performance compared with other meta-heuristics on the PFSP. In addition, GA is one of the most com-

IG and GA
To test the performance of A2C in solving dynamic PFSP, we also compared it with traditional meta-heuristics, which are extensively used in solving scheduling problems. From the review of [1], the IG-based algorithms show a very efficient performance compared with other meta-heuristics on the PFSP. In addition, GA is one of the most commonly used meta-heuristics in solving PFSP [1]. Thus, IG and GA are selected as traditional meta-heuristics to compare with our A2C-based scheduling approach.
For the dynamic PFSP, jobs arrive randomly. When a new job arrives, a rescheduling procedure is required to schedule the newly arrived jobs with jobs in the system. Following [16], we also set rescheduling times as 5 for an instance. When a rescheduling point is not reached, the newly arrived jobs are appended to the current job sequence to obtain a new job sequence. At each rescheduling point, the new job sequence is searched by meta-heuristics, i.e., IG and GA.
IG starts from an initial solution, executes destruction, reconstruction, local search, and acceptance criteria on the solution sequence for several iterations, and returns the optimized solution [21]. At the beginning of an instance, the initial solution π T is generated by putting all jobs arrived at time zero into a job sequence randomly. Jobs in π T are processed successively. When a rescheduling point is reached, the current job sequence π T is optimized through destruction, reconstruction, local search, and acceptance criteria for several iterations. In the destruction, |π T | × p desJ jobs are selected randomly from the current job sequence π T and removed from π T to the insertion job sequence π R . |π T |denotes the number of jobs in π T . After the destruction, the insertion job sequence π R is obtained, and the left job sequence in π T is denoted as π D . In the reconstruction, jobs in π R are selected randomly, inserted to all possible positions in π D , and finally placed in the position with minimum total tardiness cost to obtain a new job sequence. In the local search, the jobs in the current job sequence are randomly selected without replication.
The selected jobs are then tested in all possible positions of the current job sequence and placed in the position with minimum total tardiness cost. When more than one position obtains the minimum total tardiness cost, one of those best positions is selected randomly. For the acceptance criteria, the aggravated solutions are generally accepted with a certain probability to expand searching scopes and escape from local optimal [58,65]. In this paper, worse solutions are accepted at a certain probability τ. The parameters of IG are tuned using the design of experiments (DOE) and analysis of variance (ANOVA) as performed in [58]. After tuning, the parameter settings of IG are p desJ = 0.2, τ = 0.05.
GA starts with an initial population, performs crossover, mutation, and selection between the population or individual for several iterations, and returns the optimized solution. Similar to IG, the initial solution is generated randomly using jobs arrived at time zero. At each rescheduling point, the current job sequence π T is permuted randomly to generate a population with N individuals. N = min(N ,|π T | × (|π T | − 1)), where N is the given population size. The population is evolved through crossover, mutation, and selection operations. In the crossover, the N individuals are paired randomly. Two individuals in a pair perform crossover with a probability p c . For the pairs that performs crossover, a crossover point is selected randomly. The sequences before the crossover point are switched between the two individuals in the same pair, and two new child job sequences are obtained. The repetition jobs in two child job sequences are replaced with jobs shown in the parent job sequence but not in the child's sequence. In the mutation, every individual l in the population mutates with probability p m . If an individual mutates, two jobs in this individual's job sequence are selected randomly and switched with each other. In the selection, N new individuals are selected from the current population with replication. Every individual will be selected at a probability p l , which is calculated by Equation (19).
where f l is the fitness (total tardiness cost) of individual l. The parameters of GA are tuned using the aforementioned DOE and ANOVA. The parameter settings of GA are N = 50, p c = 0.80, p m = 0.10.

Comparison with DQN, DDQN, IG, and GA
In this section, the A2C is compared with two other DRL algorithms, DQN and DDQN, and two meta-heuristics, IG and GA. After training, the trained models of all DRL agents are saved and used to generate scheduling actions during production. Figure 7 presents the average total tardiness cost of all compared algorithms on all test instances. As can be seen in Figure 7, A2C obtains the best results compared with all other algorithms. Typically, A2C outperforms two traditional meta-heuristics, IG and GA, by a large margin. We can also see that the three DRL algorithms outperform two meta-heuristics. This may result from the efficient scheduling action a 1 . One surprising result is that the DDQN is worse than DQN. This may be because the two algorithms are not fully convergence at the given epochs, and their results may change when more training epochs are given. For the two meta-heuristics, IG and GA are tested at two search iteration levels, 50 iterations and 300 iterations. When the iteration increased from 50 to 300, IG shows a significant statistical improvement, while GA does not. Overall, the two meta-heuristics with two search iteration levels do not differ significantly in the solution quality. other algorithms. Typically, A2C outperforms two traditional meta-heuristics, IG and GA, by a large margin. We can also see that the three DRL algorithms outperform two metaheuristics. This may result from the efficient scheduling action a1. One surprising result is that the DDQN is worse than DQN. This may be because the two algorithms are not fully convergence at the given epochs, and their results may change when more training epochs are given. For the two meta-heuristics, IG and GA are tested at two search iteration levels, 50 iterations and 300 iterations. When the iteration increased from 50 to 300, IG shows a significant statistical improvement, while GA does not. Overall, the two meta-heuristics with two search iteration levels do not differ significantly in the solution quality. The specific average total tardiness cost and CPU times of the compared algorithms are provided in Tables 3 and 4. Table 3 shows that the A2C and DQN generate the best results considering different production configurations. For the A2C, it outperforms IG and GA in solution quality and CPU times under all production configurations considered. Specifically, A2C, on average, outperforms IG and GA by 31.36% and 31.31% in solution quality, using only 0.04% and 1.35% CPU times that IG and GA spent when 50 searching iterations are provided for IG and GA. When IG and GA's iterations increased to 300, A2C outperforms IG and GA by 30.34% and 30.98%, using merely 0.01% and 0.23% CPU times spent by IG and GA. For the two meta-heuristics, when iterations increased from 50 to 300, the CPU times increase by 279.71% and 493.01%, but IG and GA's solution quality only improves slowly. IG obtains slightly better solution quality than GA, using more CPU times.
Tables 3 and 4 further show that, as the production configuration n and m increases, The specific average total tardiness cost and CPU times of the compared algorithms are provided in Tables 3 and 4. Table 3 shows that the A2C and DQN generate the best results considering different production configurations. For the A2C, it outperforms IG and GA in solution quality and CPU times under all production configurations considered. Specifically, A2C, on average, outperforms IG and GA by 31.36% and 31.31% in solution quality, using only 0.04% and 1.35% CPU times that IG and GA spent when 50 searching iterations are provided for IG and GA. When IG and GA's iterations increased to 300, A2C outperforms IG and GA by 30.34% and 30.98%, using merely 0.01% and 0.23% CPU times spent by IG and GA. For the two meta-heuristics, when iterations increased from 50 to 300, the CPU times increase by 279.71% and 493.01%, but IG and GA's solution quality only improves slowly. IG obtains slightly better solution quality than GA, using more CPU times. Table 3. Average total tardiness cost grouped by the number of jobs n and machines m for compared DRL and meta-heuristics. IG and GA are tested under two search iteration levels, i.e., 50 iterations and 300 iterations. The best results are highlighted in bold. Tables 3 and 4 further show that, as the production configuration n and m increases, the disparity in total tardiness cost between A2C and meta-heuristics increases. For example, A2C outperforms IG and GA with 50 iterations by 8.21% and 7.24%, when m is 5. However, when m is 20, A2C outperforms by 28.05% and 26.67%. This indicates that the A2C-based method has more advantages in solving large instances.

DRL
Note that the average CPU times of A2C for calculating a test instance is 0.21 s. Considering the number of jobs in test instances, the average decision time for taking a scheduling action is 2.16 ms. This is almost instantaneous and can be used in real-time scheduling.
The DRL method models the scheduling problems as an MDP where a DRL agent successively determines the job to be processed next [44]. The optimization of scheduling is realized by the successive decisions of DRL agents rather than by searching the job sequence directly as some meta-heuristics do. The generality of NN enables the DRL agents to behave well for unseen instances. In addition, the DRL agent can generate a scheduling decision instantly after trained. Thus, the DRL can be served as an alternative approach for dynamic scheduling problems.

Generalization to Larger Instances
To further verify the generalization ability of DRL, we tested the trained A2C agent on a novel set of extended instances with larger production configurations. This experiment aims to find out whether the performance deteriorates greatly when production settings different from the original training and test instances.
In the extended instances, the number of jobs n is increased from the original setting n = {20, 50, . . . , 200} to n = {300, 500, 800}. The number of machines m is set to 10. Other production settings are the same as the original instances. Each configuration of n and m are repeated three times. Thus, a total of 9 extended large instances are generated.
The SDR is used as a baseline to evaluate whether the performance of A2C deteriorates when the production configuration becomes larger. The results of A2C and SDR on the extended instances are shown in Figure 8. Figure 8 shows that the A2C is worse than a 1 and a 5 . The a 5 becomes the best action, which may be because when too many jobs exist in the system, the utilization becomes more critical.

Generalization to Larger Instances
To further verify the generalization ability of DRL, we tested the trained A2C agent on a novel set of extended instances with larger production configurations. This experiment aims to find out whether the performance deteriorates greatly when production settings different from the original training and test instances.
In the extended instances, the number of jobs n is increased from the original setting n = {20, 50, …, 200} to n = {300, 500, 800}. The number of machines m is set to 10. Other production settings are the same as the original instances. Each configuration of n and m are repeated three times. Thus, a total of 9 extended large instances are generated.
The SDR is used as a baseline to evaluate whether the performance of A2C deteriorates when the production configuration becomes larger. The results of A2C and SDR on the extended instances are shown in Figure 8. Figure 8 shows that the A2C is worse than a1 and a5. The a5 becomes the best action, which may be because when too many jobs exist in the system, the utilization becomes more critical. To show the change of performance of A2C for larger production configurations, we provide the percentages that A2C outperforms SDR on the original and extended instances in Figure 9. As shown in Figure 9, the percentages on extended instances are larger than those on original instances for most SDR strategies except for the a5. This indicates that, on the whole, the DRL has a better generalization ability on larger production configurations compared with SDR. To show the change of performance of A2C for larger production configurations, we provide the percentages that A2C outperforms SDR on the original and extended instances in Figure 9. As shown in Figure 9, the percentages on extended instances are larger than those on original instances for most SDR strategies except for the a 5 . This indicates that, on the whole, the DRL has a better generalization ability on larger production configurations compared with SDR. original instances extended instances Figure 9. The percentages that A2C outperforms SDR considering average total tardiness cost on the original and extended instances.

Conclusions
This paper solved the dynamic PFSP with new job arrival to minimize total tardiness cost using DRL. This study aims to establish an intelligent decision-making scheduling system to provide real-time optimization for dynamic scheduling problems. The whole procedure of solving dynamic PFSP using DRL is illustrated, and the mathematical model is established. The DRL-based scheduling system is proposed with state features, actions, and reward designed for the scheduling agent and workshop environment. Five scheduling actions are designed to perform well at different production states. The state features with 17 dimensions are designed to provide directly related information for choosing scheduling actions. The A2C algorithm is adapted to train the scheduling agent SCD to learn an appropriate scheduling action at different states. A total of 84 instances are generated to train the A2C-based scheduling agent, and extensive comparisons are carried out. The trained scheduling agent is compared with SDR, two other DRL algorithms, DQN and DDQN, two traditional meta-heuristics, IG and GA. The training curve shows that the average total tardiness cost decreases significantly with training epochs increases, indicating the scheduling agent has learned to choose appropriate scheduling actions. The comparison experiments show our A2C-based scheduling agent performs best among all compared DRL and meta-heuristics. Typically, the A2C-based scheduling agent outperforms IG and GA by a large margin, both in solution quality and CPU times. Our A2C-based scheduling agent can generate a scheduling action in 2.16 ms on average. This can be used for real-time scheduling and online decision-making in a real production workshop. In addition, the A2C-based scheduling agent shows well generalization on unseen instances with larger production configurations. The scheduling agent can still be trained during production. Our study contributes to establishing a self-organizing and self-learning scheduling system.
Future research can consider more realistic characteristics, such as machine breakdown, transportation time, etc., and solve real-time scheduling of more complex workshops, such as reconfiguration workshop, hybrid flowshop, and distributed workshop. In addition, more studies can be applied to design more effective state features, action spaces, and reward for the scheduling system. Moreover, more DRL algorithms could be studied to improve the scheduling agent's learning effect and solution performance.
Supplementary Materials: All instances, training and test results, and the video of solving dynamic PFSP using the trained agent and SDR are available at https://osf.io/tynrd/?view_only=bbea6a7 8dff042e3a5952c3283031676.