A Multi-Agent Reinforcement Learning Approach to the Dynamic Job Shop Scheduling Problem

: In a production environment, scheduling decides job and machine allocations and the operation sequence. In a job shop production system, the wide variety of jobs, complex routes, and real-life events becomes challenging for scheduling activities. New, unexpected events disrupt the production schedule and require dynamic scheduling updates to the production schedule on an event-based basis. To solve the dynamic scheduling problem, we propose a multi-agent system with reinforcement learning aimed at the minimization of tardiness and ﬂow time to improve the dynamic scheduling techniques. The performance of the proposed multi-agent system is compared with the ﬁrst-in–ﬁrst-out, shortest processing time, and earliest due date dispatching rules in terms of the minimization of tardy jobs, mean tardiness, maximum tardiness, mean earliness, maximum earliness, mean ﬂow time, maximum ﬂow time, work in process, and makespan. Five scenarios are generated with different arrival intervals of the jobs to the job shop production system. The results of the experiments, performed for the 3 × 3, 5 × 5, and 10 × 10 problem sizes, show that our multi-agent system overperforms compared to the dispatching rules as the workload of the job shop increases. Under a heavy workload, the proposed multi-agent system gives the best results for ﬁve performance criteria, which are the proportion of tardy jobs, mean tardiness, maximum tardiness, mean ﬂow time, and maximum ﬂow time.


Introduction
Scheduling is one of the critical activities in production management to enhance a production system's performance. Scheduling determines the jobs produced on a machine and the production sequence [1][2][3][4][5][6]. In case the arrival times of the jobs are pre-known, all of the jobs in a process can be organized once by static scheduling. However, the arrival time of each job can barely be foreseen in practice, so it is a necessity to dynamically update the production schedule while the system is running.
In practice, many dynamic events such as arrival times, processing times, machine breakdowns, order cancellations, and due date changes can occur. The actual times of the events are not precise. Random events continuously corrupt the current schedule, so a revised schedule is needed every time a new event occurs. In this study, we propose a dynamic scheduling method based on an event-based simulation to model the rescheduling issue.
In dynamic scheduling problems, production systems are classified as job shop, flow shop, mixed shop, open shop, and group shop [7][8][9][10][11][12][13][14]. In a job shop production system, the variety of products is high, and the batch volume is low because of the varying customers' orders. In a dynamic job shop, new orders constantly arrive at the system to be produced, and completed orders leave the system. The continuous arrivals of jobs that require different

Literature Summary
In this section, we review the relevant studies published during 2010-2022. For the literature review, we use "job shop scheduling", "dynamic job shop scheduling", "agent", "multi-agent system", and "reinforcement learning" as the keywords. We include the research papers indexed in Science Citation Index (SCI) and Science Citation Index Expanded (SCIE). We examine the DJSP characteristics and solution approach in the field. The interested readers are referred to recent review papers [26][27][28][29][30][31][32][33][34][35] in the field. The relevant studies in the literature are summarized under three main categories as the static problem, the dynamic problem, and DJSP.

Static Problem
In a static scheduling problem, it is assumed that all jobs are ready at the beginning of the scheduling period. The production is scheduled once, so no unexpected events can interfere with the schedule. Studies on the static problem using MAS are summarized in Table 1 in terms of problem/environment and solution method. Table 1. Literature summarized by static problem.

Paper
Problem/Environment Solution Method [36] JSP MAS [37] Single machine MAS [38] IMS MAS [39] Two identical parallel machines MAS [40] JSP MAS with ACO [41] Flow shop MAS with RL [42] Personalized manufacturing MAS with RL JSP: job shop scheduling problem; IMS: intelligent manufacturing systems; MAS: multi-agent system; ACO: ant colony optimization; RL: reinforcement learning.  Komma et al. (2011) [36] is the pioneering research in the field. They prepared a guide on designing agent architecture in different production systems using the Java Agent Development Framework. They consider a discrete event simulation by modeling the components of a production system. Owliya et al. (2012) [37] designed a MAS for general use. They tested the MAS structure on a single machine scheduling problem. They used cost and resource utilization rates as the performance criteria. Leitao et al. (2015) [38] designed a MAS and agents' communication with each other as block diagrams. These show the general behavior of the agents. Yu et al. (2018) [39] developed MAS-based scheduling on two identical parallel machines. They defined the operations and machines as agents. They took the makespan and total tardiness as the criteria. Wong et al. (2012) [40] designed a MAS using ACO for the process planning and integrated scheduling problem. They took the makespan, average flow time, and resource utilization rates as the criteria. Wang et al. (2019) [41] developed a MAS in which the agents communicate with each other by using the game theory method. They tested this MAS architecture on a simulation model of a smart workshop. They took the makespan, machine workloads, and energy consumption as the criteria. They showed that the MAS architecture yielded better results than the FIFO-based and SPT-based approaches. Kim et al. (2020) [42] designed a MAS for personalized manufacturing. They used the makespan and maximum tardiness as the performance criteria. They compared the designed MAS with the frequently used dispatching rules in the literature. They used the RL algorithm for the development of the decision mechanism.
While static scheduling problems are ideal for testing new solution methods, real-world scheduling problems are dynamic. For this reason, a technique that offers feasible solutions to the static problem may not provide a feasible solution to dynamic real-world problems.

Dynamic Problem
In a dynamic scheduling problem, the literature research focuses on the flow shop and job shop production systems. If there are dynamic events in a flow shop production system, the problem is called "D-Flow shop". Likewise, in others, if there are dynamic events in the job-shop-type production, the problem is called "DJSP"; and, if there are dynamic events in the flexible job-shop-type production, the problem is called the dynamic flexible job shop scheduling problem (DFJSP) in the literature. In this study, a solution method is proposed for DJSP, but there are different MAS and AI approaches designed for other problem types such as CPPS, DFJSP, D-Flow Shop, and SHFS in the literature. The studies on the dynamic problem are summarized in Table 2 in terms of the problem/environment, the solution method, and dynamic factors.  [43] designed a structure with two rival agents in a system with two parallel machines. The agents' goal was to minimize the makespan. The structure was compared with the GA. Ahmadi et al. (2016) [44] used NSGA-II and NRGA in the DFJSP, considering machine breakdowns. Shiue et al. (2018) [45] designed a structure that changes the dispatching rules by using RL for the DFJSP. They chose the average flow time and number of tardy jobs as the criteria. Sahin et al. (2017) [46] designed a MAS for the DFJSP. Each agent tried to achieve its own goal. They made both dynamic and static scheduling and achieved satisfactory results. Maoudj et al. (2019) [47] designed a MAS architecture for the robotic flexible assembly cell, which is considered as the DFJSP. The MAS architecture created the schedule by switching between the dispatching rules. They used the makespan as a criterion. They showed that the agent architecture they designed could yield better results than the metaheuristics they compared it with. Huang and Liao (2012) [48] designed a MAS architecture for the dynamic parallel machine scheduling problem. In the MAS structure, which consists of work, machine, and management agent, the communication between agents is examined in detail. As the criteria, they considered total tardiness, flow time, resource utilization rates, and revenue value. Y. Liu et al. (2018) [49] studied cloud manufacturing. They created a MAS-based scheduling mechanism and tested it in a sample study using the simulation method. They explained the communication between the agents in detail. Jiang et al. (2017) [50] worked on dynamic scheduling in CPPS. They established a double-layered decision-making mechanism. This decision-making mechanism performed the rescheduling activity with a GA. The agents both collected information and took actions from the decision-making mechanism. S. Zhang and Wong (2017) [51] simulated different dynamic factors in different scenarios in the DFJSP. They hybridized the MAS-based approach they developed with the ACO. The makespan was considered as a criterion. Barenji et al. (2017) [52] worked on MAS-based DSS for solving the D-Flow Shop problem. They tested the MAS-based DSS by modeling a small-and medium-sized real-life system in a simulation environment. The proposed a system that can perform both static and dynamic scheduling. They used the makespan as a criterion. Shi et al. (2021) [53] designed a MAS that updates the priorities of the jobs with different types of GA. They tested the MAS structure in sustainable hybrid-flow-type production. They took into account the makespan, energy consumption, and carbon emissions as the criteria. The proposed MAS structure increased the computation time as the problem size increases but gave better results than the compared algorithms. Luo (2020) [54] designed a MAS with the RL approach for the DFJSP. The makespan was used as a criterion. The designed MAS was compared with the dispatching rules frequently used in the literature.
Since dynamic scheduling problems reflect real production systems, they are divided into many categories. It would not be correct to say that a method that gives feasible solutions for one category necessarily gives feasible solutions in other categories.
In this study, we propose a MAS-RL for the DJSP. The studies conducted on the DJSP are summarized in Table 3 according to the problem/environment, the solution method, and dynamic factors. Baykasoglu and Karaslan (2017) [55] developed a GRASP-based approach to the DJSP problem. They used goal programming logic to reach better performance criteria. They showed that their proposed GRASP-based approach yields viable solutions. Sel and Hamzadayı (2018) [56] proposed a simulation optimization study based on the SA carried out in the DJSP. They considered average flow time and average tardiness as the criteria. The proposed simulation optimization approach yielded better results than EDD and FIFO. C. L. Zhang et al. (2019) [57] designed a two-agent structure in the DJSP to minimize the makespan. They showed that the proposed two-agent structure gives better results as the problem scale becomes larger. Turker et al. (2019) [58] proposed a DSS for the DJSP. The proposed DSS was designed to increase the performance of dispatching rules for dynamic scheduling by using real-time data. They used average machine utilization, average waiting time, work in process, number of tardy jobs, average tardiness, and average earliness as the performance criteria. They conducted the experiments in different scenarios with different job arrival rates. The proposed DSS decided about a job by considering not only the workload of the machine it is currently in but also the workloads of the machines it goes to in the following steps. Aydin and Öztemel (2000) [59] designed a single agent with RL for the DJSP. The agent was trained with RL and was able to carry out scheduling activities. The designed system consisted of two parts: agent and simulator. The agent determined the most appropriate dispatching rule by reading the data from the shop floor. The simulator, on the other hand, applied the dispatching rule determined by the agent. Kardos et al. (2020) [60] designed a MAS structure using RL for the DJSP. OA was taken as the dynamic factor in the problem. They compared this structure with the SPT and other dispatching rules in the literature. Average lead-time was considered as the objective function. They determined that the complexity of the production environment was important because, as it increases, using RL for dynamic scheduling becomes more effective. Erol et al. (2012) [61] developed a MAS-based scheduling approach for AGVs and machines in a dynamic production system. Jana et al. (2013) [62] designed a MAS using fuzzy multi-criteria decision making and multi-objective optimization techniques based on ratios. They tested the designed MAS in different scenarios. M.   [63] studied dynamic scheduling in smart workshop environments within the scope of Industry 4.0. They aimed to collect real-time data with the help of the IoT and RFID and to make decisions using the MAS structure based on these data. M. E.   [64] studied the data they received from a real production system. They aimed to reduce the workload of the job shop with their designed MAS. The proposed MAS yielded better results than the job shop's current scheduling strategy.
As can be seen from the literature summary, there are studies using MAS for the DJSP, but only one study was found that provided training for a MAS with RL on the DJSP. We examined both the method and the problem characteristics of the studies in the literature and summarized them in the following list. We described the points of our study's similarities and differences from the literature. As a result of the literature review, the following list of improvements to the literature were reached.

1.
In our study, each job type can have different priority values on each machine as a unique scheduling method. In the literature, there is no other study using this scheduling method exactly as it is in this study. With this method, we aim to give flexibility to the production schedule. This method, which has a unique scheduling way, is explained in detail in the following sections. It is thought that researchers can adapt this scheduling method to their own studies and maybe improve this method by making some changes.

2.
No other studies using MAS with RL for DJSP were found in the literature. However, there is one study using a single agent with RL, which is Kardos et al. (2020) [60].
Since there are insufficient studies on this specific mixture of the problem and solution method, our study can be considered as a novel study conducted on this area.  [60] considered only the OA as a dynamic factor, we extend dynamic factors such as OA, PT, and DD. Since using more dynamic events together means that the problem becomes more difficult to solve, we improve the literature in this aspect. iii.
While Kardos et al. (2020) [60] took into account the average lead time as a performance criterion, we extend the number of performance criteria to nine, which is the proportion of the tardy jobs, mean tardiness, maximum tardiness, mean earliness, maximum earliness, mean flow time, maximum flow time, work in process, and makespan in this study. With the expansion of the performance criteria, the results in this specific area can be examined in a wider range, making it possible to reach conclusions from various aspects.
In addition to all these improvements to the literature, we aimed to make it easier for researchers who are not experts in MAS to understand the MAS easily and develop their own studies on this subject. From this aspect, the MAS-RL in this study was designed to be as understandable as possible, and each agent's working principle was explained in detail. In this way, we tried to encourage that MAS studies be carried out in the future.

Problem Statement
The main frame of the DJSP is to use a limited number of machines (or service providers), to process a specified number of jobs (or tasks), while trying to optimize the specified objectives such as the makespan or tardiness. Each of these jobs has a specified operation sequence or route through the machines, with a specified processing time at the corresponding machine. When the job passes through the last operation sequence, it is considered as a finished job.
The DJSP also has other constraints that needs to be taken care of. In some of the studies in the literature, the problem is attempted to be solved by the mathematical programming method, while, in other studies, simulation programs specially designed for scheduling problems are used. The advantages of using a simulation program are that the production schedule can be stopped and examined at any time, the workflows can be followed visually, and there is no need for a mathematical model. In our study, the Arena ® package program was used as a simulation program. Within the modules of the program, the constraints 1.
Different operations are performed on different machines; 2.
The machines operate only one job at the same time; 3.
The jobs are operated on only one machine at the same time; 4.
Operations that have started cannot be interrupted or paused; 5.
The jobs must follow their routes in the specific order; 6.
The queue capacity is unlimited for any machine.
Job shop scheduling problems can be of different sizes. The problem size is expressed as the number of job types and the number of machines (jxm). The machine and job-type thresholds of a job shop that determine the complexity of a job shop-instance are not generally agreed upon in the literature [65]. In addition, there are studies in the literature that mention that the problem size does not make a difference for the performance of the dispatching rules [66,67].
In this study, we performed experiments for the 3 × 3, 5 × 5, and 10 × 10 problem sizes to show that our proposed approach works well for different problem sizes. The routes that jobs follow in a job-shop environment are complex and difficult to follow. To illustrate this, a visual representation of the 3 × 3 problem size dynamic job-shop environment is given in Figure 1. 5. The jobs must follow their routes in the specific order; 6. The queue capacity is unlimited for any machine.
Job shop scheduling problems can be of different sizes. The problem size is expressed as the number of job types and the number of machines (jxm). The machine and job-type thresholds of a job shop that determine the complexity of a job shop-instance are not generally agreed upon in the literature [65]. In addition, there are studies in the literature that mention that the problem size does not make a difference for the performance of the dispatching rules [66,67].
In this study, we performed experiments for the 3 × 3, 5 × 5, and 10 × 10 problem sizes to show that our proposed approach works well for different problem sizes. The routes that jobs follow in a job-shop environment are complex and difficult to follow. To illustrate this, a visual representation of the 3 × 3 problem size dynamic job-shop environment is given in Figure 1. "M" means machine, and "jt" means job type. Job type 1 is marked in red, job type 2 is marked in green, and job type 3 is marked in blue. Each job type visits each machine according to its route. For example, jt1's route (red) is M1-M2-M3, jt2's route (green) is M2-M3-M1, and jt3's route (blue) is M3-M1-M2. A machine can have more than one job of the same job type in the queue.
The DJSP is an NP-hard class problem due to its complexity. The increase in the diversity of machines and job types and the increase in the complexity of the jobs' routes make it almost impossible to reach the optimum solution of the problem in the polynomial time. Due to the stochastic and dynamic nature of the job arrivals to the system, it may cause a computational burden to produce reliable solutions even with the 3 × 3 problem size.
In Table 4, the model notations are described.  "M" means machine, and "jt" means job type. Job type 1 is marked in red, job type 2 is marked in green, and job type 3 is marked in blue. Each job type visits each machine according to its route. For example, jt1's route (red) is M1-M2-M3, jt2's route (green) is M2-M3-M1, and jt3's route (blue) is M3-M1-M2. A machine can have more than one job of the same job type in the queue.
The DJSP is an NP-hard class problem due to its complexity. The increase in the diversity of machines and job types and the increase in the complexity of the jobs' routes make it almost impossible to reach the optimum solution of the problem in the polynomial time. Due to the stochastic and dynamic nature of the job arrivals to the system, it may cause a computational burden to produce reliable solutions even with the 3 × 3 problem size.
In Table 4, the model notations are described. In the job-shop-type production system, jobs (j) randomly arrive at the shop floor according to exponential distribution. The arrival time of each job is recorded as (A j ). The processing times of the job on each separate machine (m) are determined according to the normal distribution and are recorded as (P j,m ). The due dates of the jobs are assigned as (D j ). After the assignments, the jobs are directed to the first machines on their routes. If a machine is in idle state and the machine's queue is empty, the job's processing starts immediately, otherwise the job is directed to the machine's queue and waits for the machine to become idle. When the machine becomes idle, and the job is chosen to be next, it enters the machine and is processed as the time (P j,m ). The job is then routed to the next machine according to its route, and this sequence repeats until the job's route complete. Here, we assume that the transportation times between the machines can be neglected. The completion time of each job is recorded as (C j ) and is used to calculate the flow time (F j ) and deviation from the due date (Dev j ). These formulations are presented below. The flow time is calculated by Equation (1). Equation (2) calculates the deviation from the due date. A positive deviation indicates that the job is tardy (or late), and the outcome of the equation is considered tardiness (T j ). If the deviation is negative, it corresponds to earliness (E j ). It is undesirable for a job leaving the system to be early or tardy. A job that is early indirectly causes other jobs in the system to be tardy. This is a situation where every job is requested to be completed exactly on its due date [68].

Dispatching Rules Used
Dispatching rules are used for arranging jobs in the machines' queues according to a specific rule. Dispatching rules are known as priority rules. Dispatching rules are used to determine the job that starts processing on the machine as soon as the machine becomes idle. In this study, FIFO, SPT, and EDD rules, which are the most frequently used rules in the literature, were compared to the MAS-RL.

FIFO Rule
The way FIFO works is based on giving priority to the job that arrives to the machine first. The FIFO rule is one of the most frequently used rules in the literature. In addition, it is easy to apply in theory as well as in practice. FIFO fits for systems where the entities are food with a short expiration date or people. In fact, we unknowingly use FIFO in most of the public zones where we encounter queues such as supermarkets, restaurants, banks, and counters in our daily life. In order to mathematically apply FIFO, it is necessary to know the arrival times of the jobs in the queue of the machine (A j,m ). The job with the Sustainability 2023, 15, 8262 9 of 24 minimum arrival time should be selected and processed on the machine as a priority. FIFO is formulated in Equation (3).

SPT Rule
SPT is based on giving priority to the job with the shortest processing time. As example of a real-life use of SPT is when there is a long queue in front of a cash register in a supermarket, but customers with a few items take the lead. SPT minimizes mean flow time. In order to mathematically apply SPT, it is necessary to know the processing times of the jobs in the queue of the machine (P j,m ). The job with the minimum processing time should be selected and processed on the machine as a priority. SPT is formulated in Equation (4).

EDD Rule
EDD is based on prioritizing the job with the shortest due date. For example, a reallife use of EDD is when a customer with very urgent business takes the lead while there is a long queue in front of a cash register in a supermarket. EDD is generally used for minimizing tardiness. In order to mathematically apply EDD, it is necessary to know the due dates of the jobs (D j ). The job with the minimum due date should be selected and processed on the machine as a priority. EDD is formulated in Equation (5).

Proposed MAS-RL Approach
In this section, we introduce the MAS structure and the RL mechanism. In the MAS structure, we explain the agents in the system and the relationships between the agents. The RL mechanism, which provides a learning function to the MAS, is examined.

Multi-Agent System Structure
The MAS is an AI approach distributing intelligence to different individuals by agents. The MAS is a computerized system that consists of multiple intelligent agents communicating with each other. An agent has a goal, sensors, and actuators. When an agent is put into an environment, the agent should be able to change the environment for its intended goal.
In the MAS, there are multiple agents that can have different goals and different ways of changing the environment. The MAS can be compared to a bee colony. The different types of bees are specialized for different goals, such as searching for resources, making honey, or defending the colony. Each bee type has different actuators to achieve its own goal. The natural swarm intelligence of the bee colony can be imitated with the MAS.
The MAS should be intelligent in order to provide feasible solutions. Intelligence is defined as the "ability to learn". There are three main machine learning methods: supervised, unsupervised, and RL. We used RL to implement the learning ability to the MAS. The aforementioned three types of machine learning are explained in the next section.
The proposed MAS-RL structure designed in this study is shown in Figure 2. Five different types of agents are designed for the MAS-RL, and they all have different goals. All of them have the ability to communicate with each other and take actions toward a goal. In the MAS-RL structure, jobs are created by a Job Agent, and a chain reaction begin when an order arrives. The reaction continues until the maximum number of jobs has bee reached. We describe the goals, decisions, and internal mechanisms of the agents.

Job Agents
The main goal of a Job Agent is to create jobs according to incoming orders and repo the job's information to the Database Agent. The first aspect of the job information is th job's arrival time to the system. The information about job arrivals is obtained by the Jo Agent, generating a random number from the exponential distribution. Another task o the Job Agent is to determine the job type corresponding to the incoming order and assign processing times for the considered job type. These processing times are obtaine by the Job Agent by generating a random number from the normal distribution.
As soon as a job enters the system, its due date is determined. After the assignment the Job Agent sends jobs to the Queue Agent and reports the job information to the Dat base Agent.

Queue Agents
A Queue Agent sends the selected job to a Machine Agent when the machine is i idle state. It also provides the current state of the queue to the relevant agents.
Information such as how many jobs are in the queue at the moment, how long the tot processing time of these jobs is, and what these jobs' types are, are continuously transferre to the Database Agent. When a job is removed from the queue and sent to the Machin Agent, the information changes are recalculated and reported to the Database Agent.

Machine Agents
A Machine Agent sends the jobs to the queue of another machine when a job is com pleted on the current machine. The machine's status is monitored as busy or idle by th Machine Agent and when a change occurs that notifies the Database Agent.  In the MAS-RL structure, jobs are created by a Job Agent, and a chain reaction begins when an order arrives. The reaction continues until the maximum number of jobs has been reached. We describe the goals, decisions, and internal mechanisms of the agents.

Job Agents
The main goal of a Job Agent is to create jobs according to incoming orders and report the job's information to the Database Agent. The first aspect of the job information is the job's arrival time to the system. The information about job arrivals is obtained by the Job Agent, generating a random number from the exponential distribution. Another task of the Job Agent is to determine the job type corresponding to the incoming order and to assign processing times for the considered job type. These processing times are obtained by the Job Agent by generating a random number from the normal distribution.
As soon as a job enters the system, its due date is determined. After the assignments, the Job Agent sends jobs to the Queue Agent and reports the job information to the Database Agent.

Queue Agents
A Queue Agent sends the selected job to a Machine Agent when the machine is in idle state. It also provides the current state of the queue to the relevant agents.
Information such as how many jobs are in the queue at the moment, how long the total processing time of these jobs is, and what these jobs' types are, are continuously transferred to the Database Agent. When a job is removed from the queue and sent to the Machine Agent, the information changes are recalculated and reported to the Database Agent.

Machine Agents
A Machine Agent sends the jobs to the queue of another machine when a job is completed on the current machine. The machine's status is monitored as busy or idle by the Machine Agent and when a change occurs that notifies the Database Agent.

Database Agent
A Database Agent stores the incoming information and forwards the information to the other agents. The agent that needs to acquire information requests the information from the Database Agent. The Database Agent acts as an information center for the other agents.

Decision Agent
A Decision Agent decides which job in the queue has the highest priority using the current values in the priority table. After determining the prior job, the Decision Agent transmits the information to the Database Agent and the RL mechanism. The Decision Agent informs the RL mechanism by sending the values from the priority table that it used while determining the prior job. Then, the RL mechanism updates the priority table according to the deviations from the due date of the jobs.

Reinforcement Learning Mechanism
Machine learning algorithms receive historical input and output data from supervised learning. The supervised learning method allows the algorithm to create outputs as close to the desired result as possible by changing the model between each input/output pair. Supervised learning algorithms include decision trees, neural networks, support vector machines, and linear regression.
The labeled training sets and data are not used in unsupervised learning. Instead, the machine searches the data for less obvious patterns. Machine learning of this type makes decisions by using the data to find patterns. K-means, Hidden Markov models, a Gaussian mixture, and hierarchical clustering models are common unsupervised learning algorithms.
RL is a machine learning type that reflects humans' learning mechanism. The agent learns by interacting with the environment and receives a positive reward or negative reward (punishment). The agent is programmed to seek a long-term reward to reach the goal [69]. The RL mechanism is illustrated in Figure 3. The agent takes an action by looking at the state. The environment changes according to the action taken. According to this change, the agent receives a reward. Then, the loop starts over by looking at the state again.

Database Agent
A Database Agent stores the incoming information and forwards the information to the other agents. The agent that needs to acquire information requests the information from the Database Agent. The Database Agent acts as an information center for the other agents.

Decision Agent
A Decision Agent decides which job in the queue has the highest priority using the current values in the priority table. After determining the prior job, the Decision Agent transmits the information to the Database Agent and the RL mechanism. The Decision Agent informs the RL mechanism by sending the values from the priority table that it used while determining the prior job. Then, the RL mechanism updates the priority table according to the deviations from the due date of the jobs.

Reinforcement Learning Mechanism
Machine learning algorithms receive historical input and output data from supervised learning. The supervised learning method allows the algorithm to create outputs as close to the desired result as possible by changing the model between each input/output pair. Supervised learning algorithms include decision trees, neural networks, support vector machines, and linear regression.
The labeled training sets and data are not used in unsupervised learning. Instead, the machine searches the data for less obvious patterns. Machine learning of this type makes decisions by using the data to find patterns. K-means, Hidden Markov models, a Gaussian mixture, and hierarchical clustering models are common unsupervised learning algorithms.
RL is a machine learning type that reflects humans' learning mechanism. The agent learns by interacting with the environment and receives a positive reward or negative reward (punishment). The agent is programmed to seek a long-term reward to reach the goal [69]. The RL mechanism is illustrated in Figure 3. The agent takes an action by looking at the state. The environment changes according to the action taken. According to this change, the agent receives a reward. Then, the loop starts over by looking at the state again. To use the RL mechanism, a priority table is needed, as in dispatching rules. Therefore, a priority value is defined for each job type on each machine. These values are indicated by W. W values are visualized in Table 5 only for the 3 × 3 problem, as the size of the problem increases as the job types and the number of machines increase. For the 5 × 5 and 10 × 10 problems, the table expands as the job types (i) and machines (m) increase.  To use the RL mechanism, a priority table is needed, as in dispatching rules. Therefore, a priority value is defined for each job type on each machine. These values are indicated by W. W values are visualized in Table 5 only for the 3 × 3 problem, as the size of the problem increases as the job types and the number of machines increase. For the 5 × 5 and 10 × 10 problems, the table expands as the job types (i) and machines (m) increase. When scheduler agents need to select a job from the corresponding machine's queue, they give priority to the job type with the highest W value. This structure is formulated in Equation (6).
In cases where the queues have more than one job of the same job type, the W values of the jobs are equal. When trying to give priority to one of these jobs, a tie situation occurs. To break the tie, the FIFO rule is used to define the earliest job that came to the queue.
The W values are updated for every job leaving the system. A job leaving the system changes the priority values of the jobs in the same type in the system. For example, a job with job type Type 2 updates W 21 , W 22 , and W 23 when leaving the system. The magnitude of change takes place, as shown in Equation (7) for tardy jobs and as shown in Equation (8) for early jobs.
α: Total number of tardy jobs; T j : Tardiness of the jth job; N i : Total number of jobs of Type i in the system; M m : Total number of jobs waiting in the queue of the mth machine; β: Total number of early jobs; E j : Earliness of the jth job. By dividing each variable by its maximum value, the effect of these variables on the W value is normalized. This way, the effect of the spike values over the W values is reduced, and the W values become more stable over time.
The update activity occurs in time t. In other words, the max(x) functions in the Equations (7) and (8) are considered as the greatest measurement value of x up to time t. According to this, the W values increase by a maximum of 3 in a single update, and the update rate slows down over time.
Since early jobs and tardy jobs should affect the W value in opposite directions, and earliness and tardiness are expressed with different notations, two separate equations are presented as W update equations.
The logic of the change in W values is explained as follows.

1.
A tardy job affects the W values as much as the delay time (the same applies to early jobs); 2.
A tardy job affects the W values as much as the jobs of the same type in the system; 3.
A tardy job updates much more than the W value for machines with a long queue length.
The magnitude of the W value shows the priority of the job on the machine. The higher the W value is, the higher the priority of the job type is. The W value increases in the case of tardiness and decreases in the case of earliness. Tardy jobs are more undesirable than early jobs in production systems. For this reason, the change in the W value is greater for tardy jobs than early jobs. While there are three factors affecting the W value for tardy jobs in Equation (7), it is seen that there is only one factor for early jobs in Equation (8).

Simulation Model
In order to simulate a real job-shop environment, all input data need to be dynamically and stochastically obtained throughout the simulation period. For this reason, while the simulation is running, input data such as jobs' arrival times, processing times, and due dates are generated according to the probability distributions when needed. The routes and processing times of the jobs used in the 3 × 3 problem simulation model are given in Table 6. The processing times are randomly generated by the normal distribution. Job type (i) in the table expands to 5 lines for the 5 × 5 problem and 10 lines for the 10 × 10 problem. The processing times used for the 5 × 5 and 10 × 10 problems are given in Tables 7 and 8, respectively. The jobs' arrival rates are assigned separately for five different scenarios. As the time between arrivals becomes shorter, the jobs' arrivals become more frequent, and the workload of the system increases. The scenarios are presented in Table 9, which represents very low, low, moderate, heavy, and very heavy workloads. The time between arrivals is randomly generated by the exponential distribution. There are different due date assignment methods in the literature for job-shop-scheduling problems. These different methods do not have any advantages over each other. Due to its ease of implementation, one of the "processing time multiplying" methods was used in [70]. In this method, the due date is assigned by the uniform distribution for each job, as shown in Equation (9). When calculating the due date, the arrival time of the job and the estimated total processing time should also be taken into account. After the due date assignment, jobs go to the first machine on their routes. A job that finishes its route on the machines is completed. For completed jobs, the flow time (F j ) and deviation from the due date (Dev j ) are calculated by Equation (1) and Equation (2), respectively. If the Dev j value is positive, the job is tardy and updates the W values using Equation (7). If the Dev j value is negative, the job is early and updates the W values using Equation (8).
The simulation model is prepared in the Arena ® ver.13.50 package program. Due to its large size, only a small section of the 3 × 3 problem simulation model is given in Figure 4.
the estimated total processing time should also be taken into account. After the due date assignment, jobs go to the first machine on their routes. Dj = Pj × U(10,20) + Aj ∀j, (9) A job that finishes its route on the machines is completed. For completed jobs, the flow time (Fj) and deviation from the due date (Devj) are calculated by Equation 1 and Equation (2), respectively. If the Devj value is positive, the job is tardy and updates the W values using Equation (7). If the Devj value is negative, the job is early and updates the W values using Equation (8).
The simulation model is prepared in the Arena ® ver.13.50 package program. Due to its large size, only a small section of the 3 × 3 problem simulation model is given in Figure 4. The designed MAS-RL structure is transferred to the event-based simulation model. The momentary status of each job, machine, or queue can be observed by stopping the model at any time. In this way, the job shop can be continuously monitored to determine whether there is a problem with any component.

Experimental Results
Since the simulation model works with random numbers, it is necessary to eliminate the effects of extreme values. For this reason, the number of replications is set to 30, and each simulation run ends when 5000 jobs are completed. The average results of 30 replications for the 3 × 3, 5 × 5, and 10 × 10 problems are presented in Table 10, Table 11, and  Table 12, respectively. The designed MAS-RL structure is transferred to the event-based simulation model. The momentary status of each job, machine, or queue can be observed by stopping the model at any time. In this way, the job shop can be continuously monitored to determine whether there is a problem with any component.

Experimental Results
Since the simulation model works with random numbers, it is necessary to eliminate the effects of extreme values. For this reason, the number of replications is set to 30, and each simulation run ends when 5000 jobs are completed. The average results of 30 replications for the 3 × 3, 5 × 5, and 10 × 10 problems are presented in Table 10, Table 11, and Table 12, respectively.  Tables 10-12 should be evaluated by themselves. Moreover, each scenario should be evaluated on its own. The values highlighted in bold show the best result among FIFO, SPT, EDD, and the MAS-RL for that scenario. For example, for the 3 × 3 problem in Scenario 5, the MAS-RL gave the best result with 8.2454% in PC1. This is also true for the 5 × 5 and 10 × 10 problems: the MAS-RL gave the best results for PC1 in Scenario 5.
For each problem size in each scenario, the method that is best in most of the nine performance criteria is highlighted in bold. For example, for the 3 × 3 problem in Scenario 5, the MAS-RL gave the best results within five of the nine criteria. The same situation was observed in Scenario 4. So the MAS-RL was the best method for five different scenarios: two times for the 3 × 3 problem, three times for the 5 × 5 problem, and three times for the 10 × 10 problem.
When only examining from the aspect of the MAS-RL, from Scenario 1 to Scenario 5, the number of best results given by the MAS-RL was 0-1-2-5-5 for the 3 × 3 problem, 0-0-4-4-4 for the 5 × 5 problem, and 0-0-4-4-4 for the 10 × 10 problem. It can be said that the performance of the MAS-RL increased as it progressed from Scenario 1 to Scenario 5. The MAS-RL gave better results as the workload load increased.
Since PC1, PC2, and PC3 are related to tardiness, these criteria usually symmetrically act with each other. Looking at these criteria, EDD was generally expected to give the best results. However, for all the problem sizes, EDD only gave the best results in Scenario 1 and Scenario 2. The reason for this may be that the workload of the job shop increased a lot since Scenario 3, so it may cause bottlenecks. In systems with bottlenecks, standard dispatching rules may not yield the expected results. In Scenario 4 and Scenario 5, it was observed that the MAS-RL excels in tardiness.
Since PC4 and PC5 are related to earliness, these criteria were examined together. Earliness and tardiness factors are expected to act in opposition to each other. When examined for all problem sizes, it can be said that EDD gave good results for PC4 and PC5. It was observed that the MAS-RL did not achieve good results in terms of earliness.
SPT was expected to give the best results for the PC6 and PC7 criteria in terms of flow time. It is well-known that SPT minimizes the flow time in the single machine scheduling problem. However, this is not the case in the DJSP. For all problem sizes, for the aspect of PC6, SPT (2 out of 5) under low workloads and the MAS-RL (3 out of 5) under heavy workloads gave the best results. For the aspect of PC7, FIFO and EDD shared first place, while the MAS-RL only gave the best results for the 3 × 3 problem size for Scenario 4 and Scenario 5.
PC8 is a measure that shows the number of jobs in the job shop at any given moment. It is one of the important criteria that indicate the chaos in the job shop. The higher it is, the harder it is to keep track of jobs and scheduling activities. For this reason, PC8 is desired to be low. The rule that is expected to give the best results for PC8 in the literature is SPT and its derivatives. As expected, SPT gave the best results in almost every situation.
One of the frequently used performance criteria for scheduling problems in the literature is PC9. The makespan indicates how long it takes to complete a certain number of jobs. In other words, it shows when the last job exited the system. For all problem sizes, it is seen that SPT gave the best results as the workload increased in Scenarios 3, 4, and 5.
An event-based graph of the W values in the 3 × 3 problem is given in Figure 5 to examine the curve. As seen in the graph, the W values started from 0 at the beginning of the simulation and made peaks to extreme values. After the peaks, the rate of change in the W value gradually decreased and became stable state. A graph was similarly formed for the 5 × 5 and 10 × 10 problems. In the graphs shown in Figure 5, it is seen that a learning curve (LC), which is very common in machine learning studies in the literature, has emerged. The LC is known for initially making hard peaks and becoming stable as time passes [71]. The LC describes a system's performance on a task as a function over some resource to solve that task, as shown in Figure 6. In the graphs shown in Figure 5, it is seen that a learning curve (LC), which is very common in machine learning studies in the literature, has emerged. The LC is known for initially making hard peaks and becoming stable as time passes [71]. The LC describes a system's performance on a task as a function over some resource to solve that task, as shown in Figure 6. In machine learning studies, performance criteria such as the Mean Squared Error (MSE) or the Mean Absolute Percentage Error (MAPE) are often used. In our study, instead of using them, a strategy based on instantly correcting the error occurred was adopted. The MAS-RL constantly monitored the system and updated the W values according to the magnitude of the errors. That is, the error and W values symmetrically proceeded according to each other.

Conclusions
In this paper, a MAS-RL approach was proposed to solve the DJSP. The performance of the proposed approach were compared to the FIFO, SPT, and EDD dispatching rules in the literature. Five different scenarios with increasing job arrival rates and nine different performance criteria were used for comparison. Experiments were performed for the 3 × 3, 5 × 5, and 10 × 10 problem sizes. The following conclusions were made from the experimental results.
1. As the workload increases, the MAS-RL performs better. From Scenario 1 to Scenario 5, the workload increases along with the performance of the MAS-RL. It is understood that this is caused by two factors. The first factor is that as the workload increases, the number of jobs in the system also increases, so more scheduling activities are needed. The MAS-RL quickly examines the status of all jobs in the system and makes the most appropriate choices. The second factor is that the MAS-RL starts to make more effective decisions after completing its learning stage. Having many jobs in the system at the same time enables the MAS-RL to learn faster and also allows it to apply what it has learned to more jobs. 2. The MAS-RL can successfully overcome tardiness. For all problem sizes, the MAS-RL gave the best results in Scenarios 4 and 5 for all the performance criteria related to tardiness. In the literature, the dispatching rules that work best for tardiness are known as EDD and its derivatives. However, the MAS-RL showed promising results for tardiness, outperforming EDD under heavy workloads.  In machine learning studies, performance criteria such as the Mean Squared Error (MSE) or the Mean Absolute Percentage Error (MAPE) are often used. In our study, instead of using them, a strategy based on instantly correcting the error occurred was adopted. The MAS-RL constantly monitored the system and updated the W values according to the magnitude of the errors. That is, the error and W values symmetrically proceeded according to each other.

Conclusions
In this paper, a MAS-RL approach was proposed to solve the DJSP. The performance of the proposed approach were compared to the FIFO, SPT, and EDD dispatching rules in the literature. Five different scenarios with increasing job arrival rates and nine different performance criteria were used for comparison. Experiments were performed for the 3 × 3, 5 × 5, and 10 × 10 problem sizes. The following conclusions were made from the experimental results.

1.
As the workload increases, the MAS-RL performs better. From Scenario 1 to Scenario 5, the workload increases along with the performance of the MAS-RL. It is understood that this is caused by two factors. The first factor is that as the workload increases, the number of jobs in the system also increases, so more scheduling activities are needed. The MAS-RL quickly examines the status of all jobs in the system and makes the most appropriate choices. The second factor is that the MAS-RL starts to make more effective decisions after completing its learning stage. Having many jobs in the system at the same time enables the MAS-RL to learn faster and also allows it to apply what it has learned to more jobs.

2.
The MAS-RL can successfully overcome tardiness. For all problem sizes, the MAS-RL gave the best results in Scenarios 4 and 5 for all the performance criteria related to tardiness. In the literature, the dispatching rules that work best for tardiness are known as EDD and its derivatives. However, the MAS-RL showed promising results for tardiness, outperforming EDD under heavy workloads.

3.
The MAS-RL can reduce the flow time. For the 3 × 3 problem in Scenarios 4 and 5, the MAS-RL gave the best results for all the performance criteria related to flow time. For the 5 × 5 and 10 × 10 problems, the MAS-RL only gave the best results for PC6 (mean flow time). 4.
The MAS-RL can give feasible solutions for the aspect of the makespan. The makespan shows how long the duration is to complete a certain number of jobs. For all the problem sizes in Scenario 2, the MAS-RL gave the best results for the makespan. For real businesses, it is not very meaningful to only look at the makespan. Even when the makespan is optimal, if orders exceed the due date, it would not be long before the business loses its customers. We still included the makespan in our study, as it has been calculated since the first studies of scheduling problems.

5.
There is no remarkable relation between the size of the problem and the performance of the MAS-RL. Except for minor differences, the solution methods for all the problem sizes yield similar results.
Another unique contribution of this study to the literature is that each job type could receive different priorities on each machine. In addition, the priorities were reconciled with the RL mechanism so that they could change over time. This technique allowed for more flexible changes to be possible in the production schedule.
In future studies, the MAS-RL can be tested in even larger or smaller systems. Dynamic events such as machine failures and order cancellations can be implemented in future research. A different variety of parameters can be used in the calculation of the W values. This may change the duration of the learning period for the MAS-RL. Researchers can adapt this scheduling method to their own studies for different problem types.

Conflicts of Interest:
The authors declare no conflict of interest.