Solving blocking flowshop scheduling problem with makespan criterion using q-learning-based iterated greedy algorithms

This study proposes Q-learning-based iterated greedy (IGQ) algorithms to solve the blocking flowshop scheduling problem with the makespan criterion. Q learning is a model-free machine intelligence technique, which is adapted into the traditional iterated greedy (IG) algorithm to determine its parameters, mainly, the destruction size and temperature scale factor, adaptively during the search process. Besides IGQ algorithms, two different mathematical modeling techniques. One of these techniques is the constraint programming (CP) model, which is known to work well with scheduling problems. The other technique is the mixed integer linear programming (MILP) model, which provides the mathematical definition of the problem. The introduction of these mathematical models supports the validation of IGQ algorithms and provides a comparison between different exact solution methodologies. To measure and compare the performance of IGQ algorithms and mathematical models, extensive computational experiments have been performed on both small and large VRF benchmarks available in the literature. Computational results and statistical analyses indicate that IGQ algorithms generate substantially better results when compared to non-learning IG algorithms.


Introduction
proposed a permutation flowshop scheduling problem (PFSP), which is the simplest version of the flowshop scheduling problems, for the first time in the literature.In PFSP, each job is processed on several machines.The route of each job is the same.Also, all the jobs have the same permutation on each machine.PFSP has a variety of applications in the industry (Blazewicz et al., 2007) and can be applied to sectors such as textile, plastic, chemical, and semiconductor (Pan & Ruiz, 2012).Also, PFSP has received significant attention from the literature over several years (Fernandez-Viagas et al., 2016, 2017) and was proved to be NP-hard when the objective is minimizing the makespan (Garey et al., 1976).For the last fifty years, various mathematical models have been developed for several extensions of the PFSP models to meet the needs of the industry (Cheng et al., 2019).In addition, a wide range of objective function values were considered for the PFSP (Birgin et al., 2020;Ramezanian et al., 2019).In PFSP, the buffer spaces between machines are assumed as unlimited.However, in some production plants, it cannot be possible to have unlimited buffer spaces (Miyata & Nagano, 2019).Therefore, a blocking permutation flowshop scheduling problem (BFSP) has arisen as a variant of PFSP.In BFSP, there is not any buffer space between the consecutive machines.Thus, jobs cannot move to the next machine if that machine is not available.Since there is no buffer area, jobs cannot move anywhere after they complete the process on the current machine, so have to stay at the current machine without being processed.At that time period, none of the jobs can be processed by the blocked current machine.When the upstreaming machine becomes available, then the job leaves the current machine and allows next job to be processed.These blocking situations can occur in several production types, such as robotic cells (Carlier et al., 2010;Ribas et al., 2015;Ribas & Companys, 2015), and modern industrial production systems (Shao et al., 2018a).Furthermore, several sectors have blocking constraints and actively apply BFSP, i.e., the chemical and pharmaceutical sectors (Merchan & Maravelias, 2016).In this sector, buffer areas are not allowed in production for health and hygiene reasons.Waiting in the buffer area causes the structures of chemical products to deteriorate and the effects of the drugs to disappear.BFSP is also applied in the iron and steel industry (Gong et al., 2010), in which the structure of the products waiting in the buffer areas is damaged due to oxidation.In addition, BFSP is suitable for the electronic manufacturing shop (Chen et al., 2014), where in some electronic products, waiting periods between production processes may damage the product structure.
Moreover, some studies address the specific needs of the industry such as sequence-dependent setup times integrated into the BFSP by Shao, et al. (2020).Also, time constraints for BSFP are considered by Chen et al. (2014).Multi-objective optimization of energy-efficient BFSP is considered by Kizilay et al. (2019), and very recently, Han et al. (2020) integrated a setup time to similar considerations.A BFSP group scheduling problem integrated with the transfer times is considered by Yuan et al. (2020).A hybrid BFSP is handled by the two studies (Aqil & Allali, 2021;Elmi & Topaloglu, 2013), while parallel BFSP is also considered in the literature (Ribas et al., 2017(Ribas et al., , 2019)).Furthermore, a lot-streaming BFSP is presented by Gong et al. (2018).In recent years, a distributed BFSP with makespan has been considered by Zhang et al. (2018) and solved using discrete differential evolution algorithms, while Shao, et al. (2020) consider a fuzzy distributed BFSP.A detailed review of BFSP literature is provided (Miyata & Nagano, 2019).
Very recently, multiple perturbation operators were incorporated into their iterated greedy (IG) algorithm, denoted as QIG.They employed the Q-learning approach, one of the machine learning techniques, to select the perturbation strength of the destruction and construction operator for the PFSP.They show that QIG outperforms even the algorithms that achieve the best results in the literature for scheduling problems to date.Note that a similar Q-learning is employed to solve the no-idle PFSP (Öztop et al., 2020; Öztop et al., 2022) as mentioned by Karimi-Mamaghan et al. (2022).In this study, we utilize the Q-learning approach to solve the BFSP and develop our Q-learning-based IG algorithms, denoted as IGQ1 and IGQ2, to compare to IG, IGALL, and QIG algorithms.Computational results indicate that IG algorithms with Q-learning, namely, IGQ1, IGQ2, and QIG, substantially outperform the traditional IG algorithm.
The following is explained in the upcoming sections of the article.In section 2, the CP and MILP mathematical models proposed for problem-solving are explained and their formulations are given.In section 3, IG and IGALL algorithms, which are traditional metaheuristic approaches in the literature, are explained.Additionally, the local search (LS) procedure used in these algorithms is also explained.Sections 4 and 5 summarize reinforcement learning (RL) and Q-learning approaches.Comparative results and performances of all developed and proposed models and algorithms are explained in section 6.In the last section, section 7, the results obtained are summarized and information about future studies is given.

Mathematical Formulations
BFSP is formulated using MILP and CP models.Both models use the same parameters, which are presented in the following.We have several jobs and machines, the process duration time of each job, and a sufficiently big integer used in only the CP model.We used the MILP model formulation of Ronconi & Armentano (2001) and developed a CP model.Both models, including their objectives and constraints are explained in the following parts.

Mixed-Integer Linear Programming Model
The MILP model is constructed by introducing specific decision variables, an objective function, and various constraints.
The decision variables consist of two types of integers representing the job finish time on individual machines and the maximum finish time, respectively.Additionally, a binary variable is utilized to denote the positions of the jobs on the machines, ensuring that job permutations remain consistent across all machines.The model is designed to optimize the job schedules on the machines, considering the completion times and processing orders, with the ultimate goal of achieving efficient and effective scheduling outcomes.

Decision variables:
, : The end time of a process of job  ∈  on machine  ∈   , = 1, if job  is processed at position  on the machines 0, otherwise : The maximum of the process end time of the jobs Objective Function

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐶𝑚𝑎𝑥
(1) This study's objective (1) is to achieve the lowest makespan value, which can be calculated as the last processed job's production end time on the last machine, as stated in constraint (2).Constraints (3-5) calculate the jobs' completion time regarding the problem's blocking variant.Constraint (6) states that a job is processed through a series of machines.Assignment constraints (7-8) provide that each job is fixed in a single position in the sequence and each position has a job at each machine.

Constraint Programming Model
The CP model incorporates both interval and sequence variables, which are the expressions that make it easier to build scheduling models.Interval variables represent the start time of the process, duration of the process, and end time of each job's process on the machines.Specifically, we define interval variables for each job and machine to track the temporal characteristics of their processing.On the other hand, sequence variables are established for each machine, detailing the set of interval variables associated with the respective machines.This approach allows us to model and optimize the sequencing of jobs on the machines effectively, considering both their temporal properties and the overall operations on each machine.

Decision variables
, : Interval variable denotes processing of job  ∈  on machine  ∈  with a duration between  , and . : Sequence variable of machine  defined over interval variables  , .
Objective Function As seen from the equations the CP model is compact when compared to the MILP model.Since the CP has interval and sequence variables, the formulation of the scheduling problems is very easy and understandable.The objective function aims to obtain the minimum makespan value, which is achieved from the interval variable.Since it can be reached the end time point of an interval variable, minimizing the maximum end time of each job only for the last machine, provides the objective value (9).To enforce the blocking constraints that each job should start its processing on a machine immediately after it has completed processing on the previous machine, equation ( 10) is introduced.To ensure proper blocking between machines, an interval variable y is defined between the job processing times and a suitably large constant D. Constraint (11) states that an engine can operate only one job in the same period of time.This constraint is written with the help of the glocal constraint of noOverlap.The noOverlap constraint prevents interval variables from starting and ending within the same time period, ensuring that jobs on a machine are not executed at the same time.Furthermore, constraint (12) again uses the global constraint, which ensures that all jobs are operated in the same rank on all machines, in short, it is the constraint that ensures the permutation is the same.

Iterated Greedy Algorithms
In this section, the traditional IG algorithm first presented by Ruiz and Stützle (2007b) is explained.There are four important parts at the core of the IG algorithm.These parts are the initial solution, destruct-construct procedure, LS, and acceptance criteria.How these four parts are addressed is important.In this article, the NEH algorithm (Nawaz et al., 1983), which is a well-known algorithm in the literature is used for the initial solution.After obtaining the initial solution, the destruction part randomly removes  number of jobs from the job list  obtained from the initial solution.Then, these removed, and the leftover jobs are kept in separate lists,  and  , respectively.The order of the remaining jobs constitutes the partial solution.LS has been applied to this partial solution.In the traditional IG structure, applying LS in this section is not mandatory, it is optional (Dubois-Lacoste et al., 2017).During the construction phase, previously extracted jobs were added to the job list obtained because of LS, one by one, in random positions in the order in which they were removed.After all the extracted jobs are added, the job list is completed, and the complete solution is obtained.Then, LS was applied again to the obtained solution.The applied LS procedure is explained in Algorithm 1.In this procedure, the insertion LS with speedup technique, previously presented by Tasgetiren et al. (2017), and inspired by the speed-up techniques developed by Taillard (1990), was used.The purpose of applying this LS procedure is to quickly upgrade the solution quality by getting benefits from the speed-up techniques.If the current solution improves, the LS continues to be applied over the improved solution, otherwise, over the current solution.After the LS phase is completed, if the solution obtained is better than the inprocess solution, it is saved as the new current solution; if not, it can be saved by checking it according to the (SA) acceptance criteria.At this point, a probability value for the acceptance rate is calculated respecting the objective function value and the temperature value  (Osman & N. Potts, 1989): where  is a parameter to be determined.Finally, Algorithm 2 shows the pseudocode of the traditional IG algorithm implemented in this work.

Reinforcement Learning and Q-learning
RL technique is based on machine learning and its basis is to reward good behavior and punish bad behavior.An RL agent can perceive and interpret its environment, take action, and learn through trial and error to achieve a specific goal (Kaelbling et al., 1996;Sutton & Barto, 2018).The aim here is to enable learning by trial-and-error method by creating interaction with the environment.As the environment is learned, the reward obtained will also reach its maximum level.Most known RL methods require a model that includes all possible states, actions available for each state, transition probabilities between states, and expected rewards.However, often a complete model may not be available, or it may take a long time to obtain the complete model.For such situations, a model-independent RL algorithm called Q-learning has been developed (Watkins, 1989).The developed Q-learning algorithm is based on gradual differences.In this algorithm, there is the state space (), the action space (), the state-action pair (s, a) and the expected gain score (, ) obtained as a result of the action chosen for each situation.(, ) is calculated as follows: In this equation,  indicates the current state and  indicates the action applied in state .Additionally, the next situation (′) and the action to be applied in the next situation (′) are also included in the equation.While the resulting score is updated, a section multiplied by the learning rate  (0 < α ≤ 1), is added to the existing score.In this section, the difference between the maximum of future scores and the current score is multiplied by the discount factor  (0 < γ ≤ 1), and the reward () for choosing action  is added.
In choosing an action for the current situation, both exploration and exploitation actions are important.At this point, while the Q-learning technique provides a balance between them, it also allows using state-action pairs that have never been discovered before.The ϵ-greedy strategy defined below is implemented using certain probability values (Sutton & Barto, 2018).a= argmax a∈A Q(s, a) having 1-ϵ prob.value any action with equal choice in A and chosen at random having ϵ prob.value

IGQ Algorithms with Q-learning
In this article, the Q-learning algorithm and the IG algorithm explained in the previous sections are used together in the IGQ algorithms.The learning mechanism in the RL and Q-learning algorithms was used to determine the parameter values of the IG algorithm self-adaptively in the IGQ algorithm.The Q value in the Q-learning function is calculated for each parameteraction pair in the IGQ algorithm.In this case, Eq. ( 15) was created by using the parameter () value instead of the state () value in equation 14.Ee define (, ), a function that determines both the temperature scale factor (τP) and the destruction size () for the IGQ algorithms as follows: In the IGQ algorithm used, the parameter set includes temperature scale factor and destruction size values.The reward () is defined as 1/ since the objective function of the problem try to obtain the minimum  value and a smaller  value should lead to larger reward values.Moreover, if the target value of the obtained solution becomes worse than the current value during the iterations, it is still accepted with the SA-type acceptance criterion.At this point, a lower reward should be achieved for the action performed, because a higher  value than the current value has been achieved.In the IGQ algorithm, clusters with different values for τP and dS were determined as  and  .A set is defined as  ∈ 0.1, 0.2, 0.3,0.4,0.5,0.6,0.7,0.8, 0.9 and  ∈ 1, 2, 3 .It is important to bear in mind that in the IGQ1 algorithm, we employ the  ∈ 2, 3, 4,5,6,7 , whereas the  ∈ 1, 2, 3 is taken as an action list in the IGQ2 algorithm.In both algorithms,  is taken as  ∈ 0.1, 0.2, 0.3,0.4,0.5,0.6,0.7,0.8, 0.9 .Other parameters are taken as ϵ ≔ 0.8, β: = 0.996, : = 0.6,  : = 0.8, which are experimentally determined by Karimi-Mamaghan et al. ( 2022) and we barrow those values for our IGQ algorithms.
The proposed IGQ algorithms are almost the same as the traditional IG algorithm.However, we determine parameter set  ∈ (τP, dS) by using Q-learning algorithms at each iteration.In addition, we employ the referenced local search (RLS) (M.Fatih Tasgetiren et al., 2009), given in Algorithm 3, with the speed-up methods by Tasgetiren et al. (M. Fatih Tasgetiren et al., 2017) in Q-learning-based IGQ and QIG algorithms.Finally, we can outline IGQ algorithms in Algorithm 4.

Computational Results
This study handles the BFSP problem and presents the mathematical formulation of the BFSP by using the MILP and CP models.In addition, several heuristic algorithms were developed to acquire good and qualified solutions in a short computational time.All the models and the heuristics were performed on small VRF instances, which are well-known benchmarks, proposed by Vallada et al. (2015).Both MILP and CP models were coded on OPL CPLEX Studio IDE 12.10 and given a 1-hour time limit to solve them to optimality or an upper bound with a GAP from optimality.Once we obtained results both from the MILP and CP, we chose the better ones amongst them as lower bounds for the small-sized VRF instances.Microsoft Visual Studio platform and C++ coding language were used to acquire the solutions for the developed metaheuristic algorithms.The results of all heuristic algorithms used were obtained by running these algorithms for 25*n*m milliseconds with five replications.The results of the mathematical model and heuristic algorithms were obtained using an Intel (R) Core (TM) i7-2600 desktop with a 3.40 GHz CPU and 8GB RAM.All the best-known solutions (BKS) for both small and largesize VRF instances (ins.) are given in Appendix A1 and Appendix A2.

Comparisons on the Small VRF Instances
In this section, the results of the CP model MILP model, and the proposed and developed heuristic algorithms (IG, IGALL, IGQ1, IGQ2, and QIG ) are compared considering the relative deviations from the obtained results for the small VRF examples.The average relative percent deviations (ARPD) of the mathematical models are calculated by the following equation: In Eq.16, f(x) is the obtained solution from the models, and f(best) is the optimal or the best solution of the models, MILP and CP.Therefore, relative percent deviations (RPD) of the algorithms from the best or optimal results obtained from the models are calculated.Regarding the metaheuristics, we also provide the relative percent improvements as: In Eq. ( 17), upper bounds (UB) are taken either from MILP or CP results, and f(best) is the obtained solution of IG, IGALL, IGQ1, QIG, and IGQ2 algorithms.In VRF small instances, there are six different job sizes (10,20,30,40,50,60) and four different machine numbers (5,10,15,20).There exist ten instances for each  × ℎ combination.The calculated RPD and RPI values of the instances are gathered, and the average for each  × ℎ combination (average of ten instances) is calculated as the average relative percent deviation (ARPD) and average relative percent improvement (ARPI).
Results of the heuristics were obtained by running them for 25×× milliseconds with five replications for each combination.For each  ×  combination, Table 1 presents the ARPI values of the average solutions of these five replications, as well as the ARPD values of MILP and CP models.The overall average values are written in bold.The CPU represents the average computation time of the mathematical models, i.e., MILP and CP, respectively.The GAP value represents the difference between the lower and upper limits obtained because of solving the MILP and CP models within a limited time.The solution value obtained for the minimization problem is the upper limit value.The GAP value is obtained by finding the difference between this value and the lower limit and dividing it by the upper limit value.The GAP values written in the table were calculated by the solver.If the results are optimal, then the GAP values will be zero.Table 1 displays that, as the job size increases, the ARPD% values also increase for the models and the heuristics.However, the same comment cannot be made about the number of machines.The ARPD% values do not follow a smooth pattern.Regarding the overall average values, the best-performing algorithm is the IGQ2 with -1.79 ARPD%.Then IGALL and QIG follow with very small differences in their ARPD% values.All the heuristics have relatively small differences in their ARPD% values, but it is obvious that the performances of the models are worse.When the models were compared, the CP model generated better results than the MILP model respecting the solution quality.Although the models were given 3600 seconds, they could not find good results due to the computational complexity of the problem, especially for the MILP model.Thus, the MILP model is not employed for large-size instances.
To see all the numbers, Table 2 summarizes the results of all models and algorithms, indicating the number of optimal solutions, best solutions, ARPD/ARPI, and average CPU times.The best solutions for each instance obtained by the models or the algorithms are recorded.Table 2 shows which model and algorithm found the best solution in how many of the 240 instances were presented as "# of best".Both MILP and CP models obtained the optimal solutions for all replications of the ten job instance sets.It corresponds to 40 instances out of 240 instances.However, if the job size reaches 20, models cannot find the optimal results within 3600 seconds and provide sub-optimal solutions.When the results for the ten jobs were investigated, it was seen that all the heuristic algorithms could find all the optimal results except for the IGQ2 and IG algorithms.However, these two algorithms cannot find optimal solutions only for one instance, and there is a very small difference.Most of the best results are found by the IGQ2 algorithm.Then the IGALL and QIG algorithms follow.However, the ARPI values of all the algorithms are very close to each other, so an interval plot is provided in Fig. 1 to show whether these algorithms' results are statistically different from each other.Fig. 1 compares the mean of the ARPI values of the algorithms under a 95% confidence interval.As seen from Fig. 1, all the algorithms, except IG, intersect each other, so they are not statistically different from each other.The IG algorithm does not intersect with the IGALL, QIG, and IGQ2 algorithms, so it is statistically significant that the results of the IG are worse than these algorithms.However, the IG and IGQ1 results are not significantly different.

Comparisons on the Large VRF Instances
This section provides a comparison between different algorithms on the same instance sets.Large-size VRF instances were used for the comparison.IG, IGALL, IGQ1, QIG, and IGQ2 algorithms were compared to each other.VRF large instances include eight different job sizes (100,200,300,400,500,600,700,800) and three different machine numbers (20,40,60).Each job and machine combination includes ten different instances.CP model was run for 1 hour to solve large data sets.Since the CP cannot provide good solutions under the given time limit, the results of the model are accepted as upper bound (UB).The deviations of the results of all algorithms from the CP model's results (upper bound) were calculated.Each algorithm was run in five iterations for each   ℎ combination.Table 3 shows the ARPI of the average of these replications of all algorithms.As seen in Table 3, all five algorithms provide similar results with small differences in their overall ARPI values.The bestperforming algorithms are the QIG and IGQ2 algorithms, with -8.43 and -8.42 ARPI values, respectively.As the number of jobs increases, the improvement performance of algorithms increases.This is because the CP model achieves worse results within 1 hour as the number of jobs increases.While the number of jobs is 500 and greater than 500, the improvement percentages of the algorithms decrease as the number of machines increases.However, the same trend is not valid for the number of jobs less than 500; no significant decrease or increase was observed in these data sets according to the number of machines.On the other hand, it is obvious that learning-based algorithms perform the best.According to the graph, the differences in RPIs become statistically significant as long as the confidence intervals of the two selected algorithms do not overlap.For each machine size, all metaheuristics follow a similar pattern for 20, 40, and 60 machines, as seen in Fig. 2. For 20 machines, the confidence intervals of IGQ1, QIG, and IGQ2 algorithms do not intersect with the IG algorithm's confidence intervals, and they have a small intersection with the IGALL algorithm; hence, their differences are statistically significant when compared to the traditional IG algorithm as well as the IGALL.Even the IGALL algorithm is statistically significant to the traditional IG algorithm.For 40 machines, the results from IGQ1, QIG, and IGQ2 algorithms are statistically significant when compared to the IG and IGALL algorithms since their confidence intervals do not coincide.Similarly, for 60 machines, a similar pattern can be observed.Ultimately, it can be concluded that the proposed IG algorithms with Q-Learning outperform the traditional IG algorithm.
Fig. 3 presents the interval plot of metaheuristic algorithms for the 200 jobs having 20, 40, and 60 machines under the 95% confidence interval.For each machine size, all metaheuristics follow a similar pattern for 20, 40, and 60 machines, as seen in Fig. 2, too.For each machine combination, the confidence intervals of IGALL, IGQ1, QIG, and IGQ2 algorithms do not intersect with the IG algorithm's confidence intervals, so their differences are statistically significant compared to the traditional IG.Since the IG algorithms with QL generate better results than IGALL algorithms, they have a small intersection between their confidence intervals.Ultimately, it can be concluded that the proposed IG algorithms with QL outperform the traditional IG algorithm.Fig. 6 and 7 present the interval plot of metaheuristic algorithms for the 500 and 600 jobs, respectively, with 20, 40, and 60 machines under the 95% confidence interval.These two figures follow the same pattern.The traditional IG algorithms perform statistically worse than the other algorithms for all machine sizes.Also, the QIG and IGQ2 algorithms' confidence intervals do not intersect with the IGALL algorithm's, in all machine combinations, providing that their results are statistically better than the IGALL algorithm.Since the IGQ1 and IGALL algorithms have small intersections in all combinations, we cannot comment that their solutions are statistically different from each other.Fig. 8 and 9 present the interval plot of metaheuristic algorithms for the 700 and 800 jobs, respectively, with 20, 40, and 60 machines under the 95% confidence interval.These two figures also follow a similar pattern to the previous figures.Different than the previous figures, the QIG algorithm's confidence interval does not intersect with the other algorithms' except for the IGQ2 algorithm.These results indicate that the best-performing algorithms are the QIG and the IGQ2 for the 700 and 800 jobs.The most significant difference generated by the traditional IG algorithm indicates that it is the worst-performing among all algorithms.From the above figures, when IG is compared to the IGALL algorithm, IGALL statistically performs better.Generally, we can claim that Q-learning-based algorithms generate the best results with respect to the traditional IG and IGALL algorithms.However, the results of the IGQ1, QIG, and IGQ2 algorithms are statistically not different than each other in many job and machine combinations.Last of all, it can be concluded that the proposed IG algorithms with Q-learning outperform the traditional IG algorithms and generate the best results.

Conclusion and future research
This study considers the BFSP to minimize makespan.Two types of mathematical models, such as MILP and CP, were developed to solve the problem and verify the results of the metaheuristic algorithms over the optimal solutions.Sets containing parameter values frequently used in the literature were created so that the parameter values of IG algorithms can be learned on their own while the algorithm is running.Then, by using the Q learning algorithm, a mechanism was developed to learn the parameter meter that is most suitable for the problem among the values in this set.Thus, besides the traditional IG and IGALL algorithms, IGQ1, QIG, and IGQ2 algorithms were developed.The performances of all models and metaheuristics were analyzed and compared using small and large-size VRF instances, and the best-known solutions were reported.In the analysis of the mathematical models, when the job size is 10, both models can find all the solutions optimally, but the CPU time of the MILP model is reasonably less than the CP model.However, as the job size increases from 20 to 60, both CP and MILP models have difficulty achieving optimal solutions within a 1-hour time limit.The CP model starts to perform better than the MILP model considering solution quality for larger job sizes.Thus, for the large-size VRF instances, only the results of the CP model were obtained to get comparisons with the metaheuristics.When the metaheuristic algorithms are compared over small VRF instances, all the algorithms perform similarly except the traditional IG algorithm, which is the statistically worst-performing algorithm of the other algorithms.Similar results were obtained for the large-size VRF instances.These results indicate that Q-learning-based IG algorithms are not statistically different than each other, but they perform better than the traditional IG and IGALL algorithms.
This study proves the robust performance of Q-learning-based IG algorithms on BFSP.In future studies, these algorithms can contribute to obtaining better results by applying them to different scheduling problems.In addition, Q-learning-based different metaheuristic algorithms can be developed, such as Q-learning-based iterated local search, variable neighborhood search, and so on.Self-adaptive learning of the parameter values of the algorithms will perform successfully on scheduling problems.Apart from the makespan objective function, its effect should be investigated by testing it on varied objective functions, i.e., total flow time or tardiness minimization.There are gaps in literature in this area, and we believe the literature will move in this direction in the future.

Fig. 2 .
Fig. 2. Interval plot for 100 jobs and 20, 40, and 60 machines Fig. 3. Interval plot for 200 jobs and 20, 40, and 60 machines Fig. 2 plots the range graph of the meta-heuristic algorithms for 20, 40, and 60 machines and 100 jobs.A 95% confidence interval is assumed in this graph.According to the graph, the differences in RPIs become statistically significant as long as the confidence intervals of the two selected algorithms do not overlap.For each machine size, all metaheuristics follow a similar pattern for 20, 40, and 60 machines, as seen in Fig.2.For 20 machines, the confidence intervals of IGQ1, QIG, and IGQ2 algorithms do not intersect with the IG algorithm's confidence intervals, and they have a small intersection with the IGALL algorithm; hence, their differences are statistically significant when compared to the traditional IG algorithm as well as the IGALL.Even the IGALL algorithm is statistically significant to the traditional IG algorithm.For 40 machines,

Table 1
ARPD of the results for small VRF instances

Table 2
Summary of the results for small VRF instances

Table 3
ARPI of the results for large VRF instances