1 Introduction

The complexity of many industrial scheduling problems prevents their optimal solution and requires the application of heuristics or metaheuristics for efficiently calculating solutions. Such a hard-to-solve scheduling problem is the problem considered in this contribution: scheduling parallel serial-batch processing machines with incompatible job families, restricted batch capacities, arbitrary batch capacity demands, and sequence-dependent setup times. This problem arises for example in the metal processing industry where the (laser) cutting of metal sheets is one of the first and mandatory steps in the process (cf. e.g., Helo et al., 2019; Gahm et al., 2022b). The same scheduling problem also arises in the emerging area of additive manufacturing (i. e., industrial 3D printing; cf., e.g., Li & Zhang, 2018; Zhang et al., 2020; Toksarı & Toğa, 2022).

The specific problem with the weighted tardiness minimization objective was first described by Gahm et al. (2022b) and classified as \(P|sb,\,\,if^*,\,\,b,\,\,\,a_{j} ,\,\,s_{fg} |wT\)-problem. To solve the NP-hard problem, the authors formulate a mixed-integer linear program, propose a set of new heuristics, and benchmark those with heuristics from literature (with adaptations). Their evaluation shows that only small problem instances with up to 30 jobs can be solved to optimum. Since problem instances of up to 30 jobs are far from industrial-scale problems, which are typically in the order of several thousand jobs, mixed-integer linear programming cannot be considered for solving practical problems. In consequence, the authors propose several heuristics and their ATCS–BATCS(β) heuristic performs best regarding the tradeoff between solution quality and computation time (to simplify reading, we will name this heuristic BATCS-b in the following). This BATCS-b heuristic is based on the ATCS priority rule (Lee & Pinedo, 1997) and uses a multi-start approach to perform a full grid search on the (discretized) parameter space stretched by three parameters (cf. Section 2.2). Due to this design, the computation time of the construction heuristic strongly increases with problem instances becoming large (real-world instances can consist of several thousand jobs; cf., Gahm et al., 2022b). This effect of multi-start full-grid searches is well-known and increases with the number of appropriate parameter configurations. However, the performance of many heuristics strongly depends on applying appropriate parameters, and thus, multi-start full-grid searches are very effective. Nevertheless, this comes at the cost of restarting the heuristic multiple times with different parameter configurations, resulting in long computation times. This aspect makes the application of multi-start heuristics in real-time decision scenarios, e.g., when the problem changes due to machine breakdowns or new urgent customer orders arrive, very challenging and thus prevents tardiness improvements of up to 83% (as reported in Gahm et al., 2022b).

Our contribution to literature is a new approach to improve the efficiency of multi-start heuristics through machine learning enhanced grid search (MLGS). To that, we propose to estimate the performance of individual parameter configurations regarding a single problem instance by machine learning models and use the performance predictions for creating a ranking of parameter configurations. Based on this ranking, we develop and analyze various strategies to select the most promising ones and create appropriate small grids. To the best of our knowledge, this ranking-based approach to determine a set of most suitable parameter configurations is completely new in the literature. Regarding the machine learning model, an Artificial Neural Network (NN), we develop and evaluate different “subsampling” strategies to determine samples that are most efficient for the training of the NN. In addition, we recommend controlling the size of the reduced grid with regard to the size of an instance (i.e., the number of jobs) to keep computation times manageable. The resulting heuristic with the machine learning enhanced grid search is called BATCS-b-MLGS in the following.

We would also like to point out that the application of our ranking-based “parameter tuning” approach is not limited to BATCS-b and can be applied or adapted by other parametrized heuristics or metaheuristics. In addition, the ranking of parameter configurations could be used in a different way: for example, to compute a set of promising “initial” solutions for population-based metaheuristics (e.g., to initialize the initial population of a Genetic Algorithm).

The structure of the paper is as follows. In Sect. 2, we first describe the scheduling problem and the BATCS-b heuristic in detail and then analyze the most relevant literature. In Sect. 3, all the components of the developed machine learning enhanced grid search are presented. Section 4 presents the results of the hyperparameter tuning for the Machine learning model and the final testing results on a large test set of problem instances. In the closing Sect. 5, we summarize the findings and give an outlook on future research topics.

2 Related work

2.1 The scheduling problem

Basic task of the \(P|sb,\,\,if^*,\,\,b,\,\,\,a_{j} ,\,\,s_{fg} |wT\)-problem is the grouping of n jobs (\(J = \{ J_{j} \left| {\,j = 1,\;...,\;n \in {\mathbb{Z}}^{ > } } \right.\}\)) into o batches (\(B = \{ B_{b} \left| {\,b = 1,\;...,\;o \in {\mathbb{Z}}^{ > } } \right.\}\)) and the machine allocation and sequencing (scheduling) of those batches on a set of m identical parallel machines (\(M = \{ M_{i} \left| {\,i = 1,\;...,\;m \in {\mathbb{Z}}^{ > } } \right.\}\)) with the objective of minimizing the total weighted tardiness. The maximum batch capacity bc is identical for all machines. Each job has individual weights \(w_{j}\), processing times \(p_{j}\), due dates \(d_{j}\), and an individual (arbitrary) batch capacity requirement \(cr_{j}\) (size; whereby \(cr_{j} \le bc\) must hold). Furthermore, each job belongs to a job family \(f\) (\(F = \{ F_{f} \left| {\,f = 1,\;...,\;q \in {\mathbb{Z}}^{ > } } \right.\}\)), whereby job families are “incompatible” (that means jobs of different families cannot be processed together in one batch, e.g., due to technical or material restrictions). Because setups between the processing of two batches (jobs) are required, the batching of jobs of the same family is done to reduce setup efforts. Hereby, the setup times are family- and sequence-dependent: \(s_{f,g}\) defined the setup time for a setup from a family \(f\) batch to a family \(g\) batch. Note that \(s_{0,f}\) depicts initial setup times for family f at the beginning of a schedule.

Other assumptions include that each machine can process at most one batch at a time, that a batch can only be processed by one machine at a time, that all jobs are available for processing at the start (i.e., no release dates), that batch processing cannot be interrupted (i.e., no preemption), and that jobs cannot be added or removed once processing of a batch has started (i.e., batch availability). Consequently, a job's completion time equals the completion time of the batch to which a job is assigned.

2.2 The BATCS-b heuristic

To solve the aforementioned problem, Gahm et al. (2022b) developed the BATCS-b multi-start heuristic that is based on the “apparent tardiness cost with setups” (ATCS) priority rule to approximate the urgency of a single job and on the summation of job urgencies grouped in a batch to a “batch urgency”. The idea of the ATCS rule goes back to the ATC rule proposed by Vepsalainen and Morton (1987) for a job shop problem with the weighted tardiness objective. Lee and Pinedo (1997) extended this rule by integrating sequence-dependent setup times for a parallel machine environment, also with the objective weighted tardiness. Since these publications, the ATC(S) rule has been used by many authors to efficiently calculate schedules for different kind of problems (cf., e.g., Heger et al., 2016; Lee, 2018; Maecker & Shen, 2020).

The BATCS-b heuristic computes the urgency ATCSj of job j at time step t based on job attributes, the family f of the last scheduled batch, and the two look-ahead (scaling) parameters \(\kappa_{1}\) and \(\kappa_{2}\):

$$ ATCS_{j} (t,\,\,f) = \frac{{w_{j} }}{{p_{j} }} \cdot \exp \left( { - \frac{{\max \{ d_{j} - p_{j} - t,\,\,0\} }}{{\kappa_{1} \cdot \overline{p}_{t} }}} \right) \cdot \exp \left( { - \frac{{s_{{f,f_{j} }} }}{{\kappa_{2} \cdot \overline{s}_{t} }}} \right) $$
(1)

In Eq. (1), \(\overline{p}_{t}\) is the mean processing time of all unscheduled jobs, \(\overline{s}_{t}\) the mean setup time between all families with unscheduled jobs, and \(f_{j}\) the family of job j. By knowing each job’s ATCS-value, we can derive the batch ATCS (BATCS)-value \(\Pi_{b} (t,f)\) for a given time step and family by summing up ATCS-values of jobs belonging to batch b: \(\Pi_{b} (t,f): = \sum\limits_{j \in b} {ATCS_{j}^{{}} (t,f)}\).

In its basic version, the BATCS-b heuristic allocates jobs (ordered by their priorities) to a batch until no more jobs can be allocated without violating the batch capacity. However, using the total capacity of a batch may lead to worse solutions as jobs with a low urgency (priority value) are added to batches at the beginning of the schedule. This is because the batch processing time in serial-batching is the sum of its job’s processing times and adding more jobs to a batch increases the batch processing time. To circumvent this, Gahm et al. (2022b) introduced the parameter \(\beta \in (0,1]\) to control the maximum batch utilization. This parameter defines how much of the batch capacity can be utilized and is given with \(bc^{ACT} : = \max \{ \beta \cdot bc,\max \{ cr_{j} \left| {\,\forall j \in J} \right.\} \}\). The \(\max \{ cr_{j} \left| {\,\forall j \in J} \right.\}\) term ensures that the “actualized” maximum batch capacity is not less than any job capacity requirement and that all jobs can be placed on a metal slide. Gahm et al. could report that introducing the parameter \(\beta\) improves the objective value remarkably.

Determining appropriate values for the parameters is necessary to achieve a satisfactory solution quality with ATC(S) based heuristics. Because there is no general rule or analytical method for obtaining good parameter configurations (containing values for \(\beta\), \(\kappa_{1}\) and \(\kappa_{2}\)) for the considered scheduling problem, Gahm et al. (2022b) use a multi-start grid search in their BATCS-b heuristic. For the multi-start heuristic, the authors define the sets for the parameters \(\beta\), \(\kappa_{1}\) and \(\kappa_{2}\) as \({\rm B}: = \{ 0.5,0.55, \ldots ,1\}\), \({\rm K}_{1} : = \{ 0.5,1, \ldots ,5\}\) and \({\rm K}_{2} : = \{ 0.1,0.2, \ldots ,1.6\}\), respectively, and obtain a combined set of parameter configurations \({\rm B} \times {\rm K}_{1} \times {\rm K}_{2}\) with 1760 (= \(11 \cdot 10 \cdot 16\) = \(\left| {\rm B} \right| \cdot \left| {{\rm K}_{1} } \right| \cdot \left| {{\rm K}_{2} } \right|\)) configurations. Furthermore, they use the instance characteristics-related calculation procedures proposed by Lee and Pinedo (1997) to determine additional \(\tilde{\kappa }_{1}\) and \(\tilde{\kappa }_{2}\) values. Since the procedures do not output a value for \(\beta\), the estimated \(\tilde{\kappa }_{1}\), \(\tilde{\kappa }_{2}\) are combined with all \(\beta\) values to 11 new parameter combinations. In total, the BATCS-b heuristic evaluates 1771 parameter configurations to obtain best solutions.

The evaluation of the BATCS-b heuristic shows that it has a competitive solution quality compared to other heuristics. However, the multi-start full-grid search with 1771 parameter configurations results in very high computation times if the number of jobs increases (cf. Gahm et al., 2022b). Because of this lack of efficiency, we propose to use machine learning models to determine a single best parameter configuration (or at least a much smaller grid) to increase efficiency and simultaneously preserve solution quality.

2.3 Literature review on machine learning enhanced scheduling methods

Using machine learning models to improve the efficiency and solution quality of heuristics in machine scheduling dates back several decades. Since then, a large body of literature was published that goes in different methodological directions. We do not aim to give a comprehensive survey on applications of machine learning models within scheduling but give the reader a brief introduction in directions closely related to our approach to allow for categorizing this paper. We categorize previous research into approaches that: (1) choose the proper scheduling method from a set of methods, (2) predict the objective value achieved by a specific scheduling method, and (3) determine “good” parameters for a solution method.

To avoid misunderstandings in what follows, we want to clarify the ambiguous terms online and offline for machine learning models. We say that a model's learning mode is “online” when the learning happens while solving a single problem instance, and “offline” when the model performs the learning on a set of already solved instances before solving the actual problem instance. One major benefit of offline models is the faster response time during model application since the learning task is already performed beforehand.

  1. 1.

    The problem of choosing a scheduling method is often seen in dynamic production environments (e.g., when the job arrival times are stochastic). Since these problems usually arise in a time-critical environment, offline-trained machine learning models are frequently used: cf., e.g., Piramuthu et al. (1994), Lee et al. (1997a), Priore et al. (2006), Mouelhi-Chibani and Pierreval (2010), Shiue et al. (2012), Priore et al. (2018), Stricker et al. (2018), Waschneck et al. (2018), or Park et al. (2021). These works chose a proper scheduling method based on the characteristics of the problem instance to solve either by machine learning models (often an inductive decision tree or an artificial neural network NN) or a genetic algorithm (which can be classified as a reinforcement learning approach in a broader sense). Even if not as prominent, method selection approaches are also considered in static production environments: cf., e.g., Azadeh et al. (2012), Azadeh et al. (2013), Shiue et al. (2018), or Lin et al. (2019). Since we do not focus on choosing the best method from a set of existing methods for a problem instance, we do not discuss these works in more detail.

  2. 2.

    Commonly in scheduling literature, objective value predictions are used to predict a makespan to use it as a “metric” for fast order acceptance decisions or delivery date confirmations (cf., e.g., Akyol, 2004; Raaymakers & Weijters, 2003; Shafaei et al., 2011). Examples of objective value predictions not directly related to scheduling problems are Neuenfeldt Júnior et al. (2019), which predict objective value to estimate lower bounds for 2D strip-packing problems, and Gahm et al. (2022a), which predict heights of 2D nesting problems to check constraints in a hierarchical production planning problem.

  3. 3.

    Literature in the third category, determining “good” parameters for a solution method for scheduling related problems, is of most interest for the purpose of our approach. Lee and Pinedo (1997) developed rules through experiments that determine values for both look-ahead parameters of the ATCS-rule. They use a curve fitting method that takes due date tightness, due date range, setup time severity, and a job-machine factor of an instance to obtain proper values for the parameters \(\kappa_{1}\) and \(\kappa_{2}\). Kim et al. (1995) proposed a NN to predict proper look-ahead parameters for the ATC(S) rule based on problem characteristics (such as due-date tightness and due-date range). Their computational tests show that the proposed approach outperforms the ATC rule with a fixed parameter configuration (i.e., the heuristic computes every instance with the same heuristic parameter configuration). Also, their extension for the scheduling problem with sequence-dependent setup times outperforms the method based on curve-fitted parameters by Lee and Pinedo (1997) (note that Lee et al. first published their work as a technical report in 1992 and in 1997 as a journal article). Park et al. (2000) also use NNs to predict suitable values for the two look-ahead parameters of the ATCS rule for a scheduling problem with identical parallel machines and sequence-dependent setup times. The authors propose using two independent NNs to predict \(\kappa_{1}\) and \(\kappa_{2}\) independently and investigate two feature vectors with four and five features, respectively. Their results reveal a positive influence of the new setup time-related feature and thus emphasize the importance of feature engineering. Mönch et al. (2006) extend the ATC rule for a parallel-batch scheduling problem. Like the BATCS-b heuristic, they aggregate each job’s ATC value in a batch by summing the individual ATC values and use the sum to prioritize each batch. For determining an appropriate value for the single look-ahead parameter \(\kappa\), they train an inductive decision tree and a NN (multi-layer perceptron) with a feature vector consisting of five instance characteristics. They could show in a computational study with generated instances that a successful choice of the look-ahead parameter via machine learning models is possible. El-Bouri (2012) uses a multiple regression for determining a machines proximity factor in a composite dispatching rule to solve a dynamic flow shop problem. In his computational study, the proposed method outperforms (or is at least as good) six other methods that are typically favored for tardiness-related objectives. Heger et al. (2016) study dynamically changing (e.g., new job arrivals, stochastic processing times, or machine breakdowns) flow shop problems with sequence-dependent setup times. Due to the changing nature of the problem, they dynamically switch the heuristic depending on the current situation. For that, the authors use the ATCS rule and adjust the parameter \(\kappa_{1}\) and \(\kappa_{2}\) according to the current state. A Gaussian process regression model is trained offline on simulation runs to estimate the best parameter configuration for \(\kappa_{1}\) and \(\kappa_{2}\) online. Their computational study shows the superiority of the presented approach compared to fixed parameters. Shahrabi et al. (2017) use reinforcement learning to predict proper parameters for a variable neighborhood search procedure in a dynamic job shop environment. They could report that their approach outperforms priority rule-based heuristics and a general variable neighborhood search. In addition, the reinforcement learning approach leads to ongoing learning over time.

    Similar to the previous contribution, Heger and Voss (2021) use the ATCS rule to solve a flexible flow shop in a dynamic environment and use reinforcement learning to find the parameters for the ATCS rule. Involving reinforcement learning in the solution methods enables the advantages of online training. This advantage can be observed in the evaluation, in which the authors' approach reduces the mean tardiness by up to 5%.

The analysis of the literature on determining "good" parameters shows that they all use a single estimated parameter configuration when the solution method makes a scheduling decision. However, since parameter estimates are by definition uncertain, it seems more suitable to use a few most appropriate parameter configurations rather than a single one. Therefore, in contrast to the existing literature, we propose to use a (small) set of most appropriate parameter configurations. Hereby, the challenge is to estimate which parameter configurations perform best and how many of them should be used in a reduced grid search to compute sufficiently good solutions in a reasonable time.

In addition, due to our ranking-based “parameter tuning” approach, only the relative performance between parameter configurations must be estimated (and not the absolute one) and thus, the estimation is more robust regarding prediction errors.

2.4 Summary

The \(P|sb,\,\,if^*,\,\,b,\,\,\,a_{j} ,\,\,s_{fg} |wT\)-problem can be efficiently solved by the BATCS-b multi-start heuristic if the number of jobs is not too large. If the number of jobs increases, the full grid search with 1771 parameter configurations becomes inefficient and a smaller grid or even a single parameter configuration should be used. To determine appropriate parameter configurations, machine learning models can be used. Hereby, several aspects must be considered:

First, the machine learning model and/or the training data preparation must consider the fact that different parameter configurations can lead to the best-known objective value.

Second, the three parameters of the BATCS-b heuristic are not independent and thus, complete parameter configurations should be predicted and not single values (as for instance done by Park et al., 2000). As multiple parameter configurations can lead to the same best objective value, considering the parameter interdependency is especially important. Consider, for example, the following situation in which the parameter configurations (1.2, 2.5) and (2, 0.8) lead to the same best-known objective value for a problem instance. If the dependency is not considered, the machine learning models could output values like (1.2, 0.8), which are good predictions if considered independently but (may) lead to bad solution when applied in the heuristic. Therefore, the used machine learning models should incorporate the interdependent nature of the heuristic parameters.

Third, learning should take place offline to achieve high efficiency; thus, online learning approaches are not appropriate.

Fourth, because priority rule-based heuristics (like the BATCS-b heuristic) are sensitive to appropriate parameters and (parameter) predictions are uncertain by definition, not only a single prediction-based parameter configuration should be used but mechanisms to determine a (small) set of most suitable parameter configurations should be developed. For that, the trade-off between solution quality and computation time must be considered.

Due to these aspects, we propose a machine learning enhanced grid search that uses a ranking of parameter configurations (determined by estimated objective values) and investigates different ranking application strategies for determining grids. Hereby, the machine learning model considers the interdependency between the parameters as we use instance characteristics and parameters as features for the predictions.

3 Machine learning enhanced grid search (MLGS)

The developed machine learning enhanced grid search (MLGS) approach must not only consider the previously discussed aspects but also must be applicable to any type of problem instance (i.e., a suitable generalization of the machine learning model is required) and provide a reasonable trade-off between solution quality and computation time. To achieve this, our approach consists of several components depicted in Fig. 1 and described in the following subsections.

Fig. 1
figure 1

Components of the machine learning enhanced grid search

First, to provide a sufficient data basis for the machine learning model, an exhaustive diverse set of serial-batch problem instances must be prepared and solved by the BATCS-b heuristic. For the machine learning, a problem instance must be converted into a numerical representation. To that, the feature engineering provides two feature vectors with different elements. In addition to the problem instance-related features, the three parameters of the BATCS-b heuristic must be part of the vectors to enable the prediction of their performance. Because training the machine learning model with all combinations of instances and parameter configurations from the full grid search is not manageable, we present different “subsampling” strategies to choose suitable subsets of parameter configurations for speeding up the training. The combination of a subsampling strategy and one of the feature vectors is called a “data pipeline configuration” and we investigate different combinations. The data provided by one of the pipelines form the input of the machine learning model. From our previous summary in 2.4, we conclude that the only requirement for selecting an appropriate machine learning model is that it must be capable of performing regressions. As machine learning models' accuracy highly depends on their hyperparameters, we recommend performing a comprehensive hyperparameter tuning with one or more selected machine learning models. The tuned models with the highest accuracy are then used to estimate the performance (objective value) of all parameter configurations (of the full grid) for a given instance. Based on these estimations, a ranking of the parameter configurations is created. To use this ranking in a most suitable way regarding solution quality and efficiency, different “ranking application strategies” are proposed. In the validation, combinations of data pipeline configurations, tuned models, and ranking application strategies are used to evaluate the performance of the resulting, differently parametrized BATCS-b-MLGS variants. Finally, the most promising BATCS-b-MLGS variants are evaluated by testing their performance on problem instances never used before in the development process.

3.1 Data preparation

The accuracy of a machine learning model depends on many factors (e.g., model type, feature engineering, and hyperparameters). One factor that is hard to compensate by the remaining components is the data set used for training the model. In many applications, providing a “good” data set is the major challenge when developing machine learning models. A fundamental assumption in machine learning theory is that both training and test data points (i.e., data points that have to be predicted) underlie identical and independent distributions (c.f., Bishop, 2006) and represent the “real world”. Therefore, for machine learning, there exists an indispensable need of selecting a training data set representative for the test data set. Besides the training data sets' representativeness, the data sets' size also plays an essential role in obtaining an accurate Machine learning model. If the dependency between input and output of the learning problem is complex, a model with high capacity is needed (cf., Hastie et al., 2017 for model capacity/complexity). A high capacity in model-based learning (e.g., neural networks, linear regression, decision trees) often means that the model possesses many parameters (weights). Unfortunately, if the number of parameters is relatively high compared to the size of the data set and no penalizing terms for high model complexity are considered, the model is capable of “memorizing” the training data instead of finding a “generalization”. This well-known phenomenon would lead to highly accurate predictions on the training data but large errors on predictions for the test data (i.e., the model has a high variance, cf., bias-variance trade-off). So, a common way to combat this issue is to add a penalizing term on a model’s complexity or use a large data set.

With these considerations, we build our data set I for the machine learning on a large set of instances that is provided for download at Mendeley data by Gahm (2022). This data set consists of 93,360 generated scheduling instances for the \(P|sb,\,\,if^*,\,\,b,\,\,\,a_{j} ,\,\,s_{fg} |wT\)-problem. Those problem instances are divided into 18,672 instance classes, whereby each instance class contains five instances. The instance classes result from different combinations of instance attributes and thus provide a vast variety of instances. The most important instance attributes are the number of jobs n \(\in\){15, 30, 60, 100, 200, 400, 800, 1600, 3200}, the number of machines m \(\in\){1, 3, 4, 5, 10, 20}, and the number of incompatible job families q \(\in\){3, 5, 10, 20, 40}. Further attributes are, for instance, setup time severity, tardiness factor, and due date range. These attributes primarily define how the job characteristics (e.g., processing times, due dates) are drawn from a distribution. More details on the data set can be found in the online description.

As the data set provides a high degree of diversity in its problem instances (e.g., number of jobs ranges from 15 to 3200, the number of machines from 1 to 20, and different types of distributions for drawing further job characteristics), we assume the requirements on the data set as described above are fulfilled. To make our study reproduceable for the scientific community, we did not integrate any real-world instances which could not be made available due to confidentiality issues. However, showing the effectiveness of our approach on this variety of problem instances justifies its applicability to real-world problems since manufacturers usually do not have such a high fluctuation in the attributes (e.g., the number of machines is mostly fixed, and the number of jobs does not vary as much as in the data set).

As strongly recommended for machine learning, different subsets of the total available data set are used for different development phases. Usually, 80% of the data for training and validation and 20% for testing. Therefore we split the complete set of scheduling instances into subsets and assign the different instances per instance class to different subsets: \(D_{P} \subseteq I\) (with.

\(P \subseteq \{ 1,..,5\}\)) indicates which of the five instances per class \(D_{P}\) contains (e.g., subset \(D_{{\{ 1,2\} }}\) contains the first and the second instance of every instance class).

All 93,360 scheduling instances have been solved with the BATCS-b heuristic (with the full grid search that uses 1771 parameter configurations), and thus, 165,340,560 samples are available in total.

3.2 Feature engineering

The first element of a data pipeline configuration is the feature vector representing a problem instance. Here, we follow two basic approaches: the first one uses a small number of features based on complex instance characteristics (called “complex feature”-vector; CF) and the second one uses a large number of aggregated simple instance characteristics (called “aggregated feature”-vector; AF). The CF-vector consists of 12 features that are listed in Table 1 and partly taken or inspired by literature (cf., Park et al., 2000; Mönch et al., 2006).

Table 1 Feature definitions of the CF-vector

Most of the features are self-explanatory, but the approximated makespan \(\tilde{M}\) requires further explanations. This feature needs the approximated total setup time \(tst = tns \cdot \overline{s}\), which again has to compute the approximated total number of setups \(tns = {n \mathord{\left/ {\vphantom {n {\tilde{n}_{b} }}} \right. \kern-0pt} {\tilde{n}_{b} }}\), the mean setup time \(\overline{s}\), the approximated mean number of jobs per batch \(\tilde{n}_{b} = \left\lceil {{{bc} \mathord{\left/ {\vphantom {{bc} {\overline{r}}}} \right. \kern-0pt} {\overline{r}}}} \right\rceil\), and the approximated mean batch capacity requirement \(\overline{r} = \tfrac{1}{n}\sum\nolimits_{j \in J} {cr_{j} }\). With these approximations, one can compute the makespan approximation according to (Gahm et al., 2022b):

$$ \tilde{M} = ({{n \cdot \overline{p} + tst)} \mathord{\left/ {\vphantom {{n \cdot \overline{p} + tst)} m}} \right. \kern-0pt} m} $$
(2)

The AF-vector consists of 85 features. Five features are identical to the first five features of Table 1 (indicated by an asterisk) and the other 80 are calculated by eight basic instance characteristics and ten aggregation functions. The basic instance characteristics are the job processing times (\(p_{j}\)), job due dates (\(d_{j}\)), job weights (\(w_{j}\)), job capacity requirements (\(cr_{j}\)), job capacity requirement to processing time ratios (\({{cr_{j} } \mathord{\left/ {\vphantom {{cr_{j} } {p_{j} }}} \right. \kern-0pt} {p_{j} }}\)), job due date to weight ratios (\({{d_{j} } \mathord{\left/ {\vphantom {{d_{j} } {w_{j} }}} \right. \kern-0pt} {w_{j} }}\)), setup times (\(s_{f,g}\)), and the number of jobs per family (\(n_{f}\)). The corresponding values of a single instance are aggregated to features by the following aggregation functions: MIN, MAX, SUM, MED (median), VAR (variance), Q1 (first quartile), Q3 (third quartile), P10 (10% percentile), P90 (90% percentile), and SKEW (Fisher-Pearson coefficient of skewness).

For the actual prediction task, we extend the CF- and AF-vectors by candidate parameter configuration, so in total, the CF and AF-vectors have 15 and 88 features, respectively.

3.3 Subsampling

To achieve a sufficient generalization of the machine learning model, we must use the complete range of scheduling instances for training. At the same time, we must ensure that the training effort remains manageable. Therefore, we use the scheduling instance subset \(D_{{\{ 1,2\} }}\) for the initial model training phase. However, using this scheduling instance subset with 37,344 instances results in 66,136,224 data points (due to the BATCS-b grid search with 1771 parameter configurations). Although innovations in hardware (e.g., parallelized computing) and theory (e.g., using rectifying linear units to overcome the vanishing gradient problem) made it possible to train large and complex models with large data sets, the computational efforts for learning with this number of data points becomes unmanageable high (as preliminary experiments have shown).

To reduce learning efforts, we introduce subsampling strategies that are developed to limit the number of parameter configurations (per scheduling instance) used for the learning but achieve sufficient prediction accuracy. Hereby, we differentiate between the set of parameter configurations that achieve best-known solutions (“best configurations”, BC) and the set that contains those that lead to solutions with higher (worse) objective values (“worse configurations”, WC). Because there can be more than one parameter configuration in set BC, we define a special representative that is in the “center” of all parameter configurations in BC. The procedure to determine this “central best configuration” (cBC) is described by the following pseudo-code:

figure a

The sequence of the calculation (\(\beta \to \kappa_{1} \to \kappa_{2}\)) is used as we observed that the performance of the heuristics is generally more sensitive regarding \(\beta\) compared to the two other parameters and that it is more sensitive to \(\kappa_{1}\) than to \(\kappa_{2}\).

In summary, three types of parameter configurations (cBC, BC, and WC) are available to perform subsampling on parameter configurations.

In the proposed subsampling strategies, we generally limit the number of parameter configurations to ten, and thus, we get different training data sets containing 373,440 data points. The strategies differ in the number of samples randomly drawn (by a uniform distribution) from BC and WC and if cBC is used. Table 2 depicts the eight subsampling strategies.

Table 2 Subsampling strategies

These strategies are defined to analyze the impact of the cBC and of the WC configurations on the accuracy of the models.

The eight resulting sample sets (named according to the subsampling strategies) and the two feature vectors are combined to 16 data pipeline configurations, for example, identified by [0,1,9]-CF.

3.4 Model selection and hyperparameter tuning

Multi-layered neural networks (NN) are commonly chosen as a machine learning model when high accuracy is needed, and interpretability is unimportant in the application. Since the interpretability of results is not necessary for our application, we use NN (from the tensorflow package in Python) for the prediction task. Of course, other machine learning models like regression trees or support vector machines could also be used, but the investigation of different models is out of the scope of this paper.

Besides the training of the weights itself, many hyperparameter decisions (e.g., number of neurons in the hidden layers, regularization, drop-out factor) must be made for the NN. Initial tests have shown that the complexity of NNs with two hidden layers is too restricted (a grid search returned NNs with the highest possible complexity, namely both hidden layers with 1024 neurons). Thus, we extended the NN by an additional third layer to avoid biasing by too simple models. This additional layer enlarges the number of weights in the NN and thus its capacity to represent more complex structures. We perform a hyperparameter grid search for all combinations of hyperparameter values depicted in Table 3 to identify the best hyperparameter setting for the model. To measure the error in this phase, we use the root mean squared error to assess the deviation between the predicted and actual objective values. The learning rate for the Adadelta optimizer is set to 1.0 and the number of epochs to 50 during the hyperparameter tuning phase.

Table 3 Hyperparameter space for NN

To obtain models with highest prediction accuracy, we tune these hyperparameters (defining 576 combinations) with data set \(D_{{\{ 3\} }}\) for each of the 16 data pipelines individually.

After identifying the best hyperparameter setting for each pipeline configuration, we train the machine learning models on 60% of the data (\(D_{{\{ 1,\,2,\,3\} }}\)).

3.5 Ranking application strategies

With the trained and tuned models, we are able to predict the objective value for each parameter configuration of the BATCS-b grid and a given scheduling instance i. Due to the immanent uncertainty of predictions, we are not only interested in the best predicted configuration but use the prediction to create a complete ranking Ri of parameter configurations of the BATCS-b grid for instance i. To handle the uncertainty and simultaneously achieve a sufficient trade-off between solution quality and computation times, we propose different ranking application strategies to form a final grid FG that is much smaller than the original grid. The BATCS-b-MLGS heuristic then uses this reduced FGi to perform a multi-start grid search.

The first strategy B1 assumes a perfect prediction and only uses the first (best) parameter configuration pc1 of the ranking Ri to define FGi. This strategy can be seen as a reference for the following more sophisticated ones.

The second strategy B1-G uses the best prediction pc1 and stretches a grid around this configuration. This strategy aims to compensate for a prediction’s deviations if it is close to a best parameter configuration but does not match exactly. Since the job size n has the most substantial influence on the computation time of BATCS-b, we vary the neighborhood (grid) size based on the number of jobs n of an instance to control the computation time. More precisely, we define a grid size parameter \(\zeta \in \{ 1,2,3\}\) that indicates how many neighboring values (\((2\zeta + 1)^{3}\)) in the reduced grid (below and above a predicted heuristic parameter) are added to the neighborhood (based on \({\rm B},\,\,{\rm K}_{1}\), and \({\rm K}_{2}\)). For example, if the model predicts \(\widehat{pc}\) = \((\hat{\beta },\hat{\kappa }_{1} ,\hat{\kappa }_{2} )\) = (0.75, 1.0, 1.5) and we have \(\zeta = 2\), the neighborhood of \(\hat{\beta }\) is {0.6, 0.7, 0.75, 0.8, 0.85}, of \(\hat{\kappa }_{1}\) is {0.8, 0.9, 1.0, 1.1, 1.2}, and of \(\hat{\kappa }_{2}\) is {0.5, 1.0, 1.5, 2.0, 2.5}. So, in total, when \(\zeta\) is equal to 1, 27 parameter configurations are in the neighborhood grid (NGpc), 125 when \(\zeta = 2\), and 343 when \(\zeta = 3\). For different numbers of jobs, the grid size parameter \(\zeta\) is defined by \(\zeta = 3\) for \(n_{i} \in [0,\,\,100)\), \(\zeta = 2\) for \(n_{i} \in [100,\,\,1000)\), and \(\zeta = 1\) for \(n_{i} \in [1000,\,\,3200]\). This means that the neighborhood (or reduced grid as we refer to) decreases if the job size increases. Note that these settings have been developed based on preliminary tests to create final grids with desired sizes that achieve a sufficient solution quality and keep computation times below 240 s for instances with 3,200 jobs. Since the instance characteristic-related calculation procedures proposed by Lee and Pinedo (1997) to determine additional \(\tilde{\kappa }_{1}\) and \(\tilde{\kappa }_{2}\) values have shown their benefits, we use the 11 parameter configurations combing \(\tilde{\kappa }_{1}\) and \(\tilde{\kappa }_{2}\) with β to form LPGi (Lee-Pinedo grid for instance i). Summarizing, strategy B1-G defines a final grid FG consisting of \(NG_{{pc_{1} }}\) and LPGi.

In the third strategy Bx, we not only use the overall best ranked parameter configuration but add the first x parameter configurations \((pc_{j} )_{1 \le j \le x}\) from ranking Ri to the final grid FGi that also contains LPGi. To achieve desired final grid sizes, we use \(x = (2\zeta + 1)^{3}\) with \(\zeta\) as defined before.

The fourth strategy B(k)-G creates neighborhood grids based on the best k parameter configurations from ranking Ri to from the final grid FGi. This follows the idea of “exploring” the parameter space by increasing k. As we use the same approach to create neighborhood grids as before, the size of the FGi would become very large with k becoming larger and thus, computation times would also increase. To avoid this effect, we reduce the size of the neighborhood grids to \(\left\lceil {{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 k}}\right.\kern-0pt} \!\lower0.7ex\hbox{$k$}}(2\zeta + 1)^{3} } \right\rceil\) (with \(\zeta\) as determined before). Since the different neighborhood grids may contain duplicates, we create the neighborhood grids one after the other (\(NG_{{pc_{k} }}\) starting with k = 1), eliminate duplicated parameter configurations after creating one of the grids, and increment k until the desired size of the final grid is achieved. For the same reason, we skip those parameter configurations in the ranking to create neighborhoods which are already part of the final grid. The following pseudo-code illustrates this procedure:

figure b

As one may notice, B1-G is a special case of B(k)-G with \(k = 1\). Nevertheless, we want to distinguish between these two strategies since they are conceptually different. B1-G selects the best-ranked parameter configuration and invests all efforts to exploit the neighboring configurations. In contrast, B(k)-G aims to find a trade-off between exploration and exploitation by varying \(k\). A higher \(k\) results in a higher exploration of the parameter configuration space but the reduced grids of the parameter configurations are “exploited” less. This behavior is enforced by the decision to only span reduced neighborhood grids on top-ranked parameter configurations that are not yet in the final grid. In the validation, we use the B(k)-G strategy with \(k \in \{ 3,5,9\}\) to investigate the trade-off between exploration and exploitation.

3.6 Validation and testing

In the validation, the twelve data pipeline configurations are combined with the six ranking application strategies B1, B1-G, Bx, B(3)-G, B(5)-G, and B(9)-G to determine the most suitable combinations. Hereby, a subset of scheduling instances of the previously unused data set \(D_{{\{ 4\} }}\) is used. After the validation, the most promising data pipeline configurations are trained with data set \(D_{{\{ 1,2,3,4\} }}\). In the final testing, the complete, preliminary unused data set \(D_{{\{ 5\} }}\) is used to assess the performance of the most promising combinations identified by the validation.

4 Experimental results

The application of most machine learning methods requires a hyperparameter tuning; in our case, the tuning of the NN parameters described in the first subsection. In the following validation, we investigate the performance of the 16 pipeline configurations combined with the six ranking application strategies to determine the best-performing combinations in terms of solution quality and computation time. Finally, we compare the best nine BATCS-b-MLGS variants (i.e., the best three pipeline configurations combined with the best three ranking strategies) with BATCS-b.

For assessing the solution quality, we use the key figure “mean relative improvement versus the worst objective function value” (MRIW; cf., Valente & Schaller, 2012; Gahm et al., 2022b). This metric is used because the objective values of our weighted tardiness problem may be equal or very close to zero and setting the objective values in relation to a value close to zero distorts the interpretability of other metrics. For that, we define a set \(S\) of solution methods and are interested in the objective value \(ov_{i,s}\) of the solution method \(s \in S\) for an instance \(i\). For instance, \(ov_{i,BATCS - b}\) is the best objective value obtained by the BATCS-b heuristic (with all 1771 parameter configurations) or \(ov_{i,\,BATCS - b - MLGS\,\,([0,5,5],\,AF,\,B(5) - G)}\) is the objective value obtained after running the BATCS-b-MGLS heuristic (with pipeline [0,5,5]-AF and ranking application strategy B(5)-G). The relative improvement regarding instance \(i\) and solution method \(s\) to the worst objective value \(ov_{i}^{\scriptscriptstyle{\text{WORST}}}\)\( = \max_{s \in S} \left\{ {ov_{i,s} } \right\}\) is calculated as follows: if \(ov_{i}^{\scriptscriptstyle{\text{WORST}}}\)\(> 0\): \(RIW_{i,s} = {{(ov_{i}^{\scriptscriptstyle{\text{WORST}}} - ov_{i,s} ) \cdot 100} \mathord{\left/ {\vphantom {{(ov_{i}^{\scriptscriptstyle{\text{WORST}}} - ov_{i,s} ) \cdot 100} {ov_{i}^{\scriptscriptstyle{\text{WORST}}} }}} \right. \kern-0pt} {ov_{i}^{\scriptscriptstyle{\text{WORST}}}}}\); otherwise: \(RIW_{i,s} = 0\). We obtain the \(MRIW_{s}\) for solution method s by calculating the mean of all \(RIW_{i,s}\) for a given set of instances.

4.1 Hyperparameter tuning

Based on the hyperparameter space for the NN depicted in Table 3, we perform a full grid search for all twelve pipeline configurations to find the best hyperparameter setting for the NNs. For the initial training phase, we use instance subset \(D_{{\{ 1,\,2\} }}\) with 40% of the data and in the hyperparameter tuning phase, we use instance subset \(D_{{\{ 3\} }}\) with 20% of the data to test the performance of the machine learning models with an epoch size of 50 for the NN. Table 4 shows the hyperparameter settings that minimize the expected loss referring to the predicted objective values and the actual objective values. We use the mean squared error (MSE) to quantify the excepted loss because of its advantageous numerical properties, and we want to penalize higher deviations in the predictions more severely.

Table 4 Tuned hyperparameters

The hyperparameter tuning was conducted on three workstations and training all 576 hyperparameter combinations (cf. Table 3) for a single pipeline configuration takes 331 h on average. Differences in computation time between AF and CF vectors are not remarkable (334.9 h and 327.1 h, respectively). In total, all hyperparameter tuning efforts sum up to 5,297 h.

Even though finding the best-performing NN architecture is a very effortful task, it only has to be performed once for a manufacturing company and the architecture decision obtained can be used in the future (unless there are major changes in the structure of the problem instances of the company).

4.2 Validation

To evaluate the performance of the 16 data pipelines combined with the six ranking application strategies, we use 162 randomly selected instances from set \(D_{{\{ 4\} }}\) (18 instances for each value of n). Since the hyperparameter tuning is completed in this stage, we use the set \(D_{{\{ 1,2,3\} }}\) to train the NN. Table 5 shows the MRIWs [%] and computation times (CT; [s]) for BATCS-b-MLGS for each pipeline configuration and ranking application strategy combination.

Table 5 Key figures by pipeline configuration and ranking application strategy

The validation results depicted in Table 5, shows that the training of the NN only by the parameter configurations calculating best-known solutions ([1,9,0]-CF and [1,9,0]-AF) is outperformed by all the other ones. Based on this observation, we conclude that the model needs data points with”worse” objective values to achieve a more accurate ranking. An explanation for this phenomenon could be that the model needs to reduce the uncertainty for worse performing parameter configurations in order not to overestimate their performance.

Another interesting observation is that the pipeline configurations [0,0,10]-CF and [0,0,10]-AF achieve a good solution quality without learning with parameter configurations achieving best-known solutions. This supports the previous conclusion that “worse” solutions are important for the training of the models. Furthermore, we can see that the combination of parameter configurations calculating best-known solutions and randomly selected parameter configurations seem to be favorable and that the integrating the “central best configuration” within the training data provides some benefits.

We also observed that pipeline configurations with the AF-vector outperform the CF-vector pipeline configurations. From this, we can conclude that the NN is able to automatically “compute” the complex features from the CF-vector (which are based on problem specific expert knowledge).

Summarizing, the pipeline configurations [0,5,5]-AF, [1,0,9]-AF, and [1,2,7]-AF and the ranking application strategies Bx, B(5)-G, and B(9)-G achieved the highest and most robust solution qualities and thus, we use those in the final testing.

4.3 Testing

The final performance testing of the proposed BATCS-b-MLGS heuristic with different pipeline configurations and ranking applications strategies is based on test data set \(D_{{\{ 5\} }}\) comprising 18,672 instances never used before in the development process. Furthermore, the NNs are trained on the data set \(D_{{\{ 1,2,3,4\} }}\) with an epoch size of 200 to achieve a more accurate prediction. With the larger data set and higher epoch size, the three best-performing pipeline configurations [0,5,5]-AF, [1,0,9]-AF, and [1,2,7]-AF need training times of 2.1 h, 5.1 h and 5.2 h, respectively. The differences in the training time result from the different hyperparameter choices (cf., Table 4), leading to different number of weights to be learned (1024·128 + 128·1024 = 262,144; 1024·512 + 512·512 = 786,432; and 512·512 + 512·1024 = 786,432; respectively).

Table 6 depicts the MRIWs of the nine best BATCS-b-MLGS variants and BATCS-b. For the nine BATCS-b-MLGS variants, only the mean computation times are listed as deviations between their computation times result from operating system activities and not from their configurations (all have identical grid sizes related to n). We want to point out that the MRIW values in this stage are very different from the validation phase (Sect. 4.2) due to the new reference points of the MRIW metric. Since we have limited our final testing evaluation to the three best data pipeline configurations, the worst objective value (i.e., the highest for the weighted tardiness) among all solution methods for an instance is lower compared to the worst objective value in the validation stage. Of course, BATCS-b has the highest MRIWs because it runs the heuristic with all parameter configurations. The relatively low MRIW of BATCS-b of 7.28% over all instances indicates the overall good solution quality of the BATCS-b-MLGS variants (compared to BATCS-b).

Table 6 Performance comparison

The results in Table 6 show that BATCS-b-MLGS with the pipeline [1,2,7]-AF and ranking application strategy Bx achieves the highest mean solution quality among the nine variants. The small deviation by 2.46 percentage points between the mean MRIWs of BATCS-b-MLGS([1,2,7]-AF, Bx) and BATCS-b illustrates the high solution quality. This value is all the more impressive when considering the computation times savings of about 89.2% regarding all kinds of instances and 98.0% regarding instances with 3,200 jobs (with a decrease of solution quality of only 2.74 percentage points). These results also emphasize the previous finding that the composition of the training data by the different types of parameter configurations (cBC, BC, and WC) influences the solution quality: [1,2,7] clearly outperforms the other pipelines.

In additionto the desired improvement in solution efficiency, the results show that different combinations of data pipeline configurations and ranking application strategies are preferable depending on the number of jobs (however, the performance differences are relatively low): For example, for instances with 3,200 jobs, BATCS-b-MLGS([0,5,5]-AF, Bx) performs best and for instances with 1,600 jobs, BATCS-b-MLGS([1,0,9]-AF, B(9)-G) performs best. For the practical application of our ranking base-based grid search in a manufacturing company, this means that depending on the number of jobs, the most preferred combination should be determined in a process as described and applied in this paper. For “small” instances with up to 400 jobs, the absolute time savings of our approach seems to be negligible for practical applications and thus, BATCS-b could be implemented.

5 Conclusions and avenues for further research

The literature on approaches that combine machine learning models with heuristic scheduling methods shows different methodological directions. Two directions are useful for the purpose of our paper, which is to select the most promising parameters for a heuristic: approaches that directly determine the parameters of a scheduling method and approaches that predict objective values. The analysis of the literature shows that none of the approaches in the literature consider a potential interdependence of parameters (as required for the BATCS-b heuristic) and/or do not consider the inherent uncertainty of single parameter predictions. Therefore, our approach fills a gap in the literature: The proposed machine learning enhanced grid search (applying multiple parameter configurations to cope with the uncertainty) uses a prediction-based ranking, taking into account interdependencies between parameters. Our extensive computational study shows that the BATCS-b-MLGS heuristic needs only a fraction of the computation time (-89.2%) needed for the full grid search used by BATCS-b and achieves a reasonably good solution quality (-2.46 percentage points lower compared to BATCS-b).

Based on the experimental results, we can report several interesting findings:

  • The way in which the ranking is used by the ranking application strategy is of minor importance but using only the best parameter configuration is not recommended (solution quality decreases remarkably) as the uncertainty of the predictions is too high.

  • To reduce the learning effort, we developed and analyzed different subsampling strategies to provide the data set for training. Regarding these strategies, we can report that the training of the NN should not only be based on the best-known solutions and that randomly selected “worse” solutions improve the accuracy of the resulting rankings. Furthermore, also not using best-known solutions at all leads to a reasonably good solution quality (which is particularly interesting when the computation of best-known solution is very effortful).

  • In this context, it is also worth mentioning that the computation of a “central” best parameter configuration (based on all available best-known solutions) and its integration into the training data set is recommended.

  • Depending on the characteristics of a problem instance (e.g., the number of jobs), different subsampling strategies achieve the best solution quality. Thus, the validation and testing process as applied in our study should be used to achieve the best results in real-world applications.

However, our analysis is not complete and could be extended in several ways: It should be investigated under which circumstances and application contexts the integration of “worse” sample solutions in the training positively impacts prediction accuracy. Also, other machine learning models capable of performing regressions (e.g., random forests or gradient boosted trees) could be analyzed and, on a higher level, analyzing the applicability of reinforcement learning approaches to the problem at hand and the impact on the computation time is interesting as we believe. Furthermore, our MLGS approach works on a discrete parameter space of 1771 parameter configurations. As it is generally capable of predicting objective values for “continuous” parameters, it is seeming to be worth to investigate using the prediction model to improve parameters in a more dynamic and flexible way. Furthermore, research on applying the MLGS approach to other parameterized heuristics would be interesting to show its general applicability. For population-based metaheuristics, the ranking could be used to compute promising initial solutions and also to determine “bad” solutions supporting the diversity of the initial population. One, rather evident, application of the performance prediction based ranking approach is to evaluate multiple heuristics/algorithms individually for every problem instance and only execute the top-ranked methods (algorithm selection based on Machine Learning).