Parallel Algorithm with Blocks for a Single-Machine Total Weighted Tardiness Scheduling Problem

: In this paper, the weighted tardiness single-machine scheduling problem is considered. To solve it an approximate (tabu search) algorithm, which works by improving the current solution by searching the neighborhood, is used. Methods of eliminating bad solutions from the neighborhood (the so-called block elimination properties) were also presented and implemented in the algorithm. Blocks allow a signiﬁcant shortening of the process of searching the neighborhood generated by insert type moves. The designed parallel tabu search algorithm was implemented using the MPI (Message Passing Interface) library. The obtained speedups are very large (over 60,000 × ) and superlinear. This may be a sign that the parallel algorithm is superior to the sequential one as the sequential algorithm is not able to effectively search the solution space for the problem under consideration. Only the introduction of diversiﬁcation process through parallelization can provide an adequate coverage of the entire search process. The current methods of parallelization of metaheuristics give a speedup which strongly depends on the problem’s instances, rarely greater than number of used parallel processors. The method proposed here allows the obtaining of huge speedup values (over 60,000 × ), but only when so-called blocks are used. The above-mentioned speedup values can be obtained on high performance computing infrastructures such as clusters with the use of MPI library.


Introduction
Problems of scheduling tasks on a single machine with cost goal functions, despite the simplicity of formulation, mostly belong to the class of the most difficult (NP-hard) discrete optimization problems. In the literature many types of such problems differing in task parameters, functional properties of machines and criteria are considered. Starting from the simplest ones with minimizing the number of late tasks (problem denoted by 1|| ∑ U i ) to complex ones with machine setups and time windows, where machine downtime may occur. Their optimization boils down to determining tasks starting times (or their order) that minimize the sum of penalties (costs of performing tasks). Optimal algorithms solve, within a reasonable time, examples with several tasks not exceeding 50 (80 in a multiprocessor environment, see [1]), therefore in practice there are almost exclusively approximate algorithms used. In the author's opinion, the best of them are based on local search methods (as opposed to randomized methods, which will potentially give a different result each time you run, sometimes very good and sometimes poor).
In the problem considered in this work there is a set of tasks that must be performed on one machine. Each task has its requested execution time, deadline and weight of the objective function for tardiness. One should determine the order of performing tasks, minimizing the sum of delay costs. It is undoubtedly one of the most studied problems of scheduling theory. It belongs to the class of strongly NP-hard problems. First work on this subject, Rinnoy Kan et al. [2] was published in the 1970s. Despite the passage of over 40 years this problem is still attractive to many researchers. Therefore, its various variants are considered in the papers published in recent years: Cordone and Hosteins [3], Here we propose the use of a parallel algorithm, tabu search multiple-walk version based on the MPSS model (Multiple starting Points Single Strategy). The MPI (Message Passing Interface) library is used for communication between computing processes run on several processors cores which execute its own tabu search processes.

Contributions
To sum up, the paper presents a new method of generating sub-neighborhoods based on the elimination properties of the block in a solution. Searching for sub-neighborhoods in the parallel tabu search algorithm gives a significant acceleration of calculations without losing the quality of the generated solutions. Compared to paper [1], where blocks have been introduced, here we consider the division into blocks of the entire permutation, and not only its middle fragment (for unfixed tasks, as in the work [1]). Moreover, the properties of blocks are used to eliminate solutions from the neighborhood.

Formulation of the Problem
In the formulation part and throughout the paper, we use the following notations: n -number of tasks, J -set of tasks, p i -time of task execution, w i -task's tardiness cost factor, d i -requested task completion time, π -permutation of tasks, Φ -set of all permutation of elements from J , N (π) -neighborhood of solution π, S i -task starting time i ∈ J , C i -task completion time i ∈ J , T i -task tardiness i, π T -semi block of early tasks, π D -semi block of tardy tasks, B -partition of permutation into blocks, F (≈) -sum of tardiness costs (criterion), Total Weighted Tardiness Problem, (in short TWT) can be formulated as follows: TWT problem: Each task from the set J = {1, 2, . . . , n} is to be executed on one machine. Wherein the following restrictions must be met: (a) all jobs are available at time zero, (b) the machine can process at most one job at a time, (c) preemption of the jobs is not allowed, (d) associated with each job j ∈ J there is (i) processing time p j , (ii) due date d j , (iii) positive weight w j .
The order in which tasks will be performed must be determined, which minimizes the sum of the cost of delays. As in [22], we denote the problem as 1|| ∑ w i T i .
Any solution to the considered problem (the order in which the tasks are performed on the machine) can be represented by the permutation of tasks (elements of the set J ). By Φ we denote the set of all such permutations.
Let π ∈ Φ be some permutation of tasks. For task π(i) (i = 1, 2, . . . , n), let: C π(i) -be the completion time of task execution, In the considered problem there should be determined the order in which the tasks will be performed (permutation π ∈ Φ) by machine, minimizing sum of tardiness costs, i.e., sum (1) The problem of minimizing the cost of tardiness is NP-hard (Lawler [23] and Lenstra et al. [22]). There were many papers published devoted to researching this problem. Emmons [24] introduced a partial order relationship on a set of tasks, thus limiting the process of searching the optimal solutions to some subset of the set of solutions. These properties are used in the best metaheuristic algorithms. Optimal algorithms (based on the dynamic programming method or branch and bound) were published by: Rinnoy Kan et al. [25], Potts i Van Wassenhove [26] oraz Wodecki [1]. Some of them were presented in the review of Abdul-Razaq et al. [27]. They are, however not very useful in solving most examples found in practice, because the calculation time increases exponentially relatively to the number of tasks. Hence, in a reasonable time one can solve only examples with the number of tasks not exceeding 50 tasks (80 with the use of parallel algorithm [1]). There is extensive literature devoted to algorithms determining approximate solutions within acceptable time. Methods, on which the constructions of these algorithms are based, can be divided into the class of construction and correction style method.
Construction algorithms usually have low computational complexity. However, designated solutions may differ significantly (even by several hundred percent) from optimal ones. Most commonly used construction algorithms used in solving TWT problems are presented in the works of Fischer [28], Morton and Pentico [29] and in Potts' and Van Wassenhove review [30].
In the correction algorithms we start with a solution (or a set of solutions) and we try to improve them by local search. Obtained in this way solution is the starting point in the next iteration of the algorithm. The most known implementations of the correction method to solving TWT problem are metaheuristics: tabu searches (Crauwels et al. [31]), simulated annealing (Potts and Van Wassenhove [30], Matsuo et al. [32]), genetic algorithm (Crauwels et al. [31]), the ant algorithm (Den Basten et al. [33,34]). A very interesting and effective implementation was also presented in the work of Congram et al. [35] and then it was developed by Grosso et al. [36]. Its main advantage is the use of neighborhood browsing procedure with an exponential number of elements in polynomial time.

Definitions and Properties of the Problem
For permutation π ∈ Φ C π(i) = ∑ i j=1 p π(j) is completion time of execution of task π(i) (i = 1, 2, . . . , n) in permutation π. The task π(i) is early), if its completion time is not greater than the requested completion time (i.e., C π(i) ≤ d π(i) ) and late), if this time is greater than the requested completion time i.e., C π(i) > d π(i) .
First, we introduce certain methods of aggregating tasks for generating blocks. In any permutation π ∈ Φ, there are subpermutations (subsequences of following tasks) for which: (1) execution of each task from subpermutation ends before its desired completion time (all tasks are early), or (2) execution of each task from subpermutation ends after its desired deadline (all tasks are tardy).
In the further part we present two types of blocks: early tasks blocks and tardy tasks ones. They will be used to eliminate worse solutions.

Blocks of Tasks
This section briefly introduces the definitions and properties of blocks and algorithms for their determination. They were described in detail in the work by Wodecki [1].
Blocks of early tasks Subpermutation of tasks π T in permutation π ∈ Φ is T -block, if: (a) each task j ∈ π T is early and d j ≥ C last , where C last is the completion time of the last task in π T , (b) π T is maximum subpermutation (in sense of number of elements) satisfying the restriction (a).
It is easy to see that if π T is T -block, then the inequality min{d j : j ∈ π T } ≥ C last is satisfied. Therefore, in any permutation of elements from π T every task in permutation π is early. Using this property, we present the algorithm determining the first T -block in permutation π.
Input of Algorithm 1 is permutation π, and output some T -block of this permutation. In line 1 first due date job is determined. Next in lines 4-7, it is checked whether adding another due date task to the block will keep the following property: in any permutation of these tasks, all tasks are on time. The computational complexity of the algorithm is O(n).
It is easy to see that in any permutation of elements from π D each task (belonging to π D ) in permutation π is tardy.
Similarly, as for T -block, according to the above definition, we present the algorithm for determining the first D-block in permutation π.
Input of Algorithm 2 is permutation π, and output some D-block of this permutation. In line 1 first tardy job is determined. Next, in lines 5-8, it is checked whether adding another tardy job to the block will keep the following property: in any permutation of these tasks, all tasks are tardy. The computational complexity of the algorithm is O(n).
Property 1. From block definition and from Theorem 1:

1.
Each task belongs to a certain T or D block,
Two T or D blocks can appear directly next to each other, 4.
A block can contain only one task. 5.
The partition of permutations into blocks is not explicit. According to the block definitions and the Theorem 1 : If π D is D-block in permutation π, then for any task Therefore, the cost function of performing this task is a linear function. It follows from Smith's theorem [37] that tasks in π D occur in optimal order if and only if w π(i−1) where subpermutation π D = (π(a), π(a + 1), . . . , Permutation π ∈ Φ is ordered (in short D-OPT) due to the partition into blocks, if in each D-block each pair of neighboring tasks meets the relation (3), hence they appear in the optimal order. Theorem 2 ([1]). A change in the order of tasks in any block of permutation D-OPT does not generate permutations with a smaller value of criterion function.
From the above statement follows the so-called block elimination property. It will be used while generating neighborhoods.
then in permutation β at least one task of some block from partition π was swapped before the first or the last task of this block.
Therefore, by generating from D-OPT permutation π ∈ Φ new solutions to the TWT problem we will only swap element of the block before the first or after the last element of this block.

Moves and Neighborhoods
The essential element found in approximate algorithms solving NP-hard optimization problems based on the local search method is neighborhood-mapping: The number of elements of the neighborhood, the method of their determination and browsing has a decisive impact on efficiency (calculation time and criterion values) of the algorithm based on the local search method. Classic neighborhoods are generated by transformations commonly known as moves, i.e., "minor" changes of certain permutation elements consisting of:

1.
Swapping positions of two elements in permutation-swap move s k l , changes the position of two elements π(k) and π(l) (found respectively in k and l in π positions), generating permutation s k l (π) = π k l . In short, it will be called s-move. The computational complexity of s-move execution is O(1).

2.
Moving the element in the permutation to a different position-insert move i k l , swaps element π(k) (from position k in π) to position l, generating permutation i k l (π) = π k l . The insert type move will be abbreviated to i-move. Its computational complexity is O(1).
One way to determine the permutation neighborhood is to define the set of moves that generate them. If M(π) is a certain set of moves specified for permutation π ∈ Φ, then N (π) = {m(π) : m ∈ M(π)} is neighborhood π generated by moves from M(π).
In each iteration of an algorithm based on a local search method using the neighborhood (moves) generator, a subset of a set of solutions -neighborhood is determined. Let be a partition of ordered (D-OPT) permutation π into blocks.
We consider the task π(j) belonging to a certain block from the division B. Moves that can bring improvement of the criterion values consist of swapping the task π(j) (before) the first or (after) the last task of this block. Let M j b f and M j a f be sets of these moves (i.e., respectively all such i-moves and s-moves). These sets are symbolically shown in the Figure 1. The task π(a k ) is the first whereas a π(b k ) is the last element of B k block, which also includes the considered task π(j).
be a set of all moves that can bring improvement (see Corollary 1), i.e., moves before or after blocks of some π permutation. Some properties of i-moves and s-moves were proven, which can be used to determine subneighborhood. These are both elimination criteria and procedures for determining sets of moves and their representatives.

Properties of Insert Moves
The move m * ∈ M(π) is a representative of a set of moves from a certain set of moves W ⊂ M(π), if ∀r ∈ W, F (r(π)) ≥ F (m * (π)).
Let us assume that permutation π ∈ Phi is D-OPT, and the neighborhood is generated by insert type of (i-moves). If i k l is a representative of the moves of the set M k ⊂ M(π) (i k l ∈ M k ), then moves belonging to M k are removed from the set M k b f i M k a f i.e., they are modified as follows: This procedure makes it possible, in the process of generating the neighborhood, to omit the elements that do not directly improve the value of the criterion. Theorem 3. If task π(k) after swapping to the position l ( after the move i k l ), 1 ≤ k < l ≤ n, in permutation π is early (i.e., C π(l) ≤ d π(k) ), then for a pair of moves i k l−1 , i k l there is Proof. Let us assume that the task π(k) after swapping to position l (i.e., executing the move i k l ) in permutation π is early. We consider two moves: i k l and i k l−1 . In permutations π k l and π k l−1 generated by these moves there occurs: π k l (j) = π k l−1 (j) for j = 1, 2, . . . , l − 2, l + 1, . . . , n, π k l−1 (l − 1) = π k l (l) = π(k), π k l (l) = π k l (l − 1) = π(l). We present both permutations.
Therefore, the move i k τ(k) is a representative of a set {i k 1 , i k 2 , . . . , i k τ(k)−1 }, hence these moves can be omitted by modifying accordingly the sets: Corollary 3. Let π(a) i π(b) be the first and the last task T -bloku in permutation π. If 1 ≤ k < a and a ≤ τ(k) ≤ b, where parameter τ(k) is defined in (6), then for the sequence of moves Proof. The move i k l , l = τ(k) + 1, τ(k) + 2, . . . , b generates permutation π k l , in which the task π(k) is tardy. Using the Theorem 4, it is easy to show that Computational complexity of algorithms checking for each proven property 2-3 is O(n).

Properties of Swap Moves
To eliminate some s-moves from neighborhoods there will be blocks and elimination criteria used.
There is a partial relation '→' introduced on the set of tasks J . Let Γ − (i) and Γ + (i) be respectively, the sets of predecessors and successors of tasks i ∈ J in relation →. Properties enabling to determine the elements of the relationship → are called in the literature elimination criteria.

Theorem 5. If one of the conditions is met:
(a) p r ≤ p j , w r ≥ w j , d r ≤ d j , lub then there is an optimal solution in which the task r precedes j, i.e., r → j. [38]. Conditions (b) and (c) are a generalized version of theorem 1 and 2 from Emmons' work [24].

Proof. Condition (a) is proved by Shwimer
After establishing a new relationship between the task i and j to avoid a cycle, there should be a transitive closure of relations given, i.e., one should modify the sets accordingly: Using the relation → (using the statement 5), from the set of s-moves generating the neighborhood, there will be some elements removed.

Corollary 4. For any s-move s
Proof. It follows directly from the Theorem 5.

Construction of Algorithm
The local search algorithm starts with some startup solution. Then its neighborhood is generated and a certain element is selected from it, which is assumed as the starting solution for the next iteration. Thus, the space of solutions is browsed with the "moves", from one element to the other. This process continues until certain stop criterion is met. In this way a sequence (trajectory) of solutions is created, from which the best element is the result of an algorithm running.
One of the deterministic and most commonly used implementations of the local search method is tabu search (TS for short). Its main ideas were presented by Glover in the works of [39,40] and monographs Glover and Lagoon [41]. To avoid 'looping' (going back to the same solution), a short-term memory mechanism is introduced, tabu (solutions or prohibited movements). performing motion-determining the starting solution in the next iteration-its attributes are remembered on the list. Generating new neighborhood solutions, whose attributes are on the list, are omitted, except for those meeting the so-called aspiration criterion (i.e., "exceptionally favorable"). The basic elements of this method are: Let x ∈ X be any startup solution LT tabu list and Ψ selection criterion (i.e., function enabling comparing elements of the neighborhood), usually it is a goal function F .
Computational complexity of a single algorithm iteration depends on the number of elements of the neighborhood, the procedure for generating its elements and the complexity of the function that calculates the criterion value. A detailed description of the implementation of this algorithm for a single-machine task scheduling problem is presented in the work of Bożejko et al. [42]

Construction Algorithms
Most construction algorithms are quick and simple to implement, unfortunately the solutions they designate are usually "far from optimal". Hence, they are almost exclusively used to determine "good" startup solutions for other algorithms. In case of the TWT problem, the following algorithms (or their simple modifications and different hybrids) has been used for years:  [30]).
The first two have a static priority function and computational complexity of O(nlnn), whereas in the other two there is dynamic priority function used and their computational complexity is O(n 2 ).

Tabu Search Algorithm
Considered in this work problem were solved using tabu search algorithm. Next, we will describe each element of the algorithm in more detail.
Neighborhood. In TS algorithm there is neighborhood used, generated by swap and insert moves. Let B be a division of D-OPT permutation π into blocks. Using the elimination block properties for i − moves we determine (Corollaries 2-3) a set of moves before and after the block Next: 1. According to Corollary 2 and 3 z M(π) we remove some subsets of i − moves leaving only their representatives.

2.
We remove s − moves, whose performance means that one of the conditions of the Theorem 5 is not met Therefore, the neighborhood of permutation π N (π) = {m k l (π) : m k l ∈ M(π)}.
The procedure for determining the neighborhood has a complexity O(n 2 ). Startup solution. For each example, in TS algorithm, there were adopted startup solutions determined by the best construction algorithms: SWPT, EDD, COVERT, AU and META algorithms. Their description is presented in Section 5.1.
Stop condition. A stop condition for completing the calculations of both algorithms was the maximum number of iterations. In the process of parallel implementations of TS algorithm was a strategy of many computing processors working in parallel used. The maximum number of iterations for each processor was 1000/p , where p is the number of processors.
Tabu List in TS algorithm. To prevent the formation of cycles too quickly (i.e., a return to the same permutation, after a small number of iterations of the algorithm), some attributes of each move are remembered on the so-called tabu list (abbreviated to LT ). It is supported on a FIFO queue basis. Making a move m r j ∈ M(π) (i.e., generating from π ∈ Φ permutation π r j ), we write some attributes of this move on the list, i.e., threesome: (π(r), j, F (π r j )). Let us assume that we are considering the move m k l ∈ M(β) generating permutation β k l . If on the list LT there is a threesome (r, j, Y) such that β(k) = r, l = j and F (β k l ) ≥ Y, then this move is eliminated (removed) from the set M(β).

Parallelization of Algorithms
The proposed tabu search algorithm (in short TSA) has been parallelized using the MPI library using the scheme of independent search processes with different starting points (MSSS according to Voß [19] classification) On the cluster platform, there were parallel noncooperative processes implemented with mechanism for diversifying startup solutions based on the Scatter Search idea. Each of the processors modified the solution generated by the MERTA algorithm by performing a certain number of swap moves proportional to the size of the problem and the number of a processor. A parallel reduction mechanism (MPI_Bcast) was used to collect the data. MWPTSA (Multiple-Walk Parallel Tabu Search Algorithm) parallel algorithm's pseudocode is given in Algorithm 3, and the scheme in Figure 2. Generate neighborhood N (π i ) taking under consideration tabu list LT i as well as aspiration criterion; Add move attributes to the local tabu list LT i ; 12 until Stop criterion (achieving solution with assumed or better value of the cost function); 13 Reduce π * i , i = 1, 2, . . . , p, to the best value with using tree-based parallel calculating scheme; 14 return Θ *

Computational Experiments
The parallel tabu search MWWTS algorithm has been implemented in C++ using the MPI library. The calculations were made on the BEM cluster installed in the Wrocław Network and Supercomputing Center. Parallel computing tasks were run on Intel Xeon E5-26702.30GHz on processors under the control of the PBS queue system.
Examples of test data of various sizes and degrees of difficulty, on which the calculations were made, were divided into two groups: (a) the first group includes 375 examples of three different sizes (n = 40, 50, 100). Together with the best solutions, they are placed on the OR-Library [44] website.
(b) the second group embracing test instances for n = 200, 500, 1000 was generated in following way: • p i -random integer from range [1, 100] with uniform distribution, • w i -random integer from range [1, 10] with uniform distribution, Where P = ∑ v i=1 p i , R DD = 0.2, 0.4, 0.6, 0.8, 1.0 (relative range of due dates) and F T = 0.2, 0.4, 0.6, 0.8, 1.0 (average tardiness factor). For each of the 25 pairs of values of R DD and F T five instances were generated. Overall, 325 instances were generated -125 for each value of n. Test instances was published on web page [45].
Computational experiments of the algorithms presented in the work were carried out in several stages. First, the construction algorithms SWPT, EDD, COVERT, AU and META algorithms, presented in Section 5.1, were compared. Calculations were made on examples from group a). The obtained results (percentage relative error (11)) are presented in Table 1, compared to the reference solutions taken, as well as benchmarks data, from the OR-Library [44]. The best solutions were determined by the META algorithm [30]. The solutions of this algorithm are the starting point of the TSA algorithm. When running these algorithms on multiple processors, each of them started from a different initial solution. Differentiation of initial solutions was made by running a sequential algorithm with a small number of iterations 10i, i = 1, 2, . . . , p, where p is the number of processors.
To determine the values of the parameters of approximation algorithms, preliminary calculations were made on a small number of randomly determined examples. Based on the analysis of the results obtained, the following were adopted: • length of tabu list: 7, • length of the list for a long-term memory: 5, • algorithm for determining the startup solution: META, • algorithm's stop condition: computation time t = 120 s.
As the solution of the parallel algorithm there was the best solution obtained by individual processors. The percentage relative error was determined for each solution: where: F re f -value of the reference solution obtained with META algorithm, F alg -the value of the solution determined by tested algorithm. Table 2 contains PRD improvement values relative to the solution obtained by the META algorithm, which is also a starting solution for TSA (that is why they are negative) for bigger instances of the sizes (n = 200, 500 and 1000). Table 2. Computational results for sequential tabu search algorithm (t = 120 s, META as reference). wt200  wt500  wt1000   TS  TS b  TS  TS b  TS  TS b  Table 3 contains PRD improvement values relative to the solution obtained by the META algorithm, which is also a starting solution for parallel TSA (that is why they are negative) for bigger instances of the sizes (n = 200, 500 and 1000). Experiments have been conducted for the number of processors p = 8, 16, 32 and 64. One can observe that increasing of the number of processors results in improving of the quality of the obtained solutions especially for algorithm with blocks. From the results presented in the table, it can be seen that high speedups are obtained for smaller size problems (s TS b values over 60, 000 for n = 40). The block algorithm gives better acceleration (on average more than 2600 times better). Such results are probably the result of the fact that the size of the problem solution space (40!, 50!, 100!, 200!, 500!, 1000!) increases much faster than the number of processors increases (8,16,32,64). The increase in speedup within a single size of the problem is also not large as shown above. The large disproportion in speedup of the algorithm with and without blocks results from the fact that the block algorithm strongly limits the space of the solutions being reviewed.

Instance Group
As one can see increasing the size of the problem reduces the value of the speedup obtained -which is probably due to the limited computation time for a larger space for search solutions; however, increasing the number of processors does not have a big impact on acceleration values even when increasing the size of the problem.

Conclusions
Based on the problem analysis and computational experiments performed, we can suggest the following conclusions. Tabu search method using block properties allows the solution of the problem not only faster than classic tabu search metaheuristic without using these properties, but also with huge speedups. Speedups achieved of the parallel tabu search algorithm with blocks are much greater than for classic tabu search algorithm, which confirms the legitimacy of using block elimination properties. The fact of achieving such huge speedups should be the subject of further research. Also proposed blocks elimination criteria can be used for constructing efficient parallel algorithms solving other NP-hard scheduling problems. Data Availability Statement: Test instances for a single-machine total weighted tardiness scheduling problem generated during the study are available online: https://zasobynauki.pl/zasoby/51561 (accessed on 25 February 2021).