A Dual Heterogeneous Island Genetic Algorithm for Solving Large Size Flexible Flow Shop Scheduling Problems on Hybrid multi-core CPU and GPU Platforms

The flexible flow shop scheduling problem is an NP-hard problem and it requires significant resolution time to find optimal or even adequate solutions when dealing with large size instances. Thus, this paper proposes a dual island genetic algorithm consisting of a parallel cellular model and a parallel pseudo model. This is a two-level parallelization highly consistent with the underlying architecture and is well suited for parallelizing inside or between GPUs and a multi-core CPU. At the higher level, the efficiency of island GAs is improved by exploring new regions within the search space utilizing different methods. In the meantime, the cellular model keeps the population diversity by decentralization and the pseudo model enhances the search ability by the complementary parent strategy at the lower level. To encourage the information sharing between islands, a penetration inspired migration policy is designed which sets the topology, the rate, the interval and the strategy adaptively. Finally, the proposed method is tested on some large size flexible flow shop scheduling instances in comparison with other parallel algorithms. The computational results show that it cannot only obtain competitive results but also reduces execution time.


Introduction
The Flexible Flow Shop scheduling problem (FFS) focuses on improving machine utilization and reducing make-span. Some works on solving small size FFS are concerned on exact methods [1][2] to find the optimal solutions. However, conventional optimization techniques always fail in industry application as the problem sizes in the real world are much bigger and the computational cost is increased. Therefore, there is a growing interest in developing heuristic methods to solve large complex FFS problems [3] [4]. Although these approaches cannot guarantee finding optimal solutions, there is a sizable probability that an adequate solution is found in a reasonable time.
The Genetic Algorithm (GA) is one of the most widely known heuristic methods and is one of the best approaches in solving FFS problems. But when GAs are applied to large or complex problems, there is a conflict between searching better solutions and execution time. In contrast to classical GAs, the island GA [5] divides the population into a few relatively large subpopulations. Each of them works as an island and is free to converge towards its own sub-optimum. At some points, a migration operator is utilized to exchange individuals among islands. This imitates the nature in a better way which increases the search diversification [6]. Furthermore, it is one of the most famous models to exploit parallelism in GAs. Nevertheless, due to the same genetic operator configurations in each island, island GAs are apt to yield premature convergence. Meanwhile, this design has to be carried out with respect to the underlying architectures for parallelization implementation.
With the unprecedented evolution of GPUs and multi-core CPUs, almost all modern computers are equipped with both. Some comparisons between their performances for GA applications were discussed [7], but the cooperation between the two in this domain was rarely concerned. These facts have motivated the design of a heterogeneous island GA that keeps better population diversity and is well suited for parallelization on GPUs and a multi-core CPU. In this paper, we seek to address it and its application to a large size FFS problem. Specially, the contributions of our work are summarized as follows: 1. a dual heterogeneous island model is proposed where the 2D variable space of the cellular GA and the complementary parent strategy of the pseudo GA keep the population diversity; 2. a two-level parallelization highly consistent with the underlying architecture is implemented that is well suited for parallelizing inside or between GPUs and a multi-core CPU; 3. a penetration inspired migration policy is designed so that it can share good individuals effectively by setting the topology, the rate, the interval and the strategy adaptively.
The remaining sections of this paper are organized as follows. Section 2 introduces related works. Section 3 describes the research problem. Section 4 presents the design of the dual heterogeneous island GA on hybrid multi-core CPU and GPU platforms.
Section 5 presents the numerical experiments and result analysis. Finally, section 6 states the conclusions.

Related Works
When the population size is N and there are n islands, only N/n individuals work with GA operators in one island. Moreover, the selection and the elitist strategy in GAs decrease the subpopulation diversity in one island after several generations. Although the migration at some points can help create new individuals, the influence is restricted because GA operators in each island function in the same way. What is worse, an inappropriate implementation of migration mechanism may cause genetic drift and leads to converge toward a local optimum. One approach for dealing with this problem is the heterogeneous island GA which makes distinction among subpopulations by different configurations. Herrera et al. [8] presented the gradual distributed real-coded GA that applied different crossover operators to different subpopulations. Alba et al. [9] encompassed the actual parallelization of the gradual distributed real-coded GA on a cluster of 8 homogeneous PCs. In [10], Miki et al.
designed a parallel GA using nCUBE-2E where different islands had different parameter settings. Although these heterogeneous algorithms have improved the solutions' quality, the implementation is usually executed on a homogeneous architecture or even on a mono processor. In these cases, different islands can work in parallel but GA operations inside one island are executed in a sequential way.
In addition to propose heterogeneous island GAs, some works were carried out to evaluate the performance of heterogeneous computing architectures for island GAs. In [11], a homogenous island GA was run at the same time on different types of machines which obtained super-linear speedup. García-Sánchez et al. [12] studied benefits from setting the subpopulation sizes according to each heterogeneous node's computational power. García-Valdez et al. [13] tested the randomized parameter setting strategy for heterogeneous workers in pool-based GAs. Despite promising results from leveraging computational capabilities of a heterogeneous cluster, these methods must face some common challenges such as lost connections, low bandwidth, abandoned work, security and privacy. Moreover, the above-mentioned designs generally are hard to profit the computation capability from GPUs or heterogeneous environment mixed with multi-core processors and many-core processors.
Since the cooperation between GPUs and a multi-core CPU is stable and secure, some efforts have considered to utilize both and enjoy their compute capabilities maximally. Dabah et al. [14] proposed 5 accelerated branch and bound algorithms for solving the blocking job shop scheduling problem where two of them presented a hybridization between the multi-core CPU approach and the GPUs-based parallelization approach. Benner et al [15] discussed a hybrid Lyapunov solver where the intensive parts of the computation were accelerated using GPUs while executing the remaining operations on a multi-core CPU. In [16], Bilel et al. introduced a CPU-GPU co-simulation framework where synchronization and experiment design were performed on CPU and node's processes were executed in parallel on GPUs.
These studies have confirmed the interest to design a scheme that exploits GPUs and a multi-core CPU in efficient ways. However, simultaneous parallelization on two sides and its implementation for island GAs are not yet concerned.
Several researches have tried island GAs to solve shop scheduling problems either for improving the solutions' quality [17] [18] or for decreasing the execution time [19] [20]. But none of them have so far, and to the best of our knowledge, considered heterogeneous island GAs parallelized on GPUs and a multi-core CPU. All the above-mentioned efforts provide us a starting point for designing a dual heterogeneous island GA that keeps a better population diversity and that is well suited for parallelization on hybrid multi-core CPU and GPU platforms.

Problem Definition
The FFS is a multistage production process as illustrated in Figure 1. An instance of the FFS problem considers a set of J jobs (1≤ j ≤ J). Each of them consists of a set of S stages (2≤ s ≤ S). At every stage, there is a set of M ' machines (1≤ m ≤ M ' ) and at least one stage has more than one machine. All jobs need to go through all stages in the same order and only one machine is selected for processing on each stage. There is no precedence between operations of different jobs, but there is precedence among operations due to the jobs' processing cycles. Preemptive operations are not allowed.
A feasible solution is described by jobs' sequence on target machines M )' . The processing time of job j at stage s on machine m is abbreviated as P )'+ . Usually, it is known with the release time R ) and the due time D ) . The objective function to minimize the total tardiness and the makespan is represented by WT * ∑T ) + C +45 using the classification scheme of Bruzzone et al. [21], where WT indicates the priority of the first objective. As a minimization problem, the fitness function of an individual is transferred from the objective function by max (E +45 − (WT * ∑T ) + C +45 , 0), where E +45 is the estimated maximum value of the objective function. The FFS problem is NP-hard in essence and is thus difficult to solve [22]. When dealing with large size instances, it requires huge resolution time to find optimal or even adequate solutions.

Dual Heterogeneous Island Strategy
The general framework of the proposed dual heterogeneous island strategy is shown in Fig. 2. There is the same number of individuals on each island where island A works with the cellular GA [23] and island B works with the pseudo GA [24]. Moreover, in addition to the parallelization on GPUs or on a multi-core CPU at the lower level, the GPUs and the multi-core CPU can work concurrently at the higher level to maximally use computing resources.
l High consistency with the proposed GA: The cellular GA maps individuals on a 2D grid and the CUDA threads are grouped into 2D blocks that are organized in a 2D grid, using the local memory, the shared memory and the global memory respectively [26]. Thus, the cellular GA can be entirely parallelized on GPUs. On the other hand, only the crossover, the fitness evaluation and the replacement are kept in the pseudo GA. The crossover is performed between fixed complementary parents. The fitness evaluations of individuals are independent.
Since no global information is required, all for loops in the above two steps can be easily handled on a multi-core CPU in parallel.
As the texture caches of CUDA are designed to gain an increase in performance for accelerating access patterns with spatial locality [27], we design the neighborhood area of the cellular GA as shown in

Migration Policy
The migration between islands is controlled by the topology, the rate, the interval and the strategy. To decrease the number of parameters that need to be set manually, we develop a migration policy inspired by the penetration theory [11] where a migration threshold value θ is set (0 ≤ θ ≤ 1). The execution of migration is decided by this value and there is more likely for individuals to migrate when θ = 1. Moreover, the migration rate α and the migration direction indicator β are formulated as in equation (1) and equation (2), respectively: Here, fit J and fit K are the best individual's fitness value of subpopulation A on island A and subpopulation B on island B. In a certain generation, we calculate the above functions and carry out three steps as follows: l If 1 − β < θ, the migration is executed. Otherwise, do nothing.
l The topology of migration is determined by the ratio of fit J to fit K . If fit J fit K > 1, the migration is from subpopulation A to subpopulation B. If fit J fit K < 1, the migration direction is reversed. If fit J fit K = 1, no migration is implemented.
l When the migration is carried, α×N individuals with best fitness values in the emigrant subpopulation are selected to replace α×N individuals with worst fitness values in the immigrant subpopulation.
The migration policy is executed by the CPU where results of cellular GA on GPUs are sent back to the CPU at this moment. With this design, the topology, the rate, the interval and the strategy no longer need to be considered manually. New merged individuals with good genes can be transited quickly and the execution time is saved by preventing ineffective information sharing.

Numerical Experiments
To analyze the performance of the proposed algorithm, we compare its solutions' quality and execution time with the parallel cellular GA and the parallel pseudo GA.
The population size is kept as 512 for all tested GAs while the subpopulation size for each island of heterogeneous GA is 256. The crossover rate and the mutation rate of cellular GA are set as 1.00 and 0.05 respectively [23], while the crossover rate of pseudo GA is equal to 0.75 [24]. The cellular GA from the dual heterogeneous GA keeps the same crossover rate and mutation rate as the cellular GA. Similarly, the pseudo GA from the dual heterogeneous GA keeps the same crossover rate as the pseudo GA. Moreover, to better check the influence of migration, the migration threshold is fixed as 1.00. As a large size FFS is concerned in this paper, all analyzed instances are characterized by 300 jobs with 4 stages and there are 2 available machines at each stage. Other experimental relative data are defined in Table 1.

Test on Migration Policy Execution Gap
Even the topology, the rate, the interval and the strategy are set adaptively when the migration policy is carried in a certain generation. We still need to test when to execute it since the migration policy needs call back results on GPUs and too frequent data exchange between the device and the host may weaken the performance of the proposed method. As it is displayed in Fig.4, the migration policy execution gap is increased from 10 generations to 800 generations and the island GA has a risk to fall in a local optimum if this value is either too small or too big. As a result, it finds that an inappropriate migration can also lead onto premature convergence, besides homogeneous genetic operator configurations and limited subpopulation sizes.
Following the polynomial fitting values, the best performance for the tested instance is obtained when the migration policy execution gap is around 500 generations and we keep this setting for the remaining tests in this paper. Fig.4 The influence of the migration policy execution gap for the heterogeneous GA

Comparison Test on Solutions' Quality
The solutions' quality of different GAs are shown in Table 2. Although the specific designs of cellular GA and the pseudo GA can help increase population diversity, the proposed method combines the merits from both and optimizes the performance by independent evolution and penetration migration. Thus, the heterogeneous GA overcomes them with better solutions and less variance. This effect is also confirmed by the convergence trend among three GAs in Fig. 5. Moreover, there are elbows in the convergence curve of the designed approach and they always appear around the generations where the migration policy is executed. This phenomenon witnesses the process of how the premature convergence is avoided thanks to two heterogeneous islands connected by the penetration migration.

Conclusions
A dual heterogeneous island GA was proposed in this paper. It was composed of a cellular GA on GPUs and a pseudo GA on a multi-core CPU where the 2D variable space of the cellular GA and the complementary parent strategy of the pseudo GA kept the population diversity. This structure was highly consistent with the underlying architecture which can be parallelized inside or between GPUs and a multi-core CPU.
Since the two islands evolved independently in different ways, a penetration inspired migration was designed to share information between them and to decrease the risk of premature convergence. For solving some large instances of the FFS problem, it firstly found out the importance of an appropriate migration implementation.
Otherwise, the migration could cause genetic drift and lead to a convergence towards a local optimum. The second test showed the proposed method obtained better solutions with less variance because of the merits from two different islands and confirmed the efficiency of the penetration migration. Finally, the effectiveness of the dual heterogeneous island GA was displayed by comparison tests with other parallel methods and pointed that the balance of computation capability between the host and the device had a great influence on its overall performance.