Improving the Performance of Whale Optimization Algorithm through OpenCL-Based FPGA Accelerator

Whale optimization algorithm (WOA), known as a novel nature-inspired swarm optimization algorithm, demonstrates superiority in handling global continuous optimization problems. However, its performance deteriorates when applied to large-scale complex problems due to rapidly increasing execution time required for huge computational tasks. Based on interactions within the population, WOA is naturally amenable to parallelism, prompting an effective approach to mitigate the drawbacks of sequential WOA. In this paper, field programmable gate array (FPGA) is used as an accelerator, of which the high-level synthesis utilizes open computing language (OpenCL) as a general programming paradigm for heterogeneous System-on-Chip.With above platform, a novel parallel framework of WOA named PWOA is presented. *e proposed framework comprises two feasible parallel models called partial parallel and all-FPGA parallel, respectively. Experiments are conducted by performing WOA on CPU and PWOA on OpenCL-based FPGA heterogeneous platform, to solve ten well-known benchmark functions. Meanwhile, other two classic algorithms including particle swarm optimization (PSO) and competitive swarm optimizer (CSO) are adopted for comparison. Numerical results show that the proposed approach achieves a promising computational performance coupled with efficient optimization on relatively large-scale complex problems.


Introduction
Swarm optimization or evolutionary algorithms have demonstrated their significance in a wide range of scientific and practical problems [1][2][3][4][5]. Recent years, more and more research studies' focus has been on multiobjective problems and artificial intelligence [6][7][8][9]. Whale optimization algorithm (WOA), a novel swarm intelligence-based metaheuristic algorithm, was proposed by Mirjalili and Lewis in 2016 [10]. Inspired by the special hunting behavior of humpback whales, WOA shows better performance compared with several existing popular methods and has drawn great research attention. Typically, Abdel-Basset et al. [11] integrated WOA with a locals search strategy for tackling the permutation flow shop scheduling problem. Mafarja and Mirjalili [12] proposed a hybrid WOA with simulated annealing for feature extraction. Aljarah et al. [13] introduced WOA-based trainer to train multilayer perceptron (MLP) neural networks. Moreover, there are also research bodies trying to tackle other diverse problems using WOA, such as multiobjective optimization [14][15][16], image processing [17][18][19], software testing [20], and power system applications [21,22].
However, large-scale, multiple constraints and complex scenarios usually appear in actual engineering optimization problems, such as job shop scheduling, mixed unit commitment problem, and automatic path planning. Furthermore, high requirements in response speed and real-time performance need to be satisfied when solving problems above. In this situation, most optimization algorithms including WOA might get stuck in the executing dilemma. As the scale and complexity of the problem increase, the execution time of WOA will increase rapidly, which leads to time-performance deterioration [23]. With inherent parallelism of WOA, the abovementioned problem can be tackled by applying parallel algorithm developed targeting specific accelerating platforms. In recent years, experts and scholars have tried to implement various swarm optimization algorithms employing state-of-the-art technologies such as multicore (message passing interface-MPI, OpenMP), distributed (MapReduce, Spark), and heterogeneous computing-based parallel platforms (graphics processing unit-GPU, FPGA).
Heterogeneous computing refers to using dedicated hardware devices with different architectures to execute time-consuming tasks, balancing the computational load of CPU. GPU is a classical parallel computing device, widely used in graphics visualization, image/video processing, scientific computing, deep learning, and so on. Nevertheless, with the increasing deployment of GPU, energy consumption and heat dissipation have become severe limitations for system extension, as well as brings heavy environmental pressure to human society [24]. In light of this, some researchers begin to choose other hardware devices to alleviate pressure caused by GPU. FPGA, a novel parallel accelerator, possesses powerful parallel computing capability and flexible programmability, while maintaining the advantage of low power consumption [25]. Traditional FPGA design, however, has drawbacks of high development difficulty and time consumption. Recently, Intel provides a development kit for software users, making it possible to deploy OpenCL program on FPGA. Consequently, developers can rapidly implement FPGA-based heterogeneous applications through OpenCL API thus reducing the development cost and time-to-market.
is research proposes high-performance parallel WOA (PWOA), with implementation on FPGA to effectively solve large-scale and complex optimization problems. More specifically, the main contributions of this paper are presented as follows: (1) A novel heterogeneous parallel framework of WOA based on OpenCL-based FPGA accelerator. (2) Two efficient models including partial parallel model and all-FPGA parallel model, with program flow design and dataflow analysis. (3) Several diverse numerical experiments are conducted with ten selected benchmark functions. By comparing with sequential WOA executing on CPU, the proposed PWOA based on two parallel models achieves higher execution performance. e rest of this paper is organized as follows: Section 2 represents a substantial literature review on exploration for parallel optimization algorithms. e theory of WOA and OpenCL-based FPGA heterogeneous accelerating platform is introduced in Section 3. Section 4 describes FPGA implementation of the proposed PWOAs with two parallel models and followed by the experimental results and statistical analysis in Section 5. Finally, conclusions are given in Section 6.

Related Work
Swarm optimization algorithms including WOA encounter challenges that the optimization performance decreases due to extensive computational cost when solving problems with high-dimension and complex mathematical model. To overcome these challenges, researchers have designed parallel swarm algorithms with implementation on various platforms. In recent years, distributed and parallel particle swarm optimization (PSO) has been implemented. Some studies [26][27][28][29][30] applied GPU to parallelize PSO, putting forward diverse parallel strategies. Hajewski and Oliveira [31] developed fast cache-aware parallel PSO relying on OpenMP. Ant colony optimization (ACO) [32] and artificial bee colony (ABC) [33] were also parallelized by GPU. Concerning brain storm optimization (BSO), Jin and Qin [34] presented GPU-based manner whilst Ma et al. [35] proposed parallelized BSO algorithm based on Spark framework for association rule mining. Similar works in [36,37] used GPU and FPGA to accelerate genetic algorithm (GA). What deserves attention is that Garcia et al. [38] achieved parallel implementation and comparison of teaching-learning based optimization (TLBO) and Jaya on many-core GPU. As for WOA, Khalil et al. [39] proposed a simple and robust distributed WOA using Hadoop Map-Reduce, reaching a promising speedup.
It can be concluded that there are several typical kinds of parallel techniques including OpenMP, MapReduce, Spark, and heterogeneous architecture based on dedicated accelerators, such as GPU and FPGA. GPU becomes popular for general-purpose parallel computing as developing parallel swarm intelligence algorithms via GPU has successfully gained remarkable performance improvement [40]. Recently, FPGA is gradually applied to heterogeneous computing and algorithms accelerating based on OpenCL, which benefits from its high-parallelism, better energy efficiency, and flexible programmability [41][42][43][44]. e experiments conducted by [45] showed that swarm algorithms on FPGAs achieved a better speedup than that on GPUs and multicore CPUs. Nevertheless, designing a near-optimal accelerator is not an easy task. Implementing CPU-oriented codes on FPGA rarely increases the performance and even reduces the performance compared to CPU. erefore, it requires not only the digital design expertise but also software skills to form appropriate OpenCL codes [46].
Few research works have been investigated on FPGA implementation of swarm optimization algorithms, especially WOA. Our prior work [47] explored WOA based on partial parallel scheme and deployed it on the FPGA heterogeneous platform. en, empirical results using classic benchmarks proved the consequential advance of the 2 Complexity proposed methology in execution performance and convergence speed. In this paper, motivated by the previous studies, a novel PWOA scenario with two parallel models encompassed is further exploited via FPGA. Meanwhile, more diverse benchmarks are used to verify the effectiveness of PWOA based on FPGA parallel framework and its computational performance for large-scale complex problems.

Whale Optimization Algorithm and Acceleration Platform
3.1. Basic WOA Algorithm. e WOA algorithm constitutes two main phases, exploitation and exploration, through emulating shrinking encircling, bubble-net attacking, and searching for preys. e following subsections explain in detail the mathematical models of each phase.

Exploitation Phase (Encircling and Bubble-Net
Attacking). To hunt preys, humpback whales first recognize the location of preys and encircle them. e mathematical model of shrinking encircling is represented by the following equations: where X is the position vector, X * represents the position of the best solution obtained so far, t indicates the current number of iteration, | | denotes the absolute operation, and · means an element-by-element multiplication. A and C are two parameters, which are calculated as follows: where a is linearly decreasing from 2 to 0 through iterations (in both exploitation and exploration phases) and r is a random number in [0, 1]. e value of a is calculated by a � 2 − t(2/MaxIter), and MaxIter is the maximum number of iterations. Another method used in the exploitation phase is spiral updating position, which in coordination with aforementioned shrinking encircling constitutes the bubble-net attacking strategy of humpback whales. e mathematical equations are as follows: where b is a constant for determining the shape of the logarithmic spiral and l is a random number in [−1, 1]. Shrinking encircling and spiral updating position are used simultaneously during exploitation phase. e mathematical model is as follows: where p is a random value in [−1, 1] which stands for a probability of 50% to choose either the shrinking encircling method or the spiral-shaped mechanism to update the position of whales during optimization process.

Exploration Phase (Searching for Preys).
In addition to exploitation phase, a stochastic searching technique is also adopted to enhance the exploration in WOA. Unlike exploitation, a random whale X rand is selected from swarm to navigate the search space, so as to find a better optimal solution (prey) than the existing one. is phase can efficiently prevent the algorithm from falling into local optima stagnation. Subsequently, based on the parameter A, a decision is made on which mechanism to be used for updating the position of whales. Exploration is done if |A| ≥ 1, meanwhile if |A| < 1. e optimization process is mathematically described as follows: where X rand is a random position of the whale chosen from the current population and C is calculated by equation (4). Algorithm 1 presents the pseudocode of WOA. At the beginning of the algorithm, an initial random population is generated, and each individual gets evaluated by fitness function and X * is the current best solution. en, the algorithm is repeatedly executed until the stop condition is satisfied. At each iteration, search agents update their position according to either a random chosen individual when |A| ≥ 1, or the optimal solution obtained so far when |A| < 1. Depending on p, the WOA algorithm decides on whether using circular or spiral movement.

OpenCL-Based FPGA Heterogeneous
Computing Platform

OpenCL and FPGA. OpenCL, maintained by Khronos
Group, is an open standard for general-purpose parallel computing [48]. Various hardware devices, such as CPU, FPGA, GPU, and DSP, are supported for implementing highly efficient and parallel algorithms across heterogeneous computing platform. Additionally, OpenCL specifies a C99-based programming API for convenience of software developers. A typical OpenCL program consists of host and kernel sections. FPGA is a configurable integrated circuit that can be repeatedly reconfigured to perform a huge number of logic functions. It generally includes programmable core logics, hierarchical reconfigurable interconnects, I/O elements, memory blocks, and DSPs. With these substantial logical resources, FPGA achieves an increased programming flexibility compared to application-specific integrated circuits Complexity (ASICs). However, traditional development flow on FPGA heavily relies on register transfer level (RTL) descriptions such as Verilog and very high speed integrated circuit hardware description language (VHDL), which incurs high development and verification cost. To address this problem, FPGA vendors such as Intel and Xilinx released OpenCL-based development flow which eases software developers to design FPGA-based applications, making this process more efficient.

Intel FPGA SDK for OpenCL.
e Intel FPGA SDK for OpenCL [49] entitles developers to create high-level FPGA implementation with OpenCL. is SDK generates a heterogeneous computing environment where OpenCL kernels are compiled by Altera Offline Compiler (AOC) for programming FPGA at runtime. In this paradigm, Intel achieves design optimization while hiding low-level hardware details of FPGA. Subsequently, FPGA has gradually been applied to a wide range of fields such as image and video processing [42,50], deep learning [51][52][53], and intelligent optimization algorithm [46].
OpenCL-based FPGA logic framework is illustrated in Figure 1 where several modules are specifically explained as follows: (1 )Kernel pipeline: the core module of entire framework, which is an implementation of specific functions. e kernel code is compiled by AOC offline compiler and will be synthesized into highly parallel optimized logic circuit referring to the internal architecture of FPGA. (2 )Processor: a host processor, typically CPU, used to control programs running on FPGA device.
(3 )DDR: off-chip memory, including global and constant memory in the OpenCL memory model. Intel Cyclone V FPGA device used in this context has a DDR3 with a capacity of 1 GB. By default, the constant cache size is 16 KB and can be modified in accordance with practical requirements. (4 )PCI-e: high-speed data exchanging interface, responsible for transporting data and instruction between host and device. (5 )On-chip memory: internal memory of FPGA device, equivalent to local and private memory in the OpenCL memory model. With small capacity but high speed, it is mainly used for storing input and output temporary data, reducing the number of accesses to global memory. us, we may take advantage of on-chip memory to improve the efficiency of OpenCL program. (6 )Local memory interconnect: a bridge between executing unit and memory. (7 )External memory controller and PHY: a controller which is in charge of controlling data sending and receiving via DDR.

Parallel Whale Optimization Algorithm Based on FPGA
With the descriptions and definitions above, the framework of WOA can be summarized as shown in the left flowchart in Figure 2. Note, meanwhile, that the right flowchart is a simplified framework of WOA which is mainly composed of initialization, swarm updating, fitness calculation, and swarm evaluation. Similar to other swarm optimization Update the position of the current search agent by equation (2)  10 else if(|A| ≥ 1)then 11 Select a random search agent (X rand ) 12 Update the position of the current search agent by equation (9)  13 end if 14 else if(p ≥ 0.5)then 15 Update the position of the current search agent by equation (6)  4 Complexity algorithms, WOA unavoidably suffers from this drawback of time-consuming operations such as updating swarm and calculating fitness, which greatly limits its execution speed [45]. anks to natural parallelism, the components utilized to implement swarm updating and fitness calculation in WOA can be executed concurrently. Within swarm updating phase, the positions of searching whales are updated separately by corresponding moving mechanism, more biologically simulating a real hunting process. For the remaining two phases, the initialization keeps primary ideology in this work for it has little effect on computational performance, whereas the evaluation is synchronous and cannot be paralleled. is section will propose parallel WOA based on FPGA heterogeneous computing platform. In order to reach efficient acceleration, some compute-intensive tasks of WOA need to be transferred to FPGA side for parallel execution whilst CPU performs the remaining tasks. e parallel model can be divided into partial parallel and all-FPGA parallel by assigning different tasks to CPU and FPGA. Below is the PWOA implementation, which is described in two aspects: program flow design and dataflow analysis.

Initialization.
Initialization mainly prepares basic data needed during the whole phase of WOA, including generating random numbers and initial population. is process is carried out at the beginning stage of WOA and executed only once. On top of this, C/C++ dedicated library for random number generation is applied as OpenCL does not support native random number generator. In this paper, a generic methodology, putting computational task of initialization at CPU side, is adopted into these two proposed parallel models, so as to take full use of computational horsepower from CPU.
Random number generation is a crucial component for WOA. On the one hand, the initial population is composed of whales with a random position. What needs to be ensured is that the value of the random position must be within the range of decision variable according to specific objective functions. On the other hand, there are several random numbers used as coefficients (a, A, C, l, and p) for updating the position of whale, which plays a significant role in optimization performance. Besides, these coefficients appear in each iteration, meaning that data transportation between FPGA and CPU also appears in each iteration. It will become a bottleneck for the running speed of PWOA due to frequent data transportation between FPGA and CPU. To alleviate this drawback, all random numbers required as well as the initial population are generated at CPU side and then sent to FPGA side once via OpenCL global memory. e proposed approach can substantially reduce the time overhead of PWOA.

Program Flow Design.
e partial parallel model executes several algorithmic sections in parallel involving the socalled master-slaves model. e partial parallel model-based PWOA (PWOA-PPM) on FPGA is presented in Figure 3.

Complexity 5
Host Program Flow. In terms of host program running on CPU, it undertakes the initialization of PWOA and transfers basic data related to kernel side via OpenCL global memory. Due to the restriction of synchronization, swarm evaluation is put on CPU for sequential executing in this model. After that, host program maintains the basic framework of WOA where it allocates tasks to FPGA, reads computation results from FPGA, and evaluates swarm in each iteration. e evaluation result is also sent to FPGA when host program enqueues task commands which drive kernel function to be executed on FPGA. Such task allocation can make better use of the processing power of CPU but correspondingly cause supernumerary communication overhead between CPU and FPGA.
Kernel Program Flow. FPGA device is used to deploy kernel program and accelerate it. Host offloads computationally intensive tasks onto FPGA for parallel computing. Based on the OpenCL programming model, the parallel parts of the algorithm are mapped to kernel function to be executed by threads (or work items) independently [40,45]. In the proposed model, a fine-grained strategy is adopted, where each thread takes charge of an individual, calculating fitness and updating position. According to the coefficients (A and p), each thread (individual) performs different mechanisms simultaneously: shrinking encircling, spiral updating, or stochastic searching. Once kernel program finishes executing, the final results are written back to global memory.

Dataflow Analysis between Host and Kernel.
In the proposed implementation, the dataflow between host and kernel mostly depends on global memory bandwidth. At host side, the memory buffers are created and the data used are mapped to these buffers, which will be further sent to the Update the current whale by equation (6) Update the current whale by equation (2) Update the current whale by equation (9) Amend whales which go beyond the search space  6 Complexity global memory of kernel via PCI-e. At kernel side, each thread works as a basic processing element, reads data from global memory and complete kernel function. As illustrated in Figure 4, the data set contains positions and fitnesses of all search agents, the global optima X * , and coefficients (a, A, C, l and p). One block in the "positions" memory block and "fitnesses" memory block represents multidimension position information and fitness value of one whale individual, respectively. In the "coefficients" memory block, all coefficients required for a whale during the whole iterations are stored in one block, whereas the "optima" memory block just holds the position of the global best whale.

Program Flow Design.
In the all-FPGA parallel model, most constituent parts of WOA, except for initialization, are ported to FPGA. e All-FPGA parallel model-based PWOA (PWOA-AFPM) is designed as shown in Figure 5.
Host Program Flow. At host side (CPU), similar to the previous partial parallel model, host program undertakes the initialization of WOA and the basic data related are offloaded onto kernel side via OpenCL global memory. However, it no longer controls the basic framework of WOA in this parallel model, making a relatively low workload for CPU while a greater computation overhead for FPGA. After completing the above two operations, the host program enqueues task commands to start the kernel program of FPGA and finally reads results from global memory. A dramatic advantage of this design, in comparison with partial parallel model, is minimal communication overhead between CPU and FPGA.
Kernel Program Flow. Within this model, kernel program running on FPGA becomes more complex than the previous model. In addition to receiving data and writing results back to global memory, the evolutionary framework, which contains swarm updating, fitness calculation, and swarm evaluation, is controlled by the kernel. Similarly, the finegrained model is also applied to make multiple threads executing kernel function in parallel. Nevertheless, cares should be taken when evaluating swarm because all threads share one global optimal solution. To ensure the accuracy of the algorithm, we define memory consistency across threads with respect to memory fences [54,55]. As depicted in Figure 5, the process in red dotted line is performed as synchronizer, where not only do all threads reach a synchronized state before this process, but also using a better solution obtained by any thread to substitute for the global optimal solution is an atomic operation as well. In this way, all threads can be executed in order, therefore guarantees the evaluation results.

Dataflow Analysis between Host and
Kernel. In this model, the dataflow between host and kernel involves global memory and on-chip memory (local memory), which is presented in Figure 6. e same as the previous model, positions and coefficients are transmitted via global memory. To transmit the final result from the kernel to host, it also requires global memory to store this variable. erefore,  Complexity a memory buffer is created by host program at the beginning, requesting a global memory space for the global optima X * . Usage of on-chip memory from FPGA is a noticeable variation of dataflow between this model and the prior model illustrated in Figure 4. is is because most operations of WOA are executed by FPGA, and it is a rational strategy to utilize on-chip memory comprised of local memory and private memory. Besides, this kind of memory can be directly and efficiently requested during the process of executing kernel. us, the intermediate results, such as optima and the fitness of all individuals, are stored into local memory. Furthermore, a more efficient synchronous evaluation process also benefits from the dataset in local memory.   In this paper, ten general benchmark functions [56], listed in Table 1, are used to make performance comparisons between serial WOA (CPU implementation) and two parallel models based PWOA (FPGA implementation). Among these benchmark functions, f1 ∼ f5 are unimodal functions while f6 ∼ f10 are multimodal functions.
Concerning other parameters in the canonical WOA algorithm, coefficient b in the spiral-updating model is held constant during the whole evaluation process and set to be 1.0. Dimensions including 64 D, 128 D, 256 D, and 512 D are set for the optimization test, and the population size of WOA is dynamically set to be twice the size of the dimension. To verify the performance of the proposed PWOAs, other two canonical algorithms, PSO [57] and competitive swarm optimizer (CSO) [58], are selected for comparison. Additionally, for each implementation with a specific dimension setting, 30 independent runs are executed and the average performance is considered. For each independent run, the maximum number of fitness evaluations (FEs) is set to 1000 × D, where D is the search dimension of the test functions.

Optimization Result and Running Time on Benchmark
Functions. By using three WOAs with different schemes and two state-of-the-art algorithms to optimize 10 benchmark functions, experimental data can be obtained as listed in Table 2 Based on numerical values given in above tables, it can be noticed that WOA and PWOAs constructed by two parallel models (PWOA-PPM and PWOA-AFPM) present higher problem solving efficacy than CSO and PSO when optimizing all benchmark functions with several dimensions. As for mean results of all the 10 test cases, WOA and the proposed PWOAs obtain more accurate values, compared with other two algorithms. When optimizing f1, f2, f4, f8, and f10, the results of WOA and PWOAs maintain a tiny gap with optimal values (0). e proposed algorithms, particularly, can converge to a theoretical optimal value (f min � 0) for f7 and f9 at any scale. CSO can get more reliable solutions for f1, f2, f8, and f9, which, however, are still lower than the proposed algorithms in accuracy. Relatively speaking, PSO hardly converges to an accurate value for most benchmarks. e comparison between WOA and PWOAs shows that the proposed parallel framework based on FPGA heterogeneous platform for WOA maintains intrinsic outstanding global convergence. On top of that, with the increasing of both dimension and population size, the performance of the proposed algorithms are improved for most benchmark functions except for f3, f5, and f8, which indicates that dimension setting affects optimization performance to some  Concerning running time, two perspectives of function type and scale settings are considered. From the perspective of function type, since multimodal functions (f6 ∼ f10) generally have higher arithmetic complexity than unimodal functions (f1 ∼ f5) [27,40], there is an obvious time gap existed between unimodal and multimodal functions optimized by all algorithms in tables. For classic algorithms, WOA and PSO have relatively close running time, especially for unimodal functions. is is because the two algorithms essentially have similar structure and complexity. CSO, on the contrary, maintains faster performance than WOA and  10,100,4) [−50, 50] D 0   In brief, the proposed PWOA-PPM and PWOA-AFPM are executed more stable than WOA, which benefits from the hardware-accelerated performance of FPGA due to built-in dedicated arithmetic units and modular design of the pipeline.

Speedup Analysis.
In this section, speedup is calculated based on the running time of different problem scales, given as follows: where T WOA and T PWOA denote the running time of serial WOA and FPGA implementation of parallel WOA, respectively.
Speedup produced by PWOAs for settling various benchmark functions is shown in Figure 7 and analyzed as follows. Note that PWOAs have a certain degree of execution improvement and the speedup in both PWOA-PPM and PWOA-AFPM in multimodal functions is better than that in unimodal functions with all problem scales. From Figure 7(a), for both unimodal and multimodal functions, the greater the dimension of search space becomes, the higher the speedup ratio WOA-PPM obtains. Moreover, WOA-PPM can hold noticeable acceleration when solving the most complex f10, and the maximum speedup reaches around 18x with dimension � 512 D. As for WOA-AFPM, it has been found in Figure 7(b) that WOA-AFPM exhibits unstable computational performance, where the speedup for all functions decreases in case of 256 D while it manifests relatively better acceleration in the cases of 128 D and 512 D. In addition, the speedup ratio obtained by WOA-AFPM for optimizing f10, contrary to WOA-PPM, shows a slight downward trend, as the problem scale increases. e maximum speedup produced by PWOA-AFPM can be up to 10x (solving f9 with dimension � 512 D).
Four bar graphs, depicted in Figure 8, are used here to intuitively make comparisons for the speedup between PWOA-PPM and PWOA-AFPM with different problem scales. In cases of small scale including 64 D and 128 D, the speedup of PWOA-PPM is not as good as PWOA-AFPM, especially when solving all functions in case of 64 D and f5 ∼ f9 in case of 128 D. Note, however, that the running efficiency of PWOA-PPM steadily rises as the scale    In a few words, WOA-PPM has more advantages in solving medium-scale and large-scale problems, while WOA-AFPM has better computational performance in smallscale problems. It can be seen from above experimental analysis that PWOAs with two models have held discrepant influence on acceleration, which are mainly caused by the different frameworks instructing the implementation of PWOA-PPM and PWOA-AFPM on FPGA heterogeneous platform. For PWOA-PPM, it utilizes a partial parallel model and a certain amount of extra overhead that frequent communication between CPU and FPGA becomes a bottleneck leading to worse performance in case of small-scale. Unlike PWOA-PPM, PWOA-AFPM transfers most work of WOA to FPGA side for execution. Additionally, synchronous operation using memory fence requires more hardware to implement and might degrades kernel performance at FPGA side [55].
is, in turn, makes PWOA-AFPM become more inefficient with the increment of benchmark complexity and problem scale.

Conclusion
Demonstrating its excellence in global optimization, WOA has drawn significant research interests in the last few years. An unavoidable reality is that performance degradation takes places in WOA when facing large-scale complex optimization problems. Many proposals exist to address this issue, most of which, however, are based on classic algorithms such as genetic algorithm and particle swarm optimization, while very few literature studies about parallel WOA can be found. Based on FPGA accelerator, this study proposes two well-designed parallel models to implement parallel PWOA using the OpenCL framework with a demonstration on Intel heterogeneous platform. Finally, the performances of two parallel models based on PWOA (PWOA-PPM and PWOA-AFPM) have been evaluated using 10 benchmark functions.
For future work, it is essential to apply this algorithm to real engineering problems to verify the practical benefits. Besides, more different types of devices such as GPU and DSP need to be investigated, to build a multidevice heterogeneous platform. is platform will be an efficient cooperative running environment where a high-computational task can be decomposed into several parts and then assigned to different devices. erefore, the proposed parallel scheme has potential for real applications.

Data Availability
e data used to support the findings of this study are included within the article.