A Tabu Search-Based Memetic Algorithm for Hardware/Software Partitioning

1 Department of Mathematics, Minjiang University, Fuzhou 350108, China 2 Center for Discrete Mathematics and Theoretical Computer Science, Fuzhou University, Fuzhou 350108, China 3 School of Computational and Applied Mathematics, Faculty of Science, University of the Witwatersrand, (Wits), Johannesburg 2050, South Africa 4 TCSE, Faculty of Engineering and Build Environment, University of the Witwatersrand, (Wits), Johannesburg 2050, South Africa


Introduction
The embedded systems have become omnipresent in a wide variety of applications and typically consist of application specific hardware components and programmable components. With the growing complexity of embedded systems, hardware/software codesign has become an effective way of improving design quality, in which the HW/SW partitioning is the most critical step.
HW/SW partitioning decides which tasks of an embedded system should be implemented in hardware and which ones should be in software. A task implemented with a hardware module is faster but more expensive, while a task implemented with a software module is slower but cheaper. For this reason, the main target of HW/SW partitioning is to balance all the tasks to optimize some objectives of the system under some constraints [1].
In recent years, there has been an increasing interest in the study of the HW/SW partitioning problem. Several exact approaches [2][3][4][5] to the problem have been developed. However, since most of the exact formulations of the HW/SW partitioning problem are NP-hard [6], the practical usefulness of these approaches is limited to fairly small problem instances only. For larger instances, a number of heuristic algorithms, both traditional and general purpose, have been proposed.
There are two traditional families of heuristics: the software-oriented heuristic and the hardware-oriented heuristic. The software-oriented heuristic starts with a complete software solution and parts of the system are migrated to hardware until all constraints are satisfied. On the other hand, the hardware-oriented heuristic starts with a complete hardware solution and moves parts of the system to software until a constraint is violated.
It must be mentioned that the HW/SW partitioning problem is to minimize a cost while satisfying some design constraints. In fact, the problem is a discrete constrained optimization problem, and the optimal solutions usually lie on the boundary of the feasible region. Hence, it is necessary to develop effective techniques in handling the constraints.
This paper presents a tabu search-based memetic algorithm (TSMA) for solving the HW/SW partitioning problem. The TSMA uses an adaptive penalty function to convert the original problem into an equivalent unconstrained integer programming problem; both problems have the same discrete global minimizers. The TSMA is then applied to unconstrained integer programming problem in order to solve the original problem. The TSMA integrates a tabu search procedure, a path relinking procedure, and a population updating strategy. These strategies achieve a balance between intensification and diversification of the search process. The algorithm has been tested on four benchmark test instances in the literature. Experimental results and comparisons show that the proposed algorithm is effective and robust.
The remaining part of this paper is organized as follows. Section 2 gives the system model and problem formulation. Section 3 presents the parameter-free adaptive penalty function used in converting the original problem into the unconstrained problem. Section 3 also presents some theoretical properties of the resulting unconstrained problem. In Section 4, a full description of TSMA is presented. Experimental results on four benchmark test instances are presented in Section 5. Final remarks are given in Section 6.

Hardware/Software Partitioning Model and Formulation
The HW/SW partitioning model discussed in this paper is similar to that presented in [16]. In order to make the paper self-contained, this section briefly reviews the description of the HW/SW partitioning model considered in [16].

Function Description.
The main function of the system is usually described by a high-level programming language, and then the function will be mapped into a control-data flow graph (CDFG). The CDFG is composed of nodes and arcs [20,21]. Generally, the CDFG is a directed acyclic graph (DAG). In the DAG, the nodes are determined by the model granularity, that is, the semantic of a node. A node can represent a short sequence of instructions, a basic block, a function, or a procedure [12]. The arcs between nodes denote their relationship. Each node in DAG can receive data from its previous nodes and can send data to its next nodes.

Binary Integer Programming Formulation.
In a directed acyclic graph (DAG) = ( , ) with node set = {1, 2, . . . , } and arc set , every node is labeled with several attributes. The attributes for a node are defined as follows.
(i) denotes the cost of in software implementation.
(ii) denotes the executing time of in software implementation.
(iii) ℎ denotes the cost of in hardware implementation.
(iv) ℎ denotes the executing time of in hardware implementation.
(v) is an array, which denotes the set of incoming nodes of and stores the set of corresponding communication times.
(vi) is an array, which denotes the set of outgoing nodes of and stores the set of corresponding communication times.
(vii) denotes the number of times that node is executed.
Figure 1 [16] shows the partition model used in this paper, where "HA1", "HA2", . . . , and "HA " denote the hardware nodes implemented by application specific integrated circuit (ASIC) or field programmable gate array (FPGA). The software nodes are executed in a programmable processor (CPU). All the nodes exchange data through a shared bus, and they share the common memory to store the interim data. All the hardware nodes without dependency can be executed concurrently [16].
After all nodes are determined to be implemented by hardware or software, all the nodes are scheduled. The total executing time can be evaluated by the list schedule algorithm [11,22], and the cost of the design can be evaluated. The detailed steps of the list schedule algorithm are shown in Algorithm 1.
In this paper, the objective is to minimize the total cost of the system under the time constraint. Since the cost of software is usually negligible, the HW/SW partitioning problem can be described as follows [11]: Mathematical Problems in Engineering

3
(1) Initialization. Set two attributes for every node, start and finish, which are used for recording the node's starting time and finishing time. Choose all nodes without data dependency from the node set and put them into a ready queue. Choose one software node from the queue to be executed by CPU, and all hardware nodes to be executed at the same time. Set 0 as the values of the start attributes for the running nodes. Set the sum of the executing time and the communication time as the values of the finish attributes. If there is a software node being executed by CPU, set CPU = BUSY. Algorithm 1: List schedule algorithm [16,22].
where ∈ means that node is implemented by hardware, denotes the total executing time, and TimeReq denotes the required time which is given in advance by the designer. Let = ( 1 , . . . , ) ∈ {0, 1} denote a solution of the problem (P). = 1 ( = 0) indicates that node is implemented by hardware (software). Then problem (P) can be formulated as the following constrained binary programming problem: where ( ) denotes the total executing time of which can be evaluated by the list schedule algorithm, that is, Algorithm 1. We suppose without loss of generality that , , ℎ , ℎ , , , , and TimeReq are nonnegative integers.

The Adaptive Penalty Function
In this section, we introduce the basic form of the penalty function for constrained optimization problem and present an adaptive penalty function to convert the problem (IP) into an unconstrained binary optimization problem.

The Basic Penalty Function.
A variety of penalty functions have been proposed to deal with constraints. Of them, the exact penalty function and the quadratic penalty function are most basic and widely used [23][24][25]. Using these penalty functions problem (IP) can be converted into the following unconstrained problem: where ( ) = max{0, ( ) − TimeReq} , is the penalty parameter, and ≥ 1. For = 1 (resp., = 2), the optimization method that uses (1) is known as the exact (the quadratic) penalty function method.

The Proposed Adaptive Penalty Function.
It is a wellknown fact that the penalty parameter in (1) is sensitive and problem-dependent. Hence, it is difficult to choose a value for [26] that can be used for a wide range of optimization problems. Therefore, a number of authors [27][28][29][30][31] have suggested parameter-free penalty functions to do away with the sensitive parameter . Following the same reasonings, we propose the following param0065ter-free adaptive penalty function: where and is an upper bound on the global minimum value of problem (IP). Since the cost of the system is smaller than ∑ =1 ℎ , we can initialize = ∑ =1 ℎ . The value of needs to be updated by the current best known objective function value at a feasible solution.
Based on this adaptive penalty function, we construct the following problem: Mathematical Problems in Engineering Case (i). By (2), we have From (4), (5), and (6), it is obvious that * is a discrete global minimizer of problem (UIP2).

The Proposed Memetic Algorithm for HW/SW Partitioning
Memetic algorithms (MAs) are a kind of global search technique derived from Darwinian principles of natural evolution and Dawkins' notion of memes [32]. They are genetic algorithms that use local search procedures to intensify the search [33]. MAs have been applied to solve a variety of optimization problems [34][35][36][37][38][39][40].
In this section, a tabu search-based memetic algorithm, TSMA, is presented to solve the parameter-free unconstrained problem (UIP2). First, a general framework of TSMA is presented, followed by detailed descriptions of various components of the algorithm.

General Framework of the Hybrid Memetic Algorithm.
The steps of the general framework of TSMA for solving (UIP2) is shown in Algorithm 2. Throughout its execution, the TSMA updates the upper bound with the function value at the current best solution * .
Before the TSMA begins, its iterative process performs three main steps; see Algorithm 2. In step 1, it initializes the first upper bound of (UIP2) with the known trivial solution where = 1, = 1, 2, . . . , . In step 2, it generates the initial random population set of size ; that is, = { 1 , . . . , }, ∈ , = 1, 2, . . . , . In step 4, the TSMA refines each member of the initial population by the local search, the tabu search, and treats as the starting population set.
As the iteration process begins, the TSMA repeatedly performs three main steps: crossover followed by tabu search, path relinking, and the updating of . In step 10, a new solution 0 is identified using the crossover operation followed by tabu search. In step 17, the path relinking procedure may be applied to 0 in an attempt to generate an improved solution from 0 . Finally, in step 22, the population updating process is performed to decide whether ∈ and which solution should be replaced, provided that such an improved has been generated. The remaining steps of Algorithm 2 are used to update and * . This iterative process repeats itself until some stopping conditions are met.
The main iterative components of TSMA, the crossover followed by the tabu search procedure, the path relinking procedure, and the population updating strategy are explained in the subsequent subsections.

Crossover Operator.
We adopt the fixed crossover operator [41] to generate a new offspring. First, a pair of individuals are selected randomly from the population. Next, if two selected individuals have the same bit value, their offspring inherits it. Otherwise, it takes value 0 or 1 randomly. Suppose that and are two selected individuals from the population. The pseudocode of the crossover operator is given in Algorithm 3.
end if (8) end for (9) while termination criteria (see Section 4.6) do not meet do (10) Use the crossover operator (see Section 4.2) to generate a new offspring 0 , and apply the local search procedure to refine 0 , also denote it by 0 . (11) if ( 0 ) < then (12) Let = ( 0 ), and * = 0 . (13) end if (14) if 0 = * then (15) = 0 . (16) else (17) Use the path relinking procedure PR (see Section 4.4) to the solution 0 and the current best solution * to obtain a solution . (18) end if (19) if ( ) < then (20) Let = ( ), and * = . being trapped in a local minimizer and to present revisiting recently generated solutions [46]. The neighborhood structure and the tabu tenure with which TS is implemented are fundamental to achieving the above objectives. A new solution 0 is generated from the existing solution , within a defined neighborhood of , by the operation called a "move. " The solution 0 is not generated again for a number of iterations (tabu tenure).
The tabu search procedure presented here differs mainly in solution generation. While we preserve the tabu tenure feature of TS, we do away with the neighborhood structure in generating 0 from . The solution generation procedure has been suggested here to conform to the problem at hand and to make the search greedier. In particular, a sequence of flips (each flip returns a solution) on the (allowed) components of are carried out. The best solution from the sequence is considered as 0 . More specific to the problem considered, a flip gain ( , ) of a node (node corresponds to the variable ) is defined as follows: A positive gain is a gain when the objective value of problem (UIP2) decreases if the component (variable) is flipped from the current value to its complement; that is, if = 1, then its value is changed to = 0. Therefore, for each variable we keep an associated tabu tenure which prevents from being flipped again until diminishes ( ≤ 0). Hence, the th gain ( = 1, 2, . . . , ) is calculated using (11) when it is permitted by the tabu tenure or some aspiration criterion is satisfied.
Let 0 = 1 − 0 . Rename the solution corresponding to this change as 0 again. Calculate by (12), let = . (6) if ( 0 ) < ( best ) then (7) Let best = 0 . (8) end if (9) = + 1. (10) for = {1, . . . , } do (11) if > 0 then (12) Let = − 1. At the beginning of the tabu search procedure, ( = 1, 2, . . . , ) are initialized by the user and if is flipped the corresponding is updated, as suggested by Cai et al. [42], as follows: The updating rule (12) is called the feedback mechanism in Cai et al. [42], as this updating prevents from being too large or too small. We have proposed a simple aspiration criterion that permits a variable to be flipped in spite of being tabu if it leads to a solution better than the best solution best produced by the tabu search procedure thus far. There are a number of parameters which are needed to be initialized. These are the maximum numbers of iterations, , and the values ( = 1, 2 . . . , ). The tabu search procedure we have suggested here is now summarized in Algorithm 4. Notice that the index in step 4 denotes that the node is not in the tabu list ( = 0) or when the aspiration criterion is met (which is a positive gain). The tabu search procedure process stops when of iterations is reached and returns the best solution found in the process.

Path Relinking (PR)
Procedure. The path relinking technique was originally proposed by Glover et al. [47] to explore possible trajectories connecting high quality solutions obtained by heuristics. It has been applied successfully to solve a number of optimization problems such as data mining [48], unconstrained binary quadratic programming [49], capacitated clustering [50], and multiobjective knapsack problem [51].
The PR procedure presented here explores solution trajectories using two good solutions. Within the iteration process of TSMA, there are always two such solutions: the current best solution * and the 0 produced by the tabu search procedure. When * ̸ = 0 , PR is executed in an attempt to improve the current best solution * . At the beginning, PR identifies the number of variables whose values differ in 0 and * . The indices of these variables are then kept in a set . PR then starts its iteration process. For each index in an iteration is executed. PR stops if an improved * has been found during an iteration.
An iteration starts with identifying the index corresponding to the highest gain; that is, = argmax{ ( , 0 ) | for all }. Let the solution corresponding to the flipping of 0 be 0 . The following tasks are then performed to see (a) whether PR has obtained a new best solution or (b) whether PR is approaching towards the current * . PR then proceeds with next iteration for the next index in . The pseudocode of the PR is illustrated in Algorithm 5.

Population Updating Strategy.
A combination of intensification and diversification is essential for designing high quality hybrid heuristic algorithms [52,53]. We have presented how the TSMA implements its search intensification Require: An initiating solution 0 , the current best solution * . Ensure: A solution .
Break. (13) else (14) = argmax{ ( , 0 )| ∈ }. strategies via the tabu search procedure and PR; we now present its search diversification strategies. This is achieved by updating the population set with the solution obtained by the tabu search procedure or PR. In this subsection, will be referred to as the trial solution.
Like any other algorithm [35,38], the TSMA uses its population updating to cater for both quality and diversity of the member of . A measure is needed to decide whether the trial solution should be inserted into and if so which solution should it replace. To this end, the following function is suggested when ∉ : where * ∈ is the current best solution and > 0 is a parameter, which balances the relative importance of the solution quality and the diversity of the population. The function ( , ), ∉ , is defined as where In the updating process, we want to achieve two objectives: solution quality and population diversity. At early stages of TSMA, points in are scattered and diversely distributed. Therefore, a further diversification is not needed. Hence, the decision whether ∈ and which solution deletes should be based on the contribution of the function value ( ) in ( , ) and not purely be based on the location of in the solution space with respect to the points in (i.e., density of the points { } ∪ ). Notice that this always holds in (13) when (13) is likely to dominate ( , )/ in ( , ). At later stages of TSMA, points in are likely to be clustered together. Hence, diversification of the points in is needed. Clearly, it is likely that ( )− ( * ) (for being more likely to be close to * ) is small ensuring less contribution from it and more from ( , )/ in ( , ).
With the functional form (13) of ( , ), we now device a strategy to achieve both objectives. Central to this strategy is the comparison of + 1 measure values, namely, { ( ,̂), ( , )},̂= \ { } ∪ { }, = 1, 2, . . . , . Here ( ,̂) (resp., ( , )) represents a measure with respect to the function value at (resp., the function value at ) and its distance (sparsity) with respect tô(sparsity of with respect to ). The point with the minimum measure is then identified; denote it by . In particular, we find On the other hand, if = , then is updated with with a small probability. This is because inclusion of in with probability one will reduce the diversity of . In particular,  (7) is inserted into , and is deleted from . (8) else (9) if rand(0, 1) < then (10) Let   Here is replaced to increase the diversity in . The pseudocode for the population updating procedure is presented with Algorithm 6.
Finally, we present the time required to update the set . It needs ( ) to calculate the distance between and ∈ . The distance ( , ) can be calculated in time ( ), and it needs ( 2 ) to calculate ( ,̂). Moreover, it takes ( ) time to find the solution with the smallest and the second smallest values in { ( , )} ∪ { ( ,̂}, = 1, . . . , . Therefore, the total running time of the population updating procedure is bounded by ( 2 ).

Termination Criteria.
In this paper, we use two criteria to stop TSMA. If the maximum number of generations max is reached, then we stop the algorithm. On the other hand, when the quality of the best solution from one generation to the next is not improved after no generations, then we stop the algorithm. Therefore, the TSMA stops when any of these two criteria is met.

Computational Experiments
In this section, we present experiments performed in order to evaluate the performance of TSMA. The benchmark test instances used are introduced followed by the parameter settings for TSMA. Finally, computational results, comparisons, and analyses are reported. The algorithm is coded in C language and implemented on a PC with 2.11 GHz AMD processor and 1 G of RAM.

Test Instances.
Task graphs for free (TGFF) [54] can create directed acyclic graphs (DAGs), that is, task graphs. Given identical parameters and input seeds, it can generate identical task graphs. We have used TGFF (Window Version 3.1) to generate four DAGs as hardware/software partitioning test instances. These DAGs were also used to test the performance of the artificial immune algorithm (ENSA-HSP) [16]. The TGFF (Window Version 3.1) can be obtained at the website (http://ziyang.eecs.umich.edu/∼dickrp/tgff/), and the parameters are given in the Appendix. Since these DAGs do not contain cycles, all nodes in each DAG are executed only once; that is, = 1. The characteristics of these test instances are given in Table 1. These are the times when all nodes are implemented by software (TimeSw), the time when all nodes are implemented by hardware (TimeHw), the required time (TimeReq), and the cost when all nodes are implemented by hardware (CostHw).  values are different from those used by Cai et al. [42] for unconstrained binary quadratic programming. Justifications for these choices are that the tabu search procedure proposed here is different from [42], and the scale of test problems used in this paper is smaller than that of the test problems used in [42]. Moreover, we have chosen these values based on our numerical experiments. An important parameter of TSMA is the size of the population set . The bigger the value of is, the higher the CPU time is. This is because ( 2 ) operations are needed to update . The parameter in (13) is used by TSMA to diversify , and as such it has an important significance in the performance of TSMA. We have chosen the values of these two parameters after some numerical experiments.

The Parameter
We first conduct a number of preliminary experiments using the test instance number 3, that is, the instance with 90 nodes. We have fixed = 10, = 1, . . . , , = 1.5 × , max = 200, and no = 20 and run TSMA with various values of and . This experiment was conducted to get suitable ranges for both parameters.
With the values of , , max , and no chosen as above, we have then conducted another series of runs of TSMA using the test instance with 90 nodes. We have used ∈ {10, 20, 30, 40}. For each value of , six values of , that is, ∈ {0.001, 0.003, 0.005, 0.008, 0.01, 0.012}, are used to constitute ( , ) pairs. Hence, there are 24 ( , ) pairs. For each pair, TSMA was run 5 times, making a total of 120 runs. The average results of 5 runs on each ( , ) pair are reported in Table 2.
The results in Table 2 show that = 20 and = 0.008 are suitable values to choose. Hence, we have used these values throughout the numerical experiments. We present all parameter values used in TSMA in Table 3, where the column 2 presents the subsections where the corresponding parameter has been discussed.

Numerical Comparisons.
In order to show the effectiveness of TSMA, we compare its results with those of the evolutionary negative selection algorithm (ENSA-HSP) [16] and traditional evolutionary algorithm (EA) [16]. TSMA was run 20 times on each test instance in Table 1; hence, average results are used in the comparison. Results of ENSA-HSP and EA have been taken from [16]. Comparisons of results are presented in Table 4, where columns under the captions "best, " "mean, " and "worst" report the best cost, the average cost, and the worst cost, respectively. The column under the caption "time" reports the average CPU times for TSMA.
We are unable to present the CPU times for the other two algorithms as this would require running these algorithms ourselves, since the CPU times of EA and ENSA-HSP were not reported in [16]. The last column of Table 4 reports the percentage of improvement on the best results by TSMA. The results obtained by all three algorithms on the first 4 test instances are further summarized in Figures 2, 3, 4, and 5. Figures 3-5 clearly show the superiority of TSMA over the other two algorithms. Wide differences in costs are noticeable in these figures. Indeed, it can be seen from Table 4 that the TSMA has achieved 84, 65, and 176 less costs than the corresponding previous best obtained by ENSA-HSP for the instances numbers 2, 3, and 4, respectively. These improvements are quite significant.

Effectiveness of the Adaptive Penalty Function.
In order to show the effectiveness of our proposed parameter-free adaptive penalty function, we have replaced the adaptive penalty function ( ) in (2) with the exact penalty function ( , , ) ( = 1) in (1) and kept other ingredients of TSMA including the parameter values in Table 3 intact. We denote this implementation of TSMA by TSMA2. Since the parameter in ( , , ) is problem-dependent, we took = 50, 100, and 200 and run TSMA2 20 times on each of the test instances, for each value. Average results including average CPU times (in seconds) are summarized in Table 5. A number of observations and comparisons using the results in Table 4 are now presented below.
(1) Dependency on : it can be clearly seen in Table 5     To give an example, in terms of the "mean" value, the best result was obtained for = 200 on instance number 2, while, for the instance numbers 3 and 4, the best "mean" values were obtained using = 50.
(2) Comparison using "mean": The TSMA performs better than TSMA2 for all instances, except for instance number 1 for which both perform the same.
(3) Comparison using "best": The TSMA performs better than TSMA2 on the last instance while both are equally matched on the remaining instances. TSMA2 obtained these results for instances 1, 2, and 3 for = 50.  A statistical view in terms of boxplot is now shown in Figures 6, 7, and 8 using optimal values obtained by TSMA and TSMA2 over 20 runs. Three figures are presented since the standard deviation is zero for instance number 1. These boxplots show that ranges over which the best results obtained over 20 runs vary. All three figures clearly show that the boxplots of TSMA has less height than those of the corresponding boxplots obtained by TSMA2 for three values. This clearly establishes the superiority of the adaptive penalty function ( ) presented in (2) over the exact penalty function (1).

Conclusion
In this paper, we have presented a memetic algorithm for the hardware/software partitioning problem. The algorithm has three main components: a local tabu search, a path  relinking procedure, and population updating. The local tabu search is a "controlled local search" procedure used for search intensification. The path relinking procedure uses further exploitation of solutions and population updating is used for diversification of solutions. Motivations for each of these components within the proposed memetic algorithm are provided. The hardware/software partitioning problem is converted into an equivalent unconstrained binary optimization problem by employing a parameter-free adaptive penalty function. We have motivated this penalty function first by theoretically and then by numerically comparing its performance with the exact penalty function. The memetic algorithm is then applied to this unconstrained problem, and robustness with respect to the quality of optimal solution is shown by comparing it with another two algorithms from the literature. period mul 1  task degree 3 3  tg cnt 1  task cnt 31 1  task type cnt 30  trans type cnt 30  table cnt 1  table label TRANS TIME  type attrib time 10 10 1 1  trans write table cnt 1  table label  DAG with 60 nodes, 90.tgffopt is for the DAG with 90 nodes, 120.tgffopt is for the DAG with 120 nodes, 300.tgffopt is for the DAG with 300 nodes, and 500.tgffopt is for the DAG with 500 nodes. See Algorithm 7.