Speeding Up FPGA Placement via Partitioning and Multithreading

One of the current main challenges of the FPGA design flow is the long processing time of the placement and routing algorithms. In this paper, we propose a hybrid parallelization technique of the simulated annealing-based placement algorithm of VPR developed in the work of Betz and Rose (1997). The proposed technique uses balanced region-based partitioning and multithreading. In the first step of this approach placement subproblems are created by partitioning and then processed concurrently by multiple worker threads that are run on multiple cores of the same processor. Our main goal is to investigate the speedup that can be achieved with this simple approach compared to previous approaches that were based on distributed computing. The new hybrid parallel placement algorithm achieves an average speedup of 2.5× using four worker threads, while the total wire length and circuit delay after routing are minimally degraded.


Introduction
In the field programmable gate arrays (FPGAs), design automation domain, placement, and routing are the most processing time intensive steps. The processing time problem has been somewhat alleviated by the advancements in processor speeds. However, the speedup of classic placement and routing algorithms obtained by using faster processors cannot keep pace with the rate at which the FPGA complexity increases.
The switch to parallel microprocessors represents a cornerstone in the history of computing [1], and the current trend is to continuously increase the number of cores per chip [2][3][4][5][6][7]. Despite the fact that parallel computing has been discussed for a long time [8,9], it is still a challenging task to find the most appropriate parallelization technique and application transformation that would maximize the benefit of parallelism. Concurrency is now a pervasive topic [10] and one of the problems of parallel computing is that there are too many solutions [11]. In the FPGA domain, distributed computing and multithreading have been used as parallelization techniques for achieving efficiency. While our primary goal in this paper is to develop an efficient and practical hybrid parallel implementation of the simulated annealing placement algorithm of VPR [12], we also hope that our study will contribute toward a better understanding of the general question of which parallelization technique suits best certain placement or routing algorithms.
In the next two sections we discuss related previous work and outline our main contribution. Then, we present details of the proposed parallel implementation, which is based on mincut partitioning and multithreading. Finally, we will present our experimental results, discussions, and conclusions.

Related Work
There have been considerable efforts to develop parallel implementations of placement algorithms in the FPGA and VLSI domains. Routing algorithms have received less attention partly due to the fact that routing has been taking less computational time and partly due to the fact that routing is a more complex step from a parallelization perspective. Traditionally, simulated annealing (SA) has been one of the most popular placement approaches in the FPGA domain. Hence, most of the previous work has focused on parallelization techniques for SA.

International Journal of Reconfigurable Computing
From an algorithmic point of view, previous work techniques can be classified into move acceleration [13,14] and parallel moves based [15][16][17] solutions. The parallel moves approaches include: (i) techniques using one copy of the main placement problem, (ii) techniques using multiple copies of the main placement problem, and (iii) techniques using placement subproblems of the main placement. The main limitations of these approaches are the degradation of the solution quality and the memory usage increase due to duplication of the database, which may lead to slowdowns. A remarkable characteristic of the techniques proposed in [14] is that serial equivalence is preserved (the parallel version of the algorithm gives the same result as the sequential version of it), which can be very useful for debugging and replication purposes.
Based on the parallel computing paradigm that is employed for parallelization, previous approaches in the FPGA domain can be classified into distributed computing [18,19] and multithreading-based [14,20] solutions. The distributed approach has the disadvantage of slower interprocessor communication, which can diminish the benefits of parallelization, especially for situations with significant and high-frequency intertask communication.
Multithreading is using a different programming approach [21] that exploits concurrency offered by processors with multicores [2][3][4][5][6][7]. It needs only a multicore processor, which is readily available today and cheaper than a network of processors. Moreover, because all communications are within the same processor chip-via shared variablesmultithreading is capable of achieving better speedups compared to those of distributed computing that can suffer from network delays. Multithreading is simple and offers an alternative implementation that complements previously proposed distributed implementations. That is, one can design a combined parallel implementation using distributed networked multicore processors, which can locally run multithreaded tasks to achieve further speedups. Distributed computing is now a more mature field [22] and has been extensively researched and used in many applications [23,24]. Multithreading has been used also in applications such as circuit and transient simulation [25,26].
There have been also efforts to parallelize non-SA-based placement algorithms. A parallel version of an analytical placement algorithm for FPGAs is presented in [18,19] and is based on the negotiation paradigm. Other standard-cell parallel algorithms are reported in [27][28][29][30].

Contribution
In this paper, we use mincut partitioning and multithreading for speeding up the simulated annealing placement algorithm of VPR [12]. Our solution can be classified as a technique in the (iii) category (using placement region-based subproblems) discussed in the previous section. The main difference (apart from using multithreading) between our implementation and previous region-based parallel solutions [16] is the fast mincut balanced partitioning that we use. This minimizes the number of nets with terminals in different partitions, and therefore, minimizes the amount of dependencies between tasks. This allows us to process tasks concurrently and independently from each other. As a result, the final quality degradation is minimal with a better overall speedup. Our main goal is to analyze how much speedup can be achieved using this technique, which requires minimal change to the code base of the sequential algorithm. Legacy algorithm implementations may be very complex, and thus, parallelization approaches that minimally modify them are desirable. Our implementation is the first one, in its category, to achieve better speedup using four threads, with less quality degradation. Our algorithm is intended to provide an alternative rather than a replacement to previous approaches. To this end, the main contribution of this paper is as follows.
(i) We propose a hybrid parallelization technique based on mincut partitioning and multithreading. We use hMetis [31], one of the best publicly available partitioning tools, to divide the main placement problem into tasks processed concurrently by different threads. This approach leads to better speedups with minimal degradation of the solution quality.
(ii) We are the first to report results for the largest new benchmarks of the latest VPR 5.0 package [32].
Preliminary results of this work were reported in a poster [20]. Here, we provide the details of our implementation and report additional results on larger test cases.

Mincut Partitioning and Multithreading-Based Parallel Placement
First, we review the classic simulated annealing-based placement for island-style FPGAs. This will help us to better introduce our ideas later in the paper.

Classic Simulated
Annealing-Based Placement. The classic simulated annealing algorithm [33] was motivated by an analogy to annealing in solids. This algorithm simulates the cooling process by gradually lowering the temperature of the system until it converges to a steady, frozen state. The major advantage of SA is the ability to avoid being trapped at local minima. It employs a random search, which accepts not only changes that decrease the objective function but also some changes that increase it. Simulated annealing has been applied successfully to the placement of both VLSI and FPGA circuits [12,34]. In both cases, the solution space exploration-going from one feasible solution to another-is achieved by performing moves. A move typically means swapping two cells or relocating only one cell. These moves are accepted with decreasing probability as the temperature is decreased gradually ( Figure 1). During placement for island-style FPGAs, combinational logic blocks (CLBs) are swapped to explore new solutions. These swaps are restricted within a distance rlim between blocks. The control parameter rlim is decreased during the annealing process from a maximum value to the minimum of 1. This way blocks that are located as far   Figure 2: Illustration of how rlim controls moves inside the SA-based placement algorithm of VPR [12]. Initially, at high temperatures, blocks far away from each other can be easily swapped. Finally, at low temperatures, only blocks close to each other can be swapped.
as the entire chip width or height from each other can be swapped at the beginning of the algorithm, while toward the end of the algorithm only adjacent blocks can be swapped ( Figure 2).

Parallel Simulated Annealing Placement.
In this section we describe our new multithreading-based parallelization technique. The pseudocode of our algorithm is presented in Algorithm 1. The main placement problem is decomposed into multiple balanced regions using multilevel 4-way partitioning. These regions form tasks that are placed into a common queue. Then, the worker threads process these tasks in parallel and independently. The solution of each task is placed back into the corresponding task object from the queue. These results are then read in and assembled by the main manager run on the main thread. Finally, the top-level solution is further improved by an ultrafast lowtemperature annealing refinement step. We use multilevel 4way partitioning for its simplicity and because it helps in achieving balanced tasks as subproblems that resemble the original top-level problem. This allows us to reuse the same sequential annealing function for tasks processing. Next, we describe in more details the main steps of our technique.
is illustrated in Figure 3. Each partition is used to construct a smaller placement subproblem (that is a task), which has input/output (IO) pins of the top-level initial placement as well as new IO pins that account for nets which cross the partition boundaries. These nets represent the nets cut during the partitioning process and have terminals located in two or more different partitions. The IO pins of the top-level initial netlist are assigned fixed locations by VPR automatically, and they represent fixed nodes of the associated graph partitioned by the mincut hMetis. The new IO pins will represent fixed anchors (during the placement in Step 2) at the boundaries of the new placement subproblems (see right-bottom part of Figure 4). This is similar to the terminal propagation technique in standard-cell placement  algorithms [35]. The location of these fixed anchors will be established on the fly as each subnetlist will be initially randomly placed in the corresponding region of the FPGA. For example, after the level 1 4-way partitioning, a net with terminals in the upper left and right partitions will have an anchor on the vertical partitioning boundary. The location of the anchor is the closest to the center of gravity of the net. If the net has terminals in diagonally opposite partitions, then the anchor will be located at the center. This is illustrated in Figure 4, where, for example, the net N 1 is split by the partitioning process into two subnets N 11 and N 12 , representing new local nets for the corresponding placement subproblems. In our experiments we also tried using floating anchors but the final quality of results was worse. We suspect that fixing the anchor points leads to better results because anchors act as attractors for terminals of the same net from different partitions to the same fixed locations. In this way, the bounding box of the top-level net will be smaller at the end of Step 2.
The use of hMetis partitioning algorithm provides a minimum number of cut nets, which translates into a reduced number of anchor points. The main benefit of this approach is that it minimizes the required synchronization between tasks and allows the threads to be run independently with a minimal negative impact on the final quality of results. The number of partitioning levels is determined by the size of the FPGA as well as the number of available worker threads to be run on different cores. If the multicore processor has, for example, four cores, then four independent worker treads can be launched and used to process concurrently four tasks. T4 T7 T9 T15   T3 T8 T10 T16   T2 T5 T11T14   T1 T6 T12T13   Partitioning and multithreading placement   Sequential placement   Thread 1   Thread 2   Thread 3   Thread 4 Step 1: main thread partitioning task creation Step 2: worker threads parallel task processing Step 3: main thread ultra-fast SA Wall time Figure 5: Illustration of Step 2. Sixteen tasks are processed concurrently by four worker threads (see also Figure 3). This step exploits parallelism to achieve overall speedup.
In this case, for the example shown in Figure 3, each of the four threads will end up processing four tasks. However, increasing the number of partitions beyond the number of available cores may improve the load balancing among threads but will also result in lower quality of results due to the smaller areas in which simulated annealing will be restricted to in each task. This partitioning step is performed by the main thread, which is also responsible for the creation and launching of the worker threads, using a managerworker multithreading strategy described in detail in [21,36]. Because hMetis is very fast, the processing time of this step is usually less than 1% of the processing time of the sequential VPR placement tool.
Step 2 (Multithreading-Based Parallel Placement). The result of the previous step is a list of tasks stored in a queue data structure, where each task represents a placement subproblem corresponding to a different region of the initial top-level FPGA. Note that the task objects stored in this queue represent the so-called shared variables in the commonly used terminologies in [21,36]. In the second step of our technique, the worker threads pickout and process these tasks concurrently until all tasks are exhausted. Every worker thread performs the SA annealing described in Figure 1, but this time on placement subproblems of smaller size than the size of the initial top-level FPGA placement. The result of each thread is deposited back in the task object, which is marked as placed. It is this step where parallelism by multithreading is exploited. This step is illustrated in Figure 5. In our implementation, during this step the main thread also retrieves the placement result from each task that is marked as being finished. The result of each task is copied to the top-level data structure that represents the top-level FPGA placement. That is, the location of each block from each placement subproblem is mapped back onto the corresponding location on the main initial top-level FPGA chip. In our experiments, the processing time of this step is usually 25% of the processing time of the sequential VPR placement tool.
Step 3 (Low-Temperature SA Refinement). In the last step of our technique, the main thread runs the ultrafast lowtemperature simulated annealing. The fast cooling scheme is realized by starting with a low initial temperature of 0.1 found by a set of calibration experiments over a set of representative test cases. This initial temperature offered a good tradeoff between speedup and degradation of solution quality. The cooling scheme is also controlled by the rlim parameter (see Figure 2). The cooling rate and the number of inner loop iterations are determined inside the modified algorithm similarly to the original VPR tool. The purpose of this sequential refinement step is to further improve the solution quality by correcting the moves that could not be explored during Step 2. These moves are restricted and involve mostly blocks located alongside the partition boundaries. During this step, nets that had terminals in different partitions will have their bounding boxes minimized (see left-bottom part of Figure 4).

Experimental Results
We implemented our technique using C++ by changing the VPR code base, which can be downloaded from [37]. We used the hMetis partitioning tool that can be downloaded from [38]. The modified VPR tool with our implementation can be downloaded from [39]. We introduced a new option called -mt place [int] which can be used in order to run our parallel implementation of the placement algorithm.
[int] specifies the desired number of worker threads to be created and used (its default value is equal to the number of cores detected on the current processor). The other options of the modified tool are the same as those of the original VPR tool. All experiments were performed using the VPR option of fixed IO pins. All our experiments were performed on a Linux machine running on an 2.4 GHz Intel Quad processor and 2GB memory.

VPR 4.3 and VPR 5.0 Test Cases.
In this section we present experimental results obtained using our modified parallel VPR tool versus the standard sequential VPR for all twenty test cases of VPR 4.3 [37] as well as for the largest eleven benchmarks included in the latest VPR 5.0 package [32]. The FPGA architecture that we used is arch4 of the VPR package, which is used by the majority of previous works. It contains a mix of wire segments of length one, two, six, and chip-width long wires. We ran our parallel VPR placement algorithm using four threads because our processor is an Intel Quad with four cores. Because of that and because the test cases are not too big and partitions obtained with hMetis are very well balanced we used only level one 4-way partitioning. The flow diagram of our experimental setup is shown in Figure 6. Each test case is basically processed using two different design flows. In the first design flow, each test case is placed using the proposed parallel VPR placement algorithm and then routed using the timing-driven sequential VPR routing tool. In the second design flow, each test case is placed using the standard sequential VPR placement algorithm and then also  routed using the timing-driven sequential VPR routing tool. During each flow, we record the CPU runtime (processing time) and the wire length (WL) after the placement step and the wire length and the circuit delay after the routing step.
In order to have a better confidence in our results, all test cases are run four times corresponding to four different seeds of the random number generator (the seed value can be set using the available VPR options). The results are presented in Table 1 and are reported based on the averages of the four different runs. Due to space limitations, we report in Table 1 only the results obtained using the first design flow (uses the proposed parallel VPR) and the improvement in runtime or degradation in terms of place WL, route WL, and route delay compared to the results obtained using the second design flow (uses the sequential parallel VPR). Because we do four different runs for each test case in both design flows, we also report the standard deviations (as percentage % of the corresponding mean) of the placement CPU, place WL, route WL, and route delay.
We observe that our new parallel VPR tool achieves an average speedup of 2.51×. The last three columns, under the label Degradation [%], report the degradation (as percentages) of the WL after placement and WL and circuit delay after routing. Note that in several cases the results obtained using the proposed parallel VPR are actually improved. In such cases, the results are reported as negative percentages in Table 1. It can be noticed that the wire length after placement degraded on average with 2.89%, which translated into 2.36% degradation of the wire length and 3.2% circuit delay degradation after routing. The routing algorithm runtime remained the same.
We note an interesting trend: the speedup for individual test cases tends to increase proportionally with the circuit size. This is illustrated in Figure 7, which shows the speedup for all the test cases from Table 1. In this figure, the x-axis represents all the test cases ordered in nondecreasing order of the number of CLBs (i.e., circuit size).

Discussion.
While the idea of region-based partitioning as a parallelization technique is not new, the merit of our paper consists in the improved speedup and smaller degradation of the quality of results. In this paper, we investigated the speedup achieved using mincut partitioning as opposed to direct partitioning into vertical and horizontal strips [16]. Moreover, our implementation is based on multithreading and run on the same chip rather than on distributed processors in a shared memory network architecture [16]. Our approach offers a better speedup and smaller quality degradation than previous region-based parallelization attempts. Because of the slight WL and delay degradation, our new modified VPR placement tool is intended to be used primarily for faster and better area and wire length estimations or as a faster placement solution when users are willing to sacrifice performance for runtime, as suggested in [40] and as demonstrated in Figure 8. In this figure, we plot the normalized wire length-runtime envelope curves for the old sequential and our new parallel VPR placement tools. Normalization is done with respect to the best sequential VPR wire length result, while runtime is controlled via the number of moves performed inside the inner loop of the annealing engine. Both curves contain five data points. The right most data point corresponds to the default states of the sequential and parallel VPR tools. The remaining data points, moving from right to left on both curves, are obtained by limiting the number of inner loop iterations to a fraction of only 0.5, 0.1, 0.05, and 0.001 of the number of inner loop iterations of the default state. We observe that the new modified VPR placement tool achieves better quality for a given runtime budget, and therefore, can offer more accurate and efficient estimations.    The quality of the final placement at the end of the ultrafast low-temperature SA depends not only on the initial temperature, the maximum distance between swapped cells (i.e., rlim), and the total number of moves attempted during the inner loop but also on the quality of the starting placement. The placement achieved using the combination of hMetis mincut partitioning and simulated annealing of placement subproblems represents a high-quality starting placement for the last ultrafast low-temperature SA step. It is this combination of techniques that leads to better speedup and smaller degradation in the quality of results compared to previous similar approaches.
Our current parallel implementation is applied only to the wire length/congestion driven VPR placement algorithm. We are currently working on the timing-driven placement algorithm, which is more challenging because timing critical paths can span multiple partitions and requires synchronization between worker threads or between worker threads and the main thread in order to maintain an accurate toplevel circuit delay information during the parallel processing of tasks. This additional required communication has a negative impact on the achievable speedup.

Related Work.
In this section, we compare qualitatively the proposed parallel VPR placement algorithm with previous parallel implementations of placement algorithms from the FPGA domain. We cannot do a direct comparison because none of the previous implementations is publicly available. Nevertheless, because the main figures of merit for evaluation of a given parallel algorithm are the speedup and the degradation of the solution quality (compared to the sequential counterpart), which typically are reported as averages for a variety of test cases, a qualitative comparison is still possible. This comparison is presented in Table 2. We note that, among the simulated annealing-(SA) based approaches, the proposed parallel VPR placement achieves the best speedup of 2.5× on four cores. However, the proposed algorithm degrades the solution quality with 2.36% compared to the move acceleration-based SA from [14], which, however, is not as scalable, suffers from memory inefficiency, and requires considerable code change to the sequential algorithm. Among the implementations that use distributed computing as the parallelization paradigm, the analytical (not simulated annealing) placement from [18] offers one of the best speedup-solution degradation product.

Conclusion
In this paper, we implemented and studied a new parallelization technique for the simulated annealing-based FPGA placement algorithm of VPR. It is a hybrid technique that uses mincut partitioning and multithreading. The new parallel VPR placement tool achieves an average speedup of 2.5× using four threads on a four-core processor, while the total wire length and delay are degraded with about 3%.