A Hierarchical C2RTL Framework for Hardware Configurable Embedded Systems

Embedded systems have been widely used in the mobile computing applications. The mobility requires high performance under strict power consumption, which leads to a big challenge for the traditional single-processor architecture. Hardware accelerators provide an energy efficient solution but lack the flexibility for different applications. Therefore, the hardware configurable embedded systems become the promising direction in future. For example, Intel just announced a system on chip (SoC) product, combining the ATOM processor with a FPGA in one package (Intel Inc., 2011).


Introduction
Embedded systems have been widely used in the mobile computing applications. The mobility requires high performance under strict power consumption, which leads to a big challenge for the traditional single-processor architecture. Hardware accelerators provide an energy efficient solution but lack the flexibility for different applications. Therefore, the hardware configurable embedded systems become the promising direction in future. For example, Intel just announced a system on chip (SoC) product, combining the ATOM processor with a FPGA in one package (Intel Inc., 2011).
The configurability puts more requirements on the hardware design productivity. It worsens the existing gap between the transistor resources and the design outcomes. To reduce the gap, design community is seeking a higher abstraction rather than the register transfer level(RTL). Compared with the manual RTL approach, the C language to RTL (C2RTL) flow provides magnitudes of improvements in productivity to better meet the new features in modern SoC designs, such as extensive use of embedded processors, huge silicon capacity, reuse of behavior IPs, extensive adoption of accelerators and more time-to-market pressure. Recently, people (Cong et al., 2011) observed a rapid rising demand for the high quality C2RTL tools.
In reality, designers have successfully developed various applications using C2RTL tools with much shorter design time, such as face detection (Schafer et al., 2010), 3G/4G wireless communication (Guo & McCain, 2006), digital video broadcasting (Rossler et al., 2009) and so on. However, the output quality of the C2RTL tools is inferior to that of the human-designed ones especially for large behavior descriptions. Recently, people proposed more scalable design architectures including different small modules connected by first-in first-out (FIFO) channels. It provides a natural way to generate a design hierarchically to solve the complexity problem.
However, there exist several major challenges of the FIFO-connected architecture in practice. First of all, the current tools leave the user to determine the FIFO capacity between modules, which is nontrivial. As shown in Section 2, the FIFO capacity has a great impact on the system performance and memory resources. Though determining the FIFO capacity via extensive RTL-level simulations may work for several modules, the exploration space will become prohibitive large in the multiple-module case. Therefore, previous RTL-level simulating method is neither time-efficient nor optimal. Second, the processing rate among modules may bring a large mismatch, which causes a serious performance degradation. Block level parallelism should be introduced to solve the mismatches between modules. Finally, the C program partition is another challenge for the hierarchical design methodology.
This chapter proposed a novel C2RTL framework for configurable embedded systems. It supports a hierarchical way to implement complex streaming applications. The designers can determine the FIFO capacity automatically and adopt the block level parallelism. Our contributions are listed as below: 1) Unlike treating the whole algorithm as one module in the flatten design, we cut the complex streaming algorithm into modules and connect them with FIFOs. Experimental results showed that the hierarchical implementation provides up to 10.43 times speedup compared to the flatten design. 2) We formulate the parameters of modules in streaming applications and design a behavior level simulator to determine the optimal FIFO capacity very fast. Furthermore, we provide an algorithm to realize the block level parallelism under certain area requirement. 3) We demonstrate the proposed method in seven real applications with good results. Compared to the uniform FIFO capacity, our method can save memory resources by 14.46 times. Furthermore, the algorithm can optimize FIFO capacity in seconds, while extensive RTL level simulations may need hours. Finally, we show that proper block level parallelism can provide up to 22.94 times speedup in performance with reasonable area overheads.
The rest of the chapter is organized as follows. Section 2 describes the motivation of our work. We present our model framework in Section 3. The algorithm for optimal FIFO size and block level parallelism is formulated in Section 4 and 5. Section 6 presents experimental results. Section 7 illustrates the previous work in this domain. Section 8 concludes this paper.

Motivation
This section provides the motivation of the proposed hierarchical C2RTL framework for FIFO-connected streaming applications. We first compare the hierarchical approach with the flatten one. And then we point out the importance of the research of block level parallelism and FIFO sizing.

Hierarchical vs flatten approach
The flatten C2RTL approach automatically transforms the whole C algorithm into a large module. However, it faces two challenges in practice. 1) The translating time is unacceptable when the algorithm reaches hundreds of lines. In our experiments, compiling algorithms over one thousand lines into the hardware description language (HDL) codes may lead to several days to run or even failed. 2) The synthesized quality for larger algorithms is not so good as the small ones. Though the user may adjust the code style, unroll the loop or inline the functions, the effect is usually limited.
Unlike the flatten method, the hierarchical approach splits a large algorithm into several small ones and synthesizes them separately. Those modules are then connected by FIFOs.
It provides a flexible architecture as well as small modules with better performance. For example, we synthesized the JPEG encode algorithm into HDLs using eXCite (Y Exploration Inc., 2011) directly compared to the proposed solution. The flatten one costs 42'475'202 clock cycles with a max clock frequency of 69.74MHz to complete one computation, while the hierarchical method spends 4'070'603 clock cycles with a max clock frequency of 74.2MHz. It implies a 10.43 times performance speedup and a 7.2% clock frequency enhancement.

Performance with different block number
Among multiple blocks in a hierarchical design, there exist processing rate mismatches. It will have a great impact on the system performance. For example, Figure 1 shows the IDCT module parallelism. It is in the slowest block in the JPEG decoder. The JPEG decoder can be boosted by duplicating the IDCT module. However, block level parallelism may lead to nontrivial area overheads. It should be careful to find a balance point between the area and the performance.

Performance with different FIFO capacity
What's more, determining the FIFO size becomes relevant in the hierarchial method. We demonstrate the clock cycles of a JPEG encoder under different FIFO sizes in Figure 2. As we can see, the FIFO size will lead to an over 50% performance difference. It is interesting to see that the throughput cannot be boosted after a threshold. The threshold varies from several to hundreds of bits for different applications as described in Section 6. However, it is impractical to always use large enough FIFOs (several hundreds) due to the area overheads. Furthermore, designers need to decide the FIFO size in an iterative way when exploring different function partitions in the architecture level. Considering several FIFOs in a design, the optimal FIFO sizes may interact with each other. Thus, determining the proper FIFO size accurately and efficiently is important but complicated. More efficient methods are preferred.

Hierarchical C2RTL framework
This section first shows the diagram of the proposed hierarchical C2RTL framework. We then define four major stages: function partition, parameter extraction, block level parallelism and FIFO interconnection.

System diagram
The framework consists of four steps in Figure 3. In Step 1, we partition C codes into appropriate-size functions. In Step 2, we use C2RTL tools to transform each function into a hardware process element (PE), which has a FIFO interface. We also extract timing parameters of each PE to evaluate the partition in Step 1. If a partition violates the timing constraints, a design iteration will be done. In Step 3, we decide which PEs should be parallelized as well as the parallelism degree. In Step 4, we connect those PEs with proper sized FIFOs. Given a large-scale streaming algorithm, the framework will generate the corresponding hardware module efficiently. The synthesizing time is much shorter than that in the flatten approach. The hardware module can be encapsulated as an accelerator or a component in other designs. Its interface supports handshaking, bus, memory or FIFO. We denote several parameters for the module as below: the number of PEs in the module as N, the module's throughput as TH all , the clock cycles to finish one computation as T all , the clock frequency as CLK all and the design area as A all .
As C2RTL tools can handle the small-sized C codes synthesis ( Step 2) efficiently, four main problems exist: how to partition the large-scale algorithm into proper-sized functions (Step 1), what parameters to be extracted from each PE(In Step 2), how to determine the parallelized PEs and their numbers ( Step 3) and how to decide the optimal FIFO size between PEs (Step 4). We will discuss them separately.

Function partition
The C code partition greatly impacts the final performance. On one hand, the partition will affect the speed of the final hardware. For example, a very big function may lead to a very slow PE. The whole design will be slowed down, since the system's throughput is decided by the slowest PE. Therefore, we need to adjust the slowest PE's partition. The simplest method is to split it into two modules. In fact, we observe that the ideal and most efficient partition leads to an identical throughput of each PE. On the other hand, the partition will also affect the Determinate which PEs should be paralleled and their degrees  Table 1. The parameter of the n th PE's input/output interfaces area. Too fine-grained partitions lead to many independent PEs, which will not only reduce the resource sharing but also increase the communication costs.
In this design flow, we use a manual partition strategy, because no timing information in C language makes the automatic partition difficult. In this framework, we introduce an iterative design flow. Based on the timing parameters 1 extracted by the PEs from the C2RTL tools, the designers can determine the C code partition. However, automatizing this partition flow is an interesting work which will be addressed in our future work.

Parameter extraction
We get the PE's timing information after the C2RTL conversion. In streaming applications, each PE has a working period T n , under which the PE will never be stopped by overflows or underflows of an FIFO. During the period T n , the PE will read, process, and write data.
We denote the input time as t ni and the output time as t no . In summary, we formulate the parameters of the n th PE interface in Table 1. Based on a large number of PEs converted by eXCite, we have observed two types of interface parameters. Figure 4 shows the waveform of the type II. As we can see, t n is less than T n in this case. In type I, t n equals to T n , which indicates the idle time is zero.

Block level parallelism
To implement block level parallelism, we denote the n th PE's parallelism degree as P n . 2 Thus, P n =1 means that the design does not parallelize this PE. When P n > 1, we can implement block level parallelism using a MUX, a DEMUX, and a simple controller in Figure 5. 1 We will define those parameters in the next section. 2 We assume that no data dependence exists among PE n 's task.

372
Embedded Systems -Theory and Design Methodology www.intechopen.com Figure 6 illustrates the working mechanism of the n th parallelized PE. It shows a case with two-level block parallelism with t ni >t no . In this case, the input and the output of the parallelized blocks work serially. It means that the PE n 2 block must be delayed for t ni by the controller, so as to wait for the PE n 1 to load its input data. However, when another work period T n starts, the PE n 1 can start its work immediately without waiting for the PE n 2 .
As we can see, the interface of the new PE n after parallelism remains the same as Table 1. However, the values of the input and the output parameters should be updated due to the parallelism. It will be discussed in Section 4.2.

FIFO interconnection
To deal with the FIFO interconnection, we first define the parameters of a FIFO. They will be used to analyze the performance in the next section. Figure 7 shows the signals of a FIFO.
F_clk denotes the clock signal of the FIFO F. F_we and F_re denote the enable signals of writing and reading. F_dat_i and F_dat_o are the input and the output data bus. F_ful and F_emp indicate the full and empty state, which are active high. Given a FIFO, its parameters are shown in Table 2. To connect modules with FIFOs, we need to determine D (n−1)n and W (n−1)n .

Algorithm for block level parallelism
This section formulates the block level parallelism problem. After that, we propose an algorithm to solve the problem for multiple PEs in the system level.

Block level parallelism formulation
Given a design with N PEs, the throughput constraint TH re f and the area constraint A re f 3 ,we decide the n th PE's parallelism degree P n . That is where TH all denotes the entire throughput and A n is the PE n 's area after the block level parallelism. Without losing generality, we assume that the capacity of all FIFOs is infinite and A re f =∞. We leave the FIFO sizing in the next section.

Parameter extraction after block level parallelism
Before determining the parallelism degree of each PE, we first discuss how to extract new interface parameters for each PE after parallelism. That is to update the following parameters: TH ni/o , A n , T n , f n , and SoP n , which are calculated based on P n ,TH ni/o ,A n ,T n ,f n , and SoP n .
First of all, we calculate TH ni/o . As Figure 8 shows, larger parallelism degree won't always increase the throughput. It is limited by the input time t ni . Assuming t ni >t no and P n ≤ ⌊T n /t ni ⌋, we have TH ni/o = P n * TH ni/o when P n ≤⌊T n /t ni ⌋ For example, as shown in Figure 6, TH ni/o =2*TH ni/o because P n =2< ⌊T n /t ni ⌋=3.  ⌈T n /t ni ⌉, we have TH ni/o = T n /t ni * TH ni/o when P n ≥⌈T n /t ni ⌉ where the throughput is limited by the input time t ni . More parallelism degree is useless in this case. For example, as shown in Figure 8, TH ni/o =T n /t ni *TH ni/o , because P n =2=⌈T n /t ni ⌉. When t ni <t no we have the similar conclusions. In summary, we have TH ni/o = P n * TH ni/o P n < p n T n /max{t ni , t no } * TH ni/o others where Second, we can solve A n , T n , and f n . Ignoring the area of the controller, we have Based on Figure 6 and 8, we conclude T n = T n +(P n − 1) * max{t ni , t no } P n ≤ p n P n * max{t ni , t no } others Equation 5 shows that TH ni and TH no change at the same rate. Therefore, Furthermore, we calculate SoP n . SoP n is the combination of each sub-block's SoP. Therefore Finally, we can obtain all new parameters of a PE after parallelism. We will use those parameters to decide the parallelism degree in Section 4.3 and Section 5.

Block level parallelism degree optimization
To solve the optimization question in Section 4.1, we need to understand the relationship between TH all and TH ni/o . When PE n is connected to the chain from PE 1 to PE (n−1) ,w e define the output interface's throughput of PE n as TH' no . This parameter is different from TH ni/o because it has considered the rate mismatch effects from previous PEs. We have In fact, TH all =TH' No . Therefore, we can express TH all in the following format where b is the index of the slowest PE b . It is the bottleneck of the system. After that, we get the fastest TH all in Line 4. If the system can never approach the optimizing target, we will change the target in Line 6. In the main loop, we find the bottleneck in each step in Line 16 and add more parallelism degree to it. We will update TH ni/o in Line 18 and evaluate the system again in Line 19. We end this loop until the design constraints are satisfied.

Algorithm for FIFO-connected blocks
This section formulates the FIFO interconnecting problem. We then demonstrate that this problem can be solved by a binary searching algorithm. Finally, we propose an algorithm to solve the FIFO interconnecting problem of multiple PEs in the system level.

FIFO interconnection formulation
Given a design consisting of N PEs, we need to determine the depth D (i−1)i of each FIFO 5 , which maximizes the entire throughput TH all and minimizes the FIFO area of A FIFO all .

MIN.
where TH re f and A FIFO re f can be the user-specified constraints or optimal values of the design. Without losing generality, we set TH re f =(TH all ) max and A FIFO_re f =∞. We assume that F 01 never empties and F N(N+1) never fulls. That is, ∀m, SoF 01 (m) = −1 and SoF N(N+1) (m) = 1 6 .

FIFO capacity optimization
We can conclude a brief relationship between TH ni/o and D i . For PE n , we define the real throughput as TH ni/o , when connected with F n−1 of D n−1 and F n+1 of D n+1 . Then we set We know that a small D n−1 or D n+1 will cause TH ni/o <TH ni/o . Also, when TH ni/o =TH ni/o , larger D n−1 or D n+1 will not increase performance any more. Therefore, as it is shown in Figure 2, f (x) is a monotone nondecreasing function with a boundary.
With the fixed relationship between TH ni/o and D i , we can solve the FIFO capacity optimization problem by a binary searching algorithm based on the system level simulations. We describe this method to determine the FIFO capacity for multiple PEs (N > 2) in Algorithm 2. The end condition is checked in line 16. When n = N, it means that all FIFOs have their optimal capacity. As we can see, the most time-consuming part of the algorithm is the getTH() function. It calls for an entire simulation of the hardware. Therefore, we build a system level simulator instead of a RTL level one. It can shorten the optimization greatly. The system level simulator adopts the parameters extracted in Step 2. The C-based system level simulator will be released on our website soon.

Experiments
In this section, we first explain our experimental configurations. Then, we compare the flatten approach, the hierarchical method without block level parallelism (BLP) and with BLP under several real benchmarks. After that, we break down the advantages by two aspects: the block level parallelism and the FIFO sizing. We then show the effectiveness of the proposed algorithm to optimize the parallel degree. Finally, we demonstrate the advantages from the FIFO sizing method.

Experimental configurations
In our experiments, we use a C2RTL tool called eXCite (Y Exploration Inc., 2011). The HDL files are simulated by Mentor Graphics' ModelSim to get the timing information. The area and clock information is obtained by Quartus II from Altera. Cyclone II FPGAs are selected as the target hardware. We derive seven large streaming applications from the high-level synthesis benchmark suits CHstone ( Hara et al. (2008)). They come from real applications and consist of programs from the areas of image processing, security, telecommunication and digital signal processing.
• JPEG encode/decode: JPEG transforms image between JPEG and BMP format.
• AES encryption/decryption: AES (Advanced Encryption Standard) is a symmetric key crypto system.
• GSM: LPC (Linear Predictive Coding) analysis of GSM (Global System for Mobile Communications).
• ADPCM: Adaptive Differential Pulse Code Modulation is an algorithm for voice compression.
• Filter Group: The group includes two FIR filters, a FFT and an IFFT block.

System optimization for real cases
We show the synthesized results for seven benchmarks and compare the flatten approach, the hierarchical approach without and with BLP. Table 3 shows the clock cycles saved by the hierarchical method without and with BLP. The last column in Table 3 shows the BLP vector for each PE. The i th element in the vector denotes the parallel degree of the PE i . The total speedup represents the clock cycle reductions from the hierarchical approach with BLP. As we can see, the hierarchical method without BLP achieves up to 10.43 times speedup compared with the flatten approach. However, the BLP can provide considerable extra up to another 5 times speedup compared with the hierarchial method without BLP. It should be noted that   Table 4. System optimization result of maximal clock frequency the BLP will lead to area overheads in some extents. We will discuss those challenges in the following experiments. Furthermore, Table 4 shows the maximum clock frequency of three approaches. As we can see, the BLP does not introduce extra delay compared with the pure hierarchical method.

Block level parallelism
The previous experimental results show the total advantages from the hierarchial method with BLP. This section will discuss the performance and the area overheads of BLP alone. We show the throughput improvement and the area costs in the GSM benchmark in Figure 9 8 . We list the BLP vector as the horizontal axis. As we can see, parallelizing some PEs will increase the throughput. For the BLP vector (1, 2, 1, 1, 1, 1), we duplicate the second PE 2 by two. It will improve the performance by 4% with 48% area overheads. The result comes from the rate mismatch between PEs. It indicates that duplicating single PE may not increase the throughput effectively and the area overheads may be quite large. Therefore, we should develop an algorithm to find the optimal BLP vector to boost the performance without introducing too many overheads. For example, the BLP vector (4, 4, 4, 1, 1, 1) leads to over 4 times performance speedup while with only less than 3 times area overheads.
Furthermore, we evaluate the proposed BLP algorithm with the approach duplicating the entire hardware. Figure 10 demonstrates that our algorithm can increase the throughput with less area. It is because the BLP algorithm does not parallelize every PE and can explore more fine-grained design space. Obviously, the BLP method provides a solution to trade off performance with area more flexibly and efficiently. In fact, as the modern FPGA can provide more and more logic elements, it makes the area not so urgent as the performance, which is the first-priority metric in most cases.    Table 5. Optimal FIFO capacity algorithm experiment result in 7 real cases

Optimal FIFO capacity
We show the simulated results for real designs with multiple PEs. First of all, we show the relationship between the FIFO size and the running time T all . Figure 11 shows the JPEG encoding case. As we can see, the FIFO size has a great impact on the performance of the design. In this case, the optimal FIFO capacity should be D 12 =44, D 23 =2.    Table 6. Compared to the large enough design strategy, the memory savings are significant. Moreover, compared to the method using RTL level simulator to decide FIFO capacity, our work is extremely time efficient. Considering a hardware with N FIFO to design, each FIFO size is fixed using a binary searching algorithm. It will request log 2 (p) times simulations with the initial FIFO depth value D (n−1)n = p. Assuming that the average time cost by ModelSim RTL level simulation is C, the entire exploration time is N * log 2 (p) * C. Considering the FilerGroup case with N = 5, p = 128 and C = 170 seconds, which are typical values on a normal PC, we have to wait 100 minutes to find the optimal FIFO size. However, our system level solution can finish the exploration in seconds.

Related works
Many C2RTL tools (Gokhale et al., 2000;Lhairech-Lebreton et al., 2010;Mencer, 2006;Villarreal et al., 2010) are focusing on streaming applications. They create design architectures including different modules connected by first-in first-out (FIFO) channels. There are some other tools focusing on general purpose applications. For example, Catapult C (Mentor Graphics, 2011) takes different timing and area constraints to generate Pareto-optimal solutions from common C algorithms. However, little control on the architecture leads to suboptimal results. As (Agarwal, 2009) has shown, FIFO-connected architecture can generate much faster and smaller results in streaming applications.
Among C2RTL tools for streaming applications, GAUT (Lhairech-Lebreton et al., 2010) transforms C functions into pipelined modules consisting of processing units, memory units and communication units. Global asynchronous local synchronous interconnections are adopted to connect different modules with multiple clocks. ROCCC (Villarreal et al., 2010) can create efficient pipelined circuits from C to be re-used in other modules or system codes. Impulse C (Gokhale et al., 2000) provides a C language extension to define parallel processes and communication channels among modules. ASC (Mencer, 2006) provides a design environment for users to optimize systems from algorithm level to gate level, all within the same C++ program. However, previous works keep how to determine the FIFO capacity efficiently unsolved. Most recently, (Li et al., 2012) presented a hierarchical C2RTL framework with analytical formulas to determine the FIFO capacity. However, block level parallelism is not supported and their FIFO sizing method is limited to PEs with certain input/output interfaces.
During the hierarchical C2RTL flow, a key step is to partition a large C program into several functions. Plenty of works have been done in this field. Many C-based high level synthesis tools, such as SPARK (Gupta et al., 2004), eXcite (Y Exploration Inc., 2011), Cyber (NEC Inc., 2011) and CCAP (Nishimura et al., 2006), can partition the input code into several functions. Each function has a corresponding hardware module. However, it leads to a nontrivial datapath area overhead because it eliminates the resource sharing among modules. On the contrary, function inline technique can reduce the datapath area via resource sharing. The fast increasing complexity of the controller makes the method inefficient. Appropriate function clustering (Okada et al., 2002) in a sub module provides a more elegant way to solve the partition problem. But it is hard to find a proper clustering rule. For example, too many functions in one cluster will also lead to a prohibitive complexity in controllers. In practise, architects often help the partition program to divide the C algorithms manually.
Similar to the hierarchical C2RTL, multiple FIFO-connected processing elements (PE) are used to process audio and video streams in the mobile embedded devices. Researchers had investigated on the input streaming rates to make sure that the FIFO between PEs will not overflow, while the real-time processing requirements are met. On-chip traffic analysis of the SoC architecture (Lahiri et al., 2001) had been explored. However, their simulation-based approaches suffer from a long executing time and fail in exploring large design space. A mathematical framework of rate analysis for streaming applications have been proposed in reference (Cruz, 1995). Based on the network calculus, reference (Maxiaguine et al., 2004) extended the service curves to show how to shape an input stream to meet buffer constraints. Furthermore, reference (Liu et al., 2006) discussed the generalized rate analysis for multimedia processing platforms. However, all of them adopts a more complicated behavior model for PE streams, which is not necessary in the hierarchical C2RTL framework.

Conclusion
Improving the booming design methodology of C2RTL to make it more widely used is the goal of many researchers. Our work of the framework does have achieved the improvement. We first propose a hierarchical C2RTL design flow to increase the performance of a traditional flatten one. Moreover, we propose a method to increase throughput by making block level parallelism and an algorithm to decide the degree. Finally, we develop an heuristic algorithm to find the optimal FIFO capacity in a multiple-module design. Experimental results show that hierarchical approach can improve performance by up to 10.43 times speedup, and block level parallelism can make extra 4 times speedup with 194% area overhead. What's more, it determines the optimal FIFO capacity accurately and fast. The future work includes automatical C code partition in the hierarchical C2RTL framework and adopting our optimizing algorithm in more complex architectures with feedback and branches.

Acknowledgement
The authors would like to thank reviewers for their helpful suggestions to improve the chapter. This work was supported in part by the NSFC under grant 60976032 and 61021001,